High-level synthesis – the automatic compilation of a software program into a custom hardware design – is an increasingly important technology. It’s attractive because it lets software engineers harness the computational power and energy-efficiency of custom hardware devices such as FPGAs. It’s also attractive to hardware designers because it allows them to enter their designs at a higher level of abstraction than Verilog or VHDL.
As such, high-level synthesis is being increasingly relied upon. But is it reliable? Yann Herklotz, Zewei Du, Nadesh Ramanathan and I have a paper next week at FCCM 2021 about our attempts to find out.
We follow a fairly well-trodden path. There have been several projects over the last decade or so that aim to test compilers by fuzzing – that is, by generating random source files, feeding each one into the compiler-under-test, and seeing whether it is compiled correctly. But these works have focused on conventional compilers that generate assembly code for processors. In our work, we apply a similar technique to HLS tools.
The ideas can be carried over without many changes – after all, HLS tools still take C programs as input, just like conventional compilers. We changed a few things: we restricted the random C generator to account for the fact that HLS tools don’t support the full C language, we added random HLS-specific directives (like: “pipeline this loop”) to the generated programs, and we tweaked the probabilities of the random generator to avoid generating programs that made the synthesis process unbearably long. To check that the HLS tool had compiled each random C program correctly, we executed its output using a Verilog simulator and compared that against the output from a executable that we compiled using GCC.
We generated 6700 random C programs and fed them to four widely used HLS tools: LegUp, Xilinx Vivado HLS, the Intel HLS Compiler, and Bambu. We found that all four tools could be made either to crash or to generate wrong code. In total, 1191 of our generated programs failed in at least one of the four tools, as shown in the picture below.
We performed test-case reduction on some of the failures, and this resulted in the identification of 8 unique bugs, several of which have been confirmed by the HLS tool developers, and one of which (shown in the code snippet below) has already been fixed.
Finally, we compared how many bugs we could find in a few different versions of the same HLS tool. The diagram below shows the results of that experiment on three versions of Xilinx Vivado HLS. We see that the 2018.3 version of the tool has the most failures, though it is also striking that there are some test-cases that only fail in the most recent version.
In conclusion: this project has shown that compiler fuzzing techniques are effective at uncovering bugs in HLS tools. With HLS tools becoming increasingly relied upon, we hope that our work has motivated the need for these tools to be more rigorously engineered.