Make the paper fit again

author: Yann Herklotz <git@yannherklotz.com> 2021-04-05 16:22:55 +0100
committer: Yann Herklotz <git@yannherklotz.com> 2021-04-05 16:23:04 +0100
commit: 3c232b2279b5c71ef8aa5a88987d5f2ac9d88016 (patch)
tree: 4f6cd8668b8022f28286b553a264ded653e289ac
parent: 6a7ebeb13cd708e5b9e2a09400255ccbd3f0242a (diff)
download: fccm21_esrhls-3c232b2279b5c71ef8aa5a88987d5f2ac9d88016.tar.gz
fccm21_esrhls-3c232b2279b5c71ef8aa5a88987d5f2ac9d88016.zip
6 files changed, 32 insertions, 21 deletions
diff --git a/conclusion.tex b/conclusion.tex
index c5f1f6a..22aef6c 100644
--- a/conclusion.tex
+++ b/conclusion.tex
@@ -1,17 +1,15 @@
 \section{Conclusion}
-We have shown how an existing fuzzing tool can be modified so that its output is suitable for HLS, and then used it in a campaign to test the reliability of four modern HLS tools. In total, we found at least \numuniquebugs{} unique bugs across all the tools, including both crashes and miscompilations. 
-Further work could be done on supporting more HLS tools, especially those that claim to prove that their output is correct before terminating, such as Catapult-C~\cite{mentor20_catap_high_level_synth}. % This could give an indication of how effective these proofs are, and how often they are actually able to complete their equivalence proofs during compilation in a feasible timescale.
 
-Conventional compilers have become quite resilient to fuzzing over the last decade, so recent work on fuzzing compilers has had to employ increasingly imaginative techniques to keep finding new bugs~\cite{karine+20}. In contrast, we have found that HLS tools -- at least, as they currently stand -- can be made to exhibit bugs even using the relatively basic fuzzing techniques that we employed in this project.
+We have shown how an existing fuzzing tool can be modified so that its output is suitable for HLS, and then used it in a campaign to test the reliability of four modern HLS tools. In total, we found at least \numuniquebugs{} unique bugs across all the tools, including both crashes and miscompilations. 
+Further work could be done on supporting more HLS tools, especially those that claim to prove that their output is correct before terminating, such as Catapult C~\cite{mentor20_catap_high_level_synth}. % This could give an indication of how effective these proofs are, and how often they are actually able to complete their equivalence proofs during compilation in a feasible timescale.
 
-As HLS is becoming increasingly relied upon, it is important to make sure that HLS tools are also reliable.  We hope that this work further motivates the need for rigorous engineering of HLS tools, whether that is by validating that each output the tool produces is correct or by proving the HLS tool itself correct once and for all.
+Conventional compilers have become quite resilient to fuzzing over the last decade, so recent work on fuzzing compilers has had to employ increasingly imaginative techniques to keep finding new bugs~\cite{karine+20}. In contrast, we have found that HLS tools can be made to exhibit bugs even using the relatively basic fuzzing techniques employed in this project.  We hope that this work further motivates the need for rigorous engineering of HLS tools, whether that is by validating that each output the tool produces is correct or by proving the HLS tool itself correct once and for all.
 
 \section*{Acknowledgements}
 
 We thank Alastair F. Donaldson for helpful feedback. 
 We acknowledge financial support from the Research Institute on Verified Trustworthy Software Systems (VeTSS), which is funded by the UK National Cyber Security Centre (NCSC).
 
-
 %%% Local Variables:
 %%% mode: latex
 %%% TeX-master: "main"
diff --git a/conference.bib b/conference.bib
index fd9d0ad..f5a0382 100644
--- a/conference.bib
+++ b/conference.bib
@@ -268,3 +268,20 @@ location = {undefined},
   school = {University of Toronto},
   year = {2015}
 }
+
+@inproceedings{chen16_empir_compar_compil_testin_techn,
+	author = {Chen, Junjie and Hu, Wenxiang and Hao, Dan and Xiong, Yingfei and Zhang, Hongyu and Zhang, Lu and Xie, Bing},
+	title = {An Empirical Comparison of Compiler Testing Techniques},
+	year = {2016},
+	isbn = {9781450339001},
+	publisher = {Association for Computing Machinery},
+	address = {New York, NY, USA},
+	url = {https://doi.org/10.1145/2884781.2884878},
+	doi = {10.1145/2884781.2884878},
+	abstract = {Compilers, as one of the most important infrastructure of today's digital world, are expected to be trustworthy. Different testing techniques are developed for testing compilers automatically. However, it is unknown so far how these testing techniques compared to each other in terms of testing effectiveness: how many bugs a testing technique can find within a time limit.In this paper, we conduct a systematic and comprehensive empirical comparison of three compiler testing techniques, namely, Randomized Differential Testing (RDT), a variant of RDT---Different Optimization Levels (DOL), and Equivalence Modulo Inputs (EMI). Our results show that DOL is more effective at detecting bugs related to optimization, whereas RDT is more effective at detecting other types of bugs, and the three techniques can complement each other to a certain degree.Furthermore, in order to understand why their effectiveness differs, we investigate three factors that influence the effectiveness of compiler testing, namely, efficiency, strength of test oracles, and effectiveness of generated test programs. The results indicate that all the three factors are statistically significant, and efficiency has the most significant impact.},
+	booktitle = {Proceedings of the 38th International Conference on Software Engineering},
+	pages = {180–190},
+	numpages = {11},
+	location = {Austin, Texas},
+	series = {ICSE '16}
+}
diff --git a/eval.tex b/eval.tex
index a1b9b28..0be2416 100644
--- a/eval.tex
+++ b/eval.tex
@@ -3,7 +3,7 @@
 We generate \totaltestcases{} test-cases and provide them to four HLS tools: Vivado HLS, LegUp HLS, Intel i++, and Bambu.
 We use the same test-cases across all tools for fair comparison (except the HLS directives, which have tool-specific syntax).
 We were able to test three different versions of Vivado HLS (v2018.3, v2019.1 and v2019.2). 
-We tested one version of Intel i++ (included in Quartus Lite 18.1), LegUp (4.0) and Bambu (v0.9.7).  We tested any reduced LegUp test-cases on LegUp 9.2 before reporting them.
+We tested one version of Intel i++ (included in Quartus Lite 18.1), LegUp (4.0) and two versions of Bambu (v0.9.7, v0.9.7-dev).
 
 %  Three different tools were tested, including three different versions of Vivado HLS. We were only able to test one version of LegUp HLS (version 4.0), because although LegUp 7.5 is available, it is GUI-based and not amenable to scripting. However, bugs we found in LegUp 4.0 were reproduced manually in LegUp 7.5. 
 %  LegUp and Vivado HLS were run under Linux, while the Intel HLS Compiler was run under Windows.
@@ -18,7 +18,7 @@ We tested one version of Intel i++ (included in Quartus Lite 18.1), LegUp (4.0)
 \definecolor{timeout}{HTML}{ef4c4c}
 \begin{figure}
   \centering
-    \begin{tikzpicture}[scale=0.61]
+    \begin{tikzpicture}[scale=0.61,yscale=0.9]
       \draw (-7.2,7.0) rectangle (7.2,0.7);
       \fill[vivado,fill opacity=0.5] (0.9,4.4) ellipse (3.3 and 1.5);
       \fill[intel,fill opacity=0.5] (-4.5,4.8) ellipse (2.0 and 1.3);
@@ -51,7 +51,7 @@ We tested one version of Intel i++ (included in Quartus Lite 18.1), LegUp (4.0)
 \end{figure}
 
 Figure~\ref{fig:existing_tools} shows an Euler diagram of our results. 
-We see that 918 (13.7\%), 167 (2.5\%), 83 (1.2\%) and 26 (0.4\%) test-cases fail in Bambu, LegUp, Vivado HLS and Intel i++ respectively.  One of the bugs we reported to the Bambu developers was fixed during our testing campaign, so we also tested the development branch of Bambu (0.9.7-dev) with the bug fix, and found only 17 (0.25\%) failing test-cases.
+We see that 918 (13.7\%), 167 (2.5\%), 83 (1.2\%) and 26 (0.4\%) test-cases fail in Bambu, LegUp, Vivado HLS and Intel i++ respectively. The bugs we reported to the Bambu developers were fixed during our testing campaign, so we also tested the development branch of Bambu (0.9.7-dev) with the bug fixes, and found only 17 (0.25\%) failing test-cases remained.
 Although i++ has a low failure rate, it has the highest time-out rate (540 test-cases) due to its remarkably long compilation time. No other tool had more than 20 time-outs.
 Note that the absolute numbers here do not necessarily correspond to the number of bugs in the tools, because a single bug in a language feature that appears frequently in our test suite could cause many failures.
 Moreover, we are reluctant to draw conclusions about the relative reliability of each tool by comparing the number of failures, because these numbers are so sensitive to the parameters of the randomly generated test suite we used. In other words, we can confirm the \emph{presence} of bugs, but cannot deduce the \emph{number} of them (nor their importance).
@@ -107,7 +107,6 @@ int main() {
 \begin{minted}{c}
 static int b = 0x10000;
 static volatile short a = 0;
-
 int main() {
   a++;
   b = (b >> 8) & 0x100;
@@ -165,7 +164,7 @@ int main() {
     LegUp HLS & miscompile & online* & reported \\
     Intel i++ & miscompile & Fig.~\ref{fig:eval:intel:mismatch} & reported \\
     Bambu HLS & miscompile & Fig.~\ref{fig:eval:bambu:mismatch} & reported, confirmed, fixed \\
-    Bambu HLS & miscompile & online* & reported, confirmed \\
+    Bambu HLS & miscompile & online* & reported, confirmed, fixed \\
     \bottomrule
   \end{tabular} \\
   \vphantom{\large A}*See \url{https://ymherklotz.github.io/fuzzing-hls/} for detailed bug reports
@@ -183,7 +182,7 @@ int main() {
 \definecolor{ribbon6}{HTML}{fdb462}
 \begin{figure}
   \centering
-  \begin{tikzpicture}[xscale=1.25]
+  \begin{tikzpicture}[xscale=1.25,yscale=0.85]
     \draw[white, fill=ribbon1] (-1.0,4.1) -- (0.0,4.1) to [out=0,in=180] (2.0,4.1) to [out=0,in=180] (4.0,4.1) -- (6.0,4.1) -- %(7.55,3.325) -- 
     (6.0,2.55) -- (4.0,2.55) to [out=180,in=0] (2.0,2.55) to [out=180,in=0] (0.0,2.55) -- (-1.0,2.55) -- cycle;
     \draw[white, fill=ribbon2] (-1.0,2.55) -- (0.0,2.55) to [out=0,in=180] (1.8,1.8) -- (2.2,1.8) to [out=0,in=180] (4.0,1.55) -- (6.0,1.55) -- %(7.3,0.9) -- 
@@ -220,9 +219,9 @@ int main() {
 
 Besides studying the reliability of different HLS tools, we also studied the reliability of Vivado HLS over time. Figure~\ref{fig:sankey_diagram} shows the results of giving \vivadotestcases{} test-cases to Vivado HLS v2018.3, v2019.1 and v2019.2. 
 Test-cases that pass and fail in the same tools are grouped together into a ribbon. 
-For instance, the topmost ribbon represents the 31 test-cases that fail in all three versions of Vivado HLS. Other ribbons can be seen weaving in and out; these indicate that bugs were fixed or reintroduced in the various versions. We see that Vivado HLS v2018.3 had the most test-case failures (62).
+For instance, the topmost ribbon represents the 31 test-cases that fail in all three versions of Vivado HLS.  Other ribbons can be seen weaving in and out; these indicate that bugs were fixed or reintroduced in the various versions.
 Interestingly, the blue ribbon shows that there are test-cases that fail in v2018.3, pass in v2019.1, and then fail again in v2019.2!
-As in our Euler diagram, the numbers do not necessary correspond to the number of actual bugs, though we can observe that there must be at least six unique bugs in Vivado HLS, given that each ribbon corresponds to at least one unique bug. 
+As in our Euler diagram, the numbers do not necessary correspond to the number of actual bugs, though we can observe that there must be at least six unique bugs in Vivado HLS, given that each ribbon corresponds to at least one unique bug.  This method of identifying unique bugs is similar to the ``correcting commits'' metric introduced by Chen et al.~\cite{chen16_empir_compar_compil_testin_techn}.
 
 %\AD{This reminds me of the correcting commits metric from Junjie Chen et al.'s empirical study on compiler testing. https://xiongyingfei.github.io/papers/ICSE16.pdf. Could be worth making the connection. }
 
diff --git a/intro.tex b/intro.tex
index 0ceab4e..246c5d5 100644
--- a/intro.tex
+++ b/intro.tex
@@ -53,6 +53,7 @@ int main() {
 The example above demonstrates the effectiveness of fuzzing. It seems unlikely that a human-written test-suite would discover this particular bug, given that it requires several components all to coincide before the bug is revealed. If the loop is unrolled, or the seemingly random value of \code{b} is simplified, or the array is declared with fewer than six elements (even though only two are accessed), then the bug goes away.
 
 Yet this example also begs the question: do bugs found by fuzzers really \emph{matter}, given that they are usually found by combining language features in ways that are vanishingly unlikely to happen `in the real world'~\cite{marcozzi+19}. This question is especially pertinent for our particular context of HLS tools, which are well-known to have restrictions on the language features they support. Nevertheless, although the \emph{test-cases} we generated do not resemble the programs that humans write, the \emph{bugs} that we exposed using those test-cases are real, and \emph{could also be exposed by realistic programs}.
+
 %Moreover, it is worth noting that HLS tools are not exclusively provided with human-written programs to compile: they are often fed programs that have been automatically generated by another compiler.
 Ultimately, we believe that any errors in an HLS tool are worth identifying because they have the potential to cause problems -- either now or in the future. And problems caused by HLS tools going wrong (or indeed any sort of compiler for that matter) are particularly egregious, because it is so difficult for end-users to identify whether the fault lies with their design or the HLS tool.
 
diff --git a/main.tex b/main.tex
index f5bc5c5..5732d81 100644
--- a/main.tex
+++ b/main.tex
@@ -61,7 +61,6 @@ As such, HLS tools are increasingly relied upon. But are they trustworthy?
 
 We have subjected four widely used HLS tools -- LegUp, Xilinx Vivado HLS, the Intel HLS Compiler and Bambu -- to a rigorous fuzzing campaign using thousands of random, valid C programs that we generated using a modified version of the Csmith tool. For each C program, we compiled it to a hardware design using the HLS tool under test and checked whether that hardware design generates the same output as an executable generated by the GCC compiler. When discrepancies arose between GCC and the HLS tool under test, we reduced the C program to a minimal example in order to zero in on the potential bug. Our testing campaign has revealed that all four HLS tools can be made to generate wrong designs from valid C programs and one tool could be made to crash; this underlines the need for these increasingly trusted tools to be more rigorously engineered.
 Out of \totaltestcases{} test-cases, we found \totaltestcasefailures{} programs that caused at least one tool to fail, out of which we were able to discern at least \numuniquebugs{} unique bugs.
-
 \end{abstract}
 
 \input{intro}
diff --git a/method.tex b/method.tex
index d5fb319..b37a26a 100644
--- a/method.tex
+++ b/method.tex
@@ -124,7 +124,7 @@ To prepare the programs generated by Csmith for HLS testing, we modify them in t
 
 %rstrst\AD{Did any reduced test-case involve these HLS-specific features?} \JW{The LegUp bug in Figure 4 requires NO\_INLINE -- does that count? If so, perhaps we could append to the Figure 4 caption: `thus vindicating our strategy of adding random HLS directives to our test-cases'.}
 
-The second modification has to do with the top-level function. Each program generated by Csmith ends its execution by printing a hash of all its variables' values, hoping that miscompilations will be exposed through this hash value. Csmith's built-in hash function leads to infeasibly long synthesis times, so we replace it with a simple XOR-based one.
+The second modification has to do with the top-level function. Each program generated by Csmith ends its execution by printing a hash of all its variables' values, hoping that miscompilations will be exposed through this hash value. Csmith's built-in hash function leads to infeasibly long simulation times, so we replace it with a simple XOR-based one.
 
 Finally, we generate a synthesisable testbench that executes the main function of the original C program, and a tool-specific script that instructs the HLS tool to create a design project and then build and simulate the design.
 
@@ -181,9 +181,8 @@ Finally, we generate a synthesisable testbench that executes the main function o
 %Figure~\ref{fig:method:toolflow} shows the three stages of testing, depicted as the testing environment in the dashed area.
 For each HLS tool in turn, we compile the C program to RTL and then simulate the RTL. 
 We also compile the C program using GCC and execute it. 
-Although each HLS tool has its own built-in C compiler that could be used to obtain the reference output, we prefer to obtain the reference output ourselves in order to rely less on the tool being tested.
-
-To ensure that our testing scales to large programs, we enforce several time-outs: we set a 5-minute time-out for C execution and a 2-hour time-out for C-to-RTL synthesis and RTL simulation. Time-outs are not counted as bugs.
+%Although each HLS tool has its own built-in C compiler that could be used to obtain the reference output, we prefer to obtain the reference output ourselves in order to rely less on the tool being tested.
+To ensure that our testing scales to large programs, we enforce several time-outs: we set a 5-minute time-out for C execution and a 2-hour time-out for C-to-RTL synthesis and RTL simulation.
 
 %% JW: This paragraph is not really needed because we repeat the sentiment in the next subsection anyway.
 %There two types of bugs that we can encounter in this testing setup: programs that cause the HLS tool to crash during compilation (e.g. an unhandled assertion violation or a segmentation fault), and programs where the software execution and the RTL simulation do not return the same value. Programs that cause either type of error are given to the reduction stage, which aims to minimise the programs and (hopefully) identify the root cause(s).
@@ -200,7 +199,7 @@ To ensure that our testing scales to large programs, we enforce several time-out
 
 \subsection{Reducing buggy programs}\label{sec:method:reduce}
 
-Once we discover a program that crashes the HLS tool or whose C/RTL simulations do not match, we systematically reduce it to its minimal form using the \creduce{} tool~\cite{creduce}, in the hope of identifying the root cause. This is done by successively removing or simplifying parts of the program, checking that the bug remains at each step.
+Once we discover a program that crashes the HLS tool or whose C/RTL simulations do not match, we systematically reduce it to its minimal form using the \creduce{} tool~\cite{creduce}, in the hope of identifying the root cause. This is done by successively removing or simplifying parts of the program, checking that the bug remains at each step.  We also check at each stage of the reduction process that the reduced program remains within the subset of the language that is supported by the HLS tools; without this check, we found that \creduce{} kept zeroing in on programs that were outside this subset and hence did not represent real bugs.
 
 %\YH{Edit this section some more}
 
@@ -214,8 +213,6 @@ Once we discover a program that crashes the HLS tool or whose C/RTL simulations
 %\creduce{} is an existing reducer for C and C++ and runs the reduction steps in parallel to converge as quickly as possible.  It is effective because it reduces the input while preserving semantic validity and avoiding undefined behaviour.
 %It has various reduction strategies, such as delta debugging passes and function inlining, that help it converge rapidly to a test-case that is small enough to understand and step through.
 
-We also check at each stage of the reduction process that the reduced program remains within the subset of the language that is supported by the HLS tools; without this check, we found that \creduce{} kept zeroing in on programs that were outside this subset and hence did not represent real bugs.
-
 %However, the downside of using \creduce{} with HLS tools is that we are not in control of which lines and features are prioritised for removal.
 %As a consequence, we can easily end up with \creduce{} producing programs that are not synthesisable, despite being valid C.
 %This is because the semantics of C for HLS tools might be a bit different to the semantics for C of a standard C compiler that \creduce{} was built for.
author	Yann Herklotz <git@yannherklotz.com>	2021-04-05 16:22:55 +0100
committer	Yann Herklotz <git@yannherklotz.com>	2021-04-05 16:23:04 +0100
commit	3c232b2279b5c71ef8aa5a88987d5f2ac9d88016 (patch)
tree	4f6cd8668b8022f28286b553a264ded653e289ac
parent	6a7ebeb13cd708e5b9e2a09400255ccbd3f0242a (diff)
download	fccm21_esrhls-3c232b2279b5c71ef8aa5a88987d5f2ac9d88016.tar.gz fccm21_esrhls-3c232b2279b5c71ef8aa5a88987d5f2ac9d88016.zip