summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorYann Herklotz <git@yannherklotz.com>2021-01-18 16:18:54 +0000
committerYann Herklotz <git@yannherklotz.com>2021-01-18 16:18:54 +0000
commit24d259e150e844ec842a6df77c4b7f3a9ec9bfa0 (patch)
tree2d07539a5d78abbf9059f9462c28d46fa8228314
parentdc0eb6c626a4068a9d28da5caafce7c39c3fd6ea (diff)
downloadfccm21_esrhls-24d259e150e844ec842a6df77c4b7f3a9ec9bfa0.tar.gz
fccm21_esrhls-24d259e150e844ec842a6df77c4b7f3a9ec9bfa0.zip
Reduction in sections
-rw-r--r--eval.tex16
-rw-r--r--intro.tex8
-rw-r--r--main.tex4
-rw-r--r--related.tex2
4 files changed, 16 insertions, 14 deletions
diff --git a/eval.tex b/eval.tex
index 661ae74..46b58a0 100644
--- a/eval.tex
+++ b/eval.tex
@@ -1,9 +1,9 @@
\section{Evaluation}\label{sec:evaluation}
-We generate \totaltestcases{} test-cases and provide them to three HLS tools: Vivado HLS, LegUp HLS and Intel i++.
+We generate \totaltestcases{} test-cases and provide them to four HLS tools: Vivado HLS, LegUp HLS, Intel i++ and Bambu.
We use the same test-cases across all tools for fair comparison (except the HLS directives, which have tool-specific syntax).
We were able to test three different versions of Vivado HLS (v2018.3, v2019.1 and v2019.2).
-We tested one version of Intel i++ (version 18.1), and one version of LegUp (4.0).
+We tested one version of Intel i++ (version 18.1), LegUp (4.0) and Bambu (v0.9.7).
LegUp 7.5 is GUI-based and therefore we could not script our tests.
However, we were able to manually reproduce all the bugs found in LegUp 4.0 in LegUp 7.5.
@@ -49,7 +49,7 @@ However, we were able to manually reproduce all the bugs found in LegUp 4.0 in L
\end{figure}
Figure~\ref{fig:existing_tools} shows a Venn diagram of our results.
-We see that 167 (2.5\%), 83 (1.2\%) and 26 (0.4\%) test-cases fail in LegUp, Vivado HLS and Intel i++ respectively.
+We see that 918 (13.7\%), 167 (2.5\%), 83 (1.2\%) and 26 (0.4\%) test-cases fail in Bambu, LegUp, Vivado HLS and Intel i++ respectively.
Despite i++ having the lowest failure rate, it has the highest time-out rate (540 test-cases), because of its remarkably long compilation time.
% We remark that although the Intel HLS Compiler had the smallest number of confirmed test-case failures, it had the most time-outs (which could be masking additional failures)
Note that the absolute numbers here do not necessarily correspond to the number of bugs in the tools, because a single bug in a language feature that appears frequently in our test suite could cause many programs to crash or fail.
@@ -129,7 +129,7 @@ As in our Venn diagram, the absolute numbers here do not necessary correspond to
\subsection{Some specific bugs found}
-We now describe two more of the bugs we found: one crash bug in LegUp and one miscompilation bug in Vivado HLS. As in Example~\ref{ex:vivado_miscomp}, each bug was first reduced automatically using \creduce{}, and then reduced further manually to achieve the minimal test-case. Although we did find test-case failures in Intel i++, the long compilation times for that tool meant that we did not have time to reduce any of the failures down to an example that is minimal enough to present here.
+We now describe three more of the bugs we found: one crash bug in LegUp, and a miscompilation in Intel and Bambu respectively. As in Example~\ref{ex:vivado_miscomp}, each bug was first reduced automatically using \creduce{}, and then reduced further manually to achieve the minimal test-case.
\begin{example}[A crash bug in LegUp]
The program shown below leads to an internal compiler error (an unhandled assertion in this case) in LegUp 4.0 and 7.5.
@@ -176,13 +176,13 @@ int main() {
static int b = 0x10000;
static volatile short a = 0;
-int result() {
+int main() {
a++;
b = (b >> 8) & 0x100;
return b;
}
\end{minted}
-\caption{Miscompilation bug in Bambu HLS. As the value of \texttt{b} is shifted to the right by 8, the output should be \texttt{0x100}. However, the actual output is 0 in Bambu.}\label{fig:eval:bambu:mismatch}
+\caption{Miscompilation bug in Bambu. As the value of \texttt{b} is shifted to the right by 8, the output should be \texttt{0x100}. However, Bambu outputs 0.}\label{fig:eval:bambu:mismatch}
\end{figure}
%\begin{example}[A miscompilation bug in Vivado HLS]
@@ -200,8 +200,8 @@ int result() {
Figure~\ref{fig:eval:intel:mismatch} shows a miscompilation bug that was found in Intel i++. Intel i++ does not seem to notice the assignment to 3 in the previous for loop, or tries to perform some optimisations that seem to analyse the array incorrectly and therefore results in a wrong value being returned.
\end{example}
-\begin{example}[A miscompilation bug in Bambu HLS]
-Figure~\ref{fig:eval:bambu:mismatch} shows the bug
+\begin{example}[A miscompilation bug in Bambu]
+Figure~\ref{fig:eval:bambu:mismatch} shows a miscompilation bug in Bambu, where the result of the value in \texttt{b} is affected by the increment operation on \texttt{a}.
\end{example}
%%% Local Variables:
diff --git a/intro.tex b/intro.tex
index 4d3f311..11fc016 100644
--- a/intro.tex
+++ b/intro.tex
@@ -30,7 +30,7 @@ In this paper, we bring fuzzing to the HLS context.
\begin{example}[A miscompilation bug in Vivado HLS]
\label{ex:vivado_miscomp}
-Figure~\ref{fig:vivado_bug1} shows a program that produces the wrong result during RTL simulation in Xilinx Vivado HLS v2018.3, v2019.1 and v2019.2.\footnote{This program, like all the others in this paper, includes a \code{main} function, which means that it compiles straightforwardly with GCC. To compile it with an HLS tool, we rename \code{main} to \code{result}, synthesise that function, and then add a new \code{main} function as a testbench that calls \code{result}.} The bug was initially revealed by a randomly generated program of around 113 lines, which we were able to reduce to the minimal example shown in the figure. This bug was also reported to Xilinx and confirmed to be a bug.\footnote{https://bit.ly/3mzfzgA}
+Figure~\ref{fig:vivado_bug1} shows a program that produces the wrong result during RTL simulation in Xilinx Vivado HLS v2018.3, v2019.1 and v2019.2.\footnote{This program, like all the others in this paper, includes a \code{main} function, which means that it compiles straightforwardly with GCC. To compile it with an HLS tool, we rename \code{main} to \code{result}, synthesise that function, and then add a new \code{main} function as a testbench that calls \code{result}.} The bug was initially revealed by a randomly generated program of around 113 lines, which we were able to reduce to the minimal example shown in the figure. This bug was also reported to Xilinx and confirmed to be a bug.\footnote{Link to Xilinx bug report redacted for review.}% \footnote{https://bit.ly/3mzfzgA}
The program repeatedly shifts a large integer value \code{x} right by the values stored in array \code{arr}.
Vivado HLS returns \code{0x006535FF}, but the result returned by GCC (and subsequently confirmed manually to be the correct one) is \code{0x046535FF}.
\end{example}
@@ -52,14 +52,16 @@ int main() {
The example above demonstrates the effectiveness of fuzzing. It seems unlikely that a human-written test-suite would discover this particular bug, given that it requires several components all to coincide -- a for-loop, shift-values accessed from an array with at least six elements, and a rather random-looking value for \code{x} -- before the bug is revealed!
-Yet this example also begs the question: do bugs found by fuzzers really \emph{matter}, given that they are usually found by combining language features in ways that are vanishingly unlikely to happen `in the real world'~\cite{marcozzi+19}. This question is especially pertinent for our particular context of HLS tools, which are well-known to have restrictions on the language features that they handle. Nevertheless, we would argue that although the \emph{test-cases} we generated do not resemble the programs that humans write, the \emph{bugs} that we exposed using those test-cases are real, and \emph{could also be exposed by realistic programs}. Moreover, it is worth noting that HLS tools are not exclusively provided with human-written programs to compile: they are often fed programs that have been automatically generated by another compiler. Ultimately, we believe that any errors in an HLS tool are worth identifying because they have the potential to cause problems, either now or in the future. And problems caused by HLS tools going wrong (or indeed any sort of compiler for that matter) are particularly egregious, because it is so difficult for end-users to identify whether the fault lies with the tool or with the program it has been given to compile.
+Yet this example also begs the question: do bugs found by fuzzers really \emph{matter}, given that they are usually found by combining language features in ways that are vanishingly unlikely to happen `in the real world'~\cite{marcozzi+19}. This question is especially pertinent for our particular context of HLS tools, which are well-known to have restrictions on the language features that they handle. Nevertheless, although the \emph{test-cases} we generated do not resemble the programs that humans write, the \emph{bugs} that we exposed using those test-cases are real, and \emph{could also be exposed by realistic programs}.
+%Moreover, it is worth noting that HLS tools are not exclusively provided with human-written programs to compile: they are often fed programs that have been automatically generated by another compiler.
+Ultimately, we believe that any errors in an HLS tool are worth identifying because they have the potential to cause problems, either now or in the future. And problems caused by HLS tools going wrong (or indeed any sort of compiler for that matter) are particularly egregious, because it is so difficult for end-users to identify whether the fault lies with their design or the HLS tool.
\subsection{Our approach and results}
Our approach to fuzzing HLS tools comprises three steps.
First, we use Csmith~\cite{yang11_findin_under_bugs_c_compil} to generate thousands of valid C programs from within the subset of the C language that is supported by all the HLS tools we test. We also augment each program with a random selection of HLS-specific directives. Second, we give these programs to four widely used HLS tools: Xilinx Vivado HLS~\cite{xilinx20_vivad_high_synth}, LegUp HLS~\cite{canis13_legup}, the Intel HLS Compiler, which is also known as i++~\cite{intel20_sdk_openc_applic} and finally Bambu~\cite{pilato13_bambu}. Third, if we find a program that causes an HLS tool to crash, or to generate hardware that produces a different result from GCC, we reduce it to a minimal example with the help of the \creduce{} tool~\cite{creduce}.
-Our testing campaign revealed that all three tools could be made to crash while compiling or to generate wrong RTL. In total, \totaltestcases{} test cases were run through each tool out of which \totaltestcasefailures{} test cases failed in at least one of the tools. Test case reduction was then performed on some of these failing test cases to obtain at least \numuniquebugs{} unique failing test cases.
+Our testing campaign revealed that all four tools could be made to generate an incorrect design. In total, \totaltestcases{} test cases were run through each tool out of which \totaltestcasefailures{} test cases failed in at least one of the tools. Test case reduction was then performed on some of these failing test cases to obtain at least \numuniquebugs{} unique failing test cases.
To investigate whether HLS tools are getting more or less reliable over time, we also tested three different versions of Vivado HLS (v2018.3, v2019.1, and v2019.2). We found far fewer failures in versions v2019.1 and v2019.2 compared to v2018.3, but we also identified a few test-cases that only failed in versions v2019.1 and v2019.2, which suggests that some new features may have introduced bugs.
diff --git a/main.tex b/main.tex
index 50ef7e2..2761aad 100644
--- a/main.tex
+++ b/main.tex
@@ -23,7 +23,7 @@
%\usepackage{balance}
\newcommand\totaltestcases{6700}
-\newcommand\totaltestcasefailures{272}
+\newcommand\totaltestcasefailures{1178}
\newcommand\numuniquebugs{8}
\newcommand\vivadotestcases{3645}
@@ -58,7 +58,7 @@ Email: \{zewei.du19, yann.herklotz15, n.ramanathan14, j.wickerson\}@imperial.ac.
High-level synthesis (HLS) is becoming an increasingly important part of the computing landscape, even in safety-critical domains where correctness is key.
As such, HLS tools are increasingly relied upon. But are they trustworthy?
-We have subjected three widely used HLS tools -- LegUp, Xilinx Vivado HLS, the Intel HLS Compiler and Bambu -- to a rigorous fuzzing campaign using thousands of random, valid C programs that we generated using a modified version of the Csmith tool. For each C program, we compiled it to a hardware design using the HLS tool under test and checked whether that hardware design generates the same output as an executable generated by the GCC compiler. When discrepancies arose between GCC and the HLS tool under test, we reduced the C program to a minimal example in order to zero in on the potential bug. Our testing campaign has revealed that all three HLS tools can be made either to crash or to generate wrong code when given valid C programs, and thereby underlines the need for these increasingly trusted tools to be more rigorously engineered.
+We have subjected four widely used HLS tools -- LegUp, Xilinx Vivado HLS, the Intel HLS Compiler and Bambu -- to a rigorous fuzzing campaign using thousands of random, valid C programs that we generated using a modified version of the Csmith tool. For each C program, we compiled it to a hardware design using the HLS tool under test and checked whether that hardware design generates the same output as an executable generated by the GCC compiler. When discrepancies arose between GCC and the HLS tool under test, we reduced the C program to a minimal example in order to zero in on the potential bug. Our testing campaign has revealed that all four HLS tools can be made either to crash or to generate wrong code when given valid C programs, and thereby underlines the need for these increasingly trusted tools to be more rigorously engineered.
Out of \totaltestcases{} test cases, we found \totaltestcasefailures{} programs that failed in at least one tool, out of which we were able to discern at least \numuniquebugs{} unique bugs.
\end{abstract}
diff --git a/related.tex b/related.tex
index 8887e52..b2fd442 100644
--- a/related.tex
+++ b/related.tex
@@ -1,6 +1,6 @@
\section{Related Work}
-The only other work of which we are aware on fuzzing HLS tools is that by Lidbury et al. \cite{lidbury15_many_core_compil_fuzzin}, who tested several OpenCL compilers, including an HLS compiler from Altera (now Intel). They were only able to subject that compiler to superficial testing because so many of the test-cases they generated led to it crashing. In comparison to our work: where Lidbury et al. generated target-independent OpenCL programs that could be used to test HLS tools and conventional compilers alike, we specifically generate programs that are tailored for HLS (e.g. with HLS-specific pragmas) with the aim of testing the HLS tools more deeply. Another difference is that where we test using sequential C programs, they test using highly concurrent OpenCL programs, and thus have to go to great lengths to ensure that any discrepancies observed between compilers cannot be attributed to the inherent nondeterminism of concurrency.
+The only other work of which we are aware on fuzzing HLS tools is that by Lidbury et al. \cite{lidbury15_many_core_compil_fuzzin}, who tested several OpenCL compilers, including an HLS compiler from Altera (now Intel). They were only able to subject that compiler to superficial testing because so many of the test-cases they generated led to it crashing. In comparison to our work: where Lidbury et al. generated target-independent OpenCL programs that could be used to test HLS tools and conventional compilers alike, we specifically generate programs that are tailored for HLS (e.g. with HLS-specific pragmas and only including supported constructs) with the aim of testing the HLS tools more deeply. Another difference is that where we test using sequential C programs, they test using highly concurrent OpenCL programs, and thus have to go to great lengths to ensure that any discrepancies observed between compilers cannot be attributed to the inherent nondeterminism of concurrency.
Other stages of the FPGA toolchain have been subjected to fuzzing. Herklotz et al.~\cite{verismith} tested several FPGA synthesis tools using randomly generated Verilog programs. Where they concentrated on the RTL-to-netlist stage of hardware design, we focus our attention on the earlier C-to-RTL stage.