summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorYann Herklotz <git@yannherklotz.com>2021-03-30 19:39:57 +0100
committerYann Herklotz <git@yannherklotz.com>2021-03-30 19:39:57 +0100
commitadc0afcec6fe025f85fbfdfdfc5ef522fa760d98 (patch)
tree22243be49716dc6c98b5e6b38a67c6c862ef9f1c
parent0f40e13fab830957ac055e076055280cdb82efff (diff)
downloadfccm21_esrhls-adc0afcec6fe025f85fbfdfdfc5ef522fa760d98.tar.gz
fccm21_esrhls-adc0afcec6fe025f85fbfdfdfc5ef522fa760d98.zip
Update text
-rw-r--r--eval.tex24
-rw-r--r--intro.tex2
-rw-r--r--main.tex2
-rw-r--r--method.tex8
4 files changed, 15 insertions, 21 deletions
diff --git a/eval.tex b/eval.tex
index f1543e4..312ed25 100644
--- a/eval.tex
+++ b/eval.tex
@@ -3,8 +3,7 @@
We generate \totaltestcases{} test-cases and provide them to four HLS tools: Vivado HLS, LegUp HLS, Intel i++, and Bambu.
We use the same test-cases across all tools for fair comparison (except the HLS directives, which have tool-specific syntax).
We were able to test three different versions of Vivado HLS (v2018.3, v2019.1 and v2019.2).
-We tested one version of Intel i++ (version 18.1), LegUp (4.0) and Bambu (v0.9.7).
-LegUp 7.5 is GUI-based so we could not script our tests; however, we were able to manually reproduce all the bugs found in LegUp 4.0 in LegUp 7.5.
+We tested one version of Intel i++ (included in Quartus Lite 18.1), LegUp (4.0) and Bambu (v0.9.7). We tested any reduced LegUp test-cases on LegUp 9.2 before reporting them.
% Three different tools were tested, including three different versions of Vivado HLS. We were only able to test one version of LegUp HLS (version 4.0), because although LegUp 7.5 is available, it is GUI-based and not amenable to scripting. However, bugs we found in LegUp 4.0 were reproduced manually in LegUp 7.5.
% LegUp and Vivado HLS were run under Linux, while the Intel HLS Compiler was run under Windows.
@@ -52,22 +51,20 @@ LegUp 7.5 is GUI-based so we could not script our tests; however, we were able t
\end{figure}
Figure~\ref{fig:existing_tools} shows an Euler diagram of our results.
-We see that 918 (13.7\%), 167 (2.5\%), 83 (1.2\%) and 26 (0.4\%) test-cases fail in Bambu, LegUp, Vivado HLS and Intel i++ respectively.
-\JW{Somewhere around here mention that Bambu originally had M failures, but after a single bugfix, it went down to N failures. Maybe mention that we would have extended the same courtesy to the other tools had they released fixed versions of their tools promptly?}
-Despite i++ having the lowest failure rate, it has the highest time-out rate (540 test-cases), because of its remarkably long compilation time.
-% We remark that although the Intel HLS Compiler had the smallest number of confirmed test-case failures, it had the most time-outs (which could be masking additional failures)
+We see that 918 (13.7\%), 167 (2.5\%), 83 (1.2\%) and 26 (0.4\%) test-cases fail in Bambu, LegUp, Vivado HLS and Intel i++ respectively. However, one of the bugs in Bambu was fixed as we were testing the tool, so we therefore tested the development branch of Bambu (0.9.7-dev) with that bug fix, and only found 17 (0.25\%) remaining failing test-cases.
+Despite i++ having the lowest failure rate, it has the highest time-out rate (540 test-cases), because of its remarkably long compilation time, whereas the other tools each only had under 20 test-cases timeout.
Note that the absolute numbers here do not necessarily correspond to the number of bugs in the tools, because a single bug in a language feature that appears frequently in our test suite could cause many programs to crash or fail.
Moreover, we are reluctant to draw conclusions about the relative reliability of each tool by comparing the number of test-case failures, because these numbers are so sensitive to the parameters of the randomly generated test suite we used. In other words, we can confirm the \emph{presence} of bugs, but cannot deduce the \emph{number} of them (nor their importance).
We have reduced several of the failing test-cases in an effort to identify particular bugs, and our findings are summarised in Table~\ref{tab:bugsummary}. We emphasise that the bug counts here are lower bounds -- we did not have time to go through the arduous test-case reduction process for every failure.
Figures~\ref{fig:eval:legup:crash}, \ref{fig:eval:intel:mismatch}, and~\ref{fig:eval:bambu:mismatch} present three of the bugs we found. As in Example~\ref{ex:vivado_miscomp}, each bug was first reduced automatically using \creduce{}, and then further reduced manually to achieve the minimal test-case.
-\begin{figure}[t]
+\begin{figure}
\begin{minted}{c}
int a[2][2][1] = {{{0},{1}},{{0},{0}}};
int main() { a[0][1][0] = 1; }
\end{minted}
-\caption{This program leads to an internal compiler error (an unhandled assertion in this case) in LegUp 4.0 and 7.5. It initialises a 3D array with zeroes and then assigns to one element. The bug only appears when function inlining is disabled (\texttt{NO\_INLINE}).}
+\caption{This program leads to an internal compiler error (an unhandled assertion in this case) in LegUp 4.0. It initialises a 3D array with zeroes and then assigns to one element. The bug only appears when function inlining is disabled (\texttt{NO\_INLINE}).}
\label{fig:eval:legup:crash}
\end{figure}
%An assertion error counts as a crash of the tool, as it means that an unexpected state was reached by this input.
@@ -86,7 +83,7 @@ int main() { a[0][1][0] = 1; }
%\end{example}
-\begin{figure}[t]
+\begin{figure}
\begin{minted}{c}
static volatile int a[9][1][7];
int main() {
@@ -104,7 +101,7 @@ int main() {
\caption{This program miscompiles in Intel i++. It should return 2 because \code{3 \^{} 1 = 2}, but Intel i++ generates a design that returns 0 instead. Perhaps the assignment to 3 in the first for-loop is being overlooked.}\label{fig:eval:intel:mismatch}
\end{figure}
-\begin{figure}[t]
+\begin{figure}
\begin{minted}{c}
static int b = 0x10000;
static volatile short a = 0;
@@ -181,7 +178,7 @@ int main() {
\definecolor{ribbon4}{HTML}{fb8072}
\definecolor{ribbon5}{HTML}{80b1d3}
\definecolor{ribbon6}{HTML}{fdb462}
-\begin{figure}[t]
+\begin{figure}
\centering
\begin{tikzpicture}[xscale=1.25]
\draw[white, fill=ribbon1] (-1.0,4.1) -- (0.0,4.1) to [out=0,in=180] (2.0,4.1) to [out=0,in=180] (4.0,4.1) -- (6.0,4.1) -- %(7.55,3.325) --
@@ -222,10 +219,7 @@ Besides studying the reliability of different HLS tools, we also studied the rel
Test-cases that pass and fail in the same tools are grouped together into a ribbon.
For instance, the topmost ribbon represents the 31 test-cases that fail in all three versions of Vivado HLS. Other ribbons can be seen weaving in and out; these indicate that bugs were fixed or reintroduced in the various versions. We see that Vivado HLS v2018.3 had the most test-case failures (62).
Interestingly, the blue ribbon shows that there are test-cases that fail in v2018.3, pass in v2019.1, and then fail again in v2019.2.
-As in our Euler diagram, the absolute numbers here do not necessary correspond to the number of actual bugs, though we can observe that there must be at least six unique bugs in Vivado HLS, given that each ribbon corresponds to at least one unique bug.
-
-
-
+As in our Euler diagram, the numbers do not necessary correspond to the number of actual bugs, though we can observe that there must be at least six unique bugs in Vivado HLS, given that each ribbon corresponds to at least one unique bug.
%\YH{Contradicts value of 3 in Table~\ref{tab:unique_bugs}, maybe I can change that to 6?} \JW{I'd leave it as-is personally; we have already put a `$\ge$' symbol in the table, so I think it's fine.}
%In addition to that, it can then be seen that Vivado HLS v2018.3 must have at least 4 individual bugs, of which two were fixed and two others stayed in Vivado HLS v2019.1. However, with the release of v2019.1, new bugs were introduced as well. % Finally, for version 2019.2 of Vivado HLS, there seems to be a bug that was reintroduced which was also present in Vivado 2018.3, in addition to a new bug. In general it seems like each release of Vivado HLS will have new bugs present, however, will also contain many previous bug fixes. However, it cannot be guaranteed that a bug that was previously fixed will remain fixed in future versions as well.
diff --git a/intro.tex b/intro.tex
index 7fa4905..4432fc1 100644
--- a/intro.tex
+++ b/intro.tex
@@ -59,7 +59,7 @@ Ultimately, we believe that any errors in an HLS tool are worth identifying beca
\subsection{Our approach and results}
Our approach to fuzzing HLS tools comprises three steps.
-First, we use Csmith~\cite{yang11_findin_under_bugs_c_compil} to generate thousands of valid C programs within the subset of the C language that is supported by all the HLS tools we test. We also augment each program with a random selection of HLS-specific directives. Second, we give these programs to four widely used HLS tools: Xilinx Vivado HLS~\cite{xilinx20_vivad_high_synth}, LegUp HLS~\cite{canis13_legup}, the Intel HLS Compiler, also known as i++~\cite{intel20_sdk_openc_applic}, and finally Bambu~\cite{pilato13_bambu}. Third, if we find a program that causes an HLS tool to crash, or to generate hardware that produces a different result from GCC, we reduce it to a minimal example with the help of the \creduce{} tool~\cite{creduce}.
+First, we use Csmith~\cite{yang11_findin_under_bugs_c_compil} to generate thousands of valid C programs within the subset of the C language that is supported by all the HLS tools we test. We also augment each program with a random selection of HLS-specific directives. Second, we give these programs to four widely used HLS tools: Xilinx Vivado HLS~\cite{xilinx20_vivad_high_synth}, LegUp HLS~\cite{canis13_legup}, the Intel HLS Compiler, also known as i++~\cite{intel20_sdk_openc_applic}, and finally Bambu~\cite{pilato13_bambu}. Third, if we find a program that causes an HLS tool to crash or to generate hardware that produces a different result from GCC, we reduce it to a minimal example with the help of \creduce{}~\cite{creduce}.
Our testing campaign revealed that all four tools could be made to generate an incorrect design. In total, \totaltestcases{} test-cases were run through each tool, of which \totaltestcasefailures{} failed in at least one of the tools. Test-case reduction was then performed on some of these failing test-cases to obtain at least \numuniquebugs{} unique failing test-cases.
diff --git a/main.tex b/main.tex
index a3bb473..ff9e1e2 100644
--- a/main.tex
+++ b/main.tex
@@ -10,7 +10,7 @@
\usepackage{graphicx}
\usepackage{siunitx}
\usepackage{minted}
-\setminted{baselinestretch=1, numbersep=5pt, xleftmargin=9pt, linenos, fontsize=\small}
+\setminted{baselinestretch=1, numbersep=5pt, xleftmargin=9pt, linenos, fontsize=\footnotesize}
\usepackage{amsthm}
\usepackage{pgfplots}
\usepackage{tikz}
diff --git a/method.tex b/method.tex
index 4cc52be..398a97d 100644
--- a/method.tex
+++ b/method.tex
@@ -82,9 +82,9 @@ Csmith exposes several parameters through which the user can adjust how often va
%\paragraph{Significant probability changes}
%Table~\ref{tab:properties} lists the main changes that we put in place to ensure that HLS tools are able to synthesise all of our generated programs.
Our overarching aim is to make the programs tricky for the tools to handle correctly (to maximise our chance of exposing bugs), while keeping the synthesis and simulation times low (to maximise the rate at which tests can be run).
-For instance, we increase the probability of generating \code{if} statements so as to increase the number of control paths, but we reduce the probability of generating \code{for} loops and array operations since they generally increase run times but not hardware complexity.
+For instance, we increase the probability of generating \code{if} statements to increase the number of control paths, but we reduce the probability of generating \code{for} loops and array operations since they generally increase run times but not hardware complexity.
We disable various features that are not supported by HLS tools such as assignments inside expressions, pointers, and union types.
-We avoid floating-point numbers since they typically involve external libraries or hard IPs on FPGAs, which make it hard to reduce bugs to a minimal form.
+We avoid floating-point numbers since these often involve external libraries or IPs on FPGAs, which make it hard to reduce bugs to a minimal form.
%Relatedly, we reduce the probability of generating \code{break}, \code{goto}, \code{continue} and \code{return} statements, because with fewer \code{for} loops being generated, these statements tend to lead to uninteresting programs that simply exit prematurely.
%\paragraph{Important restrictions}
@@ -179,9 +179,9 @@ Finally, we generate a synthesisable testbench that executes the main function o
%Figure~\ref{fig:method:toolflow} shows the three stages of testing, depicted as the testing environment in the dashed area.
For each HLS tool in turn, we compile the C program to RTL and then simulate the RTL.
We also compile the C program using GCC and execute it.
-Although each HLS tool has its own built-in C compiler that could be used to obtain the reference output, we prefer to obtain the reference output ourselves in order to minimise our reliance on the tool being tested.
+Although each HLS tool has its own built-in C compiler that could be used to obtain the reference output, we prefer to obtain the reference output ourselves in order to rely less on the tool being tested.
-To ensure that our testing scales to a large number of large programs, we enforce several time-outs: we set a 5-minute time-out for C execution and a 2-hour time-out for C-to-RTL synthesis and RTL simulation. We do not count time-outs as bugs. %, but we do record them.
+To ensure that our testing scales to large programs, we enforce several time-outs: we set a 5-minute time-out for C execution and a 2-hour time-out for C-to-RTL synthesis and RTL simulation. Time-outs are not counted as bugs.
%% JW: This paragraph is not really needed because we repeat the sentiment in the next subsection anyway.
%There two types of bugs that we can encounter in this testing setup: programs that cause the HLS tool to crash during compilation (e.g. an unhandled assertion violation or a segmentation fault), and programs where the software execution and the RTL simulation do not return the same value. Programs that cause either type of error are given to the reduction stage, which aims to minimise the programs and (hopefully) identify the root cause(s).