Update on Overleaf.

author: John Wickerson <j.wickerson@imperial.ac.uk> 2021-03-26 21:16:22 +0000
committer: overleaf <overleaf@localhost> 2021-03-28 14:43:30 +0000
commit: 546590b262efafc33133059f1184268dd25d89fe (patch)
tree: f379f428c06d1fafe2e0932d03cc500d8a40c5de
parent: b16f3242909e90bdb547ab1500d46f5e38527981 (diff)
download: fccm21_esrhls-546590b262efafc33133059f1184268dd25d89fe.tar.gz
fccm21_esrhls-546590b262efafc33133059f1184268dd25d89fe.zip
6 files changed, 187 insertions, 138 deletions
diff --git a/conclusion.tex b/conclusion.tex
index fd28bcb..d081b09 100644
--- a/conclusion.tex
+++ b/conclusion.tex
@@ -1,9 +1,8 @@
 \section{Conclusion}
-We have shown how an existing fuzzing tool can be modified so that its output is suitable for HLS, and then used it in a campaign to test the reliability of three modern HLS tools. In total, we found at least \numuniquebugs{} unique bugs across all the tools, including crashes and miscompilation bugs.
-
+We have shown how an existing fuzzing tool can be modified so that its output is suitable for HLS, and then used it in a campaign to test the reliability of three modern HLS tools. In total, we found at least \numuniquebugs{} unique bugs across all the tools, including both crashes and miscompilations. 
 Further work could be done on supporting more HLS tools, especially ones that claim to prove that their output is correct before terminating, such as Catapult-C~\cite{mentor20_catap_high_level_synth}. % This could give an indication of how effective these proofs are, and how often they are actually able to complete their equivalence proofs during compilation in a feasible timescale.
 
-Conventional compilers have become quite resilient to fuzzing over the last decade, so recent work on fuzzing compilers has had to employ increasingly imaginative techniques to keep finding new bugs~\cite{karine+20}. In comparison, we have found that HLS tools -- at least, as they currently stand -- can be made to exhibit bugs even using the relatively basic fuzzing techniques that we employed in this project.
+Conventional compilers have become quite resilient to fuzzing over the last decade, so recent work on fuzzing compilers has had to employ increasingly imaginative techniques to keep finding new bugs~\cite{karine+20}. In contrast, we have found that HLS tools -- at least, as they currently stand -- can be made to exhibit bugs even using the relatively basic fuzzing techniques that we employed in this project.
 
 As HLS is becoming increasingly relied upon, it is important to make sure that HLS tools are also reliable.  We hope that this work further motivates the need for rigorous engineering of HLS tools, whether that is by validating that each output the tool produces is correct or by proving the HLS tool itself correct once and for all.
 
diff --git a/eval.tex b/eval.tex
index e17d955..2bf08a1 100644
--- a/eval.tex
+++ b/eval.tex
@@ -1,11 +1,10 @@
 \section{Evaluation}\label{sec:evaluation}
 
-We generate \totaltestcases{} test-cases and provide them to four HLS tools: Vivado HLS, LegUp HLS, Intel i++ and Bambu.
+We generate \totaltestcases{} test-cases and provide them to four HLS tools: Vivado HLS, LegUp HLS, Intel i++, and Bambu.
 We use the same test-cases across all tools for fair comparison (except the HLS directives, which have tool-specific syntax).
 We were able to test three different versions of Vivado HLS (v2018.3, v2019.1 and v2019.2). 
 We tested one version of Intel i++ (version 18.1), LegUp (4.0) and Bambu (v0.9.7).
-LegUp 7.5 is GUI-based and therefore we could not script our tests. 
-However, we were able to manually reproduce all the bugs found in LegUp 4.0 in LegUp 7.5. 
+LegUp 7.5 is GUI-based so we could not script our tests; however, we were able to manually reproduce all the bugs found in LegUp 4.0 in LegUp 7.5. 
 
 %  Three different tools were tested, including three different versions of Vivado HLS. We were only able to test one version of LegUp HLS (version 4.0), because although LegUp 7.5 is available, it is GUI-based and not amenable to scripting. However, bugs we found in LegUp 4.0 were reproduced manually in LegUp 7.5. 
 %  LegUp and Vivado HLS were run under Linux, while the Intel HLS Compiler was run under Windows.
@@ -19,14 +18,40 @@ However, we were able to manually reproduce all the bugs found in LegUp 4.0 in L
 \definecolor{timeout}{HTML}{ef4c4c}
 \begin{figure}
   \centering
+%     \begin{tikzpicture}[scale=0.61]
+%       \draw (-7.2,7.0) rectangle (7.2,0.7);
+%       \fill[vivado,fill opacity=0.5] (0.5,4.4) ellipse (3.75 and 1.5);
+%       \fill[intel,fill opacity=0.5] (-4,4.8) ellipse (2.5 and 1.3);
+%       \fill[bambu,fill opacity=0.5] (2.5,3) ellipse (3.75 and 1.5);
+%       \fill[legup,fill opacity=0.5] (-2.5,3) ellipse (3.75 and 1.5);
+%       \draw[white, thick] (0.5,4.4) ellipse (3.75 and 1.5);
+%       \draw[white, thick] (-4,4.8) ellipse (2.5 and 1.3);
+%       \draw[white, thick] (2.5,3) ellipse (3.75 and 1.5);
+%       \draw[white, thick] (-2.5,3) ellipse (3.75 and 1.5);
+%       \node[align=center, anchor=south west] at (-0.5,6) {\textcolor{vivado}{\bf Xilinx Vivado HLS v2019.1}};
+%       \node[anchor=south west] at (-6.4,6) {\textcolor{intel}{\bf Intel i++ 18.1}};
+%       \node at (-3,1.1) {\textcolor{legup}{\bf LegUp 4.0}};
+%       \node at (3,1.1) {\textcolor{bambu}{\bf Bambu PandA 0.9.7}};
+
+%       \node at (-3.5,2.5) {\small 159 (\textcolor{timeout}{4})};
+%       \node at (-5,5) {\small 26 (\textcolor{timeout}{540})};
+%       \node at (-4,3.9) {\small 1 (\textcolor{timeout}{1})};
+%       \node at (-2.3,4.8) {\small 0 (\textcolor{timeout}{5})};
+%       \node at (-1.5,3.8) {\small 4 (\textcolor{timeout}{0})};
+%       \node at (0,2.5) {\small 3 (\textcolor{timeout}{2})};
+%       \node at (3.5,2.5) {\small 906 (\textcolor{timeout}{14})};
+%       \node at (2.5,3.8) {\small 9 (\textcolor{timeout}{0})};
+%       \node at (0,5) {\small 70 (\textcolor{timeout}{20})};
+%       \node at (-6,1.4) {4936};
+%     \end{tikzpicture}
     \begin{tikzpicture}[scale=0.61]
       \draw (-7.2,7.0) rectangle (7.2,0.7);
-      \fill[vivado,fill opacity=0.5] (0.5,4.4) ellipse (3.75 and 1.5);
-      \fill[intel,fill opacity=0.5] (-4,4.8) ellipse (2.5 and 1.3);
+      \fill[vivado,fill opacity=0.5] (0.9,4.4) ellipse (3.3 and 1.5);
+      \fill[intel,fill opacity=0.5] (-4.5,4.8) ellipse (2.0 and 1.3);
       \fill[bambu,fill opacity=0.5] (2.5,3) ellipse (3.75 and 1.5);
       \fill[legup,fill opacity=0.5] (-2.5,3) ellipse (3.75 and 1.5);
-      \draw[white, thick] (0.5,4.4) ellipse (3.75 and 1.5);
-      \draw[white, thick] (-4,4.8) ellipse (2.5 and 1.3);
+      \draw[white, thick] (0.9,4.4) ellipse (3.3 and 1.5);
+      \draw[white, thick] (-4.5,4.8) ellipse (2.0 and 1.3);
       \draw[white, thick] (2.5,3) ellipse (3.75 and 1.5);
       \draw[white, thick] (-2.5,3) ellipse (3.75 and 1.5);
       \node[align=center, anchor=south west] at (-0.5,6) {\textcolor{vivado}{\bf Xilinx Vivado HLS v2019.1}};
@@ -34,112 +59,42 @@ However, we were able to manually reproduce all the bugs found in LegUp 4.0 in L
       \node at (-3,1.1) {\textcolor{legup}{\bf LegUp 4.0}};
       \node at (3,1.1) {\textcolor{bambu}{\bf Bambu PandA 0.9.7}};
 
-      \node at (-3.5,2.5) {\small 159 (\textcolor{timeout}{4})};
-      \node at (-5,5) {\small 26 (\textcolor{timeout}{540})};
-      \node at (-4,3.9) {\small 1 (\textcolor{timeout}{1})};
-      \node at (-2.3,4.8) {\small 0 (\textcolor{timeout}{5})};
-      \node at (-1.5,3.8) {\small 4 (\textcolor{timeout}{0})};
-      \node at (0,2.5) {\small 3 (\textcolor{timeout}{2})};
-      \node at (3.5,2.5) {\small 906 (\textcolor{timeout}{14})};
-      \node at (2.5,3.8) {\small 9 (\textcolor{timeout}{0})};
-      \node at (0,5) {\small 70 (\textcolor{timeout}{20})};
-      \node at (-6,1.4) {4936};
+      \node at (-3.5,2.5) {\small 159};
+      \node at (-5,5) {\small 26};
+      \node at (-4,3.9) {\small 1};
+      \node at (-1.5,3.8) {\small 4};
+      \node at (0,2.5) {\small 3};
+      \node at (3.5,2.5) {\small 906};
+      \node at (2.5,3.8) {\small 9};
+      \node at (0,5) {\small 70};
+      \node at (-6,1.4) {5522};
     \end{tikzpicture}
-\caption{The number of failures per tool out of \totaltestcases{} test-cases. Overlapping regions mean that the same test-cases failed in multiple tools. The numbers in parentheses report how many test-cases timed out.}\label{fig:existing_tools}
+\caption{The number of failures per tool out of \totaltestcases{} test-cases. Overlapping regions mean that the same test-cases failed in multiple tools. \JW{Todo: update the Bambu numbers.} %The numbers in parentheses report how many test-cases timed out.
+}\label{fig:existing_tools}
 \end{figure}
 
-Figure~\ref{fig:existing_tools} shows a Venn diagram of our results. 
+Figure~\ref{fig:existing_tools} shows an Euler diagram of our results. 
 We see that 918 (13.7\%), 167 (2.5\%), 83 (1.2\%) and 26 (0.4\%) test-cases fail in Bambu, LegUp, Vivado HLS and Intel i++ respectively.
+\JW{Somewhere around here mention that Bambu originally had M failures, but after a single bugfix, it went down to N failures. Maybe mention that we would have extended the same courtesy to the other tools had they released fixed versions of their tools promptly?}
 Despite i++ having the lowest failure rate, it has the highest time-out rate (540 test-cases), because of its remarkably long compilation time.
 % We remark that although the Intel HLS Compiler had the smallest number of confirmed test-case failures, it had the most time-outs (which could be masking additional failures)  
 Note that the absolute numbers here do not necessarily correspond to the number of bugs in the tools, because a single bug in a language feature that appears frequently in our test suite could cause many programs to crash or fail.
-Hence, we reduce many of the failing test-cases in an effort to identify unique bugs; these are summarised in the table below.\footnote{Link to detailed bug reports available from PC Chair.}
+Moreover, we are reluctant to draw conclusions about the relative reliability of each tool by comparing the number of test-case failures, because these numbers are so sensitive to the parameters of the randomly generated test suite we used. In other words, we can confirm the \emph{presence} of bugs, but cannot deduce the \emph{number} of them (nor their importance).
 
-\begin{table}[h]
-\centering
-\begin{tabular}{lr}\toprule
-    \textbf{Tool} & \textbf{Unique Bugs}\\
-    \midrule
-    Xilinx Vivado HLS v2019.1 & $\ge 2$\\
-    LegUp HLS & $\ge 3$\\
-    Intel i++ & $\ge 1$\\
-    Bambu HLS & $\ge 2$\\
-    \bottomrule
-  \end{tabular}
-\end{table}
-
-We write `$\ge$' above to emphasise that all the bug counts are lower bounds -- we did not have time to go through the rather arduous test-case reduction process for every failure.
-
-\subsection{Results across versions of an HLS tool}
-
-\definecolor{ribbon1}{HTML}{8dd3c7}
-\definecolor{ribbon2}{HTML}{b3de69}
-\definecolor{ribbon3}{HTML}{bebada}
-\definecolor{ribbon4}{HTML}{fb8072}
-\definecolor{ribbon5}{HTML}{80b1d3}
-\definecolor{ribbon6}{HTML}{fdb462}
-\begin{figure}
-  \centering
-  \begin{tikzpicture}[xscale=1.25]
-    \draw[white, fill=ribbon1] (-1.0,4.1) -- (0.0,4.1000000000000005) to [out=0,in=180] (2.0,4.1000000000000005) to [out=0,in=180] (4.0,4.1000000000000005) -- (6.0,4.1000000000000005) -- %(7.55,3.325) -- 
-    (6.0,2.5500000000000003) -- (4.0,2.5500000000000003) to [out=180,in=0] (2.0,2.5500000000000003) to [out=180,in=0] (0.0,2.5500000000000003) -- (-1.0,2.55) -- cycle;
-    \draw[white, fill=ribbon2] (-1.0,2.55) -- (0.0,2.5500000000000003) to [out=0,in=180] (2.0,1.8) to [out=0,in=180] (4.0,1.55) -- (6.0,1.55) -- %(7.3,0.9) -- 
-    (6.0,0.25) -- (4.0,0.25) to [out=180,in=0] (2.0,0.5) to [out=180,in=0] (0.0,1.25) -- (-1.0,1.25) -- cycle;
-    \draw[white, fill=ribbon3] (-1.0,1.25) -- (0.0,1.25) to [out=0,in=180] (2.0,2.5500000000000003) to [out=0,in=180] (4.0,0.25) -- (6.0,0.25) -- %(6.05,0.225) -- 
-    (6.0,0.2) -- (4.0,0.2) to [out=180,in=0] (2.0,2.5) to [out=180,in=0] (0.0,1.2000000000000002) -- (-1.0,1.2) -- cycle;
-    \draw[white, fill=ribbon4] (-1.0,0.5) -- (0.0,0.5) to [out=0,in=180] (2.0,2.5) to [out=0,in=180] (4.0,0.2) -- (6.0,0.2) -- %(6.2,0.1) --
-    (6.0,0.0) -- (4.0,0.0) to [out=180,in=0] (2.0,2.3000000000000003) to [out=180,in=0] (0.0,0.30000000000000004) -- (-1.0,0.3) -- cycle;
-    \draw[white, fill=ribbon5] (-1.0,1.2) -- (0.0,1.2000000000000002) to [out=0,in=180] (2.0,0.5) to [out=0,in=180] (4.0,2.5500000000000003) -- (6.0,2.5500000000000003) -- %(6.2,2.45) --
-    (6.0,2.35) -- (4.0,2.35) to [out=180,in=0] (2.0,0.30000000000000004) to [out=180,in=0] (0.0,1.0) -- (-1.0,1.0) -- cycle;
-    \draw[white, fill=ribbon6] (-1.0,0.3) -- (0.0,0.30000000000000004) to [out=0,in=180] (2.0,0.30000000000000004) to [out=0,in=180] (4.0,2.35) -- (6.0,2.35) -- %(6.3,2.2) -- 
-    (6.0,2.0500000000000003) -- (4.0,2.0500000000000003) to [out=180,in=0] (2.0,0.0) to [out=180,in=0] (0.0,0.0) -- (-1.0,0.0) -- cycle;
+We have reduced several of the failing test-cases in an effort to identify particular bugs, and our findings are summarised in Table~\ref{tab:bugsummary}. We emphasise that the bug counts here are lower bounds -- we did not have time to go through the arduous test-case reduction process for every failure.
+Figures~\ref{fig:eval:legup:crash}, \ref{fig:eval:intel:mismatch}, and~\ref{fig:eval:bambu:mismatch} present three of the bugs we found. As in Example~\ref{ex:vivado_miscomp}, each bug was first reduced automatically using \creduce{}, and then further reduced manually to achieve the minimal test-case.
 
-    \draw[white, fill=black] (-0.4,4.1) rectangle (0.0,1.0);
-    \draw[white, fill=black] (1.8,4.1) rectangle (2.2,2.3);
-    \draw[white, fill=black] (3.8,4.1) rectangle (4.2,2.05);
-
-    \node at (-0.2,4.5) {v2018.3};
-    \node at (2,4.5) {v2019.1};
-    \node at (4,4.5) {v2019.2};
-    %\node at (2,5) {Vivado HLS};
-
-    \node at (5.5,3.325) {31};
-    \node at (5.5,0.9) {26};
-    \node at (5.5,2.2) {6};
-
-    \node[white] at (-0.2,1.2) {62};
-    \node[white] at (2,2.5) {36};
-    \node[white] at (4,2.25) {41};
-  \end{tikzpicture}
-  \caption{A Sankey diagram that tracks \vivadotestcases{} test-cases through three different versions of Vivado HLS. The ribbons collect the test-cases that pass and fail together. The black bars are labelled with the total number of test-case failures per version. The 3573 test-cases that pass in all three versions are not depicted.
-  }\label{fig:sankey_diagram}
-\end{figure}
-
-Besides comparing the reliability of different HLS tools, we also investigated the reliability of Vivado HLS over time. Figure~\ref{fig:sankey_diagram} shows the results of giving \vivadotestcases{} test-cases to Vivado HLS v2018.3, v2019.1 and v2019.2. 
-Test-cases that pass and fail in the same tools are grouped together into a ribbon. 
-For instance, the topmost ribbon represents the 31 test-cases that fail in all three versions of Vivado HLS. Other ribbons can be seen weaving in and out; these indicate that bugs were fixed or reintroduced in the various versions. We see that Vivado HLS v2018.3 had the most test-case failures (62).
-Interestingly, as an indicator of reliability of HLS tools, the blue ribbon shows that there are test-cases that fail in v2018.3, pass in v2019.1 but then fail again in v2019.2.
-As in our Venn diagram, the absolute numbers here do not necessary correspond to the number of actual bugs, but we can deduce that there must be at least six unique bugs in Vivado HLS, given that each ribbon corresponds to at least one unique bug.
-
-
-
-
-%\YH{Contradicts value of 3 in Table~\ref{tab:unique_bugs}, maybe I can change that to 6?} \JW{I'd leave it as-is personally; we have already put a `$\ge$' symbol in the table, so I think it's fine.}
-%In addition to that, it can then be seen that Vivado HLS v2018.3 must have at least 4 individual bugs, of which two were fixed and two others stayed in Vivado HLS v2019.1.  However, with the release of v2019.1, new bugs were introduced as well. % Finally, for version 2019.2 of Vivado HLS, there seems to be a bug that was reintroduced which was also present in Vivado 2018.3, in addition to a new bug.  In general it seems like each release of Vivado HLS will have new bugs present, however, will also contain many previous bug fixes.  However, it cannot be guaranteed that a bug that was previously fixed will remain fixed in future versions as well.
-
-\subsection{Some specific bugs found}
-
-We now describe three more of the bugs we found: one crash bug in LegUp, and a miscompilation in Intel and Bambu respectively. As in Example~\ref{ex:vivado_miscomp}, each bug was first reduced automatically using \creduce{}, and then reduced further manually to achieve the minimal test-case.
-
-\begin{example}[A crash bug in LegUp]
-The program shown below leads to an internal compiler error (an unhandled assertion in this case) in LegUp 4.0 and 7.5. 
+\begin{figure}[t]
 \begin{minted}{c}
 int a[2][2][1] = {{{0},{1}},{{0},{0}}};
 int main() { a[0][1][0] = 1; }
 \end{minted}
+\caption{This program leads to an internal compiler error (an unhandled assertion in this case) in LegUp 4.0 and 7.5. It initialises a 3D array with zeroes and then assigns to one element. The bug only appears when function inlining is disabled (\texttt{NO\_INLINE}).}
+\label{fig:eval:legup:crash}
+\end{figure}
 %An assertion error counts as a crash of the tool, as it means that an unexpected state was reached by this input. 
 %This shows that there is a bug in one of the compilation passes in LegUp, however, due to the assertion the bug is caught in the tool before it produces an incorrect design.
-It initialises a 3D array with zeroes, and then assigns to one element. The bug only appears when function inlining is disabled (\texttt{NO\_INLINE}). % The code initialises the array with zeroes except for \texttt{a[0][1][0]}, which is set to one.  Then the main function assigns one to that same location.  This code on its own should not actually produce a result and should just terminate by returning 0, which is also what the design that LegUp generates does when the \texttt{NO\_INLINE} flag is turned off.
+% The code initialises the array with zeroes except for \texttt{a[0][1][0]}, which is set to one.  Then the main function assigns one to that same location.  This code on its own should not actually produce a result and should just terminate by returning 0, which is also what the design that LegUp generates does when the \texttt{NO\_INLINE} flag is turned off.
 
 %The following code also produces an assertion error in LegUp, which is a different one this time.  This bug was not discovered during the main test runs of 10 thousand test cases, but beforehand, which meant that we disabled unions from being generated.  However, this bug also requires the \texttt{volatile} keyword which seems to be the reason for quite a few mismatches in LegUp and Vivado.
 %
@@ -151,9 +106,9 @@ It initialises a 3D array with zeroes, and then assigns to one element. The bug
 %int main() { return un.a; }
 %\end{minted}
 
-\end{example}
+%\end{example}
 
-\begin{figure}
+\begin{figure}[t]
 \begin{minted}{c}
 static volatile int a[9][1][7];
 int main() {
@@ -168,10 +123,10 @@ int main() {
   return tmp;
 }
 \end{minted}
-\caption{Miscompilation bug in Intel i++.  It should return 2 because \code{3 \^{} 1 = 2}, however, Intel i++ returns 0 instead.}\label{fig:eval:intel:mismatch}
+\caption{This program miscompiles in Intel i++. It should return 2 because \code{3 \^{} 1 = 2}, but Intel i++ generates a design that returns 0 instead. Perhaps the assignment to 3 in the first for-loop is being overlooked.}\label{fig:eval:intel:mismatch}
 \end{figure}
 
-\begin{figure}
+\begin{figure}[t]
 \begin{minted}{c}
 static int b = 0x10000;
 static volatile short a = 0;
@@ -182,7 +137,7 @@ int main() {
   return b;
 }
 \end{minted}
-\caption{Miscompilation bug in Bambu.  As the value of \texttt{b} is shifted to the right by 8, the output should be \texttt{0x100}.  However, Bambu outputs 0.}\label{fig:eval:bambu:mismatch}
+\caption{This program miscompiles in Bambu. As the value of \texttt{b} is shifted to the right by 8, the output should be \texttt{0x100}, but Bambu generates a design that returns 0. The increment operation on \texttt{a} appears unrelated, but is necessary to trigger the bug.}\label{fig:eval:bambu:mismatch}
 \end{figure}
 
 %\begin{example}[A miscompilation bug in Vivado HLS]
@@ -196,13 +151,110 @@ int main() {
 
 %\end{example}
 
-\begin{example}[A miscompilation bug in Intel i++]
-Figure~\ref{fig:eval:intel:mismatch} shows a miscompilation bug that was found in Intel i++.  Intel i++ does not seem to notice the assignment to 3 in the previous for loop, or tries to perform some optimisations that seem to analyse the array incorrectly and therefore results in a wrong value being returned.
-\end{example}
+%\begin{example}[A miscompilation bug in Intel i++]
+%Figure~\ref{fig:eval:intel:mismatch} shows a miscompilation bug found in Intel i++.  Intel i++ does not seem to notice the assignment to 3 in the previous for loop, or tries to perform some optimisations that seem to analyse the array incorrectly and therefore results in a wrong value being returned.
+%\end{example}
+
+%\begin{example}[A miscompilation bug in Bambu]
+%Figure~\ref{fig:eval:bambu:mismatch} shows a miscompilation bug in Bambu, where the result of the value in \texttt{b} is affected by the increment operation on \texttt{a}.
+%\end{example}
+
+%We have reduced several of the failing test-cases in an effort to identify particular bugs; these are summarised in the table below.\footnote{Link to detailed bug reports available from PC Chair.} \JW{One reviewer complained about this table not having a caption.} \JW{How about we extend this table so it has one bug per row? See my attempt in Table~\ref{tab:bugsummary}.}
+
+%\begin{table}[h]
+%\centering
+%\begin{tabular}{lr}\toprule
+%    \textbf{Tool} & \textbf{Unique Bugs}\\
+%    \midrule
+%    Xilinx Vivado HLS v2019.1 & $\ge 2$\\
+%    LegUp HLS & $\ge 3$\\
+%    Intel i++ & $\ge 1$\\
+%    Bambu HLS & $\ge 2$\\
+%    \bottomrule
+%  \end{tabular}
+%\end{table}
+
+\begin{table}[t]
+\centering
+\caption{A summary of the bugs we found.}
+\label{tab:bugsummary}
+\begin{tabular}{llll}\toprule
+    \textbf{Tool} & \textbf{Bug type} & \textbf{Details} & \textbf{Status} \\
+    \midrule
+    Vivado HLS & miscompile & Fig.~\ref{fig:vivado_bug1} & reported, confirmed \\
+    Vivado HLS & miscompile & webpage & reported \\
+    LegUp HLS & crash & Fig.~\ref{fig:eval:legup:crash} & reported, confirmed \\
+    LegUp HLS & crash & webpage & \JW{status?} \\
+    LegUp HLS & miscompile & webpage & reported, confirmed \\
+    Intel i++ & miscompile & Fig.~\ref{fig:eval:intel:mismatch} & reported \\
+    Bambu HLS & miscompile & Fig.~\ref{fig:eval:bambu:mismatch} & reported, confirmed, fixed \\
+    Bambu HLS & miscompile & webpage & reported, confirmed \\
+    \bottomrule
+  \end{tabular}
+\end{table}
+
+%We write `$\ge$' above to emphasise that all the bug counts are lower bounds -- we did not have time to go through the rather arduous test-case reduction process for every failure.
+
+\subsection{Results across versions of an HLS tool}
+
+\definecolor{ribbon1}{HTML}{8dd3c7}
+\definecolor{ribbon2}{HTML}{b3de69}
+\definecolor{ribbon3}{HTML}{bebada}
+\definecolor{ribbon4}{HTML}{fb8072}
+\definecolor{ribbon5}{HTML}{80b1d3}
+\definecolor{ribbon6}{HTML}{fdb462}
+\begin{figure}[t]
+  \centering
+  \begin{tikzpicture}[xscale=1.25]
+    \draw[white, fill=ribbon1] (-1.0,4.1) -- (0.0,4.1) to [out=0,in=180] (2.0,4.1) to [out=0,in=180] (4.0,4.1) -- (6.0,4.1) -- %(7.55,3.325) -- 
+    (6.0,2.55) -- (4.0,2.55) to [out=180,in=0] (2.0,2.55) to [out=180,in=0] (0.0,2.55) -- (-1.0,2.55) -- cycle;
+    \draw[white, fill=ribbon2] (-1.0,2.55) -- (0.0,2.55) to [out=0,in=180] (1.8,1.8) -- (2.2,1.8) to [out=0,in=180] (4.0,1.55) -- (6.0,1.55) -- %(7.3,0.9) -- 
+    (6.0,0.25) -- (4.0,0.25) to [out=180,in=0] (2.2,0.5) -- (1.8,0.5) to [out=180,in=0] (0.0,1.25) -- (-1.0,1.25) -- cycle;
+    \draw[white, fill=ribbon3] (-1.0,1.25) -- (0.0,1.25) to [out=0,in=180] (1.8,2.55) -- (2.2,2.55) to [out=0,in=180] (4.0,0.25) -- (6.0,0.25) -- %(6.05,0.225) -- 
+    (6.0,0.2) -- (4.0,0.2) to [out=180,in=0] (2.2,2.5) -- (1.8,2.5) to [out=180,in=0] (0.0,1.2) -- (-1.0,1.2) -- cycle;
+    \draw[white, fill=ribbon4] (-1.0,0.5) -- (0.0,0.5) to [out=0,in=180] (1.8,2.5) -- (2.2,2.5)to [out=0,in=180] (4.0,0.2) -- (6.0,0.2) -- %(6.2,0.1) --
+    (6.0,0.0) -- (4.0,0.0) to [out=180,in=0] (2.2,2.3) -- (1.8,2.3) to [out=180,in=0] (0.0,0.3) -- (-1.0,0.3) -- cycle;
+    \draw[white, fill=ribbon5] (-1.0,1.2) -- (0.0,1.2) to [out=0,in=180] (1.8,0.5) -- (2.2,0.5) to [out=0,in=180] (4.0,2.55) -- (6.0,2.55) -- %(6.2,2.45) --
+    (6.0,2.35) -- (4.0,2.35) to [out=180,in=0] (2.2,0.3) -- (1.8,0.3) to [out=180,in=0] (0.0,1.0) -- (-1.0,1.0) -- cycle;
+    \draw[white, fill=ribbon6] (-1.0,0.3) -- (0.0,0.3) to [out=0,in=180] (1.8,0.3) -- (2.2,0.3) to [out=0,in=180] (4.0,2.35) -- (6.0,2.35) -- %(6.3,2.2) -- 
+    (6.0,2.05) -- (4.0,2.05) to [out=180,in=0] (2.2,0.0) -- (1.8,0.0) to [out=180,in=0] (0.0,0.0) -- (-1.0,0.0) -- cycle;
+
+    \draw[white, fill=black] (-0.4,4.1) rectangle (0.0,1.0);
+    \draw[white, fill=black] (1.8,4.1) rectangle (2.2,2.3);
+    \draw[white, fill=black] (4.0,4.1) rectangle (4.4,2.05);
+
+    \node at (-0.2,4.5) {v2018.3};
+    \node at (2,4.5) {v2019.1};
+    \node at (4.2,4.5) {v2019.2};
+    %\node at (2,5) {Vivado HLS};
+
+    \node at (5.5,3.325) {31};
+    \node at (5.5,0.9) {26};
+    \node at (5.5,2.2) {6};
+
+    \node[white] at (-0.2,1.2) {62};
+    \node[white] at (2,2.5) {36};
+    \node[white] at (4.2,2.25) {41};
+  \end{tikzpicture}
+  \caption{A Sankey diagram that tracks \vivadotestcases{} test-cases through three different versions of Vivado HLS. The ribbons collect the test-cases that pass and fail together. The black bars are labelled with the total number of test-case failures per version. The 3573 test-cases that pass in all three versions are not depicted.
+  }\label{fig:sankey_diagram}
+\end{figure}
+
+Besides studying the reliability of different HLS tools, we also studied the reliability of Vivado HLS over time. Figure~\ref{fig:sankey_diagram} shows the results of giving \vivadotestcases{} test-cases to Vivado HLS v2018.3, v2019.1 and v2019.2. 
+Test-cases that pass and fail in the same tools are grouped together into a ribbon. 
+For instance, the topmost ribbon represents the 31 test-cases that fail in all three versions of Vivado HLS. Other ribbons can be seen weaving in and out; these indicate that bugs were fixed or reintroduced in the various versions. We see that Vivado HLS v2018.3 had the most test-case failures (62).
+Interestingly, the blue ribbon shows that there are test-cases that fail in v2018.3, pass in v2019.1, and then fail again in v2019.2.
+As in our Euler diagram, the absolute numbers here do not necessary correspond to the number of actual bugs, though we can observe that there must be at least six unique bugs in Vivado HLS, given that each ribbon corresponds to at least one unique bug.
+
+
+
+
+%\YH{Contradicts value of 3 in Table~\ref{tab:unique_bugs}, maybe I can change that to 6?} \JW{I'd leave it as-is personally; we have already put a `$\ge$' symbol in the table, so I think it's fine.}
+%In addition to that, it can then be seen that Vivado HLS v2018.3 must have at least 4 individual bugs, of which two were fixed and two others stayed in Vivado HLS v2019.1.  However, with the release of v2019.1, new bugs were introduced as well. % Finally, for version 2019.2 of Vivado HLS, there seems to be a bug that was reintroduced which was also present in Vivado 2018.3, in addition to a new bug.  In general it seems like each release of Vivado HLS will have new bugs present, however, will also contain many previous bug fixes.  However, it cannot be guaranteed that a bug that was previously fixed will remain fixed in future versions as well.
+
+%\subsection{Some specific bugs found}
+
 
-\begin{example}[A miscompilation bug in Bambu]
-Figure~\ref{fig:eval:bambu:mismatch} shows a miscompilation bug in Bambu, where the result of the value in \texttt{b} is affected by the increment operation on \texttt{a}.
-\end{example}
 
 %%% Local Variables:
 %%% mode: latex
diff --git a/intro.tex b/intro.tex
index 29f09d1..7634475 100644
--- a/intro.tex
+++ b/intro.tex
@@ -1,8 +1,9 @@
 \section{Introduction}
 
-High-level synthesis (HLS), which refers to the automatic translation of software into hardware, is becoming an increasingly important part of the computing landscape.
-It promises hardware engineers an increase in productivity by raising the abstraction level of their designs, and it promises software engineers the ability to produce application-specific hardware accelerators without having to understand hardware description languages (HDL) such as Verilog and VHDL.
-HLS is being used in an ever greater range of domains, including such high-assurance settings as financial services~\cite{hls_fintech}, control systems~\cite{hls_controller}, and real-time object detection~\cite{hls_objdetect}. As such, HLS tools are increasingly relied upon, even though ``high-level synthesis research and development is inherently prone to introducing bugs or regressions in the final circuit functionality''~\cite[Section 3.4.6]{canis15_legup}.  In this paper, we investigate whether they are trustworthy and give an empirical evaluation of their reliability.
+High-level synthesis (HLS), which refers to the automatic translation of software into hardware, is becoming an increasingly important part of the computing landscape, even in such high-assurance settings as financial services~\cite{hls_fintech}, control systems~\cite{hls_controller}, and real-time object detection~\cite{hls_objdetect}.
+The appeal of HLS is twofold: it promises hardware engineers an increase in productivity by raising the abstraction level of their designs, and it promises software engineers the ability to produce application-specific hardware accelerators without having to understand Verilog and VHDL.
+
+As such, we are increasingly reliant on HLS tools. But are these tools reliable? Questions have been raised about the reliability of HLS before; for example, Andrew Canis, co-creator of the LegUp HLS tool, wrote that ``high-level synthesis research and development is inherently prone to introducing bugs or regressions in the final circuit functionality''~\cite[Section 3.4.6]{canis15_legup}. In this paper, we investigate whether there is substance to this concern by conducting an empirical evaluation of the reliability of several widely used HLS tools.
 
 The approach we take is \emph{fuzzing}.
 %To test the trustworthiness of HLS tools, we need a robust way of generating programs that both have good coverage and also explores various corner cases. 
@@ -30,10 +31,8 @@ In this paper, we bring fuzzing to the HLS context.
 
 \begin{example}[A miscompilation bug in Vivado HLS]
 \label{ex:vivado_miscomp}
-Figure~\ref{fig:vivado_bug1} shows a program that produces the wrong result during RTL simulation in Xilinx Vivado HLS v2018.3, v2019.1 and v2019.2.\footnote{This program, like all the others in this paper, includes a \code{main} function, which means that it compiles straightforwardly with GCC. To compile it with an HLS tool, we rename \code{main} to \code{result}, synthesise that function, and then add a new \code{main} function as a testbench that calls \code{result}.} The bug was initially revealed by a randomly generated program of around 113 lines, which we were able to reduce to the minimal example shown in the figure.  We reported this issue to Xilinx, who confirmed it to be a bug.\footnote{Link to Xilinx bug report redacted for review.}
-% \footnote{https://bit.ly/3mzfzgA}
-The program repeatedly shifts a large integer value \code{x} right by the values stored in array \code{arr}.
-Vivado HLS returns \code{0x006535FF}, but the result returned by GCC (and subsequently confirmed manually to be the correct one) is \code{0x046535FF}.
+Figure~\ref{fig:vivado_bug1} shows a program that produces the wrong result during RTL simulation in Xilinx Vivado HLS v2018.3, v2019.1 and v2019.2.\footnote{This program, like all the others in this paper, includes a \code{main} function, which means that it compiles straightforwardly with GCC. To compile it with an HLS tool, we rename \code{main} to \code{result}, synthesise that function, and then add a new \code{main} function as a testbench that calls \code{result}.} The program repeatedly shifts a large integer value \code{x} right by the values stored in array \code{arr}.
+Vivado HLS returns \code{0x006535FF}, but the result returned by GCC (and subsequently confirmed manually to be the correct one) is \code{0x046535FF}. The bug was initially revealed by a randomly generated program of around 113 lines, which we were able to reduce to the minimal example shown in the figure.  We reported this issue to Xilinx, who confirmed it to be a bug.\footnote{\url{https://bit.ly/3mzfzgA}} 
 \end{example}
 
 \begin{figure}[t]
@@ -53,18 +52,18 @@ int main() {
 
 The example above demonstrates the effectiveness of fuzzing. It seems unlikely that a human-written test-suite would discover this particular bug, given that it requires several components all to coincide before the bug is revealed!
 
-Yet this example also begs the question: do bugs found by fuzzers really \emph{matter}, given that they are usually found by combining language features in ways that are vanishingly unlikely to happen `in the real world'~\cite{marcozzi+19}. This question is especially pertinent for our particular context of HLS tools, which are well-known to have restrictions on the language features that they handle. Nevertheless, although the \emph{test-cases} we generated do not resemble the programs that humans write, the \emph{bugs} that we exposed using those test-cases are real, and \emph{could also be exposed by realistic programs}.
+Yet this example also begs the question: do bugs found by fuzzers really \emph{matter}, given that they are usually found by combining language features in ways that are vanishingly unlikely to happen `in the real world'~\cite{marcozzi+19}. This question is especially pertinent for our particular context of HLS tools, which are well-known to have restrictions on the language features they support. Nevertheless, although the \emph{test-cases} we generated do not resemble the programs that humans write, the \emph{bugs} that we exposed using those test-cases are real, and \emph{could also be exposed by realistic programs}.
 %Moreover, it is worth noting that HLS tools are not exclusively provided with human-written programs to compile: they are often fed programs that have been automatically generated by another compiler.
-Ultimately, we believe that any errors in an HLS tool are worth identifying because they have the potential to cause problems, either now or in the future. And problems caused by HLS tools going wrong (or indeed any sort of compiler for that matter) are particularly egregious, because it is so difficult for end-users to identify whether the fault lies with their design or the HLS tool.
+Ultimately, we believe that any errors in an HLS tool are worth identifying because they have the potential to cause problems -- either now or in the future. And problems caused by HLS tools going wrong (or indeed any sort of compiler for that matter) are particularly egregious, because it is so difficult for end-users to identify whether the fault lies with their design or the HLS tool.
 
 \subsection{Our approach and results}
 
 Our approach to fuzzing HLS tools comprises three steps. 
-First, we use Csmith~\cite{yang11_findin_under_bugs_c_compil} to generate thousands of valid C programs from within the subset of the C language that is supported by all the HLS tools we test. We also augment each program with a random selection of HLS-specific directives. Second, we give these programs to four widely used HLS tools: Xilinx Vivado HLS~\cite{xilinx20_vivad_high_synth}, LegUp HLS~\cite{canis13_legup}, the Intel HLS Compiler, which is also known as i++~\cite{intel20_sdk_openc_applic} and finally Bambu~\cite{pilato13_bambu}. Third, if we find a program that causes an HLS tool to crash, or to generate hardware that produces a different result from GCC, we reduce it to a minimal example with the help of the \creduce{} tool~\cite{creduce}.
+First, we use Csmith~\cite{yang11_findin_under_bugs_c_compil} to generate thousands of valid C programs within the subset of the C language that is supported by all the HLS tools we test. We also augment each program with a random selection of HLS-specific directives. Second, we give these programs to four widely used HLS tools: Xilinx Vivado HLS~\cite{xilinx20_vivad_high_synth}, LegUp HLS~\cite{canis13_legup}, the Intel HLS Compiler, also known as i++~\cite{intel20_sdk_openc_applic}, and finally Bambu~\cite{pilato13_bambu}. Third, if we find a program that causes an HLS tool to crash, or to generate hardware that produces a different result from GCC, we reduce it to a minimal example with the help of the \creduce{} tool~\cite{creduce}.
 
-Our testing campaign revealed that all four tools could be made to generate an incorrect design.  In total, \totaltestcases{} test cases were run through each tool out of which \totaltestcasefailures{} test cases failed in at least one of the tools.  Test case reduction was then performed on some of these failing test cases to obtain at least \numuniquebugs{} unique failing test cases.
+Our testing campaign revealed that all four tools could be made to generate an incorrect design.  In total, \totaltestcases{} test cases were run through each tool, of which \totaltestcasefailures{} failed in at least one of the tools.  Test case reduction was then performed on some of these failing test cases to obtain at least \numuniquebugs{} unique failing test cases.
     
-To investigate whether HLS tools are getting more or less reliable over time, we also tested three different versions of Vivado HLS (v2018.3, v2019.1, and v2019.2).  We found far fewer failures in versions v2019.1 and v2019.2 compared to v2018.3, but we also identified a few test-cases that only failed in versions v2019.1 and v2019.2, which suggests that some new features may have introduced bugs.
+To investigate whether HLS tools are getting more or less reliable over time, we also tested three different versions of Vivado HLS (v2018.3, v2019.1, and v2019.2).  We found far fewer failures in versions v2019.1 and v2019.2 compared to v2018.3, but we also identified a few test cases that only failed in versions v2019.1 and v2019.2; this suggests that some new features may have introduced bugs.
 
 In summary, the overall aim of our paper is to raise awareness about the reliability (or lack thereof) of current HLS tools, and to serve as a call-to-arms for investment in better-engineered tools. We hope that future work on developing more reliable HLS tools will find our empirical study a valuable source of motivation.
 
diff --git a/main.tex b/main.tex
index 2761aad..1d9a7e4 100644
--- a/main.tex
+++ b/main.tex
@@ -4,7 +4,7 @@
 \COMMENTStrue
 
 \newif\ifBLIND
-\BLINDtrue
+\BLINDfalse
 
 \usepackage[english]{babel}
 \usepackage{graphicx}
@@ -47,9 +47,9 @@
 \author{Blind review}
 \else
 \author{%
-\IEEEauthorblockN{Zewei Du, Yann Herklotz, Nadesh Ramanathan, and John Wickerson}
+\IEEEauthorblockN{Yann Herklotz, Zewei Du, Nadesh Ramanathan, and John Wickerson}
 \IEEEauthorblockA{Imperial College London, UK \\
-Email: \{zewei.du19, yann.herklotz15, n.ramanathan14, j.wickerson\}@imperial.ac.uk}
+Email: \{yann.herklotz15, zewei.du19, n.ramanathan14, j.wickerson\}@imperial.ac.uk}
 }
 \fi
 \maketitle
diff --git a/method.tex b/method.tex
index 8841208..a06b2a0 100644
--- a/method.tex
+++ b/method.tex
@@ -78,13 +78,13 @@ This section describes how test-cases are generated (\S\ref{sec:method:csmith}),
     \end{tabular}
 \end{table}
 
-Csmith exposes several parameters through which the user can adjust how often various C constructs appear in the randomly generated programs. Table~\ref{tab:properties} describes how we configured these parameters.
+Csmith exposes several parameters through which the user can adjust how often various C constructs appear in the randomly generated programs. Table~\ref{tab:properties} describes how we configure these parameters.
 %\paragraph{Significant probability changes}
 %Table~\ref{tab:properties} lists the main changes that we put in place to ensure that HLS tools are able to synthesise all of our generated programs. 
-Our overarching aim is to make the programs tricky for the tools to handle correctly (in order to maximise our chances of exposing bugs), while keeping the synthesis and simulation times low (in order to maximise the rate at which tests can be run).
+Our overarching aim is to make the programs tricky for the tools to handle correctly (to maximise our chance of exposing bugs), while keeping the synthesis and simulation times low (to maximise the rate at which tests can be run).
 For instance, we increase the probability of generating \code{if} statements so as to increase the number of control paths, but we reduce the probability of generating \code{for} loops and array operations since they generally increase run times but not hardware complexity.
 We disable various features that are not supported by HLS tools such as assignments inside expressions, pointers, and union types.
-We avoid floating-point numbers since they typically involve external libraries or use of hard IPs on FPGAs, which make it hard to reduce bugs to a minimal form.
+We avoid floating-point numbers since they typically involve external libraries or hard IPs on FPGAs, which make it hard to reduce bugs to a minimal form.
 %Relatedly, we reduce the probability of generating \code{break}, \code{goto}, \code{continue} and \code{return} statements, because with fewer \code{for} loops being generated, these statements tend to lead to uninteresting programs that simply exit prematurely.
 
 %\paragraph{Important restrictions}
@@ -122,7 +122,7 @@ We avoid floating-point numbers since they typically involve external libraries
 
 To prepare the programs generated by Csmith for HLS testing, we modify them in two ways. First, we inject random HLS directives, which instruct the HLS tool to perform certain optimisations, including: loop pipelining, loop unrolling, loop flattening, loop merging, expression balancing, function pipelining, function-level loop merging, function inlining, array mapping, array partitioning, and array reshaping. Some directives can be applied via a separate configuration file (.tcl), some require us to add labels to the C program (e.g. to identify loops), and some require placing pragmas at particular locations in the C program.
 
-The second modification we make has to do with the top-level function. Each program generated by Csmith ends its execution by printing a hash of all its variables' values, in the hope that any miscompilations will be exposed through this hash value. We found that Csmith's built-in hash function led to infeasibly long synthesis times, so we replaced it with our own simple XOR-based one.
+The second modification has to do with the top-level function. Each program generated by Csmith ends its execution by printing a hash of all its variables' values, in the hope that any miscompilations will be exposed through this hash value. We found that Csmith's built-in hash function led to infeasibly long synthesis times, so we replace it with our own simple XOR-based one.
 
 Finally, we generate a synthesisable testbench that executes the main function of the original C program, and a tool-specific script that instructs the HLS tool to create a design project and then build and simulate the design.
 
@@ -179,10 +179,9 @@ Finally, we generate a synthesisable testbench that executes the main function o
 %Figure~\ref{fig:method:toolflow} shows the three stages of testing, depicted as the testing environment in the dashed area.
 For each HLS tool in turn, we compile the C program to RTL and then simulate the RTL. 
 We also compile the C program using GCC and execute it. 
-Although each HLS tool has its own built-in C compiler that could be used to obtain the reference output, we prefer to obtain the reference output ourselves in order to minimise our reliance on the tool that is being tested.
+Although each HLS tool has its own built-in C compiler that could be used to obtain the reference output, we prefer to obtain the reference output ourselves in order to minimise our reliance on the tool being tested.
 
-To ensure that our testing scales to a large number of large programs, we also enforce several time-outs: we set a 5-minute time-out for C execution and a 2-hour time-out for C-to-RTL synthesis and RTL simulation.
-We do not count time-outs as bugs, but we do record them.
+To ensure that our testing scales to a large number of large programs, we enforce several time-outs: we set a 5-minute time-out for C execution and a 2-hour time-out for C-to-RTL synthesis and RTL simulation. We do not count time-outs as bugs. %, but we do record them.
 
 %% JW: This paragraph is not really needed because we repeat the sentiment in the next subsection anyway.
 %There two types of bugs that we can encounter in this testing setup: programs that cause the HLS tool to crash during compilation (e.g. an unhandled assertion violation or a segmentation fault), and programs where the software execution and the RTL simulation do not return the same value. Programs that cause either type of error are given to the reduction stage, which aims to minimise the programs and (hopefully) identify the root cause(s).
@@ -213,7 +212,7 @@ Once we discover a program that crashes the HLS tool or whose C/RTL simulations
 %\creduce{} is an existing reducer for C and C++ and runs the reduction steps in parallel to converge as quickly as possible.  It is effective because it reduces the input while preserving semantic validity and avoiding undefined behaviour.
 %It has various reduction strategies, such as delta debugging passes and function inlining, that help it converge rapidly to a test-case that is small enough to understand and step through.
 
-We also check at each stage of the reduction process that the reduced program remains within the subset of the language that is supported by the HLS tools; without this check, \creduce{} only zeroed in on programs that were outside of this subset and hence did not represent real bugs.
+We also check at each stage of the reduction process that the reduced program remains within the subset of the language that is supported by the HLS tools; without this check, we found that \creduce{} kept zeroing in on programs that were outside this subset and hence did not represent real bugs.
 
 %However, the downside of using \creduce{} with HLS tools is that we are not in control of which lines and features are prioritised for removal.
 %As a consequence, we can easily end up with \creduce{} producing programs that are not synthesisable, despite being valid C.
diff --git a/related.tex b/related.tex
index b2fd442..b2edec2 100644
--- a/related.tex
+++ b/related.tex
@@ -1,14 +1,14 @@
 \section{Related Work}
 
-The only other work of which we are aware on fuzzing HLS tools is that by Lidbury et al. \cite{lidbury15_many_core_compil_fuzzin}, who tested several OpenCL compilers, including an HLS compiler from Altera (now Intel). They were only able to subject that compiler to superficial testing because so many of the test-cases they generated led to it crashing. In comparison to our work: where Lidbury et al. generated target-independent OpenCL programs that could be used to test HLS tools and conventional compilers alike, we specifically generate programs that are tailored for HLS (e.g. with HLS-specific pragmas and only including supported constructs) with the aim of testing the HLS tools more deeply. Another difference is that where we test using sequential C programs, they test using highly concurrent OpenCL programs, and thus have to go to great lengths to ensure that any discrepancies observed between compilers cannot be attributed to the inherent nondeterminism of concurrency.
+The only other work of which we are aware on fuzzing HLS tools is that by Lidbury et al. \cite{lidbury15_many_core_compil_fuzzin}, who tested several OpenCL compilers, including an HLS compiler from Altera (now Intel). They were only able to subject that compiler to superficial testing because so many of the test-cases \JW{Need to audit `test case' vs `test-case' throughout. I don't mind which we use; I mildly prefer `test-case' because it's less ambiguous when parsing.} they generated led to it crashing. In comparison to our work: where Lidbury et al. generated target-independent OpenCL programs that could be used to test HLS tools and conventional compilers alike, we specifically generate programs that are tailored for HLS (e.g. with HLS-specific pragmas and only including supported constructs) with the aim of testing the HLS tools more deeply. Another difference is that where we test using sequential C programs, they test using highly concurrent OpenCL programs, and thus have to go to great lengths to ensure that any discrepancies observed between compilers cannot be attributed to the inherent nondeterminism of concurrency.
 
-Other stages of the FPGA toolchain have been subjected to fuzzing. Herklotz et al.~\cite{verismith} tested several FPGA synthesis tools using randomly generated Verilog programs. Where they concentrated on the RTL-to-netlist stage of hardware design, we focus our attention on the earlier C-to-RTL stage.
+Other stages of the FPGA toolchain have been subjected to fuzzing. In previous work~\cite{verismith}, we tested several FPGA synthesis tools using randomly generated Verilog programs. Where that work concentrated on the RTL-to-netlist stage of hardware design, this work focuses on the C-to-RTL stage.
 
 Several authors have taken steps toward more rigorously engineered HLS tools that may be more resilient to testing campaigns such as ours. 
 The Handel-C compiler by Perna and Woodcock~\cite{perna12_mechan_wire_wise_verif_handel_c_synth} has been mechanically proven correct, at least in part, using the HOL theorem prover; however, the tool does not support C as input directly, so is not amenable to fuzzing. 
 Ramanathan et al.~\cite{ramanathan+17} proved their implementation of C atomic operations in LegUp correct up to a bound using model checking; however, our testing campaign is not applicable to their implementation because we do not generate concurrent C programs.
 In the SPARK HLS tool~\cite{gupta03_spark}, some compiler passes, such as scheduling, are mechanically validated during compilation~\cite{chouksey20_verif_sched_condit_behav_high_level_synth}; unfortunately, this tool is no longer available.
-Finally, the Catapult C HLS tool~\cite{mentor20_catap_high_level_synth} is designed only to produce an output netlist if it can mechanically prove it equivalent to the input program; it should therefore never produce wrong RTL. In future work, we intend to test Catapult C alongside Vivado HLS, LegUp, and Intel i++.
+Finally, the Catapult C HLS tool~\cite{mentor20_catap_high_level_synth} is designed only to produce an output netlist if it can mechanically prove it equivalent to the input program; it should therefore never produce wrong RTL. In future work, we intend to test Catapult C alongside Vivado HLS, LegUp, Intel i++, and Bambu.
 %more prevalent these were prioritised.
 % JW: We're not really sure that LegUp is more prevalent than Catapult HLS. Indeed, it probably isn't!
 %\JW{Obvious reader question at this point: why not test that claim by giving our Csmith test cases to Catapult C too? Can we address that here? No worries if not; but shall we try and do that after the deadline anyway?}\YH{Yes, definitely, it would be great to get an idea of how Catapult C performs, and I currently have it installed already.  I have added a small sentence for that now, but let me know if I should mention this in the conclusion instead though. }
author	John Wickerson <j.wickerson@imperial.ac.uk>	2021-03-26 21:16:22 +0000
committer	overleaf <overleaf@localhost>	2021-03-28 14:43:30 +0000
commit	546590b262efafc33133059f1184268dd25d89fe (patch)
tree	f379f428c06d1fafe2e0932d03cc500d8a40c5de
parent	b16f3242909e90bdb547ab1500d46f5e38527981 (diff)
download	fccm21_esrhls-546590b262efafc33133059f1184268dd25d89fe.tar.gz fccm21_esrhls-546590b262efafc33133059f1184268dd25d89fe.zip