Update on Overleaf.

author: n.ramanathan14 <n.ramanathan14@imperial.ac.uk> 2020-09-15 08:40:52 +0000
committer: overleaf <overleaf@localhost> 2020-09-15 08:40:55 +0000
commit: ebaf38ee9fa9ed6a74230c93fa2d15df44521c9a (patch)
tree: 9b661c26ec60e0710668e108e879abbfd1ecb3bd
parent: 66b5b9ca3de555980d93864fb7bac5e8ea0fcb1c (diff)
download: fccm21_esrhls-ebaf38ee9fa9ed6a74230c93fa2d15df44521c9a.tar.gz
fccm21_esrhls-ebaf38ee9fa9ed6a74230c93fa2d15df44521c9a.zip
12 files changed, 320 insertions, 270 deletions
diff --git a/conference.bib b/conference.bib
index c8d683e..3e766f8 100644
--- a/conference.bib
+++ b/conference.bib
@@ -216,4 +216,23 @@ with {LegUp} High-Level Synthesis},
   timestamp = {Thu, 16 Apr 2020 13:51:46 +0200},
   biburl    = {https://dblp.org/rec/journals/pacmpl/MarcozziTDC19.bib},
   bibsource = {dblp computer science bibliography, https://dblp.org}
+}
+
+@inproceedings{ramanathan+17,
+  author    = {Nadesh Ramanathan and
+               Shane T. Fleming and
+               John Wickerson and
+               George A. Constantinides},
+  editor    = {Jonathan W. Greene and
+               Jason Helge Anderson},
+  title     = {Hardware Synthesis of Weakly Consistent {C} Concurrency},
+  booktitle = {Proceedings of the 2017 {ACM/SIGDA} International Symposium on Field-Programmable
+               Gate Arrays, {FPGA} 2017, Monterey, CA, USA, February 22-24, 2017},
+  pages     = {169--178},
+  publisher = {{ACM}},
+  year      = {2017},
+  url       = {http://dl.acm.org/citation.cfm?id=3021733},
+  timestamp = {Tue, 06 Nov 2018 16:58:22 +0100},
+  biburl    = {https://dblp.org/rec/conf/fpga/RamanathanFWC17.bib},
+  bibsource = {dblp computer science bibliography, https://dblp.org}
 }
 \ No newline at end of file
diff --git a/eval.tex b/eval.tex
index 6330647..1708519 100644
--- a/eval.tex
+++ b/eval.tex
@@ -1,110 +1,205 @@
-\section{Evaluation}
+\section{Evaluation}\label{sec:evaluation}
 
 \definecolor{vivado}{HTML}{7fc97f}
 \definecolor{intel}{HTML}{beaed4}
 \definecolor{legup}{HTML}{fdc086}
 \begin{figure}
-  \resizebox{0.5\textwidth}{!}{%
+  \resizebox{0.47\textwidth}{!}{%
     \begin{tikzpicture}
-      \begin{scope}
-        \draw[fill=vivado,fill opacity=0.5] (-4.4,4.4) ellipse (3.75 and 2.75);
-        \draw[fill=intel,fill opacity=0.5] (-10.2,4.4) ellipse (3.75 and 2.75);
-        \draw[fill=legup,fill opacity=0.5] (-7.3,2) ellipse (3.75 and 2.75);
-        \node at (-10.2,6.3) {\Large\textsf{\textbf{Vivado}}};
-        \node at (-4.4,6.3) {\Large\textsf{\textbf{Intel HLS}}};
-        \node at (-7.3,0) {\Large\textsf{\textbf{Legup}}};
-      \end{scope}
-
+      \draw (-14.5,7.65) rectangle (0,-1);
+      \fill[vivado,fill opacity=0.5] (-4.4,4.4) ellipse (3.75 and 2.75);
+      \fill[intel,fill opacity=0.5] (-10.2,4.4) ellipse (3.75 and 2.75);
+      \fill[legup,fill opacity=0.5] (-7.3,2) ellipse (3.75 and 2.75);
+      \draw[white] (-4.4,4.4) ellipse (3.75 and 2.75); % making the
+      \draw[white] (-10.2,4.4) ellipse (3.75 and 2.75); % outlines
+      \draw[white] (-7.3,2) ellipse (3.75 and 2.75);    % fully opaque
+      \node[align=center] at (-10.2,6.3) {\Large\textsf{\textbf{Xilinx Vivado HLS}} \\ \textsf{\textbf{(all versions)}}};
+      \node at (-4.4,6.3) {\Large\textsf{\textbf{Intel HLS Compiler}}};
+      \node at (-7.3,0) {\Large\textsf{\textbf{LegUp}}};
+
+      \node at (-5.5,3) {\Huge 1 (\textcolor{red}{1})};
+      \node at (-9.1,3) {\Huge 4 (\textcolor{red}{0})};
+      \node at (-3,5) {\Huge 26 (\textcolor{red}{540})};
+      \node at (-11.6,5) {\Huge 79 (\textcolor{red}{20})};
+      \node at (-7.3,1) {\Huge 162 (\textcolor{red}{6})};
+      \node at (-7.3,5.2) {\Huge 0 (\textcolor{red}{5})};
+      \node at (-7.3,3.8) {\Huge 0 (\textcolor{red}{0})};
+      \node at (-13.6,-0.5) {\Huge 5856};
     \end{tikzpicture}
   }
-\caption{Venn diagram showing the existing tools and their current features.  \textbf{Implementation} refers to usable HLS tools, whereas \textbf{Proof} refers to papers that demonstrate proofs of an algorithm without necessarily linking that proof with the algorithm that implements it.}\label{fig:existing_tools}
+\caption{A Venn diagram showing the number of failures in each tool out of 6700 test cases that were run.  Overlapping regions mean that the test cases failed in all those tools.  The numbers in parentheses represent the number of test cases that timed out.}\label{fig:existing_tools}
 \end{figure}
 
 \begin{table}
   \centering
   \begin{tabular}{lr}\toprule
-    \textbf{Tool} & \textbf{Bugs found}\\
+    \textbf{Tool} & \textbf{Unique Bugs}\\
     \midrule
-    Vivado 2018.3 & 2\\
-    Vivado 2019.1 & 2\\
-    Vivado 2019.2 & 2\\
-    Legup 4.0 & 4\\
-    Intel HLS & 1\\
+    Xilinx Vivado HLS (all versions) & $\ge 2$\\
+    LegUp HLS & $\ge 3$\\
+    Intel HLS Compiler & $\ge 1$\\
     \bottomrule
   \end{tabular}
   \caption{Unique bugs found in each tool.}
   \label{tab:unique_bugs}
 \end{table}
 
-During the implementation stage, the testing system successfully detected bugs among all three tools. The test cases used were not the same for each tool, because, as being mentioned, each tool has its own supported syntax. Thus, to ensure the validness of test cases during the implementation stage, test cases were all generated based on the supported syntax of the targeted tools. 
-
-After the implementation of testing system is proven to be working, a set of 10,000 test cases are generated and fed into HLS tools for constructing the bigger picture regarding the quality of HLS tools. The 10,000 test cases were kept constant to ensure the comparison is fair between tools. Unfortunately, 10,000 test runs were still in progress by the time of writing this thesis, as runtime for each test case can range from 5 minutes up to several hours. Data showing in \ref{result table for 10k test cases} is collected when writing this report.
+We generate 6700 test cases and provide them to three HLS tools: Vivado HLS, LegUp HLS and Intel HLS. 
+We use the same test cases across all tools for fair comparison.
+We were able to test three different versions of Vivado HLS (v2018.3,v2019.1 and v2019.2). 
+We were only able to test one version of LegUp: 4.0. 
+At the point of writing, LegUp 7.5 is still GUI-based and therefore we could not script our tests. 
+However, we are able to reproduce bugs found in LegUp 4.0 in LegUp 7.5. 
+Finally, we tested one version of Intel HLS (vXXXX.X).  
+
+%  Three different tools were tested, including three different versions of Vivado HLS. We were only able to test one version of LegUp HLS (version 4.0), because although LegUp 7.5 is available, it is GUI-based and not amenable to scripting. However, bugs we found in LegUp 4.0 were reproduced manually in LegUp 7.5. 
+%  LegUp and Vivado HLS were run under Linux, while the Intel HLS Compiler was run under Windows.
+
+\subsection{Results across different HLS tools}
+
+Figure~\ref{fig:existing_tools} shows a Venn diagram of our results. 
+We see that 167 (2.5\%), 83 (1.2\%) and 26 (0.4\%) test-cases fail in LegUp, Vivado HLS and Intel HLS respectively. 
+Despite Intel HLS having the lowet failure rate, it has the highest time-out rate with 540 programs, because of its long compilation time.
+% We remark that although the Intel HLS Compiler had the smallest number of confirmed test-case failures, it had the most time-outs (which could be masking additional failures)  
+Note that the absolute numbers here do not necessary correspond number of bugs found.
+Multiple programs could crash or fail due to the same bug.
+Hence, we reduce many of the failing test-cases to identify unique bugs, as summarised in Table~\ref{tab:unique_bugs}. 
+We write `$\ge$' in the table to indicate that all the bug counts are lower bounds -- we did not have time to go through the test-case reduction process for every failure.
+
+\subsection{Results across versions of an HLS tool}
+
+Besides comparing the reliability of different HLS tools, we also investigated the reliability of Vivado HLS over time. Figure~\ref{fig:sankey_diagram} shows the results of giving 3645 test cases to Vivado HLS 2018.3, 2019.1 and 2019.2. 
+Test cases that pass and fail in the same tools are grouped together into a ribbon. 
+For instance, the topmost ribbon represents the 31 test-cases that fail in all three versions of Vivado HLS. Other ribbons can be seen weaving in and out; these indicate that bugs were fixed or reintroduced in the various versions. The diagram demonstrates that Vivado HLS 2018.3 contains the most failing test cases compared to the other versions, having 62 test cases fail in total. %Interestingly, Vivado HLS 2019.1 and 2019.2 have a different number of failing test cases, meaning feature improvements that introduced bugs as well as bug fixes between those minor versions.
+Interestingly, as an indicator of reliability of HLS tools, the blue ribbon shows that there are test-cases that fail in v2018.3, pass in v2019.1 but then fail again in 2019.2.
+
+\definecolor{ribbon1}{HTML}{8dd3c7}
+\definecolor{ribbon2}{HTML}{b3de69}
+\definecolor{ribbon3}{HTML}{bebada}
+\definecolor{ribbon4}{HTML}{fb8072}
+\definecolor{ribbon5}{HTML}{80b1d3}
+\definecolor{ribbon6}{HTML}{fdb462}
+\begin{figure}
+  \centering
+  \begin{tikzpicture}
+    \draw[white, fill=ribbon1] (-1.0,4.1) -- (0.0,4.1000000000000005) to [out=0,in=180] (2.0,4.1000000000000005) to [out=0,in=180] (4.0,4.1000000000000005) -- (6.0,4.1000000000000005) -- %(7.55,3.325) -- 
+    (6.0,2.5500000000000003) -- (4.0,2.5500000000000003) to [out=180,in=0] (2.0,2.5500000000000003) to [out=180,in=0] (0.0,2.5500000000000003) -- (-1.0,2.55) -- cycle;
+    \draw[white, fill=ribbon2] (-1.0,2.55) -- (0.0,2.5500000000000003) to [out=0,in=180] (2.0,1.8) to [out=0,in=180] (4.0,1.55) -- (6.0,1.55) -- %(7.3,0.9) -- 
+    (6.0,0.25) -- (4.0,0.25) to [out=180,in=0] (2.0,0.5) to [out=180,in=0] (0.0,1.25) -- (-1.0,1.25) -- cycle;
+    \draw[white, fill=ribbon3] (-1.0,1.25) -- (0.0,1.25) to [out=0,in=180] (2.0,2.5500000000000003) to [out=0,in=180] (4.0,0.25) -- (6.0,0.25) -- %(6.05,0.225) -- 
+    (6.0,0.2) -- (4.0,0.2) to [out=180,in=0] (2.0,2.5) to [out=180,in=0] (0.0,1.2000000000000002) -- (-1.0,1.2) -- cycle;
+    \draw[white, fill=ribbon4] (-1.0,0.5) -- (0.0,0.5) to [out=0,in=180] (2.0,2.5) to [out=0,in=180] (4.0,0.2) -- (6.0,0.2) -- %(6.2,0.1) --
+    (6.0,0.0) -- (4.0,0.0) to [out=180,in=0] (2.0,2.3000000000000003) to [out=180,in=0] (0.0,0.30000000000000004) -- (-1.0,0.3) -- cycle;
+    \draw[white, fill=ribbon5] (-1.0,1.2) -- (0.0,1.2000000000000002) to [out=0,in=180] (2.0,0.5) to [out=0,in=180] (4.0,2.5500000000000003) -- (6.0,2.5500000000000003) -- %(6.2,2.45) --
+    (6.0,2.35) -- (4.0,2.35) to [out=180,in=0] (2.0,0.30000000000000004) to [out=180,in=0] (0.0,1.0) -- (-1.0,1.0) -- cycle;
+    \draw[white, fill=ribbon6] (-1.0,0.3) -- (0.0,0.30000000000000004) to [out=0,in=180] (2.0,0.30000000000000004) to [out=0,in=180] (4.0,2.35) -- (6.0,2.35) -- %(6.3,2.2) -- 
+    (6.0,2.0500000000000003) -- (4.0,2.0500000000000003) to [out=180,in=0] (2.0,0.0) to [out=180,in=0] (0.0,0.0) -- (-1.0,0.0) -- cycle;
+
+    \draw[white, fill=black] (-0.4,4.1) rectangle (0.0,1.0);
+    \draw[white, fill=black] (1.8,4.1) rectangle (2.2,2.3);
+    \draw[white, fill=black] (3.8,4.1) rectangle (4.2,2.05);
+
+    \node at (-0.2,4.5) {2018.3};
+    \node at (2,4.5) {2019.1};
+    \node at (4,4.5) {2019.2};
+    %\node at (2,5) {Vivado HLS};
+
+    \node at (5.5,3.325) {31};
+    \node at (5.5,0.9) {26};
+    \node at (5.5,2.2) {6};
+
+    \node[white] at (-0.2,1.2) {62};
+    \node[white] at (2,2.5) {36};
+    \node[white] at (4,2.25) {41};
+  \end{tikzpicture}
+  \caption{A Sankey diagram that tracks 3645 test cases through three different versions of Vivado HLS. The ribbons collect the test cases that pass and fail together. The 3573 test cases that pass in all three versions are not depicted.
+  }\label{fig:sankey_diagram}
+\end{figure}
+% \NR{Why are there missing numbers in the ribbons?}
 
-Three versions of Vivado HLS, including version 2019.2, 2019.1, and 2018.3, were tested. As being mentioned before, Vivado HLS version 2019.2 and 2019.1 won’t process logical AND operator with constant, which will warn about changing to bitwise AND operator (\verb|&|) and then error out. Thus, as shown in \ref{result table}, the column named as “Invalid test cases” indicates test cases that have logical AND operator (\verb|&&|) with constants. It is also being said that version 2018.3 managed to cope with the logical AND operator, so the “invalid test cases” section for version 2018.3 only count the test cases that has GCC and Vivado HLS’s C simulation result unmatched. 
+As in our Venn diagram, the absolute numbers in Figure~\ref{fig:sankey_diagram} do not necessary correspond to the number of bugs. However, we can deduce from this diagram that there must be at least six unique bugs in Vivado HLS, given that a ribbon must contain at least one unique bug. \YH{Contradicts value of 3 in Table~\ref{tab:unique_bugs}, maybe I can change that to 6?} \JW{I'd leave it as-is personally; we have already put a `$\ge$' symbol in the table, so I think it's fine.}
+In addition to that, it can then be seen that Vivado HLS v2018.3 must have at least 4 individual bugs, of which two were fixed and two others stayed in Vivado HLS v2019.1.  However, with the release of v2019.1, new bugs were introduced as well. % Finally, for version 2019.2 of Vivado HLS, there seems to be a bug that was reintroduced which was also present in Vivado 2018.3, in addition to a new bug.  In general it seems like each release of Vivado HLS will have new bugs present, however, will also contain many previous bug fixes.  However, it cannot be guaranteed that a bug that was previously fixed will remain fixed in future versions as well.
 
-Moving on to the LegUp HLS version 4.0, the C/RTL unmatched result section is composed of assertion errors and C and RTL results unmatched. The assertion error happens when translating from C to RTL, which results in no Verilog file being produced. Therefore, this condition is also considered as inconsistent results since the LegUp HLS failed to translate an accurate RTL description for the inputted valid C code. Although, proportionally, the total number of C and RTL mismatch detected in LegUp HLS is much higher compared to which of Vivado HLS, two points must be emphasized. Firstly, LegUp HLS version 4.0 was implemented long time ago in 2015 and published as open sources for academic and research uses. By the time of writing this thesis, version 8.0 was just released for commercial uses. As mentioned before, version 7.5 was installed and run as GUI. Although it was not being fuzzing-tested under 10,000 test cases due to difficulties encountered when running it through the command line, it was used as a reference for comparing differences between versions released. By trying out several failing tests on version 7.5, some of the failing tests passed successfully without causing the same problem. Then we can confirm that some of the embedded issues have been solved already. Thus, lesser discrepancy in results when running through the newer versions should be expected. Secondly, Vivado HLS version 2019.2 errors out plenty of test cases due to the pragma error. Only a subset of test cases was synthesized to RTL and being simulated. Reducing in overall valid test cases can result in a lower amount of unmatched test cases. Thus, the results for LegUp HLS should not be treated “equally” with which for Vivado HLS. Those two points needed to be taken into consideration when analyzing the results obtained from LegUp HLS. 
+\subsection{Some specific bugs found}
 
-Finally, as for Intel HLS, three points need to be taken into account when analyzing the result for Intel HLS. Firstly, the hashing function used for Intel HLS is much simpler comparing to which is used for both Vivado HLS and LegUp HLS. One possible result of using simple hashing function is that a considerable amount of bug can go undetected, which can lower the total number of cases that have C and RTL results unmatched. Secondly, as Intel HLS is running under the Windows environment, it runs relatively slower comparing to other tools that run under the Linux system. Based on the result, a considerable amount of test runs was timed out, which ultimately decreases the total number of valid test runs. Therefore, similar to what has been mentioned about the reduced overall valid test cases for Vivado HLS, when analyzing the Intel HLS, this should also be taken into consideration. Lastly, as Intel HLS uses the I++ compiler instead of GCC, differences can exist. Although theoretically, the C++ compiler does have stricter requirements, it should be compatible with compiling C programs. And as the test programs were generated through Csmith in C language, words like “bool” or “class”, which are only supported in C++ but not in C, do not exist. Also, Csimth was forbidden to generate operations like malloc, which might trigger incompatibility between C and C++. Thus, although compatibility can pose a potential problem, it should not have a huge impact. Besides those three points, similar to Vivado HLS, Intel HLS will alert about using logical AND operation with constants, but it does not error out immediately. So the “invalid test cases” section is set to not applicable.
+This section describes some of the bugs that were found in the various tools that were tested. We describe two bugs in LegUp and one in Vivado HLS; in each case, the bug was first reduced automatically using \creduce{}, and then reduced further manually to achieve the minimal test case. Although we did find test-case failures in the Intel HLS Compiler, the very long compilation times for that tool meant that we did not have time to reduce any of the failures down to an example that is minimal enough to present here.
 
-We generate $10$ thousand test programs to test on three different HLS tools, including three versions of Vivado HLS, one version of LegUp HLS and one version of Intel HLS.
-We provide the same set of programs to all HLS tools and show that different tools give rise to unique bugs.
+\subsubsection{LegUp assertion error}
 
+The code shown in Figure~\ref{fig:eval:legup:assert} leads to an assertion error in LegUp 4.0 and 7.5 even though it should compile without any errors. 
+An assertion error counts as a crash of the tool, as it means that an unexpected state was reached by this input. 
+This shows that there is a bug in one of the compilation passes in LegUp, however, due to the assertion the bug is caught in the tool before it produces an incorrect design.
 
-\begin{figure}\centering
+\begin{figure}
 \begin{minted}{c}
 int a[2][2][1] = {{{0},{1}},{{0},{0}}};
 
 int main() { a[0][1][0] = 1; }
 \end{minted}
-\caption{This test cases crashes LegUp 4.0 with an assertion error.}\label{fig:legup_crash1}
+\caption{This program causes an assertion failure in LegUp HLS when \texttt{NO\_INLINE} is set.}\label{fig:eval:legup:assert}
 \end{figure}
 
-\begin{figure}\centering
-\begin{minted}{c}
-union U1 { int a; };
+The buggy test case has to do with initialisation and assignment to a three-dimensional array, for which the above piece of code is the minimal example.  However, in addition to that it requires the \texttt{NO\_INLINE} flag to be set, which disables function inlining.  The code initialises the array with zeroes except for \texttt{a[0][1][0]}, which is set to one.  Then the main function assigns one to that same location.  This code on its own should not actually produce a result and should just terminate by returning 0, which is also what the design that LegUp generates does when the \texttt{NO\_INLINE} flag is turned off.
 
-volatile union U1 un = {0};
+%The following code also produces an assertion error in LegUp, which is a different one this time.  This bug was not discovered during the main test runs of 10 thousand test cases, but beforehand, which meant that we disabled unions from being generated.  However, this bug also requires the \texttt{volatile} keyword which seems to be the reason for quite a few mismatches in LegUp and Vivado.
+%
+%\begin{minted}{c}
+%union U1 { int a; };
+%
+%volatile union U1 un = {0};
+%
+%int main() { return un.a; }
+%\end{minted}
 
-int main() { return un.a; }
-\end{minted}
-\caption{This crashes in Legup 4.0.}\label{fig:legup_crash2}
-\end{figure}
+\subsubsection{LegUp miscompilation}
 
-\begin{figure}\centering
+The test case in Figure~\ref{fig:eval:legup:wrong} produces an incorrect Verilog in LegUp 4.0, which means that the results of RTL simulation is different to the C execution.
+
+\begin{figure}
 \begin{minted}{c}
 volatile int a = 0;
 int b = 1;
 
-int main() {
+int main() { 
   int d = 1;
   if (d + a) b || 1;
   else b = 0;
   return b;
 }
 \end{minted}
-\caption{This crashes in Legup 4.0.}\label{fig:legup_crash2}
+\caption{An output mismatch: LegUp HLS returns 0 but the correct result is 1.}\label{fig:eval:legup:wrong}
 \end{figure}
 
-\NR{
-What we need for this section:
-\begin{itemize}
-    \item Venn diagrams to show the overall results. Onlyy missing information from Venn diagram is the unique bugs per tool, which we can provide with a smaller table. This table can actually lead nicely to the bug portfolio.
-    \item Sankey diagram for different versions of Vivado HLS. 
-    \item A portfolio of bugs on various tools, together with intuitive explanation for why we see these bugs. We can also mention filing of bug reports, where relevant.
-    \item Is the area and latency of buggy hardware of interest to us?
-\end{itemize}
-}
+In the code above, \texttt{b} has value 1 when run in GCC, but has value 0 when run with LegUp 4.0.  If the \texttt{volatile} keyword is removed from \texttt{a}, then the Verilog produces the correct result. As \texttt{a} and \texttt{d} are constants, the \code{if} statement should always produce go into the \texttt{true} branch, meaning \texttt{b} should never be set to 0.  The \texttt{true} branch of the \code{if} statement only executes an expression which is not assigned to any variable, meaning the initial state of all variables should not change. However, LegUp HLS generates a design which enters the \texttt{else} branch instead and assigns \texttt{b} to be 0.  The cause of this bug seems to be the \texttt{volatile} keyword and the analysis that is performed to simplify the \code{if} statement.
+
+\subsubsection{Vivado HLS miscompilation}
+
+Figure~\ref{fig:eval:vivado:mismatch} shows code that does not output the right value when compiled with all Vivado HLS versions, as it returns \texttt{0x0} with Vivado HLS whereas it should be returning \texttt{0xF}.  This test case is much longer compared to the other test cases that were reduced and could not be made any smaller, as everything in the code seems to be necessary to trigger the bug.
 
-\subsection{Venn diagram of results}
+The array \texttt{a} is initialised to all zeroes, as well as the other global variables \texttt{g} and \texttt{c}, so as to not introduce any undefined behaviour.  However, \texttt{g} is also given the \texttt{volatile} keyword, which ensures that the variable is not optimised away.  The function \texttt{d} then accumulates the values \texttt{b} that it is passed into a hash stored in \texttt{c}.  Each \texttt{b} is eight bits wide, so function \texttt{e} calls the function 7 times for some of the bits in the 64-bit value of \texttt{f} that it is passed.  Finally, in the main function, the array is initialised partially with a \code{for} loop, after which the \texttt{e} function is called twice, once on the volatile function and once on a constant.  Interestingly, the second function call with the constant is also necessary to trigger the bug.
 
-\subsection{Bug portfolio}
-\NR{In fact, the structure of this section relates to the Venn diagram.}
+\begin{figure}
+\begin{minted}{c}
+volatile unsigned int g = 0;
+int a[256] = {0};
+int c = 0;
 
-\subsubsection{Vivado HLS}
+void d(char b) { c = (c & 4095) ^ a[(c ^ b) & 15]; }
 
-\subsubsection{LegUp HLS}
+void e(long f) {
+  d(f); d(f >> 8); d(f >> 16); d(f >> 24);
+  d(f >> 32); d(f >> 40); d(f >> 48);
+}
+
+int main() {
+  for (int i = 0; i < 56; i++) a[i] = i;
+  e(g); e(-2L);
+  return c;
+}
+\end{minted}
+\caption{An output mismatch where GCC returns \texttt{0xF}, whereas Vivado HLS return \texttt{0x0}.}\label{fig:eval:vivado:mismatch}
+\end{figure}
 
-\subsubsection{Intel HLS}
 
 
 %%% Local Variables:
diff --git a/eval_rewrite.tex b/eval_rewrite.tex
deleted file mode 100644
index 1cedfe2..0000000
--- a/eval_rewrite.tex
+++ /dev/null
@@ -1,181 +0,0 @@
-\section{Evaluation}\label{sec:evaluation}
-
-\definecolor{vivado}{HTML}{7fc97f}
-\definecolor{intel}{HTML}{beaed4}
-\definecolor{legup}{HTML}{fdc086}
-\begin{figure}
-  \resizebox{0.47\textwidth}{!}{%
-    \begin{tikzpicture}
-      \draw (-14.5,7.65) rectangle (0,-1);
-      \fill[vivado,fill opacity=0.5] (-4.4,4.4) ellipse (3.75 and 2.75);
-      \fill[intel,fill opacity=0.5] (-10.2,4.4) ellipse (3.75 and 2.75);
-      \fill[legup,fill opacity=0.5] (-7.3,2) ellipse (3.75 and 2.75);
-      \draw[black] (-4.4,4.4) ellipse (3.75 and 2.75); % making the
-      \draw[black] (-10.2,4.4) ellipse (3.75 and 2.75); % outlines
-      \draw[black] (-7.3,2) ellipse (3.75 and 2.75);    % fully opaque
-      \node at (-10.2,6.3) {\Large\textsf{\textbf{Vivado HLS}}};
-      \node at (-4.4,6.3) {\Large\textsf{\textbf{Intel HLS}}};
-      \node at (-7.3,0) {\Large\textsf{\textbf{LegUp}}};
-
-      \node at (-5.5,3) {\Huge 1 (\textcolor{red}{1})};
-      \node at (-9.1,3) {\Huge 4 (\textcolor{red}{0})};
-      \node at (-3,5) {\Huge 26 (\textcolor{red}{540})};
-      \node at (-11.6,5) {\Huge 79 (\textcolor{red}{20})};
-      \node at (-7.3,1) {\Huge 162 (\textcolor{red}{6})};
-      \node at (-7.3,5.2) {\Huge 0 (\textcolor{red}{5})};
-      \node at (-7.3,3.8) {\Huge 0 (\textcolor{red}{0})};
-      \node at (-13.6,-0.5) {\Huge 5856};
-    \end{tikzpicture}
-  }
-\caption{A Venn diagram showing the number of failures in each tool out of 6700 test cases that were run.  Overlapping regions mean that the test cases failed in all those tools.  The numbers in parentheses represent the number of test cases that timed out.}\label{fig:existing_tools}
-\end{figure}
-
-\begin{table}
-  \centering
-  \begin{tabular}{lr}\toprule
-    \textbf{Tool} & \textbf{Unique Bugs}\\
-    \midrule
-    Vivado HLS & $\ge 2$\\
-    LegUp HLS & $\ge 3$\\
-    Intel HLS & $\ge 1$\\
-    \bottomrule
-  \end{tabular}
-  \caption{Unique bugs found in each tool.}
-  \label{tab:unique_bugs}
-\end{table}
-
-To evaluate the different HLS tools 6,700 test cases were generated and fed into each tool, keeping the test cases constant so that the comparison between the tools was fair.  Three different tools were tested, including three different versions of Vivado HLS, which are shown in Table~\ref{tab:unique_bugs}.  Bugs were found in all tools that were tested, and in total, \ref{??} unique bugs were found and reported to the tool vendors.
-We were only able to test one version of LegUp HLS (version 4.0). LegUp 7.5 is GUI-based and not suitable for scripting; however, bugs we found in LegUp 4.0 were reproduced manually in LegUp 7.5.
-
-\subsection{Bugs found}
-
-This section describes some of the bugs that were found in the various tools that were tested, going over different types of bugs in the process.  Each of the following bugs was first reduced automatically using C-Reduce, and then reduced further manually to achieve the minimal test case.
-
-\subsubsection{LegUp Assertion Error}
-
-The piece of code shown in Figure~\ref{fig:eval:legup:assert} produces an assertion error in LegUp 4.0 and 7.5 even though it should compile without any errors.  The assertion error states that This assertion error is equivalent to an unexpected crash of the tool as it means that an unexpected state was reached by this input.  This shows that there is a bug in one of the compilation passes in LegUp, however, due to the assertion the bug is caught in the tool before it produces an incorrect design.
-
-\begin{figure}
-\begin{minted}{c}
-int a[2][2][1] = {{{0},{1}},{{0},{0}}};
-
-int main() { a[0][1][0] = 1; }
-\end{minted}
-\caption{An assertion bug in LegUp HLS when setting \texttt{NO\_INLINE} to prevent inlining.}\label{fig:eval:legup:assert}
-\end{figure}
-
-The buggy test case has to do with initialisation and assignment to a three dimensional array, for which the above piece of code is the minimal example.  However, in addition to that it requires the \texttt{NO\_INLINE} constant to be set, which disallows inlining of functions.  The code initialises the array with zeros except for \texttt{a[0][1][0]}, which is set to one.  Then the main function assigns one to that same location, which causes LegUp to crash with an assertion error.  This code on its own should not actually produce a result and should just terminate by returning 0, which is also what the design does that LegUp generates when the \texttt{NO\_INLINE} flag is turned off.
-
-%The following code also produces an assertion error in LegUp, which is a different one this time.  This bug was not discovered during the main test runs of 10 thousand test cases, but beforehand, which meant that we disabled unions from being generated.  However, this bug also requires the \texttt{volatile} keyword which seems to be the reason for quite a few mismatches in LegUp and Vivado.
-%
-%\begin{minted}{c}
-%union U1 { int a; };
-%
-%volatile union U1 un = {0};
-%
-%int main() { return un.a; }
-%\end{minted}
-
-\subsubsection{LegUp Miscompilation}
-
-The test case in Figure~\ref{fig:eval:legup:wrong} produces an incorrect netlist in LegUp 4.0, meaning the result of simulating the design and running the C code directly is different.
-
-\begin{figure}
-\begin{minted}{c}
-volatile int a = 0;
-int b = 1;
-
-int main() { 
-  int d = 1;
-  if (d + a) b || 1;
-  else b = 0;
-  return b;
-}
-\end{minted}
-\caption{An output mismatch where GCC returns 1 and LegUp HLS returns 0.}\label{fig:eval:legup:wrong}
-\end{figure}
-
-In the code above, \texttt{b} has value 1 when run in GCC, but has value 0 when run with LegUp 4.0.  If the \texttt{volatile} keyword is removed from \texttt{a}, then the netlist contains the correct result.  As \texttt{a} and \texttt{d} are constants, the if-statement should always produce go into the \texttt{true} branch, meaning \texttt{b} should never be set to 0.  The \texttt{true} branch of the if-statement only executes an expression which is not assigned to any variable, meaning the initial state of all variables should not have changed.  However, LegUp HLS generates a design which enters the \texttt{else} branch instead and assigns \texttt{b} to be 0.  The cause of this bug seems to be the \texttt{volatile} keyword and the analysis that is performed to simplify the if-statement.
-
-\subsubsection{Vivado Miscompilation}
-
-Figure~\ref{fig:eval:vivado:mismatch} shows code that does not output the right value when compiled with all Vivado versions and GCC, as it returns \texttt{0x0} with Vivado whereas it should be returning \texttt{0xF}.  This test case is much longer compared to the other test cases that were reduced and could not be made any smaller, as everything in the code is necessary to trigger the bug.
-
-The array \texttt{a} is initialised to all zeros, as well as the other global variables \texttt{g} and \texttt{c}, so as to not introduce any undefined behaviour.  However, \texttt{g} is also assigned the \texttt{volatile} keyword, which ensures that the variable is not optimised away.  The function \texttt{d} then accumulates the values \texttt{b} that it is passed into a hash that is stored in \texttt{c}.  Each \texttt{b} is eight bits wide, so function \texttt{e} calls the function 7 times for some of the bits in the 64 bit value of \texttt{f} that it is passed.  Finally, in the main function, the array is initialised partially with a for loop, after which the \texttt{e} function is called twice, once on the volatile function but also on a constant.  Interestingly enough, the second function call with the constant is also necessary to trigger the bug.
-
-\begin{figure}
-\begin{minted}{c}
-volatile unsigned int g = 0;
-int a[256] = {0};
-int c = 0;
-
-void d(char b) { c = (c & 4095) ^ a[(c ^ b) & 15]; }
-
-void e(long f) {
-  d(f); d(f >> 8); d(f >> 16); d(f >> 24);
-  d(f >> 32); d(f >> 40); d(f >> 48);
-}
-
-int main() {
-  for (int i = 0; i < 56; i++) a[i] = i;
-  e(g); e(-2L);
-  return c;
-}
-\end{minted}
-\caption{An output mismatch where GCC returns \texttt{0xF}, whereas Vivado HLS return \texttt{0x0}.}\label{fig:eval:vivado:mismatch}
-\end{figure}
-
-\subsection{Bugs in Vivado HLS versions}
-
-In addition to the explanation to bugs given in Section~\ref{sec:evaluation}, bugs found in various versions of Vivado are also analysed, which are shown in Figure~\ref{fig:sankey_diagram}.  The figure depicts failing test cases in 3645 test cases that were passed to Vivado 2018.3, 2019.1 and 2019.2.  All test cases that fail in the same tools are grouped together into a ribbon, showing when a bug is present in one of the tools.
-
-There is a group of failing test cases that is constant between all versions of Vivado HLS, meaning these are bugs that were not fixed between the versions.  Other ribbons can be seen weaving in and out of failing for a version, meaning these bugs were fixed or reintroduced in those versions.  The diagram demonstrates that Vivado HLS 2018.3 contains the most failing test cases compared to the other versions, having 62 test cases fail in total.  Interestingly, Vivado HLS 2019.1 and 2019.2 have a different number of failing test cases, meaning feature improvements that introduced bugs as well as bug fixes between those minor versions.
-
-\definecolor{ribbon1}{HTML}{8dd3c7}
-\definecolor{ribbon2}{HTML}{b3de69}
-\definecolor{ribbon3}{HTML}{bebada}
-\definecolor{ribbon4}{HTML}{fb8072}
-\definecolor{ribbon5}{HTML}{80b1d3}
-\definecolor{ribbon6}{HTML}{fdb462}
-\begin{figure}
-  \centering
-  \begin{tikzpicture}
-    \draw[white, fill=ribbon1] (-1.0,4.1) -- (0.0,4.1000000000000005) to [out=0,in=180] (2.0,4.1000000000000005) to [out=0,in=180] (4.0,4.1000000000000005) -- (6.0,4.1000000000000005) -- %(7.55,3.325) -- 
-    (6.0,2.5500000000000003) -- (4.0,2.5500000000000003) to [out=180,in=0] (2.0,2.5500000000000003) to [out=180,in=0] (0.0,2.5500000000000003) -- (-1.0,2.55) -- cycle;
-    \draw[white, fill=ribbon2] (-1.0,2.55) -- (0.0,2.5500000000000003) to [out=0,in=180] (2.0,1.8) to [out=0,in=180] (4.0,1.55) -- (6.0,1.55) -- %(7.3,0.9) -- 
-    (6.0,0.25) -- (4.0,0.25) to [out=180,in=0] (2.0,0.5) to [out=180,in=0] (0.0,1.25) -- (-1.0,1.25) -- cycle;
-    \draw[white, fill=ribbon3] (-1.0,1.25) -- (0.0,1.25) to [out=0,in=180] (2.0,2.5500000000000003) to [out=0,in=180] (4.0,0.25) -- (6.0,0.25) -- %(6.05,0.225) -- 
-    (6.0,0.2) -- (4.0,0.2) to [out=180,in=0] (2.0,2.5) to [out=180,in=0] (0.0,1.2000000000000002) -- (-1.0,1.2) -- cycle;
-    \draw[white, fill=ribbon4] (-1.0,0.5) -- (0.0,0.5) to [out=0,in=180] (2.0,2.5) to [out=0,in=180] (4.0,0.2) -- (6.0,0.2) -- %(6.2,0.1) --
-    (6.0,0.0) -- (4.0,0.0) to [out=180,in=0] (2.0,2.3000000000000003) to [out=180,in=0] (0.0,0.30000000000000004) -- (-1.0,0.3) -- cycle;
-    \draw[white, fill=ribbon5] (-1.0,1.2) -- (0.0,1.2000000000000002) to [out=0,in=180] (2.0,0.5) to [out=0,in=180] (4.0,2.5500000000000003) -- (6.0,2.5500000000000003) -- %(6.2,2.45) --
-    (6.0,2.35) -- (4.0,2.35) to [out=180,in=0] (2.0,0.30000000000000004) to [out=180,in=0] (0.0,1.0) -- (-1.0,1.0) -- cycle;
-    \draw[white, fill=ribbon6] (-1.0,0.3) -- (0.0,0.30000000000000004) to [out=0,in=180] (2.0,0.30000000000000004) to [out=0,in=180] (4.0,2.35) -- (6.0,2.35) -- %(6.3,2.2) -- 
-    (6.0,2.0500000000000003) -- (4.0,2.0500000000000003) to [out=180,in=0] (2.0,0.0) to [out=180,in=0] (0.0,0.0) -- (-1.0,0.0) -- cycle;
-
-    \draw[white, fill=black] (-0.4,4.1) rectangle (0.0,1.0);
-    \draw[white, fill=black] (1.8,4.1) rectangle (2.2,2.3);
-    \draw[white, fill=black] (3.8,4.1) rectangle (4.2,2.05);
-
-    \node at (-0.2,4.5) {2018.3};
-    \node at (2,4.5) {2019.1};
-    \node at (4,4.5) {2019.2};
-    %\node at (2,5) {Vivado HLS};
-
-    \node at (5.5,3.325) {31};
-    \node at (5.5,0.9) {26};
-    \node at (5.5,2.2) {6};
-
-    \node[white] at (-0.2,1.2) {62};
-    \node[white] at (2,2.5) {36};
-    \node[white] at (4,2.25) {41};
-  \end{tikzpicture}
-  \caption{A Sankey diagram that tracks 3645 test cases through three different versions of Vivado HLS. The ribbons collect the test cases that pass and fail together. The 3573 test cases that pass in all three versions are not depicted. }\label{fig:sankey_diagram}
-\end{figure}
-
-From this diagram it can also be observed that there must be at least six individual bugs that were found by the fuzzer in Vivado HLS, as a ribbon must contain at least one unique bug.\YH{Contradicts value of 3 in Table~\ref{tab:unique_bugs}, maybe I can change that to 6?}  In addition to that, it can then be seen that Vivado HLS 2018.3 must have at least 4 individual bugs, of which two were fixed and two others stayed in Vivado HLS 2019.1.  However, with the release of 2019.1, new bugs were introduced as well.  Finally, for version 2019.2 of Vivado HLS, there seems to be a bug that was reintroduced which was also present in Vivado 2018.3, in addition to a new bug.  In general it seems like each release of Vivado HLS will have new bugs present, however, will also contain many previous bug fixes.  However, it cannot be guaranteed that a bug that was previously fixed will remain fixed in future versions as well.
-
-%%% Local Variables:
-%%% mode: latex
-%%% TeX-master: "main"
-%%% End:
diff --git a/intro.tex b/intro.tex
index ae8da19..44a79e6 100644
--- a/intro.tex
+++ b/intro.tex
@@ -10,7 +10,8 @@ unsigned int b = 0x1194D7FF;
 int a[6] = {1, 1, 1, 1, 1, 1};
 
 int main() {
-  for (int c = 0; c < 2; c++) b = b >> a[c];
+  for (int c = 0; c < 2; c++) 
+    b = b >> a[c];
   return b;
 }
 \end{minted}
@@ -61,7 +62,7 @@ This paper reports on our campaign to test HLS tools by fuzzing.
 \begin{itemize}
     \item We use Csmith~\cite{yang11_findin_under_bugs_c_compil} to generate ten thousand valid C programs from within the subset of the C language that is supported by all the HLS tools we test. We augment each program with a random selection of HLS-specific directives.
     
-    \item We give these programs to three widely used HLS tools: Xilinx Vivado HLS~\cite{xilinx20_vivad_high_synth}, LegUp HLS~\cite{canis13_legup} and Intel HLS~\cite{intel20_sdk_openc_applic}. When we find a program that causes an HLS tool to crash, or to generate hardware that produces a different result from GCC, we reduce it to a minimal example with the help of the C-reduce tool~\cite{creduce}.
+    \item We give these programs to three widely used HLS tools: Xilinx Vivado HLS~\cite{xilinx20_vivad_high_synth}, LegUp HLS~\cite{canis13_legup} and the Intel HLS Compiler~\cite{intel20_sdk_openc_applic}. When we find a program that causes an HLS tool to crash, or to generate hardware that produces a different result from GCC, we reduce it to a minimal example with the help of the \creduce{} tool~\cite{creduce}.
     
     \item Our testing campaign revealed that all three tools could be made to crash while compiling or to generate wrong RTL. In total, we found \ref{XX} bugs across the three tools, all of which have been reported to the respective developers, and \ref{XX} of which have been confirmed at the time of writing.
     
diff --git a/main.tex b/main.tex
index 83e5166..4ec8a73 100644
--- a/main.tex
+++ b/main.tex
@@ -33,6 +33,7 @@
 \newcommand\ZD[1]{\Comment{blue!50!black}{NR}{#1}}
 
 \newcommand{\code}[1]{\texttt{#1}}
+\newcommand\creduce{C-Reduce}
 
 %%
 %% The majority of ACM publications use numbered citations and
@@ -107,7 +108,7 @@
 %% other information printed in the page headers. This command allows
 %% the author to define a more concise list
 %% of authors' names for this purpose.
-\renewcommand{\shortauthors}{Du, Z. and Herklotz, Y. and Ramanathan, N. and Wickerson, J.}
+\renewcommand{\shortauthors}{Du, Herklotz, Ramanathan, and Wickerson}
 
 %%
 %% The abstract is a short summary of the work to be presented in the
@@ -162,11 +163,11 @@ We have subjected three widely used HLS tools -- LegUp, Xilinx Vivado HLS, and t
 
 \input{related}
 
-\input{method-new}
+\input{method}
 
 % \input{testing-system-new}
 
-\input{eval_rewrite}
+\input{eval}
 
 \input{conclusion}
 
diff --git a/method-new.tex b/method.tex
index ef8c62c..d27f22b 100644
--- a/method-new.tex
+++ b/method.tex
@@ -5,7 +5,7 @@
 This section describes how we conducted our testing campaign, the overall flow of which is shown in Figure~\ref{fig:method:toolflow}.
 \input{tool-figure}
 \begin{itemize}
-\item In~\S\ref{sec:method:csmith}, we describe how we configure CSmith to generate HLS-friendly random programs for our testing campaign.
+\item In~\S\ref{sec:method:csmith}, we describe how we configure Csmith to generate HLS-friendly random programs for our testing campaign.
 \item In~\S\ref{sec:method:annotate}, we discuss how we augment those random programs with directives and the necessary configuration files for HLS compilation. 
 \item In~\S\ref{sec:method:testing}, we discuss how we set up compilation and co-simulation checking for the three HLS tools under test. 
 \item Finally, in~\S\ref{sec:method:reduce}, we discuss how we reduce problematic programs in order to obtain minimal examples of bugs.
@@ -90,10 +90,10 @@ First, we ensure that all mathematical expressions are safe and unsigned, to ens
 We also disallow assignments being embedded within expressions, since HLS generally does not support them.
 We eliminate any floating-point numbers since they typically involve external libraries or use of hard IPs on FPGAs, which in turn make it hard to reduce bugs to their minimal form.
 We also disable the generation of pointers for HLS testing, since pointer support in HLS tools is either absent or immature~\cite{xilinx20_vivad_high_synth}.\YH{I've looked at the documentation and even pointer to pointer is supported, but maybe not pointer to pointer to pointer.  I think there was some other pointer assignment that didn't quite work, but I don't remember now.  Immature might be a good description though.}
-We also disable void functions, since we are not supporting pointers.\YH{Figure \ref{fig:eval:vivado:mismatch} actually has void functions...}
-We disable the generation of unions as these were not well supported by some of the tools such as LegUp 4.0.
+We also disable void functions, since we are not supporting pointers.\YH{Figure \ref{fig:eval:vivado:mismatch} actually has void functions...} \JW{Hm, true. Perhaps these were introduced during reduction.}
+We disable the generation of unions as these were not supported by some of the tools such as LegUp 4.0.
 
-To differentiate between if a feature should be disabled or reported as a bug, the tool in question is taken into account.  Unfortunately there is no standard subset of C that should be supported by HLS tools, and every tool chooses a slightly different subset.  It is therefore important to choose the right subset so that the most bugs are found in each tool, but we are not generating code that the tool does not support.  Therefore, to decide if a feature should be disable it should fail gracefully in one of the tools, stating that the feature is not supported or explaining what the issue is.  If the HLS tool fails in a different way though, such as generating a wrong design or crashing during synthesis, the feature is kept in the generated test cases.
+To decide whether a problematic feature should be disabled or reported as a bug, the tool in question is taken into account.  Unfortunately there is no standard subset of C that is supported by HLS tools; every tool chooses a slightly different subset.  It is therefore important to choose the right subset in order to maximise the number of real bugs found in each tool, while avoiding generating code that the tool does not support.  Therefore, we disable a feature if it fails gracefully (i.e. with an error message stating the issue) in one of the tools.  If the HLS tool fails in a different way though, such as generating a wrong design or crashing during synthesis, the feature is kept in our test suite.
 
 Use a volatile pointer for any pointer that is accessed multiple times within a single transaction (one execution of the C function). If you do not use a volatile pointer, everything except the first read and last write is optimized out to adhere to the C standard.
 
@@ -188,35 +188,27 @@ We do not count time-outs as bugs, but we record them.
 Once we discover a program that crashes the HLS tool or whose C/RTL simulations do not match, we further scrutinise the program to identify the root cause(s) of the undesirable behaviour.
 As the programs generated by Csmith can be fairly large, we must systematically reduce these programs to identify the source of a bug.
 
-\YH{Edit this section some more}
+%\YH{Edit this section some more}
 
 Reduction is performed by iteratively removing some part of the original program and then providing the reduced program to the HLS tool for re-synthesis and co-simulation. 
 The goal is to find the smallest program that still triggers the bug. 
 We apply two consecutive methods of reduction in this work. 
-We first perform a custom reduction in which we iteratively remove lines in strict order from the top-level function.
-This method of reduction first reduces the pragmas and labels that were added before synthesis of the C program, after which is proceeds by iteratively commenting out lines in the C program one after another until any further reduction would eliminate the bug.
+We first perform a custom reduction in which we iteratively remove the HLS directives that we added before synthesis of the C program.
 % \NR{We can add one or two more sentences summarising how we reduce the programs. Zewei is probably the best person to add these sentences.}\YH{Added some more lines, we can ask Zewei if she is OK with that.}
-
-Although, our custom reduction gives us the freedom and control of how to reduce buggy programs, it is arduous and requires a lot of manual effort.
-We therefore also integrated Creduce~\cite{creduce} into the workflow to automatically reduce the C programs that fail synthesis or where the outputs mismatch.
-This greatly speeds up the reduction of failing test cases as Creduce can reduce the input in a semantically correct why while avoiding undefined behaviour.
-In addition to that, it can run in parallel and has various different reduction strategies that help it converge to a smaller test case faster, such as delta debugging passes or even inlining of function bodies, which our custom reduction pass does not do.
-This means that the final product that Creduce obtains is often small enough to understand and step through.
-
-However, the downside of using Creduce with HLS tools is that we are not in control of which lines and features are prioritised for removal.
-As a consequence, we can easily end up with Creduce producing smaller programs that are not HLS-friendly and are unsynthesisable, despite being valid C.
-This is because the semantics of C for HLS tools might be a bit different to the semantics for C of a standard C compiler that Creduce was built for.
-Even though it does not normally introduce undefined behaviour, it can introduce undefined behaviour for C programs designed for HLS tools, because there are more restrictions for that C language than there normally is.
-An example of this could be the reduction of a function call, where the reducer realises that a mismatch is still observed when the arguments to the function call are removed, and that the function pointer is assigned to the integer instead.
-This is however often undefined behaviour in HLS tools, as a function pointer does not have a concrete meaning in hardware, were functions are not associated with a memory location as there are no instructions.
-Once undefined behaviour is introduced at any point in the reduction, the test cases will often reduce to that undefined behaviour as it does create a mismatch between the HLS tool and the C compiler, but does not actually represent a bug.
-To prevent this, the script which guides Creduce towards reducing it to the correct bug must try and avoid the introduction of these undefined behaviours as much as possible.
-The following measures were taken to prevent the insertion of undefined behaviour:
-
-\begin{itemize}
-  \item Add \texttt{-fsanitize=undefined} to GCC options to discover and error out on undefined behaviour at runtime of the executable.
-  \item Turn on all warnings in GCC and error out on any warning, while ignoring common warnings that are known to be harmless.
-\end{itemize}
+%Although, our custom reduction gives us the freedom and control of how to reduce buggy programs, it is arduous and requires a lot of manual effort.
+We then use the \creduce{} tool~\cite{creduce} to automatically reduce the remaining C program.
+\creduce{} is effective because it reduces the input while preserving semantic validity and avoiding undefined behaviour.
+It has various reduction strategies, such as delta debugging passes and function inlining, that help it converge rapidly to a test case that is small enough to understand and step through.
+
+However, the downside of using \creduce{} with HLS tools is that we are not in control of which lines and features are prioritised for removal.
+As a consequence, we can easily end up with \creduce{} producing programs that are not synthesisable, despite being valid C.
+%This is because the semantics of C for HLS tools might be a bit different to the semantics for C of a standard C compiler that \creduce{} was built for.
+Even though \creduce{} does not normally introduce undefined behaviour, it can introduce behaviour that is unsupported in the HLS tools.
+An example is the reduction of a function call, where the reducer realises that a mismatch is still observed when the function call's arguments are removed, and the function pointer is assigned a constant instead.
+This is however often unsupported in HLS tools, since a function pointer does not have a concrete interpretation in hardware, because in the absence of instructions, functions are not associated with a particular memory location.
+Once unhandled behaviour is introduced at any point in the reduction, the test cases will often zero in on that unhandled behaviour, even though it does not actually represent an interesting bug.
+To prevent this, we use a script to guide \creduce{} away from introducing these unhandled behaviours as much as possible. 
+This script involves adding \texttt{-fsanitize=undefined} to the GCC options in order to abort when undefined behaviour is detected at runtime, and erroring out whenever any warning is encountered while running the HLS tool (except common warnings that are known to be harmless).
 
 %%% Local Variables:
 %%% mode: latex
diff --git a/motiv-ex.tex b/motiv-ex.tex
deleted file mode 100644
index e69de29..0000000
--- a/motiv-ex.tex
+++ /dev/null
diff --git a/old/old-eval.tex b/old/old-eval.tex
new file mode 100644
index 0000000..6330647
--- /dev/null
+++ b/old/old-eval.tex
@@ -0,0 +1,113 @@
+\section{Evaluation}
+
+\definecolor{vivado}{HTML}{7fc97f}
+\definecolor{intel}{HTML}{beaed4}
+\definecolor{legup}{HTML}{fdc086}
+\begin{figure}
+  \resizebox{0.5\textwidth}{!}{%
+    \begin{tikzpicture}
+      \begin{scope}
+        \draw[fill=vivado,fill opacity=0.5] (-4.4,4.4) ellipse (3.75 and 2.75);
+        \draw[fill=intel,fill opacity=0.5] (-10.2,4.4) ellipse (3.75 and 2.75);
+        \draw[fill=legup,fill opacity=0.5] (-7.3,2) ellipse (3.75 and 2.75);
+        \node at (-10.2,6.3) {\Large\textsf{\textbf{Vivado}}};
+        \node at (-4.4,6.3) {\Large\textsf{\textbf{Intel HLS}}};
+        \node at (-7.3,0) {\Large\textsf{\textbf{Legup}}};
+      \end{scope}
+
+    \end{tikzpicture}
+  }
+\caption{Venn diagram showing the existing tools and their current features.  \textbf{Implementation} refers to usable HLS tools, whereas \textbf{Proof} refers to papers that demonstrate proofs of an algorithm without necessarily linking that proof with the algorithm that implements it.}\label{fig:existing_tools}
+\end{figure}
+
+\begin{table}
+  \centering
+  \begin{tabular}{lr}\toprule
+    \textbf{Tool} & \textbf{Bugs found}\\
+    \midrule
+    Vivado 2018.3 & 2\\
+    Vivado 2019.1 & 2\\
+    Vivado 2019.2 & 2\\
+    Legup 4.0 & 4\\
+    Intel HLS & 1\\
+    \bottomrule
+  \end{tabular}
+  \caption{Unique bugs found in each tool.}
+  \label{tab:unique_bugs}
+\end{table}
+
+During the implementation stage, the testing system successfully detected bugs among all three tools. The test cases used were not the same for each tool, because, as being mentioned, each tool has its own supported syntax. Thus, to ensure the validness of test cases during the implementation stage, test cases were all generated based on the supported syntax of the targeted tools. 
+
+After the implementation of testing system is proven to be working, a set of 10,000 test cases are generated and fed into HLS tools for constructing the bigger picture regarding the quality of HLS tools. The 10,000 test cases were kept constant to ensure the comparison is fair between tools. Unfortunately, 10,000 test runs were still in progress by the time of writing this thesis, as runtime for each test case can range from 5 minutes up to several hours. Data showing in \ref{result table for 10k test cases} is collected when writing this report.
+
+Three versions of Vivado HLS, including version 2019.2, 2019.1, and 2018.3, were tested. As being mentioned before, Vivado HLS version 2019.2 and 2019.1 won’t process logical AND operator with constant, which will warn about changing to bitwise AND operator (\verb|&|) and then error out. Thus, as shown in \ref{result table}, the column named as “Invalid test cases” indicates test cases that have logical AND operator (\verb|&&|) with constants. It is also being said that version 2018.3 managed to cope with the logical AND operator, so the “invalid test cases” section for version 2018.3 only count the test cases that has GCC and Vivado HLS’s C simulation result unmatched. 
+
+Moving on to the LegUp HLS version 4.0, the C/RTL unmatched result section is composed of assertion errors and C and RTL results unmatched. The assertion error happens when translating from C to RTL, which results in no Verilog file being produced. Therefore, this condition is also considered as inconsistent results since the LegUp HLS failed to translate an accurate RTL description for the inputted valid C code. Although, proportionally, the total number of C and RTL mismatch detected in LegUp HLS is much higher compared to which of Vivado HLS, two points must be emphasized. Firstly, LegUp HLS version 4.0 was implemented long time ago in 2015 and published as open sources for academic and research uses. By the time of writing this thesis, version 8.0 was just released for commercial uses. As mentioned before, version 7.5 was installed and run as GUI. Although it was not being fuzzing-tested under 10,000 test cases due to difficulties encountered when running it through the command line, it was used as a reference for comparing differences between versions released. By trying out several failing tests on version 7.5, some of the failing tests passed successfully without causing the same problem. Then we can confirm that some of the embedded issues have been solved already. Thus, lesser discrepancy in results when running through the newer versions should be expected. Secondly, Vivado HLS version 2019.2 errors out plenty of test cases due to the pragma error. Only a subset of test cases was synthesized to RTL and being simulated. Reducing in overall valid test cases can result in a lower amount of unmatched test cases. Thus, the results for LegUp HLS should not be treated “equally” with which for Vivado HLS. Those two points needed to be taken into consideration when analyzing the results obtained from LegUp HLS. 
+
+Finally, as for Intel HLS, three points need to be taken into account when analyzing the result for Intel HLS. Firstly, the hashing function used for Intel HLS is much simpler comparing to which is used for both Vivado HLS and LegUp HLS. One possible result of using simple hashing function is that a considerable amount of bug can go undetected, which can lower the total number of cases that have C and RTL results unmatched. Secondly, as Intel HLS is running under the Windows environment, it runs relatively slower comparing to other tools that run under the Linux system. Based on the result, a considerable amount of test runs was timed out, which ultimately decreases the total number of valid test runs. Therefore, similar to what has been mentioned about the reduced overall valid test cases for Vivado HLS, when analyzing the Intel HLS, this should also be taken into consideration. Lastly, as Intel HLS uses the I++ compiler instead of GCC, differences can exist. Although theoretically, the C++ compiler does have stricter requirements, it should be compatible with compiling C programs. And as the test programs were generated through Csmith in C language, words like “bool” or “class”, which are only supported in C++ but not in C, do not exist. Also, Csimth was forbidden to generate operations like malloc, which might trigger incompatibility between C and C++. Thus, although compatibility can pose a potential problem, it should not have a huge impact. Besides those three points, similar to Vivado HLS, Intel HLS will alert about using logical AND operation with constants, but it does not error out immediately. So the “invalid test cases” section is set to not applicable.
+
+We generate $10$ thousand test programs to test on three different HLS tools, including three versions of Vivado HLS, one version of LegUp HLS and one version of Intel HLS.
+We provide the same set of programs to all HLS tools and show that different tools give rise to unique bugs.
+
+
+\begin{figure}\centering
+\begin{minted}{c}
+int a[2][2][1] = {{{0},{1}},{{0},{0}}};
+
+int main() { a[0][1][0] = 1; }
+\end{minted}
+\caption{This test cases crashes LegUp 4.0 with an assertion error.}\label{fig:legup_crash1}
+\end{figure}
+
+\begin{figure}\centering
+\begin{minted}{c}
+union U1 { int a; };
+
+volatile union U1 un = {0};
+
+int main() { return un.a; }
+\end{minted}
+\caption{This crashes in Legup 4.0.}\label{fig:legup_crash2}
+\end{figure}
+
+\begin{figure}\centering
+\begin{minted}{c}
+volatile int a = 0;
+int b = 1;
+
+int main() {
+  int d = 1;
+  if (d + a) b || 1;
+  else b = 0;
+  return b;
+}
+\end{minted}
+\caption{This crashes in Legup 4.0.}\label{fig:legup_crash2}
+\end{figure}
+
+\NR{
+What we need for this section:
+\begin{itemize}
+    \item Venn diagrams to show the overall results. Onlyy missing information from Venn diagram is the unique bugs per tool, which we can provide with a smaller table. This table can actually lead nicely to the bug portfolio.
+    \item Sankey diagram for different versions of Vivado HLS. 
+    \item A portfolio of bugs on various tools, together with intuitive explanation for why we see these bugs. We can also mention filing of bug reports, where relevant.
+    \item Is the area and latency of buggy hardware of interest to us?
+\end{itemize}
+}
+
+\subsection{Venn diagram of results}
+
+\subsection{Bug portfolio}
+\NR{In fact, the structure of this section relates to the Venn diagram.}
+
+\subsubsection{Vivado HLS}
+
+\subsubsection{LegUp HLS}
+
+\subsubsection{Intel HLS}
+
+
+%%% Local Variables:
+%%% mode: latex
+%%% TeX-master: "main"
+%%% End:
diff --git a/method-orig.tex b/old/old-method.tex
index e9d5643..e9d5643 100644
--- a/method-orig.tex
+++ b/old/old-method.tex
diff --git a/testing-system-new.tex b/old/old-testing-system-new.tex
index a44a942..a44a942 100644
--- a/testing-system-new.tex
+++ b/old/old-testing-system-new.tex
diff --git a/testing-system.tex b/old/old-testing-system.tex
index 1627bdb..1627bdb 100644
--- a/testing-system.tex
+++ b/old/old-testing-system.tex
diff --git a/related.tex b/related.tex
index cbff444..a9d2f58 100644
--- a/related.tex
+++ b/related.tex
@@ -4,10 +4,20 @@ The only other work of which we are aware on fuzzing HLS tools is that by Lidbur
 
 Other stages of the FPGA toolchain have been subjected to fuzzing. Herklotz et al.~\cite{verismith} tested several FPGA synthesis tools using randomly generated Verilog programs. Where they concentrated on the RTL-to-netlist stage of hardware design, we focus our attention on the earlier C-to-RTL stage.
 
-Several authors have taken steps toward more rigorously engineered HLS tools that may be more resilient to testing campaigns such as ours. One such example might be the Handel-C compiler by Perna and Woodcock~\cite{perna12_mechan_wire_wise_verif_handel_c_synth} which has been mechanically proven correct, at least in part, using the HOL theorem prover. 
+Several authors have taken steps toward more rigorously engineered HLS tools that may be more resilient to testing campaigns such as ours. 
+\begin{itemize}
+    
+\item The Handel-C compiler by Perna and Woodcock~\cite{perna12_mechan_wire_wise_verif_handel_c_synth} has been mechanically proven correct, at least in part, using the HOL theorem prover. 
 %However, the Handel-C language relies upon the user to perform the parallelisation that is necessary for good hardware performance, and would require many more checks to avoid data races. 
 However, the tool does not support C as input directly, so is not amenable to fuzzing.
-Another example is the SPARK HLS tool~\cite{gupta03_spark}, in which some compiler passes, such as scheduling, are mechanically validated during compilation~\cite{chouksey20_verif_sched_condit_behav_high_level_synth}. Unfortunately, this tool is not yet readily available to test properly. Finally, the Catapult C HLS tool~\cite{mentor20_catap_high_level_synth} is designed only to produce an output netlist if it can mechanically prove it equivalent to the input program. It should therefore never produce wrong RTL. In future work, we intend to test Catapult C alongside Vivado HLS, LegUp, and Intel HLS 
+
+\item Ramanathan et al.~\cite{ramanathan+17} proved their implementation of C atomic operations in LegUp correct up to a bound using model checking. However, our testing campaign is not applicable to their implementation because we do not generate concurrent C programs.
+
+\item In the SPARK HLS tool~\cite{gupta03_spark}, some compiler passes, such as scheduling, are mechanically validated during compilation~\cite{chouksey20_verif_sched_condit_behav_high_level_synth}. Unfortunately, this tool is not yet readily available to test properly.
+
+\item Finally, the Catapult C HLS tool~\cite{mentor20_catap_high_level_synth} is designed only to produce an output netlist if it can mechanically prove it equivalent to the input program. It should therefore never produce wrong RTL. In future work, we intend to test Catapult C alongside Vivado HLS, LegUp, and the Intel HLS Compiler.
+
+\end{itemize}  
 %more prevalent these were prioritised.
 % JW: We're not really sure that LegUp is more prevalent than Catapult HLS. Indeed, it probably isn't!
 %\JW{Obvious reader question at this point: why not test that claim by giving our Csmith test cases to Catapult C too? Can we address that here? No worries if not; but shall we try and do that after the deadline anyway?}\YH{Yes, definitely, it would be great to get an idea of how Catapult C performs, and I currently have it installed already.  I have added a small sentence for that now, but let me know if I should mention this in the conclusion instead though. }
author	n.ramanathan14 <n.ramanathan14@imperial.ac.uk>	2020-09-15 08:40:52 +0000
committer	overleaf <overleaf@localhost>	2020-09-15 08:40:55 +0000
commit	ebaf38ee9fa9ed6a74230c93fa2d15df44521c9a (patch)
tree	9b661c26ec60e0710668e108e879abbfd1ecb3bd
parent	66b5b9ca3de555980d93864fb7bac5e8ea0fcb1c (diff)
download	fccm21_esrhls-ebaf38ee9fa9ed6a74230c93fa2d15df44521c9a.tar.gz fccm21_esrhls-ebaf38ee9fa9ed6a74230c93fa2d15df44521c9a.zip