summaryrefslogtreecommitdiffstats
path: root/eval.tex
diff options
context:
space:
mode:
authorYann Herklotz <git@yannherklotz.com>2021-03-30 19:39:57 +0100
committerYann Herklotz <git@yannherklotz.com>2021-03-30 19:39:57 +0100
commitadc0afcec6fe025f85fbfdfdfc5ef522fa760d98 (patch)
tree22243be49716dc6c98b5e6b38a67c6c862ef9f1c /eval.tex
parent0f40e13fab830957ac055e076055280cdb82efff (diff)
downloadfccm21_esrhls-adc0afcec6fe025f85fbfdfdfc5ef522fa760d98.tar.gz
fccm21_esrhls-adc0afcec6fe025f85fbfdfdfc5ef522fa760d98.zip
Update text
Diffstat (limited to 'eval.tex')
-rw-r--r--eval.tex24
1 files changed, 9 insertions, 15 deletions
diff --git a/eval.tex b/eval.tex
index f1543e4..312ed25 100644
--- a/eval.tex
+++ b/eval.tex
@@ -3,8 +3,7 @@
We generate \totaltestcases{} test-cases and provide them to four HLS tools: Vivado HLS, LegUp HLS, Intel i++, and Bambu.
We use the same test-cases across all tools for fair comparison (except the HLS directives, which have tool-specific syntax).
We were able to test three different versions of Vivado HLS (v2018.3, v2019.1 and v2019.2).
-We tested one version of Intel i++ (version 18.1), LegUp (4.0) and Bambu (v0.9.7).
-LegUp 7.5 is GUI-based so we could not script our tests; however, we were able to manually reproduce all the bugs found in LegUp 4.0 in LegUp 7.5.
+We tested one version of Intel i++ (included in Quartus Lite 18.1), LegUp (4.0) and Bambu (v0.9.7). We tested any reduced LegUp test-cases on LegUp 9.2 before reporting them.
% Three different tools were tested, including three different versions of Vivado HLS. We were only able to test one version of LegUp HLS (version 4.0), because although LegUp 7.5 is available, it is GUI-based and not amenable to scripting. However, bugs we found in LegUp 4.0 were reproduced manually in LegUp 7.5.
% LegUp and Vivado HLS were run under Linux, while the Intel HLS Compiler was run under Windows.
@@ -52,22 +51,20 @@ LegUp 7.5 is GUI-based so we could not script our tests; however, we were able t
\end{figure}
Figure~\ref{fig:existing_tools} shows an Euler diagram of our results.
-We see that 918 (13.7\%), 167 (2.5\%), 83 (1.2\%) and 26 (0.4\%) test-cases fail in Bambu, LegUp, Vivado HLS and Intel i++ respectively.
-\JW{Somewhere around here mention that Bambu originally had M failures, but after a single bugfix, it went down to N failures. Maybe mention that we would have extended the same courtesy to the other tools had they released fixed versions of their tools promptly?}
-Despite i++ having the lowest failure rate, it has the highest time-out rate (540 test-cases), because of its remarkably long compilation time.
-% We remark that although the Intel HLS Compiler had the smallest number of confirmed test-case failures, it had the most time-outs (which could be masking additional failures)
+We see that 918 (13.7\%), 167 (2.5\%), 83 (1.2\%) and 26 (0.4\%) test-cases fail in Bambu, LegUp, Vivado HLS and Intel i++ respectively. However, one of the bugs in Bambu was fixed as we were testing the tool, so we therefore tested the development branch of Bambu (0.9.7-dev) with that bug fix, and only found 17 (0.25\%) remaining failing test-cases.
+Despite i++ having the lowest failure rate, it has the highest time-out rate (540 test-cases), because of its remarkably long compilation time, whereas the other tools each only had under 20 test-cases timeout.
Note that the absolute numbers here do not necessarily correspond to the number of bugs in the tools, because a single bug in a language feature that appears frequently in our test suite could cause many programs to crash or fail.
Moreover, we are reluctant to draw conclusions about the relative reliability of each tool by comparing the number of test-case failures, because these numbers are so sensitive to the parameters of the randomly generated test suite we used. In other words, we can confirm the \emph{presence} of bugs, but cannot deduce the \emph{number} of them (nor their importance).
We have reduced several of the failing test-cases in an effort to identify particular bugs, and our findings are summarised in Table~\ref{tab:bugsummary}. We emphasise that the bug counts here are lower bounds -- we did not have time to go through the arduous test-case reduction process for every failure.
Figures~\ref{fig:eval:legup:crash}, \ref{fig:eval:intel:mismatch}, and~\ref{fig:eval:bambu:mismatch} present three of the bugs we found. As in Example~\ref{ex:vivado_miscomp}, each bug was first reduced automatically using \creduce{}, and then further reduced manually to achieve the minimal test-case.
-\begin{figure}[t]
+\begin{figure}
\begin{minted}{c}
int a[2][2][1] = {{{0},{1}},{{0},{0}}};
int main() { a[0][1][0] = 1; }
\end{minted}
-\caption{This program leads to an internal compiler error (an unhandled assertion in this case) in LegUp 4.0 and 7.5. It initialises a 3D array with zeroes and then assigns to one element. The bug only appears when function inlining is disabled (\texttt{NO\_INLINE}).}
+\caption{This program leads to an internal compiler error (an unhandled assertion in this case) in LegUp 4.0. It initialises a 3D array with zeroes and then assigns to one element. The bug only appears when function inlining is disabled (\texttt{NO\_INLINE}).}
\label{fig:eval:legup:crash}
\end{figure}
%An assertion error counts as a crash of the tool, as it means that an unexpected state was reached by this input.
@@ -86,7 +83,7 @@ int main() { a[0][1][0] = 1; }
%\end{example}
-\begin{figure}[t]
+\begin{figure}
\begin{minted}{c}
static volatile int a[9][1][7];
int main() {
@@ -104,7 +101,7 @@ int main() {
\caption{This program miscompiles in Intel i++. It should return 2 because \code{3 \^{} 1 = 2}, but Intel i++ generates a design that returns 0 instead. Perhaps the assignment to 3 in the first for-loop is being overlooked.}\label{fig:eval:intel:mismatch}
\end{figure}
-\begin{figure}[t]
+\begin{figure}
\begin{minted}{c}
static int b = 0x10000;
static volatile short a = 0;
@@ -181,7 +178,7 @@ int main() {
\definecolor{ribbon4}{HTML}{fb8072}
\definecolor{ribbon5}{HTML}{80b1d3}
\definecolor{ribbon6}{HTML}{fdb462}
-\begin{figure}[t]
+\begin{figure}
\centering
\begin{tikzpicture}[xscale=1.25]
\draw[white, fill=ribbon1] (-1.0,4.1) -- (0.0,4.1) to [out=0,in=180] (2.0,4.1) to [out=0,in=180] (4.0,4.1) -- (6.0,4.1) -- %(7.55,3.325) --
@@ -222,10 +219,7 @@ Besides studying the reliability of different HLS tools, we also studied the rel
Test-cases that pass and fail in the same tools are grouped together into a ribbon.
For instance, the topmost ribbon represents the 31 test-cases that fail in all three versions of Vivado HLS. Other ribbons can be seen weaving in and out; these indicate that bugs were fixed or reintroduced in the various versions. We see that Vivado HLS v2018.3 had the most test-case failures (62).
Interestingly, the blue ribbon shows that there are test-cases that fail in v2018.3, pass in v2019.1, and then fail again in v2019.2.
-As in our Euler diagram, the absolute numbers here do not necessary correspond to the number of actual bugs, though we can observe that there must be at least six unique bugs in Vivado HLS, given that each ribbon corresponds to at least one unique bug.
-
-
-
+As in our Euler diagram, the numbers do not necessary correspond to the number of actual bugs, though we can observe that there must be at least six unique bugs in Vivado HLS, given that each ribbon corresponds to at least one unique bug.
%\YH{Contradicts value of 3 in Table~\ref{tab:unique_bugs}, maybe I can change that to 6?} \JW{I'd leave it as-is personally; we have already put a `$\ge$' symbol in the table, so I think it's fine.}
%In addition to that, it can then be seen that Vivado HLS v2018.3 must have at least 4 individual bugs, of which two were fixed and two others stayed in Vivado HLS v2019.1. However, with the release of v2019.1, new bugs were introduced as well. % Finally, for version 2019.2 of Vivado HLS, there seems to be a bug that was reintroduced which was also present in Vivado 2018.3, in addition to a new bug. In general it seems like each release of Vivado HLS will have new bugs present, however, will also contain many previous bug fixes. However, it cannot be guaranteed that a bug that was previously fixed will remain fixed in future versions as well.