summaryrefslogtreecommitdiffstats
path: root/eval.tex
diff options
context:
space:
mode:
authorYann Herklotz <ymh15@ic.ac.uk>2021-04-04 20:12:20 +0000
committeroverleaf <overleaf@localhost>2021-04-04 20:18:08 +0000
commit62a127dfb009b8ffe94ac348ecafb7f596406cbd (patch)
tree7dbee2f45b6baa1edc4054d32610ff2b1fad6b5b /eval.tex
parentadc0afcec6fe025f85fbfdfdfc5ef522fa760d98 (diff)
downloadfccm21_esrhls-62a127dfb009b8ffe94ac348ecafb7f596406cbd.tar.gz
fccm21_esrhls-62a127dfb009b8ffe94ac348ecafb7f596406cbd.zip
Update on Overleaf.
Diffstat (limited to 'eval.tex')
-rw-r--r--eval.tex27
1 files changed, 16 insertions, 11 deletions
diff --git a/eval.tex b/eval.tex
index 312ed25..88ab6a1 100644
--- a/eval.tex
+++ b/eval.tex
@@ -51,14 +51,16 @@ We tested one version of Intel i++ (included in Quartus Lite 18.1), LegUp (4.0)
\end{figure}
Figure~\ref{fig:existing_tools} shows an Euler diagram of our results.
-We see that 918 (13.7\%), 167 (2.5\%), 83 (1.2\%) and 26 (0.4\%) test-cases fail in Bambu, LegUp, Vivado HLS and Intel i++ respectively. However, one of the bugs in Bambu was fixed as we were testing the tool, so we therefore tested the development branch of Bambu (0.9.7-dev) with that bug fix, and only found 17 (0.25\%) remaining failing test-cases.
-Despite i++ having the lowest failure rate, it has the highest time-out rate (540 test-cases), because of its remarkably long compilation time, whereas the other tools each only had under 20 test-cases timeout.
-Note that the absolute numbers here do not necessarily correspond to the number of bugs in the tools, because a single bug in a language feature that appears frequently in our test suite could cause many programs to crash or fail.
-Moreover, we are reluctant to draw conclusions about the relative reliability of each tool by comparing the number of test-case failures, because these numbers are so sensitive to the parameters of the randomly generated test suite we used. In other words, we can confirm the \emph{presence} of bugs, but cannot deduce the \emph{number} of them (nor their importance).
+We see that 918 (13.7\%), 167 (2.5\%), 83 (1.2\%) and 26 (0.4\%) test-cases fail in Bambu, LegUp, Vivado HLS and Intel i++ respectively. One of the bugs we reported to the Bambu developers was fixed during our testing campaign, so we also tested the development branch of Bambu (0.9.7-dev) with the bug fix, and found only 17 (0.25\%) failing test-cases.
+Although i++ has a low failure rate, it has the highest time-out rate (540 test-cases) due to its remarkably long compilation time. No other tool had more than 20 time-outs.
+Note that the absolute numbers here do not necessarily correspond to the number of bugs in the tools, because a single bug in a language feature that appears frequently in our test suite could cause many failures.
+Moreover, we are reluctant to draw conclusions about the relative reliability of each tool by comparing the number of failures, because these numbers are so sensitive to the parameters of the randomly generated test suite we used. In other words, we can confirm the \emph{presence} of bugs, but cannot deduce the \emph{number} of them (nor their importance).
We have reduced several of the failing test-cases in an effort to identify particular bugs, and our findings are summarised in Table~\ref{tab:bugsummary}. We emphasise that the bug counts here are lower bounds -- we did not have time to go through the arduous test-case reduction process for every failure.
Figures~\ref{fig:eval:legup:crash}, \ref{fig:eval:intel:mismatch}, and~\ref{fig:eval:bambu:mismatch} present three of the bugs we found. As in Example~\ref{ex:vivado_miscomp}, each bug was first reduced automatically using \creduce{}, and then further reduced manually to achieve the minimal test-case.
+% \AD{Could spell out why it's so arduous -- involves testing an enormous number of programs and each one takes ages.} \JW{I'd be inclined to leave this as-is, actually.}
+
\begin{figure}
\begin{minted}{c}
int a[2][2][1] = {{{0},{1}},{{0},{0}}};
@@ -157,15 +159,16 @@ int main() {
\textbf{Tool} & \textbf{Bug type} & \textbf{Details} & \textbf{Status} \\
\midrule
Vivado HLS & miscompile & Fig.~\ref{fig:vivado_bug1} & reported, confirmed \\
- Vivado HLS & miscompile & webpage & reported \\
+ Vivado HLS & miscompile & online* & reported \\
LegUp HLS & crash & Fig.~\ref{fig:eval:legup:crash} & reported \\
- LegUp HLS & crash & webpage & reported \\
- LegUp HLS & miscompile & webpage & reported, confirmed \\
+ LegUp HLS & crash & online* & reported \\
+ LegUp HLS & miscompile & online* & reported \\
Intel i++ & miscompile & Fig.~\ref{fig:eval:intel:mismatch} & reported \\
Bambu HLS & miscompile & Fig.~\ref{fig:eval:bambu:mismatch} & reported, confirmed, fixed \\
- Bambu HLS & miscompile & webpage & reported, confirmed \\
+ Bambu HLS & miscompile & online* & reported, confirmed \\
\bottomrule
- \end{tabular}
+ \end{tabular} \\
+ \vphantom{\large A}*See \url{https://ymherklotz.github.io/fuzzing-hls/} for detailed bug reports
\end{table}
%We write `$\ge$' above to emphasise that all the bug counts are lower bounds -- we did not have time to go through the rather arduous test-case reduction process for every failure.
@@ -218,8 +221,10 @@ int main() {
Besides studying the reliability of different HLS tools, we also studied the reliability of Vivado HLS over time. Figure~\ref{fig:sankey_diagram} shows the results of giving \vivadotestcases{} test-cases to Vivado HLS v2018.3, v2019.1 and v2019.2.
Test-cases that pass and fail in the same tools are grouped together into a ribbon.
For instance, the topmost ribbon represents the 31 test-cases that fail in all three versions of Vivado HLS. Other ribbons can be seen weaving in and out; these indicate that bugs were fixed or reintroduced in the various versions. We see that Vivado HLS v2018.3 had the most test-case failures (62).
-Interestingly, the blue ribbon shows that there are test-cases that fail in v2018.3, pass in v2019.1, and then fail again in v2019.2.
-As in our Euler diagram, the numbers do not necessary correspond to the number of actual bugs, though we can observe that there must be at least six unique bugs in Vivado HLS, given that each ribbon corresponds to at least one unique bug.
+Interestingly, the blue ribbon shows that there are test-cases that fail in v2018.3, pass in v2019.1, and then fail again in v2019.2!
+As in our Euler diagram, the numbers do not necessary correspond to the number of actual bugs, though we can observe that there must be at least six unique bugs in Vivado HLS, given that each ribbon corresponds to at least one unique bug.
+
+%\AD{This reminds me of the correcting commits metric from Junjie Chen et al.'s empirical study on compiler testing. Could be worth making the connection. }
%\YH{Contradicts value of 3 in Table~\ref{tab:unique_bugs}, maybe I can change that to 6?} \JW{I'd leave it as-is personally; we have already put a `$\ge$' symbol in the table, so I think it's fine.}
%In addition to that, it can then be seen that Vivado HLS v2018.3 must have at least 4 individual bugs, of which two were fixed and two others stayed in Vivado HLS v2019.1. However, with the release of v2019.1, new bugs were introduced as well. % Finally, for version 2019.2 of Vivado HLS, there seems to be a bug that was reintroduced which was also present in Vivado 2018.3, in addition to a new bug. In general it seems like each release of Vivado HLS will have new bugs present, however, will also contain many previous bug fixes. However, it cannot be guaranteed that a bug that was previously fixed will remain fixed in future versions as well.