summaryrefslogtreecommitdiffstats
path: root/eval.tex
diff options
context:
space:
mode:
authorYann Herklotz <git@yannherklotz.com>2021-04-05 16:22:55 +0100
committerYann Herklotz <git@yannherklotz.com>2021-04-05 16:23:04 +0100
commit3c232b2279b5c71ef8aa5a88987d5f2ac9d88016 (patch)
tree4f6cd8668b8022f28286b553a264ded653e289ac /eval.tex
parent6a7ebeb13cd708e5b9e2a09400255ccbd3f0242a (diff)
downloadfccm21_esrhls-3c232b2279b5c71ef8aa5a88987d5f2ac9d88016.tar.gz
fccm21_esrhls-3c232b2279b5c71ef8aa5a88987d5f2ac9d88016.zip
Make the paper fit again
Diffstat (limited to 'eval.tex')
-rw-r--r--eval.tex15
1 files changed, 7 insertions, 8 deletions
diff --git a/eval.tex b/eval.tex
index a1b9b28..0be2416 100644
--- a/eval.tex
+++ b/eval.tex
@@ -3,7 +3,7 @@
We generate \totaltestcases{} test-cases and provide them to four HLS tools: Vivado HLS, LegUp HLS, Intel i++, and Bambu.
We use the same test-cases across all tools for fair comparison (except the HLS directives, which have tool-specific syntax).
We were able to test three different versions of Vivado HLS (v2018.3, v2019.1 and v2019.2).
-We tested one version of Intel i++ (included in Quartus Lite 18.1), LegUp (4.0) and Bambu (v0.9.7). We tested any reduced LegUp test-cases on LegUp 9.2 before reporting them.
+We tested one version of Intel i++ (included in Quartus Lite 18.1), LegUp (4.0) and two versions of Bambu (v0.9.7, v0.9.7-dev).
% Three different tools were tested, including three different versions of Vivado HLS. We were only able to test one version of LegUp HLS (version 4.0), because although LegUp 7.5 is available, it is GUI-based and not amenable to scripting. However, bugs we found in LegUp 4.0 were reproduced manually in LegUp 7.5.
% LegUp and Vivado HLS were run under Linux, while the Intel HLS Compiler was run under Windows.
@@ -18,7 +18,7 @@ We tested one version of Intel i++ (included in Quartus Lite 18.1), LegUp (4.0)
\definecolor{timeout}{HTML}{ef4c4c}
\begin{figure}
\centering
- \begin{tikzpicture}[scale=0.61]
+ \begin{tikzpicture}[scale=0.61,yscale=0.9]
\draw (-7.2,7.0) rectangle (7.2,0.7);
\fill[vivado,fill opacity=0.5] (0.9,4.4) ellipse (3.3 and 1.5);
\fill[intel,fill opacity=0.5] (-4.5,4.8) ellipse (2.0 and 1.3);
@@ -51,7 +51,7 @@ We tested one version of Intel i++ (included in Quartus Lite 18.1), LegUp (4.0)
\end{figure}
Figure~\ref{fig:existing_tools} shows an Euler diagram of our results.
-We see that 918 (13.7\%), 167 (2.5\%), 83 (1.2\%) and 26 (0.4\%) test-cases fail in Bambu, LegUp, Vivado HLS and Intel i++ respectively. One of the bugs we reported to the Bambu developers was fixed during our testing campaign, so we also tested the development branch of Bambu (0.9.7-dev) with the bug fix, and found only 17 (0.25\%) failing test-cases.
+We see that 918 (13.7\%), 167 (2.5\%), 83 (1.2\%) and 26 (0.4\%) test-cases fail in Bambu, LegUp, Vivado HLS and Intel i++ respectively. The bugs we reported to the Bambu developers were fixed during our testing campaign, so we also tested the development branch of Bambu (0.9.7-dev) with the bug fixes, and found only 17 (0.25\%) failing test-cases remained.
Although i++ has a low failure rate, it has the highest time-out rate (540 test-cases) due to its remarkably long compilation time. No other tool had more than 20 time-outs.
Note that the absolute numbers here do not necessarily correspond to the number of bugs in the tools, because a single bug in a language feature that appears frequently in our test suite could cause many failures.
Moreover, we are reluctant to draw conclusions about the relative reliability of each tool by comparing the number of failures, because these numbers are so sensitive to the parameters of the randomly generated test suite we used. In other words, we can confirm the \emph{presence} of bugs, but cannot deduce the \emph{number} of them (nor their importance).
@@ -107,7 +107,6 @@ int main() {
\begin{minted}{c}
static int b = 0x10000;
static volatile short a = 0;
-
int main() {
a++;
b = (b >> 8) & 0x100;
@@ -165,7 +164,7 @@ int main() {
LegUp HLS & miscompile & online* & reported \\
Intel i++ & miscompile & Fig.~\ref{fig:eval:intel:mismatch} & reported \\
Bambu HLS & miscompile & Fig.~\ref{fig:eval:bambu:mismatch} & reported, confirmed, fixed \\
- Bambu HLS & miscompile & online* & reported, confirmed \\
+ Bambu HLS & miscompile & online* & reported, confirmed, fixed \\
\bottomrule
\end{tabular} \\
\vphantom{\large A}*See \url{https://ymherklotz.github.io/fuzzing-hls/} for detailed bug reports
@@ -183,7 +182,7 @@ int main() {
\definecolor{ribbon6}{HTML}{fdb462}
\begin{figure}
\centering
- \begin{tikzpicture}[xscale=1.25]
+ \begin{tikzpicture}[xscale=1.25,yscale=0.85]
\draw[white, fill=ribbon1] (-1.0,4.1) -- (0.0,4.1) to [out=0,in=180] (2.0,4.1) to [out=0,in=180] (4.0,4.1) -- (6.0,4.1) -- %(7.55,3.325) --
(6.0,2.55) -- (4.0,2.55) to [out=180,in=0] (2.0,2.55) to [out=180,in=0] (0.0,2.55) -- (-1.0,2.55) -- cycle;
\draw[white, fill=ribbon2] (-1.0,2.55) -- (0.0,2.55) to [out=0,in=180] (1.8,1.8) -- (2.2,1.8) to [out=0,in=180] (4.0,1.55) -- (6.0,1.55) -- %(7.3,0.9) --
@@ -220,9 +219,9 @@ int main() {
Besides studying the reliability of different HLS tools, we also studied the reliability of Vivado HLS over time. Figure~\ref{fig:sankey_diagram} shows the results of giving \vivadotestcases{} test-cases to Vivado HLS v2018.3, v2019.1 and v2019.2.
Test-cases that pass and fail in the same tools are grouped together into a ribbon.
-For instance, the topmost ribbon represents the 31 test-cases that fail in all three versions of Vivado HLS. Other ribbons can be seen weaving in and out; these indicate that bugs were fixed or reintroduced in the various versions. We see that Vivado HLS v2018.3 had the most test-case failures (62).
+For instance, the topmost ribbon represents the 31 test-cases that fail in all three versions of Vivado HLS. Other ribbons can be seen weaving in and out; these indicate that bugs were fixed or reintroduced in the various versions.
Interestingly, the blue ribbon shows that there are test-cases that fail in v2018.3, pass in v2019.1, and then fail again in v2019.2!
-As in our Euler diagram, the numbers do not necessary correspond to the number of actual bugs, though we can observe that there must be at least six unique bugs in Vivado HLS, given that each ribbon corresponds to at least one unique bug.
+As in our Euler diagram, the numbers do not necessary correspond to the number of actual bugs, though we can observe that there must be at least six unique bugs in Vivado HLS, given that each ribbon corresponds to at least one unique bug. This method of identifying unique bugs is similar to the ``correcting commits'' metric introduced by Chen et al.~\cite{chen16_empir_compar_compil_testin_techn}.
%\AD{This reminds me of the correcting commits metric from Junjie Chen et al.'s empirical study on compiler testing. https://xiongyingfei.github.io/papers/ICSE16.pdf. Could be worth making the connection. }