More changes

author: Yann Herklotz <git@yannherklotz.com> 2021-04-13 17:08:31 +0100
committer: Yann Herklotz <git@yannherklotz.com> 2021-04-13 17:08:45 +0100
commit: 12cf73bfe39d00fea73b17b831e09d419df0e121 (patch)
tree: c6b263879550f89d8e21da99fd8cbec66b3c4e94
parent: b6e4d6c2a90d89d3ab56110e3f621fd92408eae7 (diff)
download: oopsla21_fvhls-12cf73bfe39d00fea73b17b831e09d419df0e121.tar.gz
oopsla21_fvhls-12cf73bfe39d00fea73b17b831e09d419df0e121.zip
7 files changed, 22 insertions, 16 deletions
diff --git a/conclusion.tex b/conclusion.tex
index 755dedc..9a4bcfa 100644
--- a/conclusion.tex
+++ b/conclusion.tex
@@ -5,7 +5,7 @@ We have presented \vericert{}, the first mechanically verified HLS tool for tran
 %In this article, we present \vericert{}, the first fully mechanised proof of correctness for HLS in Coq, translating C into Verilog.
 We built \vericert{} by extending \compcert{} with a new hardware-specific intermediate language and a Verilog back end, and we verified it with respect to a semantics for Verilog due to \citet{loow19_proof_trans_veril_devel_hol}.
 %The behaviour of any C program compiled via \vericert{} is guaranteed to be translated correctly to Verilog. 
-We evaluated \vericert{} against the existing \legup{} HLS tool on the Polybench/C benchmark suite.
+We evaluated \vericert{} against the existing \legup{} HLS tool on the \polybench{} benchmark suite.
 %Our Polybench hardware is guaranteed to preserve the C behaviour, whilst \legup{} cannot provide any such guarantee about its generated hardware.
 Currently, our hardware is \slowdownOrig$\times$ slower and \areaIncr$\times$ larger compared to \legup{}, though it is only \slowdownDiv$\times$ slower if inefficient divisions are removed.
 
diff --git a/evaluation.tex b/evaluation.tex
index 9681210..e00062d 100644
--- a/evaluation.tex
+++ b/evaluation.tex
@@ -13,11 +13,11 @@ Our evaluation is designed to answer the following three research questions.
 
 \paragraph{Choice of HLS tool for comparison.} We compare \vericert{} against \legup{} 5.1 because it is open-source and hence easily accessible, but still produces hardware ``of comparable quality to a commercial high-level synthesis tool''~\cite{canis11_legup}.
 
-\paragraph{Choice and preparation of benchmarks.} We evaluate \vericert{} using the PolyBench/C benchmark suite (version 4.2.1)~\cite{polybench}, which consists of a collection of 30 numerical kernels. PolyBench/C is popular in the HLS context~\cite{choi+18,poly_hls_pouchet2013polyhedral,poly_hls_zhao2017,poly_hls_zuo2013}, since it has affine loop bounds, making it attractive for streaming computation on FPGA architectures.
+\paragraph{Choice and preparation of benchmarks.} We evaluate \vericert{} using the \polybench{} benchmark suite (version 4.2.1)~\cite{polybench}, which consists of a collection of 30 numerical kernels. \polybench{} is popular in the HLS context~\cite{choi+18,poly_hls_pouchet2013polyhedral,poly_hls_zhao2017,poly_hls_zuo2013}, since it has affine loop bounds, making it attractive for streaming computation on FPGA architectures.
 We were able to use 27 of the 30 programs; three had to be discarded (\texttt{correlation},~\texttt{gramschmidt} and~\texttt{deriche}) because they involve square roots, requiring floats, which we do not support. 
 % Interestingly, we were also unable to evaluate \texttt{cholesky} on \legup{}, since it produce an error during its HLS compilation. 
 %In summary, we evaluate 27 programs from the latest Polybench suite. 
-We configured Polybench's parameters so that only integer types are used, since we do not support floats. We use Polybench's smallest datasets for each program to ensure that data can reside within on-chip memories of the FPGA, avoiding any need for off-chip memory accesses.
+We configured \polybench{}'s parameters so that only integer types are used, since we do not support floats. We use \polybench{}'s smallest datasets for each program to ensure that data can reside within on-chip memories of the FPGA, avoiding any need for off-chip memory accesses.
 
 \vericert{} implements divisions and modulo operations in C using the corresponding built-in Verilog operators. These built-in operators are designed to complete within a single clock cycle, and this causes substantial penalties in clock frequency. Other HLS tools, including LegUp, supply their own multi-cycle division/modulo implementations, and we plan to do the same in future versions of \vericert{}. In the meantime, we have prepared an alternative version of the benchmarks in which each division/modulo operation is overridden with our own implementation that uses repeated division and multiplications by 2. Where this change makes an appreciable difference to the performance results, we give the results for both benchmark sets.
 
@@ -73,7 +73,7 @@ We configured Polybench's parameters so that only integer types are used, since
       \legend{\vericert{},\legup{} w/o opt/chain,\legup{} w/o opt};
     \end{groupplot}
   \end{tikzpicture}
-  \caption{Polybench with division enabled.}
+  \caption{\polybench{} with division enabled.}
 \end{figure}
 
 \pgfplotstableread[col sep=comma]{results/rel-time-nodiv.csv}{\nodivtimingtable}
@@ -121,7 +121,7 @@ We configured Polybench's parameters so that only integer types are used, since
       \legend{\vericert{},\legup{} w/o opt/chain,\legup{} w/o opt};
     \end{groupplot}
   \end{tikzpicture}
-  \caption{Polybench with division replaced by iterative division algorithm.}
+  \caption{\polybench{} with division replaced by iterative division algorithm.}
 \end{figure}
 
 %
@@ -275,7 +275,7 @@ It is notable that even without \vericert{} performing many optimisations, a few
 %We are very encouraged by these data points. 
 As we improve \vericert{} by incorporating further  optimisations, this gap should reduce whilst preserving our correctness guarantees.
 
-Cycle count is one factor in calculating execution times; the other is the clock frequency, which determines the duration of each of these cycles. Figure~\ref{fig:comparison_time} compares the execution times of \vericert{} and \legup{}. Across the original Polybench/C benchmarks, we see that \vericert{} designs are about \slowdownOrig$\times$ slower than \legup{} designs. This dramatic discrepancy in performance can be largely attributed to \vericert's na\"ive implementations of division and modulo operations, as explained in Section~\ref{sec:evaluation:setup}. Indeed, \vericert{} achieved an average clock frequency of just 21MHz, while \legup{} managed about 247MHz. After replacing the division/modulo operations with our own C-based implementations, \vericert{}'s average clock frequency becomes about 112MHz. This is better, but still substantially below \legup{}, which uses various additional optimisations and Intel-specific IP blocks. Across the modified Polybench/C benchmarks, we see that \vericert{} designs are about \slowdownDiv$\times$ slower than \legup{} designs.
+Cycle count is one factor in calculating execution times; the other is the clock frequency, which determines the duration of each of these cycles. Figure~\ref{fig:comparison_time} compares the execution times of \vericert{} and \legup{}. Across the original \polybench{} benchmarks, we see that \vericert{} designs are about \slowdownOrig$\times$ slower than \legup{} designs. This dramatic discrepancy in performance can be largely attributed to \vericert's na\"ive implementations of division and modulo operations, as explained in Section~\ref{sec:evaluation:setup}. Indeed, \vericert{} achieved an average clock frequency of just 21MHz, while \legup{} managed about 247MHz. After replacing the division/modulo operations with our own C-based implementations, \vericert{}'s average clock frequency becomes about 112MHz. This is better, but still substantially below \legup{}, which uses various additional optimisations and Intel-specific IP blocks. Across the modified \polybench{} benchmarks, we see that \vericert{} designs are about \slowdownDiv$\times$ slower than \legup{} designs.
 
 \subsection{RQ2: How area-efficient is \vericert{}-generated hardware?}
 
@@ -333,7 +333,7 @@ Cycle count is one factor in calculating execution times; the other is the clock
 %\end{subfigure}
 %\end{figure}
 
-Figure~\ref{fig:comparison_area} compares the resource utilisation of the Polybench programs generated by \vericert{} and \legup{}.
+Figure~\ref{fig:comparison_area} compares the resource utilisation of the \polybench{} programs generated by \vericert{} and \legup{}.
 On average, we see that \vericert{} produces hardware that is about $21\times$ larger than \legup{}. \vericert{} designs are filling up to 30\% of a (large) FPGA chip, while \legup{} uses no more than 1\% of the chip.
 The main reason for this is that RAM is not inferred automatically for the Verilog that is generated by \vericert{}; instead, large arrays of registers are synthesised.
 Synthesis tools such as Quartus generally require array accesses to be in a specific form in order for RAM inference to activate.
@@ -346,7 +346,7 @@ Enabling RAM inference is part of our future plans.
 
 \subsection{RQ3: How long does \vericert{} take to produce hardware?}
 
-Figure~\ref{fig:comparison_comptime} compares the compilation times of \vericert{} and of \legup{}, with each data point corresponding to one of the PolyBench/C benchmarks. On average, \vericert{} compilation is about $47\times$ faster than \legup{} compilation. \vericert{} is much faster because it omits many of the time-consuming HLS optimisations performed by \legup{}, such as scheduling and memory analysis. This comparison also demonstrates that our fully verified approach does not add substantial overheads in compilation time, since we do not invoke verification for every compilation instance, unlike the approaches based on translation validation that we mentioned in Section~\ref{sec:intro}.
+Figure~\ref{fig:comparison_comptime} compares the compilation times of \vericert{} and of \legup{}, with each data point corresponding to one of the \polybench{} benchmarks. On average, \vericert{} compilation is about $47\times$ faster than \legup{} compilation. \vericert{} is much faster because it omits many of the time-consuming HLS optimisations performed by \legup{}, such as scheduling and memory analysis. This comparison also demonstrates that our fully verified approach does not add substantial overheads in compilation time, since we do not invoke verification for every compilation instance, unlike the approaches based on translation validation that we mentioned in Section~\ref{sec:intro}.
 
 %\NR{Do we want to finish the section off with some highlights or a summary?}
 
diff --git a/introduction.tex b/introduction.tex
index 9e88564..d752d6c 100644
--- a/introduction.tex
+++ b/introduction.tex
@@ -21,7 +21,7 @@ Indeed, there are reasons to doubt that HLS tools actually \emph{do} always pres
 %Other reasons are more specific: 
 For instance, Vivado HLS has been shown to apply pipelining optimisations incorrectly\footnote{\url{https://bit.ly/vivado-hls-pipeline-bug}} or to silently generate wrong code should the programmer stray outside the fragment of C that it supports.\footnote{\url{https://bit.ly/vivado-hls-pointer-bug}} 
 Meanwhile, \citet{lidbury15_many_core_compil_fuzzin} had to abandon their attempt to fuzz-test Altera's (now Intel's) OpenCL compiler since it ``either crashed or emitted an internal compiler error'' on so many of their test inputs.  
-And more recently, \citet{du_fuzzin_high_level_synth_tools} fuzz-tested three commercial HLS tools using Csmith~\cite{yang11_findin_under_bugs_c_compil}, and despite restricting the generated programs to the C fragment explicitly supported by all the tools, they still found that on average 2.5\% of test cases generated a design that did not match the behaviour of the input.
+And more recently, \citet{du21_fuzzin_high_level_synth_tools} fuzz-tested three commercial HLS tools using Csmith~\cite{yang11_findin_under_bugs_c_compil}, and despite restricting the generated programs to the C fragment explicitly supported by all the tools, they still found that on average 2.5\% of test cases generated a design that did not match the behaviour of the input.
 
 \paragraph{Existing workarounds}
 
@@ -43,7 +43,7 @@ The contributions of this paper are as follows:
   \item We present \vericert{}, the first mechanically verified HLS tool that compiles C to Verilog. In Section~\ref{sec:design}, we describe the design of \vericert{}.
   \item We state the correctness theorem of \vericert{} with respect to an existing semantics for Verilog due to \citet{loow19_proof_trans_veril_devel_hol}. In Section~\ref{sec:verilog}, we describe how we lightly extended this semantics to make it suitable as an HLS target.
   \item In Section~\ref{sec:proof}, we describe how we proved this theorem. The proof follows standard \compcert{} techniques -- forward simulations, intermediate specifications, and determinism results -- but we encountered several challenges peculiar to our hardware-oriented setting. These include handling discrepancies between byte- and word-addressable memories, different handling of unsigned comparisons between C and Verilog, and correctly mapping CompCert's memory model onto a finite Verilog array.
-  \item In Section~\ref{sec:evaluation}, we evaluate \vericert{} on the Polybench/C benchmark suite~\cite{polybench}, and compare the performance of our generated hardware against an existing, unverified HLS tool called \legup{}~\cite{canis11_legup}. We show that \vericert{} generates hardware that is \slowdownOrig$\times$ slower (\slowdownDiv$\times$ slower in the absence of division) and \areaIncr$\times$ larger than that generated by \legup{}. We intend to bridge this performance gap in the future by introducing (and verifying) HLS optimisations of our own, such as scheduling and memory analysis.
+  \item In Section~\ref{sec:evaluation}, we evaluate \vericert{} on the \polybench{} benchmark suite~\cite{polybench}, and compare the performance of our generated hardware against an existing, unverified HLS tool called \legup{}~\cite{canis11_legup}. We show that \vericert{} generates hardware that is \slowdownOrig$\times$ slower (\slowdownDiv$\times$ slower in the absence of division) and \areaIncr$\times$ larger than that generated by \legup{}. We intend to bridge this performance gap in the future by introducing (and verifying) HLS optimisations of our own, such as scheduling and memory analysis.
 \end{itemize}
 
 \vericert{} is fully open source and available online.
diff --git a/main.tex b/main.tex
index 1fc5dbc..9f7f06f 100644
--- a/main.tex
+++ b/main.tex
@@ -93,6 +93,7 @@
 \newcommand{\slowdownOrig}{56}
 \newcommand{\slowdownDiv}{10}
 \newcommand{\areaIncr}{21}
+\def\polybench{Polybench/C}
 
 
 \newtheorem{theorem}{Theorem}
diff --git a/references.bib b/references.bib
index e22a73a..0e7d48f 100644
--- a/references.bib
+++ b/references.bib
@@ -838,11 +838,13 @@ year = {2020},
   year            = 2020,
 }
 
-@unpublished{du_fuzzin_high_level_synth_tools,
-  author          = {Du, Zewei and Herklotz, Yann and Ramanathan, Nadesh and Wickerson, John},
-  note            = {Unpublished},
+@inproceedings{du21_fuzzin_high_level_synth_tools,
+  author          = {Herklotz, Yann and Du, Zewei and Ramanathan, Nadesh and Wickerson, John},
+  note            = {(to appear)},
+  year            = 2021,
   url             = {https://yannherklotz.com/docs/drafts/fuzzing_hls.pdf},
-  title           = {{Fuzzing High-Level Synthesis Tools}},
+  title           = {An Empirical Study of the Reliability of High-Level Synthesis Tools},
+  booktitle       = {29th IEEE International Symposium on Field-Programmable Custom Computing Machines},
 }
 
 @inproceedings{chisel,
diff --git a/related.tex b/related.tex
index 3fed547..c6b8910 100644
--- a/related.tex
+++ b/related.tex
@@ -22,8 +22,8 @@
     \draw[myoutline] (0,0.3) ellipse (3.7 and 1.9);
     \draw[myoutline] (0,0) ellipse (3 and 1.0);
 
-    \node[align=center] at (0,2.8) {Standard HLS \\ tools~\cite{canis11_legup,xilinx20_vivad_high_synth,intel20_sdk_openc_applic,nigam20_predic_accel_desig_time_sensit_affin_types}};
-    \node[align=center] at (0,1.5) {Translation validation \\ approaches~\cite{mentor20_catap_high_level_synth,kundu08_valid_high_level_synth,clarke03_behav_c_veril}};
+    \node[align=center] at (0,2.8) {Standard HLS tools \\ \footnotesize\cite{canis11_legup,intel20_sdk_openc_applic} \\ \footnotesize\cite{nigam20_predic_accel_desig_time_sensit_affin_types,xilinx20_vivad_high_synth}};
+    \node[align=center] at (0,1.5) {Translation validation approaches \\ \footnotesize\cite{mentor20_catap_high_level_synth,kundu08_valid_high_level_synth,clarke03_behav_c_veril}};
     \node at (0,0.5) {\bf \vericert{}};
     \node[align=left] at (-1.8,0.4) {Koika~\cite{bourgeat20_essen_blues}};
     \node[align=left] at (-1.8,0.0) {L\"o\"ow et al.~\cite{loow19_verif_compil_verif_proces}};
diff --git a/results/results.org b/results/results.org
index 2ab84c3..06599fc 100644
--- a/results/results.org
+++ b/results/results.org
@@ -7,6 +7,7 @@
 |----------------+---------------+-------+------+------+---------+-------+------+-------+-------+---------|
 | 2mm            |             2 |  1764 | 4952 | 3541 |       4 |     0 |   24 | 0.152 |  4.59 |  467612 |
 | 3mm            |             9 |  1940 | 5046 | 3834 |       8 |     0 |   21 | 0.156 |  4.56 |  604582 |
+| adi            |               |       |      |      |         |       |      |       |       |         |
 | atas           |             8 |   599 | 1603 | 1254 |       4 |     0 |    6 | 0.128 | 4.539 |  101618 |
 | bicg           |             0 |   930 | 2628 | 2012 |       4 |     0 |    9 | 0.138 | 4.692 |  130790 |
 | cholesky       |             0 |  1898 | 5322 | 3192 |       8 |     0 |   12 | 0.141 | 4.877 | 2575070 |
@@ -40,8 +41,10 @@
 | 2mm            |             8 |  2341 | 1861 |  6602 |       4 |     0 |   24 | 0.247 | 82.379 |  404478 |
 | 3mm            |             8 |  2848 | 1963 |  8783 |       8 |     0 |   21 | 0.274 | 96.669 |  536114 |
 | atas           |             0 |  1039 |  873 |  2981 |       4 |     0 |    6 | 0.184 | 84.842 |   58424 |
+| adi            |               |       |      |       |         |       |      |       |        |         |
 | bicg           |             1 |  1755 | 1017 |  5269 |       4 |     0 |    9 | 0.223 | 84.845 |   53916 |
 | cholesky       |             7 |  1554 | 2011 |  4064 |       8 |     0 |   12 | 0.190 | 98.132 | 2535686 |
+| covariance     |               |       |      |       |         |       |      |       |        |  287498 |
 | doitgen        |             0 |  1436 | 1372 |  3736 |       4 |     0 |    6 | 0.189 | 76.734 |  351988 |
 | durbin         |             1 |   719 | 1047 |  1840 |       1 |     0 |   12 | 0.167 | 96.403 |   22974 |
 | fdtd-2d        |             3 |  1294 | 2049 |  3672 |       4 |     0 |    9 | 0.177 |  80.81 |  901430 |
author	Yann Herklotz <git@yannherklotz.com>	2021-04-13 17:08:31 +0100
committer	Yann Herklotz <git@yannherklotz.com>	2021-04-13 17:08:45 +0100
commit	12cf73bfe39d00fea73b17b831e09d419df0e121 (patch)
tree	c6b263879550f89d8e21da99fd8cbec66b3c4e94
parent	b6e4d6c2a90d89d3ab56110e3f621fd92408eae7 (diff)
download	oopsla21_fvhls-12cf73bfe39d00fea73b17b831e09d419df0e121.tar.gz oopsla21_fvhls-12cf73bfe39d00fea73b17b831e09d419df0e121.zip