summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorJohn Wickerson <j.wickerson@imperial.ac.uk>2020-11-20 22:29:38 +0000
committeroverleaf <overleaf@localhost>2020-11-20 22:29:40 +0000
commitdfafa2b12f83c9c4dbf7997339f99630baf7a147 (patch)
tree15df52e82861d5032968300eab035e520f333445
parent768548852781106b1ac8965f183035c57e73bc9d (diff)
downloadoopsla21_fvhls-dfafa2b12f83c9c4dbf7997339f99630baf7a147.tar.gz
oopsla21_fvhls-dfafa2b12f83c9c4dbf7997339f99630baf7a147.zip
Update on Overleaf.
-rw-r--r--algorithm.tex2
-rw-r--r--evaluation.tex46
-rw-r--r--references.bib3
-rw-r--r--results/poly.csv56
4 files changed, 54 insertions, 53 deletions
diff --git a/algorithm.tex b/algorithm.tex
index f61368d..302494f 100644
--- a/algorithm.tex
+++ b/algorithm.tex
@@ -4,7 +4,7 @@
This section covers \JP{describes} the main architecture of the HLS tool, and the way in which the Verilog back end \JP{back-end? backend?} was added to \compcert{}. This section will also cover an example of converting a simple C program into hardware, expressed in the Verilog language.
\paragraph{Choice of source language}
-First of all, the choice of C as the input language of \vericert{} is simply because it is what most major HLS tools use~\cite{canis11_legup, xilinx20_vivad_high_synth, intel_hls, bambu_hls}. This, in turn, may be because C is ``[t]he starting point for the vast majority of algorithms to be implemented in hardware''~\cite{5522874}. \JP{I think this could be reworded: ``C was chosen as the source language as it remains the most common source language amongst production-quality HLS tools~\cite{canis11_legup, xilinx20_vivad_high_synth, intel_hls, bambu_hls}. This, in turn, may be because it is ``[t]he starting point for the vast majority of algorithms to be implemented in hardware''~\cite{5522874}, lending practicality. The availability of \compcert{} [cite], also provides a solid basis for formally verified C compilation.''}
+First of all, the choice of C as the input language of \vericert{} is simply because it is what most major HLS tools use~\cite{canis11_legup, xilinx20_vivad_high_synth, intel_hls, bambu_hls}. This, in turn, may be because C is ``[t]he starting point for the vast majority of algorithms to be implemented in hardware''~\cite{5522874}. \JP{I think this could be reworded: ``C was chosen as the source language as it remains the most common source language amongst production-quality HLS tools~\cite{canis11_legup, xilinx20_vivad_high_synth, intel_hls, bambu_hls}. This, in turn, may be because it is ``[t]he starting point for the vast majority of algorithms to be implemented in hardware''~\cite{5522874}, lending a degree of practicality. The availability of \compcert{} [cite] also provides a solid basis for formally verified C compilation.''}
%Since a lot of existing code for HLS is written in C, supporting C as an input language, rather than a custom domain-specific language, means that \vericert{} is more practical.
%An alternative was to support LLVM IR as an input language, however, to get a full work flow from a higher level language to hardware, a front end for that language to LLVM IR would also have to be verified. \JW{Maybe save LLVM for the `Choice of implementation language'?}
We considered Bluespec~\cite{nikhil04_blues_system_veril}, but decided that although it ``can be classed as a high-level language''~\cite{greaves_note}, it is too hardware-oriented to be used for traditional HLS.
diff --git a/evaluation.tex b/evaluation.tex
index 09bbc70..667d48f 100644
--- a/evaluation.tex
+++ b/evaluation.tex
@@ -19,14 +19,9 @@ We were able to use 27 of the 30 programs; three had to be discarded (\texttt{co
We configured Polybench's parameters so that only integer types are used, since we do not support floats or doubles currently. We use Polybench's smallest datasets for each program to ensure that data can reside within on-chip memories of the FPGA, avoiding any need for off-chip memory accesses.
-Furthermore, using the C divide or modulo operators results in directly translate to built-in Verilog divide and modulo operators.
-Unfortunately, the built-in operators are designed as single-cycle operation, causing large penalties in clock frequency.
-To work around this issue, we use a C implementation of the divide and modulo operations, which is indirectly compiles them as multi-cycle operations on the FPGA.
-%In addition, we initial the input arrays and check the output arrays of all programs entirely on-chip.
+\vericert{} implements divisions and modulo operations in C using the corresponding built-in Verilog operators. These built-in operators are designed to complete within a single clock cycle, and this causes substantial penalties in clock frequency. Other HLS tools, including LegUp, supply their own multi-cycle division/modulo implementations, and we plan to do the same in future versions of \vericert{}. In the meantime, we have prepared an alternative version of the benchmarks in which each division/modulo operation is overridden with our own implementation that uses repeated division and multiplications by 2. Where this change makes an appreciable difference to the performance results, we give the results for both benchmark sets.
-% For completeness, we use the full set of 24 benchmarks. We set the benchmark parameters so that all datatypes are integers (since \vericert{} only supports integers) and all datasets are `small' (to fit into the small on-chip memories). A current limitation of \vericert{}, as discussed in Section~\ref{?}, is that it does not support addition and subtraction operations involving integer literals not divisible by 4. To work around this, we lightly modified each benchmark program so that literals other than multiples of 4 are stored into variables before being added or subtracted. \JW{Any other notable changes to the benchmarks?}
-
-\paragraph{Synthesis setup.} The generated Verilog, by both \vericert{} and \legup{}, is provided to Intel Quartus v16.0~\cite{quartus}, that synthesises the Verilog into a netlist and place-and-routes this netlist onto an Arria 10 FPGA device, consists of approximately 430000 LUTs.
+\paragraph{Synthesis setup} The Verilog that is generated by \vericert{} or \legup{} is provided to Intel Quartus v16.0~\cite{quartus}, which synthesises it to a netlist, before placing-and-routing this netlist onto an Intel Arria 10 FPGA device that contains approximately 430000 LUTs.
\subsection{RQ1: How fast is \vericert{}-generated hardware?}
@@ -69,17 +64,21 @@ To work around this issue, we use a C implementation of the divide and modulo op
xlabel={\legup{} execution time (ms)},
ylabel={\vericert{} execution time (ms)},
xmin=10,
- xmax=100000,
- ymax=100000,
+ xmax=1000000,
+ ymax=1000000,
ymin=10,
%log ticks with fixed point,
]
-\addplot[draw=none, mark=*, draw opacity=0, fill opacity=0.3]
+\addplot[draw=none, mark=*, draw opacity=0, fill opacity=0.3,color=red]
table [x expr={\thisrow{legupcycles}/\thisrow{legupfreqMHz}}, y expr={\thisrow{vericertcycles}/\thisrow{vericertfreqMHz}}, col sep=comma]
{results/poly.csv};
+
+\addplot[draw=none, mark=*, draw opacity=0, fill opacity=0.3, color=blue]
+ table [x expr={\thisrow{legupcycles}/\thisrow{legupfreqMHz}}, y expr={\thisrow{vericertcycles}/\thisrow{vericertoldfreqMHz}}, col sep=comma]
+ {results/poly.csv};
-\addplot[dotted, domain=10:100000]{x};
+\addplot[dotted, domain=10:1000000]{x};
%\addplot[dashed, domain=10:10000]{9.02*x + 442};
\end{axis}
@@ -88,20 +87,21 @@ To work around this issue, we use a C implementation of the divide and modulo op
\label{fig:comparison_time}
\end{figure}
-Firstly, before comparing any performance metrics, it is worth highlighting that any Verilog produced by \vericert{} is guaranteed to be \emph{correct}, whilst no such guarantee can be provided by \legup{}.
-This guarantee in itself provides a significant leap in terms of HLS reliability, compared to any other HLS tools available.
+%Firstly, before comparing any performance metrics, it is worth highlighting that any Verilog produced by \vericert{} is guaranteed to be \emph{correct}, whilst no such guarantee can be provided by \legup{}.
+%This guarantee in itself provides a significant leap in terms of HLS reliability, compared to any other HLS tools available.
Figure~\ref{fig:comparison_cycles} compares the cycle counts of our 27 programs executed by \vericert{} and \legup{} respectively.
-In most cases, we see that the data points are above the diagonal, demonstrating that the \legup{}-generated hardware is faster than \vericert{}-generated Verilog.
-This performance gap is mostly due to \legup{} optimisations such as scheduling and memory analysis, which are designed to exploit parallelism from input programs.
-On average, \legup{} designs are $4.5\times$ faster than \vericert{} designs on Polybench programs.
-This gap does not represent the performance cost that comes with formally proving a HLS tool.
-Instead, it is simply a gap between an unoptimised \vericert{} versus an optimised \legup{}.
-In fact, even without any optimisations, a few data points are close to diagonal and even below diagonal, which means \vericert{} is competitive to \legup{}.
-We are very encouraged by these data points.
-As we optimise \vericert{} to incorporate other HLS optimisations in a formally-proved manner, this gap can reduce whilst preserving our correctness guarantees.
-
-Cycle count is one factor in calculating execution latency. The other factor is the generated clock frequency, which tells us how fast a cycle executes. Figure~\ref{fig:comparison_time} compared the execution times of \vericert{} and \legup{}. On average, \vericert{} are $10\times$ slower than \legup{}. As mentioned earlier, we modify the Polybench programs to utilise C-based divides and modulos. We had to avoid using the built-in Verilog divides and modulos, since Quartus interprets them as single-cycle operations. This interpretation affects clock frequency drastically. On average, when using the built-in Verilog approach, \vericert{}'s clock frequency was 21MHz, compared to \legup{}'s clock frequency of 247MHz. By moving to the C-based approach, our average clock frequency is now 112MHz. Hence, we reduce the frequency gap from approximately $12\times$ to $2\times$. This gap still exists because \legup{} uses various optimisation tactics and Intel-specific IPs, which requires further engineering effort and testing from our side.
+In most cases, we see that the data points are above the diagonal, which demonstrates that the \legup{}-generated hardware is faster than \vericert{}-generated Verilog.
+
+On average, \legup{} designs are $4.5\times$ faster than \vericert{} designs.
+This performance gap is mostly due to \legup{} optimisations such as scheduling and memory analysis, which are designed to extract parallelism from input programs.
+%This gap does not represent the performance cost that comes with formally proving a HLS tool.
+%Instead, it is simply a gap between an unoptimised \vericert{} versus an optimised \legup{}.
+It is notable that even without \vericert{} performing many optimisations, a few data points are close to the diagonal and even below it.
+%We are very encouraged by these data points.
+As we improve \vericert{} by incorporating further optimisations, this gap should reduce whilst preserving our correctness guarantees.
+
+Cycle count is one factor in calculating execution latency. The other factor is the clock frequency, which tells us how fast a cycle executes. Figure~\ref{fig:comparison_time} compares the execution times of \vericert{} and \legup{}. On average, \vericert{} designs are about $10\times$ slower than \legup{}. As mentioned earlier, we modify the Polybench programs to utilise C-based divides and modulos. We had to avoid using the built-in Verilog divides and modulos, since Quartus interprets them as single-cycle operations. This interpretation affects clock frequency drastically. On average, when using the built-in Verilog approach, \vericert{}'s clock frequency was 21MHz, compared to \legup{}'s clock frequency of 247MHz. By moving to the C-based approach, our average clock frequency is now 112MHz. Hence, we reduce the frequency gap from approximately $12\times$ to $2\times$. This gap still exists because \legup{} uses various optimisation tactics and Intel-specific IPs, which requires further engineering effort and testing from our side.
% The difference in cycle counts shows the degree of parallelism that \legup{}'s scheduling and memory system can offer. However, their Verilog generation is not guaranteed to be correct. Although the runtime LegUp outputs are tested to be correct for these programs, this does not provide any confidence on the correctness of Verilog generation of any other programs. Our Coq proof mechanisation guarantees that generated Verilog is correct for any input program that uses the subset of \compcert{} instructions that we have proven to be correct.
diff --git a/references.bib b/references.bib
index 95014da..7303a48 100644
--- a/references.bib
+++ b/references.bib
@@ -832,7 +832,8 @@ year = {2020},
@unpublished{du_fuzzin_high_level_synth_tools,
author = {Du, Zewei and Herklotz, Yann and Ramanathan, Nadesh and Wickerson, John},
- note = {unpublished},
+ note = {Unpublished},
+ url = {https://yannherklotz.com/docs/drafts/fuzzing_hls.pdf}
title = {{Fuzzing High-Level Synthesis Tools}},
}
diff --git a/results/poly.csv b/results/poly.csv
index 3e979cf..f02c58f 100644
--- a/results/poly.csv
+++ b/results/poly.csv
@@ -1,28 +1,28 @@
-benchmark,legupcycles,legupfreqMHz,legupluts,legupregs,leguprams,legupdsps,legupcomptime,vericertcycles,vericertfreqMHz,vericertluts,vericertregs,vericertrams,vericertdsps,vericertcomptime
-durbin,15106,188.61,2509,4008,0,8,4.77,19852,199.2,3027,6168,0,8,0.077
-lu,482766,244.62,3116,4593,0,10,4.72,2634766,92.97,54806,106037,0,6,0.097
-ludcmp,470843,249.69,3605,5397,0,15,4.87,2401354,83.52,57145,111408,0,10,0.132
-trisolv,35382,213.9,2412,3749,0,3,4.73,33550,112.59,28101,55109,0,2,0.086
-2mm,60088,226.5,1114,1936,0,7,4.80,427098,116.9,32600,63212,0,18,0.136
-3mm,204195,188.96,4210,4801,0,43,4.88,539430,97.13,45073,89719,0,18,0.147
-atas,126288,193.24,3026,3719,0,10,4.80,92000,130.41,28603,56646,0,6,0.069
-bicg,11907,303.4,308,537,0,6,4.78,121134,122.74,30091,58799,0,8,0.097
-mvt,16806,384.47,372,597,0,4,4.79,139194,119.93,30753,60450,0,6,0.130
-doitgen,57199,252.14,909,1402,0,2,5.05,350302,126.18,19898,38461,0,4,0.105
-symm,64903,284.41,2155,3170,0,10,4.63,248930,113.92,27693,54514,0,14,0.115
-syrk,57395,278.01,598,976,0,2,4.73,309018,87.15,76889,57816,0,8,0.094
-syr2k,125705,240.85,3149,3679,0,6,4.85,478040,78.6,120116,82032,0,12,0.116
-trmm,41432,281.61,610,990,0,4,4.71,147528,105.61,64031,40502,0,4,0.089
-gemm,83676,192.68,1029,1544,0,35,4.79,360772,121.61,31853,62126,0,16,0.104
-gemver,28087,303.49,1854,2380,0,8,4.68,175326,107.27,32615,64118,0,14,0.099
-gesummv,6634,310.46,298,504,0,4,4.77,111094,79.97,113876,72908,0,10,0.101
-covariance,109992,245.16,2098,3096,0,5,4.77,297450,110.57,28729,56660,0,4,0.083
-fdtd-2d,214153,262.61,2736,3801,0,2,4.73,831912,108.23,31333,61421,0,6,0.116
-heat-3d,41059,115.54,3132,2910,0,60,4.92,555930,110.42,33915,67273,0,9,0.181
-jacobi-1d,6914,386.25,1355,1885,0,0,4.72,16606,277.93,1636,3305,0,0,0.071
-jacobi-2d,84609,240.79,2347,3185,0,2,4.81,357100,113.53,30393,59782,0,4,0.079
-seidel-2d,345294,232.4,2128,3337,0,2,4.68,875758,127.99,26948,53133,0,2,0.091
-floyd-warshall,1238764,235.52,1869,2367,0,2,4.71,4762766,109.4,59859,118101,0,2,0.094
-nussinov,216467,273.07,1078,1431,0,2,4.79,837958,90.73,60663,119303,0,0,0.080
-cholesky,148443,273.6,1528,3419,0,15,4.89,2347850,90.88,53992,106180,0,8,0.100
-adi,90644,256.28,1111,2391,0,0,5.32,1713702,110.44,31207,60639,0,36,0.160 \ No newline at end of file
+benchmark,legupcycles,legupfreqMHz,legupluts,legupregs,leguprams,legupdsps,legupcomptime,vericertcycles,vericertfreqMHz,vericertluts,vericertregs,vericertrams,vericertdsps,vericertcomptime,vericertoldfreqMHz
+durbin,15106,188.61,2509,4008,0,8,4.77,19852,199.2,3027,6168,0,8,0.077,18.23
+lu,482766,244.62,3116,4593,0,10,4.72,2634766,92.97,54806,106037,0,6,0.097,19.21
+ludcmp,470843,249.69,3605,5397,0,15,4.87,2401354,83.52,57145,111408,0,10,0.132,18.78
+trisolv,35382,213.9,2412,3749,0,3,4.73,33550,112.59,28101,55109,0,2,0.086,19.1
+2mm,60088,226.5,1114,1936,0,7,4.80,427098,116.9,32600,63212,0,18,0.136,20.21
+3mm,204195,188.96,4210,4801,0,43,4.88,539430,97.13,45073,89719,0,18,0.147,19.74
+atas,126288,193.24,3026,3719,0,10,4.80,92000,130.41,28603,56646,0,6,0.069,21.75
+bicg,11907,303.4,308,537,0,6,4.78,121134,122.74,30091,58799,0,8,0.097,20.91
+mvt,16806,384.47,372,597,0,4,4.79,139194,119.93,30753,60450,0,6,0.130,20.71
+doitgen,57199,252.14,909,1402,0,2,5.05,350302,126.18,19898,38461,0,4,0.105,20.14
+symm,64903,284.41,2155,3170,0,10,4.63,248930,113.92,27693,54514,0,14,0.115,20.94
+syrk,57395,278.01,598,976,0,2,4.73,309018,87.15,76889,57816,0,8,0.094,20.81
+syr2k,125705,240.85,3149,3679,0,6,4.85,478040,78.6,120116,82032,0,12,0.116,20.82
+trmm,41432,281.61,610,990,0,4,4.71,147528,105.61,64031,40502,0,4,0.089,20.3
+gemm,83676,192.68,1029,1544,0,35,4.79,360772,121.61,31853,62126,0,16,0.104,18.64
+gemver,28087,303.49,1854,2380,0,8,4.68,175326,107.27,32615,64118,0,14,0.099,18.9
+gesummv,6634,310.46,298,504,0,4,4.77,111094,79.97,113876,72908,0,10,0.101,20.67
+covariance,109992,245.16,2098,3096,0,5,4.77,297450,110.57,28729,56660,0,4,0.083,18.73
+fdtd-2d,214153,262.61,2736,3801,0,2,4.73,831912,108.23,31333,61421,0,6,0.116,20.27
+heat-3d,41059,115.54,3132,2910,0,60,4.92,555930,110.42,33915,67273,0,9,0.181,20.38
+jacobi-1d,6914,386.25,1355,1885,0,0,4.72,16606,277.93,1636,3305,0,0,0.071,21.07
+jacobi-2d,84609,240.79,2347,3185,0,2,4.81,357100,113.53,30393,59782,0,4,0.079,20.75
+seidel-2d,345294,232.4,2128,3337,0,2,4.68,875758,127.99,26948,53133,0,2,0.091,18.8
+floyd-warshall,1238764,235.52,1869,2367,0,2,4.71,4762766,109.4,59859,118101,0,2,0.094,17.2
+nussinov,216467,273.07,1078,1431,0,2,4.79,837958,90.73,60663,119303,0,0,0.080,20.38
+cholesky,148443,273.6,1528,3419,0,15,4.89,2347850,90.88,53992,106180,0,8,0.100,21.07
+adi,90644,256.28,1111,2391,0,0,5.32,1713702,110.44,31207,60639,0,36,0.160,20.75 \ No newline at end of file