\section{Evaluation}\label{sec:evaluation} Our evaluation is designed to answer the following four research questions. \begin{description} \item[RQ1] How fast is the hardware generated by \vericert{}? \item[RQ2] How area-efficient is the hardware generated by \vericert{}? \item[RQ3] How quickly does \vericert{} translate the C into Verilog? \item[RQ4] How effective is the correctness theorem in \vericert{}? \end{description} \subsection{Experimental Setup} \label{sec:evaluation:setup} \newcommand\legupnoopt{\legup{} no-opt} \newcommand\legupnooptchain{\legup{} no-opt no-chaining} \paragraph{Choice of HLS tool for comparison.} We compare \vericert{} against \legup{} 4.0, because it is open-source and hence easily accessible, but still produces hardware ``of comparable quality to a commercial high-level synthesis tool''~\cite{canis11_legup}. We also compare against \legup{} with different optimisation levels in an effort to understand which optimisations have the biggest impact on the performance discrepancies between \legup{} and \vericert{}. The baseline \legup{} version has all the default automatic optimisations turned on. % \vericert{} is also compared with other optimisation levels of \legup{}. %JW: removed because we said that a couple of sentences ago. First, we only turn off the LLVM optimisations in \legup{}, to eliminate all the optimisations that are common to standard software compilers, referred to as `\legupnoopt{}'. Secondly, we also compare against \legup{} with LLVM optimisations and operation chaining turned off, referred to as `\legupnooptchain{}'. Operation chaining~\cite{paulin89_sched_bindin_algor_high_level_synth,venkataramani07_operat} is an HLS-specific optimisation that combines data-dependent operations into one clock cycle, and therefore dramatically reduces the number of cycles, without necessarily decreasing the clock speed. % \JW{Should we cite https://ieeexplore.ieee.org/document/4397305 here? Do you think that's the right reference for op-chaining?}\NR{Interesting paper, but I am not sure if it is the seminal paper for chaining because of the year (2007).} \paragraph{Choice and preparation of benchmarks.} We evaluate \vericert{} using the \polybench{} benchmark suite (version 4.2.1)~\cite{polybench}, which is a collection of 30 numerical kernels. \polybench{} is popular in the HLS context~\cite{choi+18,poly_hls_pouchet2013polyhedral,poly_hls_zhao2017,poly_hls_zuo2013}, since it has affine loop bounds, making it attractive for streaming computation on FPGA architectures. We were able to use 27 of the 30 programs; three had to be discarded (\texttt{cor\-re\-la\-tion},~\texttt{gram\-schmi\-dt} and~\texttt{de\-riche}) because they involve square roots, requiring floats, which we do not support. % Interestingly, we were also unable to evaluate \texttt{cholesky} on \legup{}, since it produce an error during its HLS compilation. %In summary, we evaluate 27 programs from the latest Polybench suite. We configured \polybench{}'s parameters so that only integer types are used. We use \polybench{}'s smallest datasets for each program to ensure that data can reside within on-chip memories of the FPGA, avoiding any need for off-chip memory accesses. We have not modified the benchmarks to make them run through \legup{} optimally, e.g. by adding pragmas that trigger more advanced optimisations. \vericert{} implements divisions and modulo operations in C using the corresponding built-in Verilog operators. These built-in operators are designed to complete within a single clock cycle, and this causes substantial penalties in clock frequency. Other HLS tools, including LegUp, supply their own multi-cycle division/modulo implementations, and we plan to do the same in future versions of \vericert{}. Implementing pipelined operators such as the divide and modulus operator can be solved by scheduling the instructions so that these can execute in parallel, which is the main optimisation that needs to be added to \vericert{}. In the meantime, we have prepared an alternative version of the benchmarks in which each division/modulo operation is replaced with our own implementation that uses repeated division and multiplications by 2. Fig.~\ref{fig:polybench-div} shows the results of comparing \vericert{} with optimised LegUp 4.0 on the \polybench{} benchmarks, where divisions have been left intact. Fig.~\ref{fig:polybench-nodiv} performs the comparison where the division/modulo operations have been replaced by the iterative algorithm. \paragraph{Synthesis setup} The Verilog that is generated by \vericert{} or \legup{} is provided to Xilinx Vivado v2017.1~\cite{xilinx_vivad_desig_suite}, which synthesises it to a netlist, before placing-and-routing this netlist onto a Xilinx XC7Z020 FPGA device that contains approximately 85000 LUTs. \subsection{RQ1: How Fast is \vericert{}-Generated Hardware?} \pgfplotstableread[col sep=comma]{results/rel-time-div.csv}{\divtimingtable} \pgfplotstableread[col sep=comma]{results/rel-size-div.csv}{\divslicetable} \definecolor{vericertcol}{HTML}{66C2A5} \definecolor{legupnooptcol}{HTML}{FC8D62} \definecolor{legupnooptnochaincol}{HTML}{8DA0CB} \newcommand\backgroundbar[2][5]{\draw[draw=none, fill=black!#1] (axis cs:#2*2+0.5,0.1) rectangle (axis cs:1+#2*2+0.5,300);} \begin{figure}\centering \begin{tikzpicture} \begin{groupplot}[ group style={ group name=my plots, group size=1 by 2, xlabels at=edge bottom, xticklabels at=edge bottom, vertical sep=5pt, }, ymode=log, ybar=0.4pt, width=1\textwidth, height=0.4\textwidth, /pgf/bar width=3pt, legend pos=south east, log ticks with fixed point, xticklabels from table={\divtimingtable}{benchmark}, legend style={nodes={scale=0.7, transform shape}}, x tick label style={rotate=90,anchor=east,font=\footnotesize}, legend columns=-1, xtick=data, enlarge x limits={abs=0.5}, ylabel style={font=\footnotesize}, xtick style={draw=none}, ] \nextgroupplot[ymin=0.8,ymax=300,ylabel={Execution time relative to \legup{}}] \pgfplotsinvokeforeach{0,...,12}{% \backgroundbar{#1}} \backgroundbar[10]{13} \addplot+[vericertcol] table [x expr=\coordindex,y=vericert,col sep=comma] from \divtimingtable; \addplot+[legupnooptcol] table [x expr=\coordindex,y=legup noopt nochain,col sep=comma] from \divtimingtable; \addplot+[legupnooptnochaincol] table [x expr=\coordindex,y=legup noopt,col sep=comma] from \divtimingtable; \draw (axis cs:-1,1) -- (axis cs:28,1); % JW: redraw axis border which has been partially covered by the grey bars \draw (axis cs:-0.5,0.8) rectangle (axis cs:27.5,300); \nextgroupplot[ymin=0.3,ymax=10,ylabel={Area relative to \legup{}}] \pgfplotsinvokeforeach{0,...,12}{% \backgroundbar{#1}} \backgroundbar[10]{13} \addplot+[vericertcol] table [x expr=\coordindex,y=vericert,col sep=comma] from \divslicetable; \addplot+[legupnooptcol] table [x expr=\coordindex,y=legup noopt nochain,col sep=comma] from \divslicetable; \addplot+[legupnooptnochaincol] table [x expr=\coordindex,y=legup noopt,col sep=comma] from \divslicetable; \draw (axis cs:-1,1) -- (axis cs:28,1); % JW: redraw axis border which has been partially covered by the grey bars \draw (axis cs:-0.5,0.3) rectangle (axis cs:27.5,10); \legend{\vericert{},\legupnooptchain{},\legupnoopt{}}; \end{groupplot} \end{tikzpicture} \caption{Performance of \vericert{} compared to \legup{}, with division and modulo operations enabled. The top graph compares the execution times and the bottom graph compares the area of the generated designs. In both cases, the performance of \vericert{}, \legup{} without LLVM optimisations and without operation chaining, and \legup{} without LLVM optimisations is compared against default \legup{}.}\label{fig:polybench-div} \end{figure} %\NR{Is it just my eyes or are the bars overlapping per group? Is that intentional?} \pgfplotstableread[col sep=comma]{results/rel-time-nodiv.csv}{\nodivtimingtable} \pgfplotstableread[col sep=comma]{results/rel-size-nodiv.csv}{\nodivslicetable} \begin{figure}\centering \begin{tikzpicture} \begin{groupplot}[ group style={ group name=my plots, group size=1 by 2, xlabels at=edge bottom, xticklabels at=edge bottom, vertical sep=5pt, }, ymode=log, ybar=0.4pt, ytick={0.5,1,2,4,8}, width=1\textwidth, height=0.4\textwidth, /pgf/bar width=3pt, legend pos=south east, log ticks with fixed point, xticklabels from table={\nodivtimingtable}{benchmark}, legend style={nodes={scale=0.7, transform shape}}, x tick label style={rotate=90,anchor=east,font=\footnotesize}, legend columns=-1, xtick=data, enlarge x limits={abs=0.5}, ylabel style={font=\footnotesize}, ymin=0.3, xtick style={draw=none}, ] \nextgroupplot[ymin=0.3,ymax=10,ylabel={Execution time relative to \legup{}}] \pgfplotsinvokeforeach{0,...,12}{% \backgroundbar{#1}} \backgroundbar[10]{13} \addplot+[vericertcol] table [x expr=\coordindex,y=vericert,col sep=comma] from \nodivtimingtable; \addplot+[legupnooptcol] table [x expr=\coordindex,y=legup noopt nochain,col sep=comma] from \nodivtimingtable; \addplot+[legupnooptnochaincol] table [x expr=\coordindex,y=legup noopt,col sep=comma] from \nodivtimingtable; \draw (axis cs:-1,1) -- (axis cs:28,1); \draw (axis cs:-0.5,0.3) rectangle (axis cs:27.5,10); \nextgroupplot[ymin=0.3,ymax=4,ylabel={Area relative to \legup{}}] \pgfplotsinvokeforeach{0,...,12}{% \backgroundbar{#1}} \backgroundbar[10]{13} \addplot+[vericertcol] table [x expr=\coordindex,y=vericert,col sep=comma] from \nodivslicetable; \addplot+[legupnooptcol] table [x expr=\coordindex,y=legup noopt nochain,col sep=comma] from \nodivslicetable; \addplot+[legupnooptnochaincol] table [x expr=\coordindex,y=legup noopt,col sep=comma] from \nodivslicetable; \draw (axis cs:-1,1) -- (axis cs:28,1); \draw (axis cs:-0.5,0.3) rectangle (axis cs:27.5,4); \legend{\vericert{},\legupnooptchain{},\legupnoopt{}}; \end{groupplot} \end{tikzpicture} \caption{Performance of \vericert{} compared to \legup{}, with division and modulo operations replaced by an iterative algorithm in software. The top graph compares the execution times and the bottom graph compares the area of the generated designs. In both cases, the performance of \vericert{}, \legup{} without LLVM optimisations and without operation chaining, and \legup{} without LLVM optimisations is compared against default \legup{}.}\label{fig:polybench-nodiv} \end{figure} Firstly, before comparing any performance metrics, it is worth highlighting that any Verilog produced by \vericert{} is guaranteed to be \emph{correct}, whilst no such guarantee can be provided by \legup{}. This guarantee in itself provides a significant leap in terms of HLS reliability, compared to any other HLS tools available. The top graphs of Fig.~\ref{fig:polybench-div} and Fig.~\ref{fig:polybench-nodiv} compare the execution time of the 27 programs executed by \vericert{} and the different optimisation levels of \legup{}. Each graph uses optimised \legup{} as the baseline. When division/modulo operations are present \legup{} designs execute around 27$\times$ faster than \vericert{} designs. However, when division/modulo operations are replaced by the iterative algorithm, \legup{} designs are only 2$\times$ faster than \vericert{} designs. The benchmarks with division/modulo replaced show that \vericert{} actually achieves the same execution speed as \legup{} without LLVM optimisations and without operation chaining, which is encouraging, and shows that the hardware generation is following the right steps. The execution time is calculated by multiplying the maximum frequency that the FPGA can run at with this design, by the number of clock cycles that are needed to complete the execution. We can therefore analyse each separately. First, looking at the difference in clock cycles, \vericert{} produces designs that have around 4.5$\times$ as many clock cycles as \legup{} designs in both cases, when division/modulo operations are enabled as well as when they are replaced. This performance gap can be explained in part by LLVM optimisations, which seem to account for a 2$\times$ decrease in clock cycles, as well as operation chaining, which decreases the clock cycles by another 2$\times$. The rest of the speed-up is mostly due to \legup{} optimisations such as scheduling and memory analysis, which are designed to extract parallelism from input programs. This gap does not represent the performance cost that comes with formally proving an HLS tool. Instead, it is simply a gap between an unoptimised \vericert{} versus an optimised \legup{}. As we improve \vericert{} by incorporating further optimisations, this gap should reduce whilst preserving the correctness guarantees. Secondly, looking at the maximum clock frequency that each design can achieve, \legup{} designs achieve 8.2$\times$ the maximum clock frequency of \vericert{} when division/modulo operations are present. This is in great contrast to the maximum clock frequency that \vericert{} can achieve when no divide/modulo operations are present, where \vericert{} generates designs that are actually 2$\times$ better than the frequency achieved by \legup{} designs. The dramatic discrepancy in performance for the former case can be largely attributed to \vericert{}'s na\"ive implementations of division and modulo operations, as explained in Section~\ref{sec:evaluation:setup}. Indeed, \vericert{} achieved an average clock frequency of just 13MHz, while \legup{} managed about 111MHz. After replacing the division/modulo operations with our own C-based implementations, \vericert{}'s average clock frequency becomes about 220MHz. This improvement in frequency can be explained by the fact that \legup{} uses a memory controller to manage multiple RAMs using one interface, which is not needed in \vericert{} as a single RAM is used for the memory. Looking at a few benchmarks in particular in Fig.~\ref{fig:polybench-nodiv} for some interesting cases. For the \texttt{trmm} benchmark, \vericert{} produces hardware that executes with the same cycle count as \legup{}, and manages to create hardware that achieves twice the frequency compared to \legup{}, thereby actually producing a design that executes twice as fast as \legup{}. Another interesting benchmark is \texttt{doitgen}, where \vericert{} is comparable to \legup{} without LLVM optimisations, however, LLVM optimisations seem to have a large effect on the cycle count. \subsection{RQ2: How Area-Efficient is \vericert{}-Generated Hardware?} The bottom graphs in both Fig.~\ref{fig:polybench-div} and Fig.~\ref{fig:polybench-nodiv} compare the resource utilisation of the \polybench{} programs generated by \vericert{} and \legup{} at various optimisation levels. By looking at the median, when division/modulo operations are enabled, we see that \vericert{} produces hardware that is about the same size as optimised \legup{}, whereas the unoptimised versions of \legup{} actually produce slightly smaller hardware. This is because optimisations can often increase the size of the hardware to make it faster. Especially in Fig.~\ref{fig:polybench-div}, there are a few benchmarks where the size of the \legup{} design is much smaller than that produced by \vericert{}. This can mostly be explained because of resource sharing in LegUp. Division/modulo operations need large circuits, and it is therefore usual to only have one circuit per design. As \vericert{} uses the na\"ive implementation of division/modulo, there will be multiple circuits present in the design, which blows up the size. Looking at Fig.~\ref{fig:polybench-nodiv}, one can see that without division, the size of \vericert{} designs are almost always around the same size as \legup{} designs, never being more than 2$\times$ larger, and sometimes even being smaller. The similarity in area also shows that RAM is correctly being inferred by the synthesis tool, and is therefore not implemented using registers. \subsection{RQ3: How Quickly does \vericert{} Translate the C into Verilog?} \legup{} takes around 10$\times$ as long as \vericert{} to perform the translation from C into Verilog, showing at least that verification does not have a substantial effect on the run-time of the HLS tool. However, this is a meaningless victory, as a lot of the extra time that \legup{} uses is on performing computationally heavy optimisations such as loop pipelining and scheduling. \subsection{RQ4: How Effective is the Correctness Theorem in \vericert{}?} \definecolor{fuzzred}{HTML}{f8514f} \definecolor{fuzzyellow}{HTML}{fee4bf} \definecolor{fuzzgreen}{HTML}{b2df8a} \begin{figure} \centering %\begin{tabular}{cccc}\toprule % \textbf{Passing} & \textbf{Compile time errors} & \textbf{Runtime errors} & \textbf{Total}\\\midrule % 40379 (26.00\%) & 114849 (73.97\%) & 39 (0.03\%) & 155267\\\bottomrule %\end{tabular} \begin{tikzpicture}[xscale=0.11] \draw[-latex] (13,0.5) to (13,0.25); \draw[-latex] (55,0.5) to (55,0.25); \draw[-latex] (99.85,0.5) to (99.85,0.25); \draw[fuzzgreen, line width=5mm] (0,0) to (26.0,0); \draw[fuzzyellow, line width=5mm] (26.0,0) to (99.7,0); \draw[fuzzred, line width=5mm] (99.7,0) to (100,0); \node[anchor=south] at (13,0.5) {40379 passes (26.00\%)}; \node[anchor=south] at (55,0.5) {114849 compile-time errors (73.97\%)}; \node[anchor=south] at (100,0.5) {39 run-time errors (0.03\%)}; \end{tikzpicture} \caption{Results of fuzzing \vericert{} using 155267 random C programs generated by Csmith.}\label{tab:fuzzing} \end{figure} \begin{quotation} {\it ``Beware of bugs in the above code; I have only proved it correct, not tried it.''} \par\hfill -- D. E. Knuth (1977) \end{quotation} To gain further confidence that the Verilog designs generated by \vericert{} are actually correct, and that the correctness theorem is indeed effective, we fuzzed \vericert{} using Csmith~\cite{yang11_findin_under_bugs_c_compil}. \citeauthor{yang11_findin_under_bugs_c_compil} previously used Csmith in an extensive fuzzing campaign on CompCert and found a handful of bugs in the unverified parts of that compiler, so it is natural to explore whether it can find bugs in \vericert{} too. \citet{herklotz21_empir_study_reliab_high_level_synth_tools} have recently used Csmith to fuzz other HLS tools including \legup{}, so we configured Csmith in a similar way. In addition to the features turned off by \citeauthor{herklotz21_empir_study_reliab_high_level_synth_tools}, we turned off the generation of global variables and non-32-bit operations. The generated designs were tested by simulating them and comparing the output value to the results of compiling the test-cases with GCC 10.3.0. The results of the fuzzing run are shown in Fig.~\ref{tab:fuzzing}. Out of 155267 test-cases generated by Csmith, 26\% of them passed, meaning they compiled without error and resulted in the same final value as GCC. Most of the test-cases, 73.97\%, failed at compile time. The most common reasons for this were unsigned comparisons between integers (\vericert{} requires them to be signed), and the presence of 8-bit operations (which \vericert{} does not support, and which we could not turn off due to a limitation in Csmith). % The latter programs are present because of a slight limitation with how Csmith can be configured, as it does not allow 8-bit operations to be turned off completely, and will get stuck when it generates the C code. Because the test-cases generated by Csmith could not be tailored exactly to the C fragment that \vericert{} supports, such a high compile-time failure rate is expected. Finally, and most interestingly, there were a total of 39 run-time failures, which the correctness theorem should be proving impossible. However, all 39 of these failures are due to a bug in the pretty-printing of the final Verilog code, where a logical negation (\texttt{!}) was accidentally used instead of a bitwise negation (\verb|~|). Once this bug was fixed, all test-cases passed. %%% Local Variables: %%% mode: latex %%% TeX-master: "main" %%% TeX-command-extra-options: "-shell-escape" %%% End: