Remove compilation speed and fix results

author: Yann Herklotz <git@yannherklotz.com> 2021-04-16 13:56:32 +0100
committer: Yann Herklotz <git@yannherklotz.com> 2021-04-16 13:57:11 +0100
commit: d36bd7f187ddf0db1745c82400da996c97ed9a03 (patch)
tree: 0419b7a7725231788af65b475bf19e7ee2e7a7aa /evaluation.tex
parent: 20ddd80e5eb18e261d6f228d8e9103a9090b7a39 (diff)
download: oopsla21_fvhls-d36bd7f187ddf0db1745c82400da996c97ed9a03.tar.gz
oopsla21_fvhls-d36bd7f187ddf0db1745c82400da996c97ed9a03.zip
1 files changed, 16 insertions, 15 deletions
diff --git a/evaluation.tex b/evaluation.tex
index 5612eff..6cdb546 100644
--- a/evaluation.tex
+++ b/evaluation.tex
@@ -1,17 +1,16 @@
 \section{Evaluation}
 \label{sec:evaluation}
 
-Our evaluation is designed to answer the following three research questions.
+Our evaluation is designed to answer the following two research questions.
 \begin{description}
 \item[RQ1] How fast is the hardware generated by \vericert{}?
 \item[RQ2] How area-efficient is the hardware generated by \vericert{}?
-\item[RQ3] How long does \vericert{} take to produce hardware?
 \end{description}
 
 \subsection{Experimental Setup}
 \label{sec:evaluation:setup}
 
-\paragraph{Choice of HLS tool for comparison.} We compare \vericert{} against \legup{} 4.0, because it is open-source and hence easily accessible, but still produces hardware ``of comparable quality to a commercial high-level synthesis tool''~\cite{canis11_legup}.  We also compare against \legup{} with different optimisation levels.  First, we only turn off the LLVM optimisations in \legup{}, to eliminate all the optimisations that are common to standard software compilers.  Secondly, we also compare against \legup{} with operation chaining turned off, in addition to the \legup{} optimisations.
+\paragraph{Choice of HLS tool for comparison.} We compare \vericert{} against \legup{} 4.0, because it is open-source and hence easily accessible, but still produces hardware ``of comparable quality to a commercial high-level synthesis tool''~\cite{canis11_legup}.  We also compare against \legup{} with different optimisation levels.  First, we only turn off the LLVM optimisations in \legup{}, to eliminate all the optimisations that are common to standard software compilers.  Secondly, we also compare against \legup{} with LLVM optimisations and operation chaining turned off, which is an HLS specific optimisation that combines data-dependent operations into one clock cycle, and therefore dramatically reduces the number of cycles, without necessarily increasing the clock speed.
 
 \paragraph{Choice and preparation of benchmarks.} We evaluate \vericert{} using the \polybench{} benchmark suite (version 4.2.1)~\cite{polybench}, which consists of a collection of 30 numerical kernels. \polybench{} is popular in the HLS context~\cite{choi+18,poly_hls_pouchet2013polyhedral,poly_hls_zhao2017,poly_hls_zuo2013}, since it has affine loop bounds, making it attractive for streaming computation on FPGA architectures.
 We were able to use 27 of the 30 programs; three had to be discarded (\texttt{correlation},~\texttt{gramschmidt} and~\texttt{deriche}) because they involve square roots, requiring floats, which we do not support. 
@@ -145,22 +144,24 @@ We configured \polybench{}'s parameters so that only integer types are used, sin
   \caption{\polybench{} with division/modulo operations replaced by an iterative algorithm.  The top graph shows the execution time of \vericert{}, \legup{} without LLVM optimisations and without operation chaining and \legup{} without front end LLVM optimisations relative to optimised \legup{}. The bottom graph shows the area relative to \legup{}.}\label{fig:polybench-nodiv}
 \end{figure}
 
-Figure~\ref{fig:comparison_area} compares the resource utilisation of the \polybench{} programs generated by \vericert{} and \legup{}.
-On average, we see that \vericert{} produces hardware that is about $21\times$ larger than \legup{}. \vericert{} designs are filling up to 30\% of a (large) FPGA chip, while \legup{} uses no more than 1\% of the chip.
-The main reason for this is that RAM is not inferred automatically for the Verilog that is generated by \vericert{}; instead, large arrays of registers are synthesised.
-Synthesis tools such as Quartus generally require array accesses to be in a specific form in order for RAM inference to activate.
-\legup{}'s Verilog generation is tailored to enable RAM inference by Quartus, while \vericert{} generates more generic array accesses. This may make \vericert{} more portable across different FPGA synthesis tools and vendors.
-%For a fair comparison, we chose Quartus for these experiments because LegUp supports Quartus efficiently. 
-% Consequently, on average, \legup{} designs use $XX$ RAMs whereas \vericert{} use none. 
-Enabling RAM inference is part of our future plans. 
+Firstly, before comparing any performance metrics, it is worth highlighting that any Verilog produced by \vericert{} is guaranteed to be \emph{correct}, whilst no such guarantee can be provided by \legup{}.
+This guarantee in itself provides a significant leap in terms of HLS reliability, compared to any other HLS tools available.
 
-% We see that \vericert{} designs use between 1\% and 30\% of the available logic on the FPGA, averaging at around 10\%, whereas LegUp designs all use less than 1\% of the FPGA, averaging at around 0.45\%. The main reason for this is mainly because RAM is not inferred automatically for the Verilog that is generated by \vericert{}.  Other synthesis tools can infer the RAM correctly for \vericert{} output, so this issue could be solved by either using a different synthesis tool and targeting a different FPGA, or by generating the correct template which allows Quartus to identify the RAM automatically.
+The top graphs of Figure~\ref{fig:polybench-div} and Figure~\ref{fig:polybench-nodiv} compare the cycle counts of the 27 programs executed by \vericert{} and the different optimisation levels of \legup{}.  Each graph uses optimised \legup{} as the baseline.  When division/modulo operations are present \legup{} designs execute around 27$\times$ faster than \vericert{} designs.  However, when division/modulo operations are replaced by the iterative algorithm, \legup{} designs are only 2$\times$ faster than \vericert{} designs.  However, the benchmarks with division/modulo replaced show that \vericert{} actually achieves the same execution speed as \legup{} without LLVM optimisations and without operation chaining, which is encouraging, and shows that the hardware generation is following the right steps.  The execution time is calculated by multiplying the maximum frequency that the FPGA can run at with this design, by the number of clock cycles that are needed to complete the execution.  We can therefore analyse each separately.
 
-\subsection{RQ3: How long does \vericert{} take to produce hardware?}
+First, looking at the difference in clock cycles, \vericert{} produces designs that have around 4.5$\times$ as many clock cycles as \legup{} designs in both cases, when division/modulo operations are enabled as well as when they are replaced.  This performance gap can be explained in part by LLVM optimisations, which seem to account for a 2$\times$ decrease in clock cycles, as well as operation chaining, which decreases the clock cycles by another 2$\times$.  The rest of the speed-up is mostly due to \legup{} optimisations such as scheduling and memory analysis, which are designed to extract parallelism from input programs.
+This gap does not represent the performance cost that comes with formally proving a HLS tool.
+Instead, it is simply a gap between an unoptimised \vericert{} versus an optimised \legup{}.
+As we improve \vericert{} by incorporating further optimisations, this gap should reduce whilst preserving the correctness guarantees.
 
-Figure~\ref{fig:comparison_comptime} compares the compilation times of \vericert{} and of \legup{}, with each data point corresponding to one of the \polybench{} benchmarks. On average, \vericert{} compilation is about $47\times$ faster than \legup{} compilation. \vericert{} is much faster because it omits many of the time-consuming HLS optimisations performed by \legup{}, such as scheduling and memory analysis. This comparison also demonstrates that our fully verified approach does not add substantial overheads in compilation time, since we do not invoke verification for every compilation instance, unlike the approaches based on translation validation that we mentioned in Section~\ref{sec:intro}.
+Secondly, looking at the maximum clock frequency that each design can achieve, \vericert{} designs can only achieve 8.2$\times$ the maximum clock frequency of \legup{} when division/modulo operations are present.  This is in great contrast to the maximum clock frequency that \vericert{} can achieve when no divide/modulus operations are present, where \vericert{} generates designs that are actually 2$\times$ better than the frequency achieved by \legup{} designs.  The dramatic discrepancy in performance for the former case can be largely attributed to \vericert{}'s na\"ive implementations of division and modulo operations, as explained in Section~\ref{sec:evaluation:setup}. Indeed, \vericert{} achieved an average clock frequency of just 13MHz, while \legup{} managed about 111MHz. After replacing the division/modulo operations with our own C-based implementations, \vericert{}'s average clock frequency becomes about 220MHz.  This improvement in frequency can maybe be explained by scheduling trying to pack too many instructions into a cycle, or by the fact that \legup{} uses a more involved RAM template so that the hardware produces a dual-port RAM, which can perform two reads and writes per clock cycle.
 
-%\NR{Do we want to finish the section off with some highlights or a summary?}
+Looking at a few benchmarks in particular in Figure~\ref{fig:polybench-nodiv} for some interesting cases.  For the trmm benchmark, \vericert{} produces hardware that executes with the same cycle count as \legup{}, and manages to create hardware that achieves twice the frequency compared to \legup{}, thereby actually producing a design that executes twice as fast as \legup{}.  Another interesting benchmark is doitgen, where \vericert{} is comparable to \legup{} without LLVM optimisations, however, LLVM optimisations seem to have the a large affect on the cycle count.
+
+\subsection{RQ2: How area-efficient is \vericert{}-generated hardware?}
+
+The bottom graphs in both Figure~\ref{fig:polybench-div} and Figure~\ref{fig:polybench-nodiv} compare the resource utilisation of the \polybench{} programs generated by \vericert{} and \legup{} at various optimisation levels.
+By looking at the median, when division/modulo operations are enabled, we see that \vericert{} produces hardware that is about the same size as optimised \legup{}, whereas the unoptimised versions of \legup{} actually produce slightly smaller hardware.  This is because optimisations can often increase the size of the hardware to make it faster.  Especially in Figure~\ref{fig:polybench-div}, there are a few benchmarks where the size of the \legup{} design is much smaller than that produced by \vericert{}.  This can mostly be explained because of resource sharing in LegUp.  Division/modulo operations need large circuits, and it is therefore usual to only have one circuit per design.  As \vericert{} uses the na\"ive implementation of division/modulo, there will be multiple circuits present in the design, which blows up the size.  Looking at Figure~\ref{fig:polybench-nodiv}, one can see that without division, the size of \vericert{} designs are almost always around the same size as \legup{} designs, never being more than 2$\times$ larger, and sometimes even being smaller.  The similarity in area also shows that area is correctly being inferred by the synthesis tool as a RAM, and is therefore not implemented as registers.
 
 %%% Local Variables:
 %%% mode: latex
author	Yann Herklotz <git@yannherklotz.com>	2021-04-16 13:56:32 +0100
committer	Yann Herklotz <git@yannherklotz.com>	2021-04-16 13:57:11 +0100
commit	d36bd7f187ddf0db1745c82400da996c97ed9a03 (patch)
tree	0419b7a7725231788af65b475bf19e7ee2e7a7aa /evaluation.tex
parent	20ddd80e5eb18e261d6f228d8e9103a9090b7a39 (diff)
download	oopsla21_fvhls-d36bd7f187ddf0db1745c82400da996c97ed9a03.tar.gz oopsla21_fvhls-d36bd7f187ddf0db1745c82400da996c97ed9a03.zip