Update on Overleaf.

author: Yann Herklotz <ymh15@ic.ac.uk> 2020-11-21 00:03:19 +0000
committer: overleaf <overleaf@localhost> 2020-11-21 00:04:19 +0000
commit: 15aab23c1e89e9369b631eb3cefc7fc708d6da36 (patch)
tree: 1fa2ba7cf2b0af2f369918959326555368fd0b1f /evaluation.tex
parent: 5f228d3512b2e3e39305e6f53bc873c76439e96f (diff)
download: oopsla21_fvhls-15aab23c1e89e9369b631eb3cefc7fc708d6da36.tar.gz
oopsla21_fvhls-15aab23c1e89e9369b631eb3cefc7fc708d6da36.zip
1 files changed, 34 insertions, 30 deletions
diff --git a/evaluation.tex b/evaluation.tex
index 1fb7162..ef4883e 100644
--- a/evaluation.tex
+++ b/evaluation.tex
@@ -9,15 +9,15 @@ Our evaluation is designed to answer the following three research questions.
 \end{description}
 
 \subsection{Experimental Setup}
+\label{sec:evaluation:setup}
 
 \paragraph{Choice of HLS tool for comparison.} We compare \vericert{} against \legup{} 5.1 because it is open-source and hence easily accessible, but still produces hardware ``of comparable quality to a commercial high-level synthesis tool''~\cite{canis11_legup}.
 
 \paragraph{Choice and preparation of benchmarks.} We evaluate \vericert{} using the PolyBench/C benchmark suite (version 4.2.1)~\cite{polybench}, which consists of a collection of 30 numerical kernels. PolyBench/C is popular in the HLS context~\cite{choi+18,poly_hls_pouchet2013polyhedral,poly_hls_zhao2017,poly_hls_zuo2013}, since it has affine loop bounds, making it attractive for streaming computation on FPGA architectures.
-We were able to use 27 of the 30 programs; three had to be discarded (\texttt{correlation},~\texttt{gramschmidt} and~\texttt{deriche}) because they involve square roots, which require floats, which we do not support. 
+We were able to use 27 of the 30 programs; three had to be discarded (\texttt{correlation},~\texttt{gramschmidt} and~\texttt{deriche}) because they involve square roots, requiring floats, which we do not support. 
 % Interestingly, we were also unable to evaluate \texttt{cholesky} on \legup{}, since it produce an error during its HLS compilation. 
 %In summary, we evaluate 27 programs from the latest Polybench suite. 
-
-We configured Polybench's parameters so that only integer types are used, since we do not support floats or doubles currently. We use Polybench's smallest datasets for each program to ensure that data can reside within on-chip memories of the FPGA, avoiding any need for off-chip memory accesses.
+We configured Polybench's parameters so that only integer types are used, since we do not support floats. We use Polybench's smallest datasets for each program to ensure that data can reside within on-chip memories of the FPGA, avoiding any need for off-chip memory accesses.
 
 \vericert{} implements divisions and modulo operations in C using the corresponding built-in Verilog operators. These built-in operators are designed to complete within a single clock cycle, and this causes substantial penalties in clock frequency. Other HLS tools, including LegUp, supply their own multi-cycle division/modulo implementations, and we plan to do the same in future versions of \vericert{}. In the meantime, we have prepared an alternative version of the benchmarks in which each division/modulo operation is overridden with our own implementation that uses repeated division and multiplications by 2. Where this change makes an appreciable difference to the performance results, we give the results for both benchmark sets.
 
@@ -26,6 +26,7 @@ We configured Polybench's parameters so that only integer types are used, since
 \subsection{RQ1: How fast is \vericert{}-generated hardware?}
 
 \begin{figure}
+\definecolor{cyclecountcol}{HTML}{1b9e77}
 \begin{tikzpicture}
 \begin{axis}[
     xmode=log,
@@ -41,7 +42,7 @@ We configured Polybench's parameters so that only integer types are used, since
     %log ticks with fixed point,
   ]
   
-\addplot[draw=none, mark=*, draw opacity=0, fill opacity=0.3]
+\addplot[draw=none, mark=*, draw opacity=0, fill opacity=0.6,cyclecountcol]
   table [x=legupcycles, y=vericertcycles, col sep=comma]
   {results/poly.csv};
 
@@ -50,11 +51,13 @@ We configured Polybench's parameters so that only integer types are used, since
   
 \end{axis}
 \end{tikzpicture}
-\caption{A comparison of the cycle count of hardware designs generated by \vericert{} and by \legup{}, where the diagonal represents $y=x$.}
+\caption{A comparison of the cycle count of hardware designs generated by \vericert{} and by \legup{}.}
 \label{fig:comparison_cycles}
 \end{figure}
 
 \begin{figure}
+\definecolor{polycol}{HTML}{e6ab02}
+\definecolor{polywocol}{HTML}{7570b3}
 \begin{tikzpicture}
 \begin{axis}[
     xmode=log,
@@ -67,23 +70,29 @@ We configured Polybench's parameters so that only integer types are used, since
     xmax=1000000,
     ymax=1000000,
     ymin=10,
+    legend pos=south east,
     %log ticks with fixed point,
   ]
+ 
   
-\addplot[draw=none, mark=*, draw opacity=0, fill opacity=0.3,color=red]
+\addplot[draw=none, mark=*, draw opacity=0, fill opacity=0.8, polycol]
+  table [x expr={\thisrow{legupcycles}/\thisrow{legupfreqMHz}}, y expr={\thisrow{vericertcycles}/\thisrow{vericertoldfreqMHz}}, col sep=comma]
+  {results/poly.csv};
+ 
+\addlegendentry{PolyBench}
+
+\addplot[draw=none, mark=o, fill opacity=0, polywocol]
   table [x expr={\thisrow{legupcycles}/\thisrow{legupfreqMHz}}, y expr={\thisrow{vericertcycles}/\thisrow{vericertfreqMHz}}, col sep=comma]
   {results/poly.csv};
   
-\addplot[draw=none, mark=*, draw opacity=0, fill opacity=0.3, color=blue]
-  table [x expr={\thisrow{legupcycles}/\thisrow{legupfreqMHz}}, y expr={\thisrow{vericertcycles}/\thisrow{vericertoldfreqMHz}}, col sep=comma]
-  {results/poly.csv};
+\addlegendentry{PolyBench w/o division}
 
 \addplot[dotted, domain=10:1000000]{x};
 %\addplot[dashed, domain=10:10000]{9.02*x + 442};
   
 \end{axis}
 \end{tikzpicture}
-\caption{A comparison of the execution time of hardware designs generated by \vericert{} and by \legup{}, where the diagonal represents $y=x$.}
+\caption{A comparison of the execution time of hardware designs generated by \vericert{} and by \legup{}.}
 \label{fig:comparison_time}
 \end{figure}
 
@@ -101,15 +110,12 @@ It is notable that even without \vericert{} performing many optimisations, a few
 %We are very encouraged by these data points. 
 As we improve \vericert{} by incorporating further  optimisations, this gap should reduce whilst preserving our correctness guarantees.
 
-Cycle count is one factor in calculating execution times; the other is the clock frequency, which determines the duration of each of these cycles. Figure~\ref{fig:comparison_time} compares the execution times of \vericert{} and \legup{}. Across the original Polybench/C benchmarks, we see that \vericert{} designs are about $60\times$ slower than \legup{} designs. This dramatic discrepancy in performance can be largely attributed to the division
-
-As mentioned earlier, we modify the Polybench programs to utilise C-based divides and modulos. We had to avoid using the built-in Verilog divides and modulos, since Quartus interprets them as single-cycle operations. This interpretation affects clock frequency drastically. On average, when using the built-in Verilog approach, \vericert{}'s clock frequency was 21MHz, compared to \legup{}'s clock frequency of 247MHz. By moving to the C-based approach, our average clock frequency is now 112MHz. Hence, we reduce the frequency gap from approximately $12\times$ to $2\times$. This gap still exists because \legup{} uses various optimisation tactics and Intel-specific IPs, which requires further engineering effort and testing from our side. 
-
-% The difference in cycle counts shows the degree of  parallelism that \legup{}'s scheduling and memory system can offer. However, their Verilog generation is not guaranteed to be correct. Although the runtime LegUp outputs are tested to be correct for these programs, this does not provide any confidence on the correctness of Verilog generation of any other programs. Our Coq proof mechanisation guarantees that generated Verilog is correct for any input program that uses the subset of \compcert{} instructions that we have proven to be correct.
+Cycle count is one factor in calculating execution times; the other is the clock frequency, which determines the duration of each of these cycles. Figure~\ref{fig:comparison_time} compares the execution times of \vericert{} and \legup{}. Across the original Polybench/C benchmarks, we see that \vericert{} designs are about \slowdownOrig$\times$ slower than \legup{} designs. This dramatic discrepancy in performance can be largely attributed to \vericert's na\"ive implementations of division and modulo operations, as explained in Section~\ref{sec:evaluation:setup}. Indeed, \vericert{} achieved an average clock frequency of just 21MHz, while \legup{} managed about 247MHz. After replacing the division/modulo operations with our own C-based implementations, \vericert{}'s average clock frequency becomes about 112MHz. This is better, but still substantially below \legup{}, which uses various additional optimisations and Intel-specific IP blocks. Across the modified Polybench/C benchmarks, we see that \vericert{} designs are about \slowdownDiv$\times$ slower than \legup{} designs.
 
 \subsection{RQ2: How area-efficient is \vericert{}-generated hardware?}
 
 \begin{figure}
+\definecolor{resourceutilcol}{HTML}{e7298a}
 \begin{tikzpicture}
 \begin{axis}[
     height=80mm,
@@ -120,7 +126,7 @@ As mentioned earlier, we modify the Polybench programs to utilise C-based divide
     xmax=1, ymax=30,
   ]
   
-\addplot[draw=none, mark=*, draw opacity=0, fill opacity=0.3]
+\addplot[draw=none, mark=*, draw opacity=0, fill opacity=0.6,resourceutilcol]
   table [x expr=(\thisrow{legupluts}/427200*100), y expr=(\thisrow{vericertluts}/427200*100), col sep=comma]
   {results/poly.csv};
   
@@ -128,27 +134,25 @@ As mentioned earlier, we modify the Polybench programs to utilise C-based divide
   
 \end{axis}
 \end{tikzpicture}
-\caption{A comparison of the resource utilisation of designs generated by \vericert{} and by \legup{}}
+\caption{A comparison of the resource utilisation of designs generated by \vericert{} and by \legup{}.}
 \label{fig:comparison_area}
 \end{figure}
 
-Figure~\ref{fig:comparison_area} compares the resource utilisation of Polybench programs generated by \vericert{} and \legup{}.
-On average, \vericert{} produces hardware that is $21\times$ larger than \legup{} for the same Polybench programs. 
-\vericert{} designs are also filling up to 30\% of a large FPGA chip, which indicates that unoptimised Verilog generation can be costly. 
-The key reason for this behaviour, is that absence of RAM inference for the \vericert{} designs. 
-RAM are dedicated hardware blocks on the FPGA that are customly designed to allow for array synthesis.
-Synthesis tools like Quartus generally require array accesses to be in a certain template or form to enable inference of RAM use.
-\legup{}'s Verilog generation is tailored to enable RAM inference by Quartus.
-\vericert{}'s array accesses in Verilog are more generic, allowing us to target different FPGA synthesis tools and vendors, but underfits for Quartus requirements.
-For a fair comparison, we chose Quartus for these experiments because LegUp supports Quartus efficiently. 
+Figure~\ref{fig:comparison_area} compares the resource utilisation of the Polybench programs generated by \vericert{} and \legup{}.
+On average, we see that \vericert{} produces hardware that is about $21\times$ larger than \legup{}. \vericert{} designs are filling up to 30\% of a (large) FPGA chip, while $\legup$ uses no more than 1\% of the chip.
+The main reason for this is mainly because RAM is not inferred automatically for the Verilog that is generated by \vericert{}; instead, large arrays of registers are synthesised instead.
+Synthesis tools such as Quartus generally require array accesses to be in a specific form in order for RAM inference to activate.
+\legup{}'s Verilog generation is tailored to enable RAM inference by Quartus, while \vericert{} generates more generic array accesses. This may make \vericert{} more portable across different FPGA synthesis tools and vendors.
+%For a fair comparison, we chose Quartus for these experiments because LegUp supports Quartus efficiently. 
 % Consequently, on average, \legup{} designs use $XX$ RAMs whereas \vericert{} use none. 
-Improving RAM inference is part of our future plans. 
+Enabling RAM inference is part of our future plans. 
 
 % We see that \vericert{} designs use between 1\% and 30\% of the available logic on the FPGA, averaging at around 10\%, whereas LegUp designs all use less than 1\% of the FPGA, averaging at around 0.45\%. The main reason for this is mainly because RAM is not inferred automatically for the Verilog that is generated by \vericert{}.  Other synthesis tools can infer the RAM correctly for \vericert{} output, so this issue could be solved by either using a different synthesis tool and targeting a different FPGA, or by generating the correct template which allows Quartus to identify the RAM automatically.
 
 \subsection{RQ3: How long does \vericert{} take to produce hardware?}
 
 \begin{figure}
+\definecolor{compiltimecol}{HTML}{66a61e}
 \begin{tikzpicture}
 \begin{axis}[
     height=80mm,
@@ -164,7 +168,7 @@ Improving RAM inference is part of our future plans.
     ymax=0.20,
   ]
   
-\addplot[draw=none, mark=*, draw opacity=0, fill opacity=0.3] 
+\addplot[draw=none, mark=*, draw opacity=0, fill opacity=0.6,compiltimecol] 
   table [x=legupcomptime, y=vericertcomptime, col sep=comma]
   {results/poly.csv};
 
@@ -176,9 +180,9 @@ Improving RAM inference is part of our future plans.
 \label{fig:comparison_comptime}
 \end{figure}
 
-Figure~\ref{fig:comparison_comptime} compares the compilation times of \vericert{} and by \legup{}, with each data point corresponding to one of the PolyBench/C benchmarks. On average, \vericert{} compilation is about $47\times$ faster than \legup{} compilation. \vericert{} is much faster because it omits many complext time-consuming HLS optimisations performed by \legup{}, such as scheduling and memory analysis. This comparison also shows our approach do not add any significant overheads in compilation time, since we do not invoke verification for every compilation instance.
+Figure~\ref{fig:comparison_comptime} compares the compilation times of \vericert{} and of \legup{}, with each data point corresponding to one of the PolyBench/C benchmarks. On average, \vericert{} compilation is about $47\times$ faster than \legup{} compilation. \vericert{} is much faster because it omits many of the time-consuming HLS optimisations performed by \legup{}, such as scheduling and memory analysis. This comparison also demonstrates that our fully verified approach does not add substantial overheads in compilation time, since we do not invoke verification for every compilation instance, unlike the approaches based on translation validation that we mentioned in Section~\ref{sec:intro}.
 
-\NR{Do we want to finish the section off with some highlights or a summary?}
+%\NR{Do we want to finish the section off with some highlights or a summary?}
 
 %%% Local Variables:
 %%% mode: latex
author	Yann Herklotz <ymh15@ic.ac.uk>	2020-11-21 00:03:19 +0000
committer	overleaf <overleaf@localhost>	2020-11-21 00:04:19 +0000
commit	15aab23c1e89e9369b631eb3cefc7fc708d6da36 (patch)
tree	1fa2ba7cf2b0af2f369918959326555368fd0b1f /evaluation.tex
parent	5f228d3512b2e3e39305e6f53bc873c76439e96f (diff)
download	oopsla21_fvhls-15aab23c1e89e9369b631eb3cefc7fc708d6da36.tar.gz oopsla21_fvhls-15aab23c1e89e9369b631eb3cefc7fc708d6da36.zip