summaryrefslogtreecommitdiffstats
path: root/evaluation.tex
diff options
context:
space:
mode:
authorYann Herklotz <git@yannherklotz.com>2021-04-16 12:17:36 +0100
committerYann Herklotz <git@yannherklotz.com>2021-04-16 12:17:45 +0100
commit65686f3793749e60011d504b031028e74969d5f3 (patch)
tree62f8b33b19cd7015d101c1d5c476dab94f49b17f /evaluation.tex
parent197e78a9f93f2344096322cf31863ae99993f9e2 (diff)
downloadoopsla21_fvhls-65686f3793749e60011d504b031028e74969d5f3.tar.gz
oopsla21_fvhls-65686f3793749e60011d504b031028e74969d5f3.zip
Fix up evaluation
Diffstat (limited to 'evaluation.tex')
-rw-r--r--evaluation.tex219
1 files changed, 5 insertions, 214 deletions
diff --git a/evaluation.tex b/evaluation.tex
index ac67ec7..5612eff 100644
--- a/evaluation.tex
+++ b/evaluation.tex
@@ -11,7 +11,7 @@ Our evaluation is designed to answer the following three research questions.
\subsection{Experimental Setup}
\label{sec:evaluation:setup}
-\paragraph{Choice of HLS tool for comparison.} We compare \vericert{} against \legup{} 5.1 \JW{4.0 now, right?} because it is open-source and hence easily accessible, but still produces hardware ``of comparable quality to a commercial high-level synthesis tool''~\cite{canis11_legup}.
+\paragraph{Choice of HLS tool for comparison.} We compare \vericert{} against \legup{} 4.0, because it is open-source and hence easily accessible, but still produces hardware ``of comparable quality to a commercial high-level synthesis tool''~\cite{canis11_legup}. We also compare against \legup{} with different optimisation levels. First, we only turn off the LLVM optimisations in \legup{}, to eliminate all the optimisations that are common to standard software compilers. Secondly, we also compare against \legup{} with operation chaining turned off, in addition to the \legup{} optimisations.
\paragraph{Choice and preparation of benchmarks.} We evaluate \vericert{} using the \polybench{} benchmark suite (version 4.2.1)~\cite{polybench}, which consists of a collection of 30 numerical kernels. \polybench{} is popular in the HLS context~\cite{choi+18,poly_hls_pouchet2013polyhedral,poly_hls_zhao2017,poly_hls_zuo2013}, since it has affine loop bounds, making it attractive for streaming computation on FPGA architectures.
We were able to use 27 of the 30 programs; three had to be discarded (\texttt{correlation},~\texttt{gramschmidt} and~\texttt{deriche}) because they involve square roots, requiring floats, which we do not support.
@@ -19,9 +19,9 @@ We were able to use 27 of the 30 programs; three had to be discarded (\texttt{co
%In summary, we evaluate 27 programs from the latest Polybench suite.
We configured \polybench{}'s parameters so that only integer types are used, since we do not support floats. We use \polybench{}'s smallest datasets for each program to ensure that data can reside within on-chip memories of the FPGA, avoiding any need for off-chip memory accesses.
-\vericert{} implements divisions and modulo operations in C using the corresponding built-in Verilog operators. These built-in operators are designed to complete within a single clock cycle, and this causes substantial penalties in clock frequency. Other HLS tools, including LegUp, supply their own multi-cycle division/modulo implementations, and we plan to do the same in future versions of \vericert{}. In the meantime, we have prepared an alternative version of the benchmarks in which each division/modulo operation is overridden with our own implementation that uses repeated division and multiplications by 2. Where this change makes an appreciable difference to the performance results, we give the results for both benchmark sets.
+\vericert{} implements divisions and modulo operations in C using the corresponding built-in Verilog operators. These built-in operators are designed to complete within a single clock cycle, and this causes substantial penalties in clock frequency. Other HLS tools, including LegUp, supply their own multi-cycle division/modulo implementations, and we plan to do the same in future versions of \vericert{}. Implementing pipelined operators such as the divide and modulus operator can be solved by scheduling the instructions so that these can execute in parallel, which is the main optimisation that needs to be added to \vericert{}. In the meantime, we have prepared an alternative version of the benchmarks in which each division/modulo operation is replaced with our own implementation that uses repeated division and multiplications by 2. Figure~\ref{fig:polybench-div} shows the results of comparing Vericert with optimised LegUp 4.0 on the \polybench{} benchmarks, where divisions have been left intact. Figure~\ref{fig:polybench-nodiv} performs the comparison where the division/modulo operations have been replaced by the iterative algorithm.
-\paragraph{Synthesis setup} The Verilog that is generated by \vericert{} or \legup{} is provided to Intel Quartus v16.0~\cite{quartus}, which synthesises it to a netlist, before placing-and-routing this netlist onto an Intel Arria 10 FPGA device that contains approximately 430000 LUTs.
+\paragraph{Synthesis setup} The Verilog that is generated by \vericert{} or \legup{} is provided to Xilinx Vivado v2017.1~\cite{xilinx_vivad_desig_suite}, which synthesises it to a netlist, before placing-and-routing this netlist onto a Xilinx XC7Z020 FPGA device that contains approximately 85000 LUTs.
\subsection{RQ1: How fast is \vericert{}-generated hardware?}
@@ -85,7 +85,7 @@ We configured \polybench{}'s parameters so that only integer types are used, sin
\legend{\vericert{},\legup{} w/o opt+chain,\legup{} w/o opt};
\end{groupplot}
\end{tikzpicture}
- \caption{\polybench{} with division enabled. \JW{More descriptive caption needed (and next figure too)}}
+ \caption{\polybench{} with division/modulo operations enabled. The top graph shows the execution time of \vericert{}, \legup{} without LLVM optimisations and without operation chaining and \legup{} without front end LLVM optimisations relative to optimised \legup{}. The bottom graph shows the area relative to \legup{}.}\label{fig:polybench-div}
\end{figure}
\pgfplotstableread[col sep=comma]{results/rel-time-nodiv.csv}{\nodivtimingtable}
@@ -142,218 +142,9 @@ We configured \polybench{}'s parameters so that only integer types are used, sin
\legend{\vericert{},\legup{} w/o opt+chain,\legup{} w/o opt};
\end{groupplot}
\end{tikzpicture}
- \caption{\polybench{} with division replaced by iterative division algorithm.}
+ \caption{\polybench{} with division/modulo operations replaced by an iterative algorithm. The top graph shows the execution time of \vericert{}, \legup{} without LLVM optimisations and without operation chaining and \legup{} without front end LLVM optimisations relative to optimised \legup{}. The bottom graph shows the area relative to \legup{}.}\label{fig:polybench-nodiv}
\end{figure}
-%
-%\pgfplotstableread[col sep=comma]{results/exec-time.csv}{\nodivtimingtable}
-%\begin{figure}\centering
-% \begin{tikzpicture}
-% \begin{semilogyaxis}[
-% ybar=0pt,
-% width=1\textwidth,
-% height=0.5\textwidth,
-% bar width=3pt,
-% ymin=0.1,
-% ymax=3,
-% log ticks with fixed point,
-% legend pos=south east,
-% xlabel={Polybench Benchmarks},
-% xticklabels from table={\nodivtimingtable}{benchmark},
-% ylabel={\vericert{} / \legup{} execution time ratio},
-% legend style={nodes={scale=0.7, transform shape}},
-% x tick label style={rotate=60,anchor=east,font=\footnotesize},
-% xtick=data,
-% enlarge x limits={abs=0.5},
-% ]
-%
-% \addplot+ table [x expr=\coordindex,y=v no nc,col sep=comma] from \nodivtimingtable;
-% \addlegendentry{LegUp w/o opt w/o chain};
-% \addplot+ table [x expr=\coordindex,y=v no,col sep=comma] from \nodivtimingtable;
-% \addlegendentry{LegUp w/o opt};
-% \addplot+ table [x expr=\coordindex,y=v op,col sep=comma] from \nodivtimingtable;
-% \addlegendentry{LegUp};
-%
-% \end{semilogyaxis}
-% \end{tikzpicture}
-%\end{figure}
-%
-%\pgfplotstableread[col sep=comma]{results/slice-nodiv.csv}{\nodivslicetable}
-%\begin{figure}\centering
-% \begin{tikzpicture}
-% \begin{semilogyaxis}[
-% ybar=0pt,
-% width=1\textwidth,
-% height=0.5\textwidth,
-% bar width=3pt,
-% ymin=0.1,
-% ymax=3,
-% log ticks with fixed point,
-% legend pos=south east,
-% xlabel={Polybench Benchmarks},
-% xticklabels from table={\nodivslicetable}{benchmark},
-% ylabel={\vericert{} / \legup{} execution time ratio},
-% legend style={nodes={scale=0.7, transform shape}},
-% x tick label style={rotate=60,anchor=east,font=\footnotesize},
-% xtick=data,
-% enlarge x limits={abs=0.5},
-% legend columns=-1,
-% ]
-%
-% \addplot+ table [x expr=\coordindex,y=legup noopt nochain,col sep=comma] from \nodivslicetable;
-% \addlegendentry{LegUp w/o opt w/o chain};
-% \addplot+ table [x expr=\coordindex,y=legup noopt,col sep=comma] from \nodivslicetable;
-% \addlegendentry{LegUp w/o opt};
-% \addplot+ table [x expr=\coordindex,y=legup,col sep=comma] from \nodivslicetable;
-% \addlegendentry{LegUp};
-%
-% \end{semilogyaxis}
-% \end{tikzpicture}
-%\end{figure}
-
-%\begin{figure}\centering
-%\begin{subfigure}[t]{0.48\textwidth}
-%\definecolor{cyclecountcol}{HTML}{1b9e77}
-%\begin{tikzpicture}
-%\begin{axis}[
-% xmode=log,
-% ymode=log,
-% height=1\textwidth,
-% width=1\textwidth,
-% xlabel={\legup{} cycle count},
-% ylabel={\vericert{} cycle count},
-% xmin=1000,
-% xmax=10000000,
-% ymax=10000000,
-% ymin=1000,
-% %log ticks with fixed point,
-% ]
-%
-%\addplot[draw=none, mark=*, draw opacity=0, fill opacity=0.6,cyclecountcol]
-% table [x=legupcycles, y=vericertcycles, col sep=comma]
-% {results/poly.csv};
-%
-%\addplot[dotted, domain=1000:10000000]{x};
-%%\addplot[dashed, domain=10:10000]{9.02*x};
-%
-%\end{axis}
-%\end{tikzpicture}
-%\caption{A comparison of the cycle count of hardware designs generated by \vericert{} and by \legup{}.}
-%\label{fig:comparison_cycles}
-%\end{subfigure}\hfill%
-%\begin{subfigure}[t]{0.48\textwidth}
-%\definecolor{polycol}{HTML}{e6ab02}
-%\definecolor{polywocol}{HTML}{7570b3}
-%\begin{tikzpicture}
-%\begin{axis}[
-% xmode=log,
-% ymode=log,
-% height=1\textwidth,
-% width=1\textwidth,
-% xlabel={\legup{} execution time (ms)},
-% ylabel={\vericert{} execution time (ms)},
-% xmin=10,
-% xmax=1000000,
-% ymax=1000000,
-% ymin=10,
-% legend pos=south east,
-% %log ticks with fixed point,
-% ]
-%
-%\addplot[draw=none, mark=*, draw opacity=0, fill opacity=0.8, polycol]
-% table [x expr={\thisrow{legupcycles}/\thisrow{legupfreqMHz}}, y expr={\thisrow{vericertcycles}/\thisrow{vericertoldfreqMHz}}, col sep=comma]
-% {results/poly.csv};
-%
-%\addlegendentry{PolyBench}
-%
-%\addplot[draw=none, mark=o, fill opacity=0, polywocol]
-% table [x expr={\thisrow{legupcycles}/\thisrow{legupfreqMHz}}, y expr={\thisrow{vericertcycles}/\thisrow{vericertfreqMHz}}, col sep=comma]
-% {results/poly.csv};
-%
-%\addlegendentry{PolyBench w/o division}
-%
-%\addplot[dotted, domain=10:1000000]{x};
-%%\addplot[dashed, domain=10:10000]{9.02*x + 442};
-%
-%\end{axis}
-%\end{tikzpicture}
-%\caption{A comparison of the execution time of hardware designs generated by \vericert{} and by \legup{}.}
-%\label{fig:comparison_time}
-%\end{subfigure}
-%\end{figure}
-
-%Firstly, before comparing any performance metrics, it is worth highlighting that any Verilog produced by \vericert{} is guaranteed to be \emph{correct}, whilst no such guarantee can be provided by \legup{}.
-%This guarantee in itself provides a significant leap in terms of HLS reliability, compared to any other HLS tools available.
-
-Figure~\ref{fig:comparison_cycles} compares the cycle counts of our 27 programs executed by \vericert{} and \legup{} respectively.
-In most cases, we see that the data points are above the diagonal, which demonstrates that the \legup{}-generated hardware is faster than \vericert{}-generated Verilog.
-
-On average, \legup{} designs are $4.5\times$ faster than \vericert{} designs.
-This performance gap is mostly due to \legup{} optimisations such as scheduling and memory analysis, which are designed to extract parallelism from input programs.
-%This gap does not represent the performance cost that comes with formally proving a HLS tool.
-%Instead, it is simply a gap between an unoptimised \vericert{} versus an optimised \legup{}.
-It is notable that even without \vericert{} performing many optimisations, a few data points are close to the diagonal and even below it.
-%We are very encouraged by these data points.
-As we improve \vericert{} by incorporating further optimisations, this gap should reduce whilst preserving our correctness guarantees.
-
-Cycle count is one factor in calculating execution times; the other is the clock frequency, which determines the duration of each of these cycles. Figure~\ref{fig:comparison_time} compares the execution times of \vericert{} and \legup{}. Across the original \polybench{} benchmarks, we see that \vericert{} designs are about \slowdownOrig$\times$ slower than \legup{} designs. This dramatic discrepancy in performance can be largely attributed to \vericert's na\"ive implementations of division and modulo operations, as explained in Section~\ref{sec:evaluation:setup}. Indeed, \vericert{} achieved an average clock frequency of just 21MHz, while \legup{} managed about 247MHz. After replacing the division/modulo operations with our own C-based implementations, \vericert{}'s average clock frequency becomes about 112MHz. This is better, but still substantially below \legup{}, which uses various additional optimisations and Intel-specific IP blocks. Across the modified \polybench{} benchmarks, we see that \vericert{} designs are about \slowdownDiv$\times$ slower than \legup{} designs.
-
-\subsection{RQ2: How area-efficient is \vericert{}-generated hardware?}
-
-%\begin{figure}
-%\begin{subfigure}[t]{0.48\textwidth}
-%\definecolor{resourceutilcol}{HTML}{e7298a}
-%\begin{tikzpicture}
-%\begin{axis}[
-% height=1\textwidth,
-% width=1\textwidth,
-% xlabel={\legup{} resource utilisation (\%)},
-% ylabel={\vericert{} resource utilisation (\%)},
-% xmin=0, ymin=0,
-% xmax=1, ymax=30,
-% ]
-%
-%\addplot[draw=none, mark=*, draw opacity=0, fill opacity=0.6,resourceutilcol]
-% table [x expr=(\thisrow{legupluts}/427200*100), y expr=(\thisrow{vericertluts}/427200*100), col sep=comma]
-% {results/poly.csv};
-%
-%% \addplot[dashed, domain=0:1]{x};
-%
-%\end{axis}
-%\end{tikzpicture}
-%\caption{A comparison of the resource utilisation of designs generated by \vericert{} and by \legup{}.}
-%\label{fig:comparison_area}
-%\end{subfigure}\hfill%
-%\begin{subfigure}[t]{0.48\textwidth}
-%\definecolor{compiltimecol}{HTML}{66a61e}
-%\begin{tikzpicture}
-%\begin{axis}[
-% height=1\textwidth,
-% width=1\textwidth,
-% xlabel={\legup{} compilation time (s)},
-% ylabel={\vericert{} compilation time (s)},
-% yticklabel style={
-% /pgf/number format/fixed,
-% /pgf/number format/precision=2},
-% xmin=4.6,
-% xmax=5.1,
-% ymin=0.06,
-% ymax=0.20,
-% ]
-%
-%\addplot[draw=none, mark=*, draw opacity=0, fill opacity=0.6,compiltimecol]
-% table [x=legupcomptime, y=vericertcomptime, col sep=comma]
-% {results/poly.csv};
-%
-% %\addplot[dashed, domain=4.5:5.1]{0.1273*x-0.5048};
-%
-%\end{axis}
-%\end{tikzpicture}
-%\caption{A comparison of compilation time for \vericert{} and for \legup{}}
-%\label{fig:comparison_comptime}
-%\end{subfigure}
-%\end{figure}
-
Figure~\ref{fig:comparison_area} compares the resource utilisation of the \polybench{} programs generated by \vericert{} and \legup{}.
On average, we see that \vericert{} produces hardware that is about $21\times$ larger than \legup{}. \vericert{} designs are filling up to 30\% of a (large) FPGA chip, while \legup{} uses no more than 1\% of the chip.
The main reason for this is that RAM is not inferred automatically for the Verilog that is generated by \vericert{}; instead, large arrays of registers are synthesised.