summaryrefslogtreecommitdiffstats
path: root/evaluation.tex
diff options
context:
space:
mode:
authorJohn Wickerson <j.wickerson@imperial.ac.uk>2021-04-16 18:41:52 +0000
committeroverleaf <overleaf@localhost>2021-04-16 19:26:39 +0000
commit7c43b3885f3b9248170974bc4c2b8a85457c6b40 (patch)
treeb6ddedfbcfd54d16b1af710fa037c5d47c12a700 /evaluation.tex
parent35cd78c44cdc6e2682b64a9781974da01cac8a26 (diff)
downloadoopsla21_fvhls-7c43b3885f3b9248170974bc4c2b8a85457c6b40.tar.gz
oopsla21_fvhls-7c43b3885f3b9248170974bc4c2b8a85457c6b40.zip
Update on Overleaf.
Diffstat (limited to 'evaluation.tex')
-rw-r--r--evaluation.tex6
1 files changed, 3 insertions, 3 deletions
diff --git a/evaluation.tex b/evaluation.tex
index 6cdb546..c682336 100644
--- a/evaluation.tex
+++ b/evaluation.tex
@@ -12,7 +12,7 @@ Our evaluation is designed to answer the following two research questions.
\paragraph{Choice of HLS tool for comparison.} We compare \vericert{} against \legup{} 4.0, because it is open-source and hence easily accessible, but still produces hardware ``of comparable quality to a commercial high-level synthesis tool''~\cite{canis11_legup}. We also compare against \legup{} with different optimisation levels. First, we only turn off the LLVM optimisations in \legup{}, to eliminate all the optimisations that are common to standard software compilers. Secondly, we also compare against \legup{} with LLVM optimisations and operation chaining turned off, which is an HLS specific optimisation that combines data-dependent operations into one clock cycle, and therefore dramatically reduces the number of cycles, without necessarily increasing the clock speed.
-\paragraph{Choice and preparation of benchmarks.} We evaluate \vericert{} using the \polybench{} benchmark suite (version 4.2.1)~\cite{polybench}, which consists of a collection of 30 numerical kernels. \polybench{} is popular in the HLS context~\cite{choi+18,poly_hls_pouchet2013polyhedral,poly_hls_zhao2017,poly_hls_zuo2013}, since it has affine loop bounds, making it attractive for streaming computation on FPGA architectures.
+\paragraph{Choice and preparation of benchmarks.} We evaluate \vericert{} using the \polybench{} benchmark suite (version 4.2.1)~\cite{polybench}, which is a collection of 30 numerical kernels. \polybench{} is popular in the HLS context~\cite{choi+18,poly_hls_pouchet2013polyhedral,poly_hls_zhao2017,poly_hls_zuo2013}, since it has affine loop bounds, making it attractive for streaming computation on FPGA architectures.
We were able to use 27 of the 30 programs; three had to be discarded (\texttt{correlation},~\texttt{gramschmidt} and~\texttt{deriche}) because they involve square roots, requiring floats, which we do not support.
% Interestingly, we were also unable to evaluate \texttt{cholesky} on \legup{}, since it produce an error during its HLS compilation.
%In summary, we evaluate 27 programs from the latest Polybench suite.
@@ -154,9 +154,9 @@ This gap does not represent the performance cost that comes with formally provin
Instead, it is simply a gap between an unoptimised \vericert{} versus an optimised \legup{}.
As we improve \vericert{} by incorporating further optimisations, this gap should reduce whilst preserving the correctness guarantees.
-Secondly, looking at the maximum clock frequency that each design can achieve, \vericert{} designs can only achieve 8.2$\times$ the maximum clock frequency of \legup{} when division/modulo operations are present. This is in great contrast to the maximum clock frequency that \vericert{} can achieve when no divide/modulus operations are present, where \vericert{} generates designs that are actually 2$\times$ better than the frequency achieved by \legup{} designs. The dramatic discrepancy in performance for the former case can be largely attributed to \vericert{}'s na\"ive implementations of division and modulo operations, as explained in Section~\ref{sec:evaluation:setup}. Indeed, \vericert{} achieved an average clock frequency of just 13MHz, while \legup{} managed about 111MHz. After replacing the division/modulo operations with our own C-based implementations, \vericert{}'s average clock frequency becomes about 220MHz. This improvement in frequency can maybe be explained by scheduling trying to pack too many instructions into a cycle, or by the fact that \legup{} uses a more involved RAM template so that the hardware produces a dual-port RAM, which can perform two reads and writes per clock cycle.
+Secondly, looking at the maximum clock frequency that each design can achieve, \vericert{} designs can only achieve 8.2$\times$ the maximum clock frequency of \legup{} \JW{That sounds wrong? Shouldn't it be less than legup's fmax?} when division/modulo operations are present. This is in great contrast to the maximum clock frequency that \vericert{} can achieve when no divide/modulus \JW{modulo?} operations are present, where \vericert{} generates designs that are actually 2$\times$ better than the frequency achieved by \legup{} designs. The dramatic discrepancy in performance for the former case can be largely attributed to \vericert{}'s na\"ive implementations of division and modulo operations, as explained in Section~\ref{sec:evaluation:setup}. Indeed, \vericert{} achieved an average clock frequency of just 13MHz, while \legup{} managed about 111MHz. After replacing the division/modulo operations with our own C-based implementations, \vericert{}'s average clock frequency becomes about 220MHz. This improvement in frequency can maybe be explained by scheduling trying to pack too many instructions into a cycle, or by the fact that \legup{} uses a more involved RAM template so that the hardware produces a dual-port RAM, which can perform two reads and writes per clock cycle.
-Looking at a few benchmarks in particular in Figure~\ref{fig:polybench-nodiv} for some interesting cases. For the trmm benchmark, \vericert{} produces hardware that executes with the same cycle count as \legup{}, and manages to create hardware that achieves twice the frequency compared to \legup{}, thereby actually producing a design that executes twice as fast as \legup{}. Another interesting benchmark is doitgen, where \vericert{} is comparable to \legup{} without LLVM optimisations, however, LLVM optimisations seem to have the a large affect on the cycle count.
+Looking at a few benchmarks in particular in Figure~\ref{fig:polybench-nodiv} for some interesting cases. For the trmm benchmark, \vericert{} produces hardware that executes with the same cycle count as \legup{}, and manages to create hardware that achieves twice the frequency compared to \legup{}, thereby actually producing a design that executes twice as fast as \legup{}. Another interesting benchmark is \JW{tt formatting for benchmark program names?} doitgen, where \vericert{} is comparable to \legup{} without LLVM optimisations, however, LLVM optimisations seem to have a large effect on the cycle count.
\subsection{RQ2: How area-efficient is \vericert{}-generated hardware?}