summaryrefslogtreecommitdiffstats
path: root/evaluation.tex
diff options
context:
space:
mode:
authorYann Herklotz <git@yannherklotz.com>2021-04-13 17:08:31 +0100
committerYann Herklotz <git@yannherklotz.com>2021-04-13 17:08:45 +0100
commit12cf73bfe39d00fea73b17b831e09d419df0e121 (patch)
treec6b263879550f89d8e21da99fd8cbec66b3c4e94 /evaluation.tex
parentb6e4d6c2a90d89d3ab56110e3f621fd92408eae7 (diff)
downloadoopsla21_fvhls-12cf73bfe39d00fea73b17b831e09d419df0e121.tar.gz
oopsla21_fvhls-12cf73bfe39d00fea73b17b831e09d419df0e121.zip
More changes
Diffstat (limited to 'evaluation.tex')
-rw-r--r--evaluation.tex14
1 files changed, 7 insertions, 7 deletions
diff --git a/evaluation.tex b/evaluation.tex
index 9681210..e00062d 100644
--- a/evaluation.tex
+++ b/evaluation.tex
@@ -13,11 +13,11 @@ Our evaluation is designed to answer the following three research questions.
\paragraph{Choice of HLS tool for comparison.} We compare \vericert{} against \legup{} 5.1 because it is open-source and hence easily accessible, but still produces hardware ``of comparable quality to a commercial high-level synthesis tool''~\cite{canis11_legup}.
-\paragraph{Choice and preparation of benchmarks.} We evaluate \vericert{} using the PolyBench/C benchmark suite (version 4.2.1)~\cite{polybench}, which consists of a collection of 30 numerical kernels. PolyBench/C is popular in the HLS context~\cite{choi+18,poly_hls_pouchet2013polyhedral,poly_hls_zhao2017,poly_hls_zuo2013}, since it has affine loop bounds, making it attractive for streaming computation on FPGA architectures.
+\paragraph{Choice and preparation of benchmarks.} We evaluate \vericert{} using the \polybench{} benchmark suite (version 4.2.1)~\cite{polybench}, which consists of a collection of 30 numerical kernels. \polybench{} is popular in the HLS context~\cite{choi+18,poly_hls_pouchet2013polyhedral,poly_hls_zhao2017,poly_hls_zuo2013}, since it has affine loop bounds, making it attractive for streaming computation on FPGA architectures.
We were able to use 27 of the 30 programs; three had to be discarded (\texttt{correlation},~\texttt{gramschmidt} and~\texttt{deriche}) because they involve square roots, requiring floats, which we do not support.
% Interestingly, we were also unable to evaluate \texttt{cholesky} on \legup{}, since it produce an error during its HLS compilation.
%In summary, we evaluate 27 programs from the latest Polybench suite.
-We configured Polybench's parameters so that only integer types are used, since we do not support floats. We use Polybench's smallest datasets for each program to ensure that data can reside within on-chip memories of the FPGA, avoiding any need for off-chip memory accesses.
+We configured \polybench{}'s parameters so that only integer types are used, since we do not support floats. We use \polybench{}'s smallest datasets for each program to ensure that data can reside within on-chip memories of the FPGA, avoiding any need for off-chip memory accesses.
\vericert{} implements divisions and modulo operations in C using the corresponding built-in Verilog operators. These built-in operators are designed to complete within a single clock cycle, and this causes substantial penalties in clock frequency. Other HLS tools, including LegUp, supply their own multi-cycle division/modulo implementations, and we plan to do the same in future versions of \vericert{}. In the meantime, we have prepared an alternative version of the benchmarks in which each division/modulo operation is overridden with our own implementation that uses repeated division and multiplications by 2. Where this change makes an appreciable difference to the performance results, we give the results for both benchmark sets.
@@ -73,7 +73,7 @@ We configured Polybench's parameters so that only integer types are used, since
\legend{\vericert{},\legup{} w/o opt/chain,\legup{} w/o opt};
\end{groupplot}
\end{tikzpicture}
- \caption{Polybench with division enabled.}
+ \caption{\polybench{} with division enabled.}
\end{figure}
\pgfplotstableread[col sep=comma]{results/rel-time-nodiv.csv}{\nodivtimingtable}
@@ -121,7 +121,7 @@ We configured Polybench's parameters so that only integer types are used, since
\legend{\vericert{},\legup{} w/o opt/chain,\legup{} w/o opt};
\end{groupplot}
\end{tikzpicture}
- \caption{Polybench with division replaced by iterative division algorithm.}
+ \caption{\polybench{} with division replaced by iterative division algorithm.}
\end{figure}
%
@@ -275,7 +275,7 @@ It is notable that even without \vericert{} performing many optimisations, a few
%We are very encouraged by these data points.
As we improve \vericert{} by incorporating further optimisations, this gap should reduce whilst preserving our correctness guarantees.
-Cycle count is one factor in calculating execution times; the other is the clock frequency, which determines the duration of each of these cycles. Figure~\ref{fig:comparison_time} compares the execution times of \vericert{} and \legup{}. Across the original Polybench/C benchmarks, we see that \vericert{} designs are about \slowdownOrig$\times$ slower than \legup{} designs. This dramatic discrepancy in performance can be largely attributed to \vericert's na\"ive implementations of division and modulo operations, as explained in Section~\ref{sec:evaluation:setup}. Indeed, \vericert{} achieved an average clock frequency of just 21MHz, while \legup{} managed about 247MHz. After replacing the division/modulo operations with our own C-based implementations, \vericert{}'s average clock frequency becomes about 112MHz. This is better, but still substantially below \legup{}, which uses various additional optimisations and Intel-specific IP blocks. Across the modified Polybench/C benchmarks, we see that \vericert{} designs are about \slowdownDiv$\times$ slower than \legup{} designs.
+Cycle count is one factor in calculating execution times; the other is the clock frequency, which determines the duration of each of these cycles. Figure~\ref{fig:comparison_time} compares the execution times of \vericert{} and \legup{}. Across the original \polybench{} benchmarks, we see that \vericert{} designs are about \slowdownOrig$\times$ slower than \legup{} designs. This dramatic discrepancy in performance can be largely attributed to \vericert's na\"ive implementations of division and modulo operations, as explained in Section~\ref{sec:evaluation:setup}. Indeed, \vericert{} achieved an average clock frequency of just 21MHz, while \legup{} managed about 247MHz. After replacing the division/modulo operations with our own C-based implementations, \vericert{}'s average clock frequency becomes about 112MHz. This is better, but still substantially below \legup{}, which uses various additional optimisations and Intel-specific IP blocks. Across the modified \polybench{} benchmarks, we see that \vericert{} designs are about \slowdownDiv$\times$ slower than \legup{} designs.
\subsection{RQ2: How area-efficient is \vericert{}-generated hardware?}
@@ -333,7 +333,7 @@ Cycle count is one factor in calculating execution times; the other is the clock
%\end{subfigure}
%\end{figure}
-Figure~\ref{fig:comparison_area} compares the resource utilisation of the Polybench programs generated by \vericert{} and \legup{}.
+Figure~\ref{fig:comparison_area} compares the resource utilisation of the \polybench{} programs generated by \vericert{} and \legup{}.
On average, we see that \vericert{} produces hardware that is about $21\times$ larger than \legup{}. \vericert{} designs are filling up to 30\% of a (large) FPGA chip, while \legup{} uses no more than 1\% of the chip.
The main reason for this is that RAM is not inferred automatically for the Verilog that is generated by \vericert{}; instead, large arrays of registers are synthesised.
Synthesis tools such as Quartus generally require array accesses to be in a specific form in order for RAM inference to activate.
@@ -346,7 +346,7 @@ Enabling RAM inference is part of our future plans.
\subsection{RQ3: How long does \vericert{} take to produce hardware?}
-Figure~\ref{fig:comparison_comptime} compares the compilation times of \vericert{} and of \legup{}, with each data point corresponding to one of the PolyBench/C benchmarks. On average, \vericert{} compilation is about $47\times$ faster than \legup{} compilation. \vericert{} is much faster because it omits many of the time-consuming HLS optimisations performed by \legup{}, such as scheduling and memory analysis. This comparison also demonstrates that our fully verified approach does not add substantial overheads in compilation time, since we do not invoke verification for every compilation instance, unlike the approaches based on translation validation that we mentioned in Section~\ref{sec:intro}.
+Figure~\ref{fig:comparison_comptime} compares the compilation times of \vericert{} and of \legup{}, with each data point corresponding to one of the \polybench{} benchmarks. On average, \vericert{} compilation is about $47\times$ faster than \legup{} compilation. \vericert{} is much faster because it omits many of the time-consuming HLS optimisations performed by \legup{}, such as scheduling and memory analysis. This comparison also demonstrates that our fully verified approach does not add substantial overheads in compilation time, since we do not invoke verification for every compilation instance, unlike the approaches based on translation validation that we mentioned in Section~\ref{sec:intro}.
%\NR{Do we want to finish the section off with some highlights or a summary?}