Add some small fixes to the paper

author: Yann Herklotz <git@yannherklotz.com> 2021-07-21 22:54:28 +0200
committer: Yann Herklotz <git@yannherklotz.com> 2021-07-21 22:54:28 +0200
commit: 5ea6be2ade46c7150d33e9fb0c32046be74abb43 (patch)
tree: 5e52ddb0219924eae26edaf26cf4e175bc7010dc /evaluation.tex
parent: ce72ae0d5a441fb4269aa4a4fe1f788333b65e6a (diff)
download: oopsla21_fvhls-5ea6be2ade46c7150d33e9fb0c32046be74abb43.tar.gz
oopsla21_fvhls-5ea6be2ade46c7150d33e9fb0c32046be74abb43.zip
1 files changed, 2 insertions, 2 deletions
diff --git a/evaluation.tex b/evaluation.tex
index 6868917..4ac85b4 100644
--- a/evaluation.tex
+++ b/evaluation.tex
@@ -11,7 +11,7 @@ Our evaluation is designed to answer the following three research questions.
 \subsection{Experimental Setup}
 \label{sec:evaluation:setup}
 
-\paragraph{Choice of HLS tool for comparison.} We compare \vericert{} against \legup{} 4.0, because it is open-source and hence easily accessible, but still produces hardware ``of comparable quality to a commercial high-level synthesis tool''~\cite{canis11_legup}.  We also compare against \legup{} with different optimisation levels in an effort to understand which optimisations have the biggest impact on the performance discrepancies between \legup{} and \vericert{}.  First, we only turn off the LLVM optimisations in \legup{}, to eliminate all the optimisations that are common to standard software compilers, referred to as \legup{} w/o opt.  Secondly, we also compare against \legup{} with LLVM optimisations and operation chaining turned off, referred to as \legup{} w/o opt+chain. Operation chaining is an HLS-specific optimisation that combines data-dependent operations into one clock cycle, and therefore dramatically reduces the number of cycles, without necessarily decreasing the clock speed.
+\paragraph{Choice of HLS tool for comparison.} We compare \vericert{} against \legup{} 4.0, because it is open-source and hence easily accessible, but still produces hardware ``of comparable quality to a commercial high-level synthesis tool''~\cite{canis11_legup}.  We also compare against \legup{} with different optimisation levels in an effort to understand which optimisations have the biggest impact on the performance discrepancies between \legup{} and \vericert{}.  The baseline \legup{} version has all the default automatic optimisations turned on.  The benchmarks are also not manually optimised to run through \legup{} optimally, such as adding pragmas and other manual indications to add further more advanced optimisations.  \vericert{} is also compared with other optimisation levels of \legup{}.  First, we only turn off the LLVM optimisations in \legup{}, to eliminate all the optimisations that are common to standard software compilers, referred to as \legup{} w/o opt.  Secondly, we also compare against \legup{} with LLVM optimisations and operation chaining turned off, referred to as \legup{} w/o opt+chain. Operation chaining is an HLS-specific optimisation that combines data-dependent operations into one clock cycle, and therefore dramatically reduces the number of cycles, without necessarily decreasing the clock speed.
 
 \paragraph{Choice and preparation of benchmarks.} We evaluate \vericert{} using the \polybench{} benchmark suite (version 4.2.1)~\cite{polybench}, which is a collection of 30 numerical kernels. \polybench{} is popular in the HLS context~\cite{choi+18,poly_hls_pouchet2013polyhedral,poly_hls_zhao2017,poly_hls_zuo2013}, since it has affine loop bounds, making it attractive for streaming computation on FPGA architectures.
 We were able to use 27 of the 30 programs; three had to be discarded (\texttt{correlation},~\texttt{gramschmidt} and~\texttt{deriche}) because they involve square roots, requiring floats, which we do not support. 
@@ -148,7 +148,7 @@ We configured \polybench{}'s parameters so that only integer types are used.  We
 Firstly, before comparing any performance metrics, it is worth highlighting that any Verilog produced by \vericert{} is guaranteed to be \emph{correct}, whilst no such guarantee can be provided by \legup{}.
 This guarantee in itself provides a significant leap in terms of HLS reliability, compared to any other HLS tools available.
 
-The top graphs of Figure~\ref{fig:polybench-div} and Figure~\ref{fig:polybench-nodiv} compare the cycle counts of the 27 programs executed by \vericert{} and the different optimisation levels of \legup{}.  Each graph uses optimised \legup{} as the baseline.  When division/modulo operations are present \legup{} designs execute around 27$\times$ faster than \vericert{} designs.  However, when division/modulo operations are replaced by the iterative algorithm, \legup{} designs are only 2$\times$ faster than \vericert{} designs.  However, the benchmarks with division/modulo replaced show that \vericert{} actually achieves the same execution speed as \legup{} without LLVM optimisations and without operation chaining, which is encouraging, and shows that the hardware generation is following the right steps.  The execution time is calculated by multiplying the maximum frequency that the FPGA can run at with this design, by the number of clock cycles that are needed to complete the execution.  We can therefore analyse each separately.
+The top graphs of Figure~\ref{fig:polybench-div} and Figure~\ref{fig:polybench-nodiv} compare the execution time of the 27 programs executed by \vericert{} and the different optimisation levels of \legup{}.  Each graph uses optimised \legup{} as the baseline.  When division/modulo operations are present \legup{} designs execute around 27$\times$ faster than \vericert{} designs.  However, when division/modulo operations are replaced by the iterative algorithm, \legup{} designs are only 2$\times$ faster than \vericert{} designs.  However, the benchmarks with division/modulo replaced show that \vericert{} actually achieves the same execution speed as \legup{} without LLVM optimisations and without operation chaining, which is encouraging, and shows that the hardware generation is following the right steps.  The execution time is calculated by multiplying the maximum frequency that the FPGA can run at with this design, by the number of clock cycles that are needed to complete the execution.  We can therefore analyse each separately.
 
 First, looking at the difference in clock cycles, \vericert{} produces designs that have around 4.5$\times$ as many clock cycles as \legup{} designs in both cases, when division/modulo operations are enabled as well as when they are replaced.  This performance gap can be explained in part by LLVM optimisations, which seem to account for a 2$\times$ decrease in clock cycles, as well as operation chaining, which decreases the clock cycles by another 2$\times$.  The rest of the speed-up is mostly due to \legup{} optimisations such as scheduling and memory analysis, which are designed to extract parallelism from input programs.
 This gap does not represent the performance cost that comes with formally proving a HLS tool.
author	Yann Herklotz <git@yannherklotz.com>	2021-07-21 22:54:28 +0200
committer	Yann Herklotz <git@yannherklotz.com>	2021-07-21 22:54:28 +0200
commit	5ea6be2ade46c7150d33e9fb0c32046be74abb43 (patch)
tree	5e52ddb0219924eae26edaf26cf4e175bc7010dc /evaluation.tex
parent	ce72ae0d5a441fb4269aa4a4fe1f788333b65e6a (diff)
download	oopsla21_fvhls-5ea6be2ade46c7150d33e9fb0c32046be74abb43.tar.gz oopsla21_fvhls-5ea6be2ade46c7150d33e9fb0c32046be74abb43.zip