Figure -> Fig.

author: Yann Herklotz <git@yannherklotz.com> 2021-09-09 21:09:05 +0100
committer: Yann Herklotz <git@yannherklotz.com> 2021-09-09 21:09:05 +0100
commit: 55528c1f1a11988897d46993756d6aa5873095af (patch)
tree: 3f77f427e8aba3712ac9a145a28471c9ddf2ead7 /evaluation.tex
parent: 301291487f61c924ec16ec73c632b19ac5395a6d (diff)
download: oopsla21_fvhls-55528c1f1a11988897d46993756d6aa5873095af.tar.gz
oopsla21_fvhls-55528c1f1a11988897d46993756d6aa5873095af.zip
1 files changed, 6 insertions, 6 deletions
diff --git a/evaluation.tex b/evaluation.tex
index f5deff8..2a98aa7 100644
--- a/evaluation.tex
+++ b/evaluation.tex
@@ -25,7 +25,7 @@ We were able to use 27 of the 30 programs; three had to be discarded (\texttt{co
 %In summary, we evaluate 27 programs from the latest Polybench suite. 
 We configured \polybench{}'s parameters so that only integer types are used.  We use \polybench{}'s smallest datasets for each program to ensure that data can reside within on-chip memories of the FPGA, avoiding any need for off-chip memory accesses. We have not modified the benchmarks to make them run through \legup{} optimally, e.g. by adding pragmas that trigger more advanced optimisations.
 
-\vericert{} implements divisions and modulo operations in C using the corresponding built-in Verilog operators. These built-in operators are designed to complete within a single clock cycle, and this causes substantial penalties in clock frequency. Other HLS tools, including LegUp, supply their own multi-cycle division/modulo implementations, and we plan to do the same in future versions of \vericert{}. Implementing pipelined operators such as the divide and modulus operator can be solved by scheduling the instructions so that these can execute in parallel, which is the main optimisation that needs to be added to \vericert{}. In the meantime, we have prepared an alternative version of the benchmarks in which each division/modulo operation is replaced with our own implementation that uses repeated division and multiplications by 2.  Figure~\ref{fig:polybench-div} shows the results of comparing \vericert{} with optimised LegUp 4.0 on the \polybench{} benchmarks, where divisions have been left intact.  Figure~\ref{fig:polybench-nodiv} performs the comparison where the division/modulo operations have been replaced by the iterative algorithm.
+\vericert{} implements divisions and modulo operations in C using the corresponding built-in Verilog operators. These built-in operators are designed to complete within a single clock cycle, and this causes substantial penalties in clock frequency. Other HLS tools, including LegUp, supply their own multi-cycle division/modulo implementations, and we plan to do the same in future versions of \vericert{}. Implementing pipelined operators such as the divide and modulus operator can be solved by scheduling the instructions so that these can execute in parallel, which is the main optimisation that needs to be added to \vericert{}. In the meantime, we have prepared an alternative version of the benchmarks in which each division/modulo operation is replaced with our own implementation that uses repeated division and multiplications by 2.  Fig.~\ref{fig:polybench-div} shows the results of comparing \vericert{} with optimised LegUp 4.0 on the \polybench{} benchmarks, where divisions have been left intact.  Fig.~\ref{fig:polybench-nodiv} performs the comparison where the division/modulo operations have been replaced by the iterative algorithm.
 
 \paragraph{Synthesis setup} The Verilog that is generated by \vericert{} or \legup{} is provided to Xilinx Vivado v2017.1~\cite{xilinx_vivad_desig_suite}, which synthesises it to a netlist, before placing-and-routing this netlist onto a Xilinx XC7Z020 FPGA device that contains approximately 85000 LUTs.
 
@@ -155,7 +155,7 @@ We configured \polybench{}'s parameters so that only integer types are used.  We
 Firstly, before comparing any performance metrics, it is worth highlighting that any Verilog produced by \vericert{} is guaranteed to be \emph{correct}, whilst no such guarantee can be provided by \legup{}.
 This guarantee in itself provides a significant leap in terms of HLS reliability, compared to any other HLS tools available.
 
-The top graphs of Figure~\ref{fig:polybench-div} and Figure~\ref{fig:polybench-nodiv} compare the execution time of the 27 programs executed by \vericert{} and the different optimisation levels of \legup{}.  Each graph uses optimised \legup{} as the baseline.  When division/modulo operations are present \legup{} designs execute around 27$\times$ faster than \vericert{} designs.  However, when division/modulo operations are replaced by the iterative algorithm, \legup{} designs are only 2$\times$ faster than \vericert{} designs.  However, the benchmarks with division/modulo replaced show that \vericert{} actually achieves the same execution speed as \legup{} without LLVM optimisations and without operation chaining, which is encouraging, and shows that the hardware generation is following the right steps.  The execution time is calculated by multiplying the maximum frequency that the FPGA can run at with this design, by the number of clock cycles that are needed to complete the execution.  We can therefore analyse each separately.
+The top graphs of Fig.~\ref{fig:polybench-div} and Fig.~\ref{fig:polybench-nodiv} compare the execution time of the 27 programs executed by \vericert{} and the different optimisation levels of \legup{}.  Each graph uses optimised \legup{} as the baseline.  When division/modulo operations are present \legup{} designs execute around 27$\times$ faster than \vericert{} designs.  However, when division/modulo operations are replaced by the iterative algorithm, \legup{} designs are only 2$\times$ faster than \vericert{} designs.  However, the benchmarks with division/modulo replaced show that \vericert{} actually achieves the same execution speed as \legup{} without LLVM optimisations and without operation chaining, which is encouraging, and shows that the hardware generation is following the right steps.  The execution time is calculated by multiplying the maximum frequency that the FPGA can run at with this design, by the number of clock cycles that are needed to complete the execution.  We can therefore analyse each separately.
 
 First, looking at the difference in clock cycles, \vericert{} produces designs that have around 4.5$\times$ as many clock cycles as \legup{} designs in both cases, when division/modulo operations are enabled as well as when they are replaced.  This performance gap can be explained in part by LLVM optimisations, which seem to account for a 2$\times$ decrease in clock cycles, as well as operation chaining, which decreases the clock cycles by another 2$\times$.  The rest of the speed-up is mostly due to \legup{} optimisations such as scheduling and memory analysis, which are designed to extract parallelism from input programs.
 This gap does not represent the performance cost that comes with formally proving a HLS tool.
@@ -164,12 +164,12 @@ As we improve \vericert{} by incorporating further optimisations, this gap shoul
 
 Secondly, looking at the maximum clock frequency that each design can achieve, \legup{} designs achieve 8.2$\times$ the maximum clock frequency of \vericert{} when division/modulo operations are present.  This is in great contrast to the maximum clock frequency that \vericert{} can achieve when no divide/modulo operations are present, where \vericert{} generates designs that are actually 2$\times$ better than the frequency achieved by \legup{} designs.  The dramatic discrepancy in performance for the former case can be largely attributed to \vericert{}'s na\"ive implementations of division and modulo operations, as explained in Section~\ref{sec:evaluation:setup}. Indeed, \vericert{} achieved an average clock frequency of just 13MHz, while \legup{} managed about 111MHz. After replacing the division/modulo operations with our own C-based implementations, \vericert{}'s average clock frequency becomes about 220MHz.  This improvement in frequency can be explained by the fact that \legup{} uses a memory controller to manage multiple RAMs using one interface, which is not needed in \vericert{} as a single RAM is used for the memory.
 
-Looking at a few benchmarks in particular in Figure~\ref{fig:polybench-nodiv} for some interesting cases.  For the \texttt{trmm} benchmark, \vericert{} produces hardware that executes with the same cycle count as \legup{}, and manages to create hardware that achieves twice the frequency compared to \legup{}, thereby actually producing a design that executes twice as fast as \legup{}.  Another interesting benchmark is \texttt{doitgen}, where \vericert{} is comparable to \legup{} without LLVM optimisations, however, LLVM optimisations seem to have a large effect on the cycle count.
+Looking at a few benchmarks in particular in Fig.~\ref{fig:polybench-nodiv} for some interesting cases.  For the \texttt{trmm} benchmark, \vericert{} produces hardware that executes with the same cycle count as \legup{}, and manages to create hardware that achieves twice the frequency compared to \legup{}, thereby actually producing a design that executes twice as fast as \legup{}.  Another interesting benchmark is \texttt{doitgen}, where \vericert{} is comparable to \legup{} without LLVM optimisations, however, LLVM optimisations seem to have a large effect on the cycle count.
 
 \subsection{RQ2: How Area-Efficient is \vericert{}-Generated Hardware?}
 
-The bottom graphs in both Figure~\ref{fig:polybench-div} and Figure~\ref{fig:polybench-nodiv} compare the resource utilisation of the \polybench{} programs generated by \vericert{} and \legup{} at various optimisation levels.
-By looking at the median, when division/modulo operations are enabled, we see that \vericert{} produces hardware that is about the same size as optimised \legup{}, whereas the unoptimised versions of \legup{} actually produce slightly smaller hardware.  This is because optimisations can often increase the size of the hardware to make it faster.  Especially in Figure~\ref{fig:polybench-div}, there are a few benchmarks where the size of the \legup{} design is much smaller than that produced by \vericert{}.  This can mostly be explained because of resource sharing in LegUp.  Division/modulo operations need large circuits, and it is therefore usual to only have one circuit per design.  As \vericert{} uses the na\"ive implementation of division/modulo, there will be multiple circuits present in the design, which blows up the size.  Looking at Figure~\ref{fig:polybench-nodiv}, one can see that without division, the size of \vericert{} designs are almost always around the same size as \legup{} designs, never being more than 2$\times$ larger, and sometimes even being smaller.  The similarity in area also shows that RAM is correctly being inferred by the synthesis tool, and is therefore not implemented as registers.
+The bottom graphs in both Fig.~\ref{fig:polybench-div} and Fig.~\ref{fig:polybench-nodiv} compare the resource utilisation of the \polybench{} programs generated by \vericert{} and \legup{} at various optimisation levels.
+By looking at the median, when division/modulo operations are enabled, we see that \vericert{} produces hardware that is about the same size as optimised \legup{}, whereas the unoptimised versions of \legup{} actually produce slightly smaller hardware.  This is because optimisations can often increase the size of the hardware to make it faster.  Especially in Fig.~\ref{fig:polybench-div}, there are a few benchmarks where the size of the \legup{} design is much smaller than that produced by \vericert{}.  This can mostly be explained because of resource sharing in LegUp.  Division/modulo operations need large circuits, and it is therefore usual to only have one circuit per design.  As \vericert{} uses the na\"ive implementation of division/modulo, there will be multiple circuits present in the design, which blows up the size.  Looking at Fig.~\ref{fig:polybench-nodiv}, one can see that without division, the size of \vericert{} designs are almost always around the same size as \legup{} designs, never being more than 2$\times$ larger, and sometimes even being smaller.  The similarity in area also shows that RAM is correctly being inferred by the synthesis tool, and is therefore not implemented as registers.
 
 \subsection{RQ3: How Quickly does \vericert{} Translate the C into Verilog?}
 
@@ -207,7 +207,7 @@ By looking at the median, when division/modulo operations are enabled, we see th
 
 To gain further confidence that the Verilog designs generated by \vericert{} are actually correct, and that the correctness theorem is indeed effective, we fuzzed \vericert{} using Csmith~\cite{yang11_findin_under_bugs_c_compil}. \citeauthor{yang11_findin_under_bugs_c_compil} previously used Csmith in an extensive fuzzing campaign on CompCert and found a handful of bugs in the unverified parts of that compiler, so it is natural to explore whether it can find bugs in \vericert{} too. \citet{herklotz21_empir_study_reliab_high_level_synth_tools} have recently used Csmith to fuzz other HLS tools including \legup{}, so we configured Csmith in a similar way. In addition to the features turned off by \citeauthor{du21_fuzzin_high_level_synth_tools}, we turned off the generation of global variables and non-32-bit operations. The generated designs were tested by simulating them and comparing the output value to the results of compiling the test-cases with GCC 10.3.0.
 
-The results of the fuzzing run are shown in Figure~\ref{tab:fuzzing}.  Out of 155267 test-cases generated by Csmith, 26\% of them passed, meaning they compiled without error and resulted in the same final value as GCC. Most of the test-cases, 73.97\%, failed at compile time.  The most common reasons for this were unsigned comparisons between integers (\vericert{} requires them to be signed), and the presence of 8-bit operations (which \vericert{} does not support, and which we could not turn off due to a limitation in Csmith). % The latter programs are present because of a slight limitation with how Csmith can be configured, as it does not allow 8-bit operations to be turned off completely, and will get stuck when it generates the C code.  
+The results of the fuzzing run are shown in Fig.~\ref{tab:fuzzing}.  Out of 155267 test-cases generated by Csmith, 26\% of them passed, meaning they compiled without error and resulted in the same final value as GCC. Most of the test-cases, 73.97\%, failed at compile time.  The most common reasons for this were unsigned comparisons between integers (\vericert{} requires them to be signed), and the presence of 8-bit operations (which \vericert{} does not support, and which we could not turn off due to a limitation in Csmith). % The latter programs are present because of a slight limitation with how Csmith can be configured, as it does not allow 8-bit operations to be turned off completely, and will get stuck when it generates the C code.
 Because the test-cases generated by Csmith could not be tailored exactly to the C fragment that \vericert{} supports, such a high compile-time failure rate is not unexpected. Finally, and most interestingly, there were a total of 39 run-time failures, which the correctness theorem should be proving impossible.  However, all 39 of these failures are due to a bug in the pretty-printing of the final Verilog code, where a logical negation (\texttt{!}) was accidentally used instead of a bitwise negation (\verb|~|).  Once this bug was fixed, all test-cases passed.
 
 %%% Local Variables:
author	Yann Herklotz <git@yannherklotz.com>	2021-09-09 21:09:05 +0100
committer	Yann Herklotz <git@yannherklotz.com>	2021-09-09 21:09:05 +0100
commit	55528c1f1a11988897d46993756d6aa5873095af (patch)
tree	3f77f427e8aba3712ac9a145a28471c9ddf2ead7 /evaluation.tex
parent	301291487f61c924ec16ec73c632b19ac5395a6d (diff)
download	oopsla21_fvhls-55528c1f1a11988897d46993756d6aa5873095af.tar.gz oopsla21_fvhls-55528c1f1a11988897d46993756d6aa5873095af.zip