Fix some more comments

author: Yann Herklotz <git@yannherklotz.com> 2021-08-08 18:48:21 +0200
committer: Yann Herklotz <git@yannherklotz.com> 2021-08-08 18:48:21 +0200
commit: aeb8620c1f530d5a43302ea4333fa6abdc951a25 (patch)
tree: f703e76afd17f5a2ba90a3ff712356cf55410d82
parent: 8f7485fa0209cc5857c64c700feee56640d73893 (diff)
download: oopsla21_fvhls-aeb8620c1f530d5a43302ea4333fa6abdc951a25.tar.gz
oopsla21_fvhls-aeb8620c1f530d5a43302ea4333fa6abdc951a25.zip
5 files changed, 50 insertions, 13 deletions
diff --git a/algorithm.tex b/algorithm.tex
index 728ea86..222849d 100644
--- a/algorithm.tex
+++ b/algorithm.tex
@@ -22,7 +22,7 @@ Verilog~\cite{06_ieee_stand_veril_hardw_descr_languag} is an HDL that can be syn
 \paragraph{Choice of implementation language}
 We chose Coq as the implementation language because of its mature support for code extraction; that is, its ability to generate OCaml programs directly from the definitions used in the theorems.
 We note that other authors have had some success reasoning about the HLS process using other theorem provers such as Isabelle~\cite{ellis08}.
-\compcert{}~\cite{leroy09_formal_verif_realis_compil} was chosen as the front end because it has a mature \JW{We used `mature' a couple of sentences ago. Maybe change this second one to `well established'?} framework for simulation proofs about intermediate languages, and it already provides a validated C parser~\cite{jourdan12_valid_lr_parser}.
+\compcert{}~\cite{leroy09_formal_verif_realis_compil} was chosen as the front end because it has a well established framework for simulation proofs about intermediate languages, and it already provides a validated C parser~\cite{jourdan12_valid_lr_parser}.
 The Vellvm framework~\cite{zhao12_formal_llvm_inter_repres_verif_progr_trans} was also considered because several existing HLS tools are already LLVM-based, but additional work would be required to support a high-level language like C as input.
 The .NET framework has been used as a basis for other HLS tools, such as Kiwi~\cite{kiwi}, and LLHD~\cite{schuiki20_llhd} has been recently proposed as an intermediate language for hardware design, but neither are suitable for us because they lack formal semantics.
 
@@ -68,7 +68,7 @@ The .NET framework has been used as a basis for other HLS tools, such as Kiwi~\c
 \end{figure}
 
 \paragraph{Architecture of \vericert{}}
-The main work flow of \vericert{} is given in Figure~\ref{fig:rtlbranch}, which shows those parts of the translation that are performed in \compcert{}, and those that have been added.\NR{What is the extra edge labelled RAM insertion in Fig.~\ref{fig:rtlbranch}? You might want to add a sentence about it in this section.} \JW{good point. another option would be to remove that edge if we dont want to talk about RAM inference right now.}
+The main work flow of \vericert{} is given in Figure~\ref{fig:rtlbranch}, which shows those parts of the translation that are performed in \compcert{}, and those that have been added.  This includes translations to two new intermediate languages added in \vericert{}, HTL and Verilog, as well as an additional optimisation pass labelled as ``RAM insertion''.
 
 \def\numcompcertlanguages{ten}
 
@@ -396,7 +396,7 @@ Secondly, the logic in the enable signal of the RAM (\texttt{en != u\_en}) is al
   \caption{Timing diagrams showing the execution of loads and stores over multiple clock cycles.}\label{fig:ram_load_store}
 \end{figure}
 
-\JW{The following paragraph could probably be cut, as the same explanation is already in the Figure 4 caption, and replaced with something like ``Figure~\ref{fig:ram_load_store} gives an example of how the RAM interface behaves when values are loaded and stored.''}\YH{Ah ok, yes sure, I just had it there to explain the figure in case some readers are unfamiliar with timing diagrams, but it's true that' it's already in the caption.} Figure~\ref{fig:ram_load} shows an example of how the waveforms in the RAM in Figure~\ref{fig:accumulator_v} behave when a value is loaded.  To initiate a load, the data-path enable signal \texttt{u\_en} flag is toggled, the address \texttt{addr} is set and the write enable \texttt{wr\_en} is set to low.  This all happens at the positive edge of the clock, at time slice 1.  Then, on the next negative edge of the clock, at time slice 2, the \texttt{u\_en} is now different from the RAM enable \texttt{en}, so the RAM is enabled.  A load is then performed by assigning the \texttt{d\_out} register to the value stored at the address in the RAM and the \texttt{en} is set to the same value as \texttt{u\_en} to disable the RAM again.  Finally, on the next positive edge of the clock, the value in \texttt{d\_out} is assigned to the destination register \texttt{r}.  An example of a store is shown in Figure~\ref{fig:ram_store}. The \texttt{d\_in} register is assigned the value to be stored.  The store is then performed on the negative edge of the clock and is therefore complete by the next positive edge.
+Figure~\ref{fig:ram_load_store} gives an example of how the RAM interface behaves when values are loaded and stored.
 
 \subsubsection{Implementing the \texttt{Oshrximm} instruction}\label{sec:algorithm:optimisation:oshrximm}
 
diff --git a/evaluation.tex b/evaluation.tex
index a80b0fc..e1719e3 100644
--- a/evaluation.tex
+++ b/evaluation.tex
@@ -14,10 +14,12 @@ Our evaluation is designed to answer the following three research questions.
 \newcommand\legupnooptchain{\legup{} no-opt no-chaining}
 
 \paragraph{Choice of HLS tool for comparison.} We compare \vericert{} against \legup{} 4.0, because it is open-source and hence easily accessible, but still produces hardware ``of comparable quality to a commercial high-level synthesis tool''~\cite{canis11_legup}.  We also compare against \legup{} with different optimisation levels in an effort to understand which optimisations have the biggest impact on the performance discrepancies between \legup{} and \vericert{}.  The baseline \legup{} version has all the default automatic optimisations turned on.  % \vericert{} is also compared with other optimisation levels of \legup{}. %JW: removed because we said that a couple of sentences ago.
-First, we only turn off the LLVM optimisations in \legup{}, to eliminate all the optimisations that are common to standard software compilers, referred to as `\legupnoopt{}'.  Secondly, we also compare against \legup{} with LLVM optimisations and operation chaining turned off, referred to as `\legupnooptchain{}'. Operation chaining \JW{Should we cite https://ieeexplore.ieee.org/document/4397305 here? Do you think that's the right reference for op-chaining?}\NR{Interesting paper, but I am not sure if it is the seminal paper for chaining because of the year (2007).} is an HLS-specific optimisation that combines data-dependent operations into one clock cycle, and therefore dramatically reduces the number of cycles, without necessarily decreasing the clock speed.
+First, we only turn off the LLVM optimisations in \legup{}, to eliminate all the optimisations that are common to standard software compilers, referred to as `\legupnoopt{}'.  Secondly, we also compare against \legup{} with LLVM optimisations and operation chaining turned off, referred to as `\legupnooptchain{}'. Operation chaining~\cite{paulin89_sched_bindin_algor_high_level_synth,venkataramani07_operat} is an HLS-specific optimisation that combines data-dependent operations into one clock cycle, and therefore dramatically reduces the number of cycles, without necessarily decreasing the clock speed.
+
+% \JW{Should we cite https://ieeexplore.ieee.org/document/4397305 here? Do you think that's the right reference for op-chaining?}\NR{Interesting paper, but I am not sure if it is the seminal paper for chaining because of the year (2007).}
 
 \paragraph{Choice and preparation of benchmarks.} We evaluate \vericert{} using the \polybench{} benchmark suite (version 4.2.1)~\cite{polybench}, which is a collection of 30 numerical kernels. \polybench{} is popular in the HLS context~\cite{choi+18,poly_hls_pouchet2013polyhedral,poly_hls_zhao2017,poly_hls_zuo2013}, since it has affine loop bounds, making it attractive for streaming computation on FPGA architectures.
-We were able to use 27 of the 30 programs; three had to be discarded (\texttt{correlation},~\texttt{gramschmidt} and~\texttt{deriche}) because they involve square roots, requiring floats, which we do not support. 
+We were able to use 27 of the 30 programs; three had to be discarded (\texttt{cor\-re\-la\-tion},~\texttt{gram\-schmi\-dt} and~\texttt{de\-riche}) because they involve square roots, requiring floats, which we do not support.
 % Interestingly, we were also unable to evaluate \texttt{cholesky} on \legup{}, since it produce an error during its HLS compilation. 
 %In summary, we evaluate 27 programs from the latest Polybench suite. 
 We configured \polybench{}'s parameters so that only integer types are used.  We use \polybench{}'s smallest datasets for each program to ensure that data can reside within on-chip memories of the FPGA, avoiding any need for off-chip memory accesses. We have not modified the benchmarks to make them run through \legup{} optimally, e.g. by adding pragmas that trigger more advanced optimisations.
@@ -47,7 +49,7 @@ We configured \polybench{}'s parameters so that only integer types are used.  We
         vertical sep=5pt,
       },
       ymode=log,
-      ybar=0pt,
+      ybar=0.4pt,
       width=1\textwidth,
       height=0.4\textwidth,
       /pgf/bar width=3pt,
@@ -88,8 +90,9 @@ We configured \polybench{}'s parameters so that only integer types are used.  We
       \legend{\vericert{},\legupnooptchain{},\legupnoopt{}};
     \end{groupplot}
   \end{tikzpicture}
-  \caption{Performance of \vericert{} compared to \legup{}, with division and modulo operations enabled. The top graph compares the execution times and the bottom graph compares the area of the generated designs. In both cases, the performance of \vericert{}, \legup{} without LLVM optimisations and without operation chaining, and \legup{} without LLVM optimisations is compared against default \legup{}.\NR{Is it just my eyes or are the bars overlapping per group? Is that intentional?}}\label{fig:polybench-div}
+  \caption{Performance of \vericert{} compared to \legup{}, with division and modulo operations enabled. The top graph compares the execution times and the bottom graph compares the area of the generated designs. In both cases, the performance of \vericert{}, \legup{} without LLVM optimisations and without operation chaining, and \legup{} without LLVM optimisations is compared against default \legup{}.}\label{fig:polybench-div}
 \end{figure}
+%\NR{Is it just my eyes or are the bars overlapping per group? Is that intentional?}
 
 \pgfplotstableread[col sep=comma]{results/rel-time-nodiv.csv}{\nodivtimingtable}
 \pgfplotstableread[col sep=comma]{results/rel-size-nodiv.csv}{\nodivslicetable}
@@ -104,7 +107,7 @@ We configured \polybench{}'s parameters so that only integer types are used.  We
         vertical sep=5pt,
       },
       ymode=log,
-      ybar=0pt,
+      ybar=0.4pt,
       ytick={0.5,1,2,4,8},
       width=1\textwidth,
       height=0.4\textwidth,
diff --git a/main.tex b/main.tex
index 8fbb630..0cecd16 100644
--- a/main.tex
+++ b/main.tex
@@ -1,5 +1,5 @@
 %% For double-blind review submission, w/o CCS and ACM Reference (max submission space)
-\documentclass[acmsmall,10pt,anonymous,review,pagebackref=true]{acmart}\settopmatter{printfolios=true,printccs=false,printacmref=false}
+\documentclass[acmsmall,10pt,review,pagebackref=true]{acmart}\settopmatter{printfolios=true,printccs=false,printacmref=false}
 %\documentclass[pagebackref=true,acmsmall,review]{acmart}\settopmatter{printfolios=true,printccs=false,printacmref=false}
 %% For double-blind review submission, w/ CCS and ACM Reference
 %\documentclass[acmsmall,review,anonymous]{acmart}\settopmatter{printfolios=true}
@@ -57,7 +57,7 @@
 \usemintedstyle{manni}
 
 \newif\ifANONYMOUS
-\ANONYMOUStrue
+\ANONYMOUSfalse
 
 \newif\ifCOMMENTS
 \COMMENTStrue
diff --git a/proof.tex b/proof.tex
index e3546ab..7fb1048 100644
--- a/proof.tex
+++ b/proof.tex
@@ -18,7 +18,13 @@ Together, these differences mean that translating 3AC directly to Verilog is inf
 
 \subsection{Formulating the correctness theorem}
 
-The main correctness theorem is analogous to that stated in \compcert{}~\cite{leroy09_formal_verif_realis_compil}: for all Clight source programs $C$, if the translation to the target Verilog code succeeds, and $C$ has safe observable behaviour $B$ when executed, then the target Verilog code will have the same behaviour $B$. Here, a `safe' execution is one that either converges or diverges, but does not ``go wrong''. If the program does admit some wrong behaviour (like undefined behaviour in C), the correctness theorem does not apply. A behaviour, then, is either a final state (in the case of convergence) or divergence. In \compcert{}, a behaviour is also associated with a trace of I/O events, but since external function calls are not supported in \vericert{}, this trace will always be empty. This correctness theorem is also appropriate for HLS \JW{Perhaps it would be worth first explaining why somebody might think this correctness theorem might \emph{not} be appropriate for HLS. At the moment, it feels like you're giving the answer without saying the question. Is it to do with the fact that hardware tends to run forever?}\YH{Yes definitely, will add that.}, as HLS is often used as a part of a larger hardware design that is connected together using a hardware description language like Verilog.  This means that HLS designs are normally triggered multiple times and results are returned each time when the computation terminates, which is the property that the correctness theorem states.  Note that the compiler is allowed to fail and not produce any output; the correctness theorem only applies when the translation succeeds.
+The main correctness theorem is analogous to that stated in \compcert{}~\cite{leroy09_formal_verif_realis_compil}: for all Clight source programs $C$, if the translation to the target Verilog code succeeds, and $C$ has safe observable behaviour $B$ when executed, then the target Verilog code will have the same behaviour $B$. Here, a `safe' execution is one that either converges or diverges, but does not ``go wrong''. If the program does admit some wrong behaviour (like undefined behaviour in C), the correctness theorem does not apply. A behaviour, then, is either a final state (in the case of convergence) or divergence. In \compcert{}, a behaviour is also associated with a trace of I/O events, but since external function calls are not supported in \vericert{}, this trace will always be empty.  Is this correctness theorem also right for HLS?
+
+It may be argued that hardware inherently runs forever and therefore does not produce a definitive final result.  This means that the \compcert{} correctness theorem would likely not help with proving that the hardware is actually working correctly, as the behaviour would always be divergent.  However, in practice HLS designs are not normally the top-level of the design which needs to connect to other components and would therefore need to run forever.  Instead, HLS designs are often used in larger hardware designs as smaller components which take an input, execute, and then terminate with an answer.  To start the execution of the hardware and to signal to the HLS component that the inputs are ready, the \textit{rst} signal is set and unset.  Then, once the result is ready, the \textit{fin} signal is set and the result value is placed in \textit{ret}.  These signals are also present in the semantics of execution shown in Figure~\ref{fig:inference_module}.  The theorem of correctness therefore also uses these signals, and the proof shows that once the \textit{fin} flag is set, the value in \textit{ret} is correct according to the semantics of Verilog and Clight.  Note that the compiler is allowed to fail and not produce any output; the correctness theorem only applies when the translation succeeds.
+
+%This correctness theorem is also appropriate for HLS, as HLS is often used as a part of a larger hardware design that is connected together using a hardware description language like Verilog.  This means that HLS designs are normally triggered multiple times and results are returned each time when the computation terminates, which is the property that the correctness theorem states.
+
+%\JW{Perhaps it would be worth first explaining why somebody might think this correctness theorem might \emph{not} be appropriate for HLS. At the moment, it feels like you're giving the answer without saying the question. Is it to do with the fact that hardware tends to run forever?}\YH{Yes definitely, will add that.}
 
 %The following `backwards simulation' theorem describes the correctness theorem, where $\Downarrow$ stands for simulation and execution respectively.
 
@@ -233,9 +239,9 @@ The final lemma we need is that the Verilog we generate is deterministic. This r
   \label{tab:proof_statistics}
 \end{table*}
 
-The lines of code for the implementation and proof of \vericert{} can be found in Table~\ref{tab:proof_statistics}.  Overall, it took about 1.5 person-years to build \vericert{} -- about three person-months on implementation and 15 person-months on proofs.  The largest proof is the correctness proof for the HTL generation, which required equivalence proofs between all integer operations supported by \compcert{} and those supported in hardware.  From the 3069 lines of proof code in the HTL generation, 1189 are for the correctness proof of just the load and store instructions.  These were tedious to prove correct because of the substantial difference between the memory models used, and the need to prove properties such as stores outside of the allocated memory being undefined, so that a finite array could be used. In addition to that, since pointers in HTL and Verilog are represented as integers, whereas there is a separate `pointer' value in the \compcert{} semantics, it was painful to reason about them and many new theorems had to be proven about integers and pointers in \vericert{}.  In addition to that, \JW{Repeated `In addition to that'. I quite like `Moreover', for a bit of variety.} the second-largest proof of the correct RAM generation includes many proofs about the extensional equality of array operations, such as merging arrays with different assignments.  As the negative edge implies two merges take place every clock cycle, the proofs about the equality of the arrays becomes more tedious as well.
+The lines of code for the implementation and proof of \vericert{} can be found in Table~\ref{tab:proof_statistics}.  Overall, it took about 1.5 person-years to build \vericert{} -- about three person-months on implementation and 15 person-months on proofs.  The largest proof is the correctness proof for the HTL generation, which required equivalence proofs between all integer operations supported by \compcert{} and those supported in hardware.  From the 3069 lines of proof code in the HTL generation, 1189 are for the correctness proof of just the load and store instructions.  These were tedious to prove correct because of the substantial difference between the memory models used, and the need to prove properties such as stores outside of the allocated memory being undefined, so that a finite array could be used. In addition to that, since pointers in HTL and Verilog are represented as integers, whereas there is a separate `pointer' value in the \compcert{} semantics, it was painful to reason about them and many new theorems had to be proven about integers and pointers in \vericert{}.  Moreover, the second-largest proof of the correct RAM generation includes many proofs about the extensional equality of array operations, such as merging arrays with different assignments.  As the negative edge implies two merges take place every clock cycle, the proofs about the equality of the arrays becomes more tedious as well.
 
-Looking at the trusted \JW{computing?} base of \vericert{}, the Verilog semantics are 431 lines of code.  This, together with the Clight semantics from \compcert{}, are the only parts of the compiler that need to be trusted.  Compared to the 1721 lines of the implementation that are written in Coq, which are the verified parts of the HLS tool, this is larger than the 431 lines of Verilog semantics specification, even if the Clight semantics are added.  In addition to that, reading semantics specifications is simpler than trying to understand algorithms, meaning the trusted base has been successfully reduced.
+Looking at the trusted computing base of \vericert{}, the Verilog semantics are 431 lines of code.  This, together with the Clight semantics from \compcert{}, are the only parts of the compiler that need to be trusted.  Compared to the 1721 lines of the implementation that are written in Coq, which are the verified parts of the HLS tool, this is larger than the 431 lines of Verilog semantics specification, even if the Clight semantics are added.  In addition to that, reading semantics specifications is simpler than trying to understand algorithms, meaning the trusted base has been successfully reduced.
 
 %\JW{Can we include a comment about the size of the trusted base, in case we get that reviewer again?}
 
diff --git a/references.bib b/references.bib
index 1100fef..1da9d3b 100644
--- a/references.bib
+++ b/references.bib
@@ -953,3 +953,31 @@ keywords = {hardware verification, hardware synthesis, compiler verification},
 location = {Virtual, Denmark},
 series = {CPP 2021}
 }
+
+@inproceedings{paulin89_sched_bindin_algor_high_level_synth,
+	author = {Paulin, P. G. and Knight, J. P.},
+	title = {Scheduling and Binding Algorithms for High-Level Synthesis},
+	booktitle = {Proceedings of the 26th ACM/IEEE Design Automation Conference},
+	year = 1989,
+	pages = {1-6},
+	doi = {10.1145/74382.74383},
+	url = {https://doi.org/10.1145/74382.74383},
+	address = {New York, NY, USA},
+	isbn = 0897913108,
+	location = {Las Vegas, Nevada, USA},
+	numpages = 6,
+	publisher = {Association for Computing Machinery},
+	series = {DAC '89},
+}
+
+@inproceedings{venkataramani07_operat,
+	keywords = {operation chaining},
+	author = {Girish Venkataramani and Goldstein, Seth C.},
+	booktitle = {2007 IEEE/ACM International Conference on Computer-Aided Design},
+	title = {Operation chaining asynchronous pipelined circuits},
+	year = {2007},
+	volume = {},
+	number = {},
+	pages = {442-449},
+	doi = {10.1109/ICCAD.2007.4397305}
+}
author	Yann Herklotz <git@yannherklotz.com>	2021-08-08 18:48:21 +0200
committer	Yann Herklotz <git@yannherklotz.com>	2021-08-08 18:48:21 +0200
commit	aeb8620c1f530d5a43302ea4333fa6abdc951a25 (patch)
tree	f703e76afd17f5a2ba90a3ff712356cf55410d82
parent	8f7485fa0209cc5857c64c700feee56640d73893 (diff)
download	oopsla21_fvhls-aeb8620c1f530d5a43302ea4333fa6abdc951a25.tar.gz oopsla21_fvhls-aeb8620c1f530d5a43302ea4333fa6abdc951a25.zip