5 files changed, 19 insertions, 13 deletions
diff --git a/algorithm.tex b/algorithm.tex
index 6303173..c4c91a4 100644
--- a/algorithm.tex
+++ b/algorithm.tex
@@ -1,7 +1,7 @@
 \section{Designing a verified HLS tool}
 \label{sec:design}
 
-This section describes the main architecture of the HLS tool, and the way in which the Verilog back end was added to \compcert{}.  This section will also cover an example of converting a simple C program into hardware, expressed in the Verilog language.
+This section describes the main architecture of the HLS tool, and the way in which the Verilog back end was added to \compcert{}.  This section also covers an example of converting a simple C program into hardware, expressed in the Verilog language.
 
 \paragraph{Choice of source language}
 C was chosen as the source language as it remains the most common source language amongst production-quality HLS tools~\cite{canis11_legup, xilinx20_vivad_high_synth, intel_hls, bambu_hls}. This, in turn, may be because it is ``[t]he starting point for the vast majority of algorithms to be implemented in hardware''~\cite{5522874}, lending a degree of practicality. 
@@ -156,7 +156,7 @@ endmodule
 \end{figure}
 
 \subsection{Translating C to Verilog, by example}
-Figure~\ref{fig:accumulator_c_rtl} illustrates the translation of a simple program that sums the elements of an array.
+Figure~\ref{fig:accumulator_c_rtl} illustrates the translation of a simple program that stores and retrieves values from an array.
 In this section, we describe the stages of the \vericert{} translation, referring to this program as an example.
 
 \subsubsection{Translating C to 3AC}
@@ -293,7 +293,7 @@ Typically, HLS-generated hardware consists of a sea of registers and RAM memorie
 This memory view is very different to the C memory model, so we perform the following translation. 
 Variables that do not have their address taken are kept in registers, which correspond to the registers in 3AC.
 All address-taken variables, arrays, and structs are kept in RAM.
-The stack of the main function becomes an unpacked array of 32-bit integers, which is translated to a RAM when the hardware description is passed through a synthesis tool.  In a first pass, RAM is represented by direct reads and writes to an array, however, in a RAM insertion pass a proper RAM interface is added, and reads and writes are performed through that interface.
+The stack of the main function becomes an unpacked array of 32-bit integers, which is translated to a RAM when the hardware description is passed through a synthesis tool.  In a first pass, \vericert{} represents the RAM by direct reads and writes to an array. This is followed by a RAM insertion pass which inserts a proper RAM interface, and reads and writes are performed through that interface instead.  This interface is shown on lines 9-15 in the Verilog code in Figure~\ref{fig:accumulator_v}.
 Finally, global variables are not translated in \vericert{} at the moment.
 A high-level overview of the architecture can be seen in Figure~\ref{fig:accumulator_diagram}. 
 
@@ -332,7 +332,8 @@ One big difference between C and Verilog is how memory is represented.  Although
 However, the memory model that \compcert{} uses for its intermediate languages is byte-addre\-ssa\-ble~\cite{blazy05_formal_verif_memor_model_c}.  If a byte-addressable memory was used in the target hardware, which is closer to \compcert{}'s memory model, then a load and store would instead take four clock cycles, because a RAM can only perform one read and write per clock cycle.  It therefore has to be proven that the byte-addressable memory behaves in the same way as the word-addressable memory in hardware.  Any modifications of the bytes in the \compcert{} memory model also have to be shown to modify the word-addressable memory in the same way.  Since only integer loads and stores are currently supported in \vericert{}, it follows that the addresses given to the loads and stores should be multiples of four.  If that is the case, then the translation from byte-addressed memory to word-addressed memory can be done by dividing the address by four.
 
 \subsubsection{Implementation of RAM interface}
-The simplest way to implement loads and stores in \vericert{} would be to access the Verilog array directly from within the data-path (i.e., inside the always-block on lines 16--32 of Figure~\ref{fig:accumulator_v}). This would be correct, but when a Verilog array is accessed at several program points, the synthesis tool is unlikely to detect that it can be implemented as a RAM block, and will resort to using lots of registers instead, ruining the circuit's area and performance. To avert this, we arrange that the data-path does not access memory directly, but simply sets the address it wishes to access and then toggles the \texttt{u\_en} flag. This activates the RAM interface (lines 9--15 of Figure~\ref{fig:accumulator_v}) on the next falling clock edge, which performs the requested load or store. By factoring all the memory accesses out into a separate interface like this, we ensure that the underlying array is only accessed from a single program point in the Verilog code, and thus ensure that the synthesis tool will correctly infer a RAM block.  Without the interface, the array would be implemented using registers, which would increase the size of the hardware considerably.
+The simplest way to implement loads and stores in \vericert{} would be to access the Verilog array directly from within the data-path (i.e., inside the always-block on lines 16--32 of Figure~\ref{fig:accumulator_v}). This would be correct, but when a Verilog array is accessed at several program points, the synthesis tool is unlikely to detect that it can be implemented as a RAM block, and will resort to using lots of registers instead, ruining the circuit's area and performance. To avert this, we arrange that the data-path does not access memory directly, but simply sets the address it wishes to access and then toggles the \texttt{u\_en} flag. This activates the RAM interface (lines 9--15 of Figure~\ref{fig:accumulator_v}) on the next falling clock edge, which performs the requested load or store. By factoring all the memory accesses out into a separate interface like this, we ensure that the underlying array is only accessed from a single program point in the Verilog code, and thus ensure that the synthesis tool will correctly infer a RAM block.
+Interestingly, the syntax for the RAM interface is quite strict, as the synthesis tool will pattern match on it and only work for a predefined set of interfaces.  Without the interface, the array would be implemented using registers, which would increase the size of the hardware considerably.
 
 Therefore, an extra compiler pass is added from HTL to HTL to extract all the direct accesses to the Verilog array and replace them by signals which access the RAM interface in a separate always-block. The translation is performed by going through all the instructions and replacing each load and store expression in turn.  Stores can simply be replaced by the necessary wires directly, however, because loads using the RAM interface take two clock cycles where a direct load from an array takes only one, this pass inserts an extra state after each load.
 
diff --git a/evaluation.tex b/evaluation.tex
index 5dcd49c..1d800e4 100644
--- a/evaluation.tex
+++ b/evaluation.tex
@@ -5,6 +5,7 @@ Our evaluation is designed to answer the following two research questions.
 \begin{description}
 \item[RQ1] How fast is the hardware generated by \vericert{}?
 \item[RQ2] How area-efficient is the hardware generated by \vericert{}?
+\item[RQ3] How quickly does \vericert{} translate the C into Verilog?
 \end{description}
 
 \subsection{Experimental Setup}
@@ -163,6 +164,10 @@ Looking at a few benchmarks in particular in Figure~\ref{fig:polybench-nodiv} fo
 The bottom graphs in both Figure~\ref{fig:polybench-div} and Figure~\ref{fig:polybench-nodiv} compare the resource utilisation of the \polybench{} programs generated by \vericert{} and \legup{} at various optimisation levels.
 By looking at the median, when division/modulo operations are enabled, we see that \vericert{} produces hardware that is about the same size as optimised \legup{}, whereas the unoptimised versions of \legup{} actually produce slightly smaller hardware.  This is because optimisations can often increase the size of the hardware to make it faster.  Especially in Figure~\ref{fig:polybench-div}, there are a few benchmarks where the size of the \legup{} design is much smaller than that produced by \vericert{}.  This can mostly be explained because of resource sharing in LegUp.  Division/modulo operations need large circuits, and it is therefore usual to only have one circuit per design.  As \vericert{} uses the na\"ive implementation of division/modulo, there will be multiple circuits present in the design, which blows up the size.  Looking at Figure~\ref{fig:polybench-nodiv}, one can see that without division, the size of \vericert{} designs are almost always around the same size as \legup{} designs, never being more than 2$\times$ larger, and sometimes even being smaller.  The similarity in area also shows that RAM is correctly being inferred by the synthesis tool, and is therefore not implemented as registers.
 
+\subsection{RQ3: How quickly does \vericert{} translate the C into Verilog?}
+
+\vericert{} takes around 0.1$\times$ as long as \legup{} to perform the translation from C into Verilog, showing at least that verification does not have a substantial effect on the run-time of the HLS tool.  However, this is a meaningless victory, as a lot of the extra time that \legup{} uses is on performing computationally heavy optimisations such as loop pipelining and scheduling.
+
 %%% Local Variables:
 %%% mode: latex
 %%% TeX-master: "main"
diff --git a/main.tex b/main.tex
index d7583fc..12ee8b8 100644
--- a/main.tex
+++ b/main.tex
@@ -139,7 +139,7 @@
 \email{j.wickerson@imperial.ac.uk}
 
 \begin{abstract}
-  High-level synthesis (HLS), which refers to the automatic compilation of software into hardware, is rapidly gaining popularity.  In a world increasingly reliant on application-specific hardware accelerators, HLS promises hardware designs of comparable performance and energy efficiency to those coded by hand in a hardware description language such as Verilog, while maintaining the convenience and the rich ecosystem of software development.  However, current HLS tools cannot always guarantee that the hardware designs they produce are equivalent to the software they were given, thus undermining any reasoning conducted at the software level.  Worse, there is mounting evidence that existing HLS tools are quite unreliable, sometimes generating wrong hardware or crashing when given valid inputs.
+  High-level synthesis (HLS), which refers to the automatic compilation of software into hardware, is rapidly gaining popularity.  In a world increasingly reliant on application-specific hardware accelerators, HLS promises hardware designs of comparable performance and energy efficiency to those coded by hand in a hardware description language such as Verilog, while maintaining the convenience and the rich ecosystem of software development.  However, current HLS tools cannot always guarantee that the hardware designs they produce are equivalent to the software they were given, thus undermining any reasoning conducted at the software level.  Furthermore, there is mounting evidence that existing HLS tools are quite unreliable, sometimes generating wrong hardware or crashing when given valid inputs.
 
   To address this problem, we present the first HLS tool that is mechanically verified to preserve the behaviour of its input software.  Our tool, called \vericert{}, extends the \compcert{} verified C compiler with a new hardware-oriented intermediate language and a Verilog back end, and has been proven correct in Coq. \vericert{} supports most C constructs, including all integer operations, function calls, local arrays, structs, unions and general control-flow statements.  An evaluation on the PolyBench/C benchmark suite indicates that \vericert{} generates hardware that is around an order of magnitude slower (only around 2$\times$ slower in the absence of division) and about the same size as hardware generated by an existing, optimising (but unverified) HLS tool.
 \end{abstract}
diff --git a/proof.tex b/proof.tex
index 1a04fc7..1668d88 100644
--- a/proof.tex
+++ b/proof.tex
@@ -14,10 +14,9 @@ The main correctness theorem is analogous to that stated in \compcert{}~\cite{le
 \end{theorem}
 
 The theorem is a `backwards simulation' result (from target to source), following the terminology used in the \compcert{} literature.  The theorem does not demand the `if' direction too, because compilers are permitted to resolve any non-determinism present in their source programs.
-
 In practice, Clight programs are all deterministic, as are the Verilog programs in the fragment we consider. This means that we can prove the correctness theorem above by first inverting it to become a forwards simulation result, following standard \compcert{} practice.
 
-The second observation that needs to be made is that to prove this forward simulation, it suffices to prove forward simulations between each intermediate language, as these results can be composed to prove the correctness of the whole HLS tool. 
+Furthermore, to prove the forward simulation, it suffices to prove forward simulations between each intermediate language, as these results can be composed to prove the correctness of the whole HLS tool. 
 The forward simulation from 3AC to HTL is stated in Lemma~\ref{lemma:htl} (Section~\ref{sec:proof:3ac_htl}), the forward simulation for the RAM insertion is shown in Lemma~\ref{lemma:htl_ram} (Section~\ref{sec:proof:ram_insertion}), then the forward simulation between HTL and Verilog is shown in Lemma~\ref{lemma:verilog} (Section~\ref{sec:proof:htl_verilog}) and finally, the proof that Verilog is deterministic is given in Lemma~\ref{lemma:deterministic} (Section~\ref{sec:proof:deterministic}).
 
 \subsection{Forward simulation from 3AC to HTL}\label{sec:proof:3ac_htl}
diff --git a/verilog.tex b/verilog.tex
index 9b19b87..d144f90 100644
--- a/verilog.tex
+++ b/verilog.tex
@@ -2,7 +2,7 @@
 
 \newcommand{\alwaysblock}{always-block}
 
-This section describes the Verilog semantics that was chosen for the target language, including the changes that were made to the semantics to make it a suitable HLS target.  The Verilog standard is quite large~\cite{06_ieee_stand_veril_hardw_descr_languag,05_ieee_stand_veril_regis_trans_level_synth}, but the syntax and semantics can be reduced to a small subset that \vericert{} needs to target.
+This section describes the Verilog semantics that was chosen for the target language, including the changes that were made to the semantics to make it a suitable HLS target.  The Verilog standard is quite large~\cite{06_ieee_stand_veril_hardw_descr_languag,05_ieee_stand_veril_regis_trans_level_synth}, but the syntax and semantics can be reduced to a small subset that \vericert{} needs to target.  This section  also describes how \vericert{}'s representation of memory differs from \compcert{}'s memory model.
 
 The Verilog semantics we use is ported to Coq from a semantics written in HOL4 by \citet{loow19_proof_trans_veril_devel_hol} and used to prove the translation from HOL4 to Verilog~\cite{loow19_verif_compil_verif_proces}. % which was used to create a formal translation from a logic representation encoded in the HOL4~\cite{slind08_brief_overv_hol4} theorem prover into an equivalent Verilog design. 
 This semantics is quite practical as it is restricted to a small subset of Verilog, which can nonetheless be used to model the hardware constructs required for HLS.  The main features that are excluded are continuous assignment and combinational \alwaysblock{}s; these are modelled in other semantics such as that by~\citet{meredith10_veril}. %however, these are not necessarily needed, but require more complicated event queues and execution model.
@@ -119,10 +119,6 @@ Therefore, in addition to the rules shown in Figure~\ref{fig:inference_module},
 
 The Verilog semantics do not define a memory model for Verilog, as this is not needed for a hardware description language.  There is no preexisting architecture that Verilog will produce, it can describe any memory layout that is needed.  Instead of having specific semantics for memory, the semantics only needs to support the language features that can produce these different memory layouts, these being Verilog arrays.  We therefore define semantics for updating Verilog arrays using blocking and nonblocking assignment.  We then have to prove that the C memory model that \compcert{} uses matches with the interpretation of arrays that are used in Verilog.  The \compcert{} memory model is infinite, whereas our representation of arrays in Verilog is inherently finite.  There have already been various efforts to define a finite memory model for all compiler passes in \compcert{}, such as Comp\-Cert\-S~\cite{besson18_compc}, Comp\-Cert\-ELF~\cite{wang20_compc} and Comp\-Cert\-TSO~\cite{sevcik13_compc}, however, we define the translation from \compcert{}'s standard infinite memory model to finitely sized arrays that can be represented in Verilog, leaving the compiler passes intact.
 
-%\JW{It's not completely clear what the relationship is between your work and those works. The use of `only' suggests that you've re-done a subset of work that has already been done -- is that the right impression?}\YH{Hopefully that's more clear.}
-
-This translation is represented in Figure~\ref{fig:memory_model_transl}, where \compcert{} defines a map from blocks to maps from memory address to memory contents.  Instead, our Verilog semantics define two finitely sized arrays of optional values, one for the blocking assignments map $\Gamma_{\rm a}$ and one for the nonblocking assignments map $\Delta_{\rm a}$.  The optional values are present to ensure correct merging of the two association maps at the end of the clock cycle.
-
 \begin{figure}
   \centering
   \begin{tikzpicture}
@@ -132,6 +128,7 @@ This translation is represented in Figure~\ref{fig:memory_model_transl}, where \
     \node[right] at (7,-0.3) {\small \textbf{Verilog Memory Representation}};
     \node[right] (x0) at (0.2,-1.9) {\small 0};
     \node[right] (x1) at (0.2,-2.5) {\small 1};
+    \node[rotate=90] (x2) at (0.43,-3.1) {$\cdots$};
     \foreach \x in {0,...,6}{%
       \node[right] (s\x) at (2.5,-1-\x*0.3) {\small \x};
       \node[right] (t\x) at (4,-1-\x*0.3) {};
@@ -191,9 +188,13 @@ This translation is represented in Figure~\ref{fig:memory_model_transl}, where \
     \draw (7,-4.3) -- (12,-4.3);
     \node at (9.5,-4.7) {\small \texttt{stack[0] <= 0xDEADBEEF;}};
   \end{tikzpicture}
-  \caption{Change in the memory model during the translation of 3AC to HTL.  This is immediately after the assignment to the array.}\label{fig:memory_model_transl}
+  \caption{Change in the memory model during the translation of 3AC to HTL.  The state of the memories in each case is right after the execution of the store to memory.}\label{fig:memory_model_transl}
 \end{figure}
 
+%\JW{It's not completely clear what the relationship is between your work and those works. The use of `only' suggests that you've re-done a subset of work that has already been done -- is that the right impression?}\YH{Hopefully that's more clear.}
+
+This translation is represented in Figure~\ref{fig:memory_model_transl}, where \compcert{} defines a map from blocks to maps from memory address to memory contents.  Each block represents an area in memory, for example, a block can represent a global variable or a stack for a function.  Instead, our Verilog semantics define two finitely sized arrays of optional values, one for the blocking assignments map $\Gamma_{\rm a}$ and one for the nonblocking assignments map $\Delta_{\rm a}$.  The optional values are present to ensure correct merging of the two association maps at the end of the clock cycle.  During our translation we only convert block 0 to a Verilog memory, and ensure that it is the only block that is present.  This means that the block necessarily represents the stack of the main function.  The invariant that then has to hold in the proofs, is that block 0 should be equivalent to the merged representation of the $\Gamma_{rm a}$ and $\Delta_{rm a}$ maps.
+
 %However, in practice, assigning and reading from an array directly in the state machine will not produce a memory in the final hardware design, as the synthesis tool cannot identify the array as having the necessary properties that a RAM needs, even though this is the most natural formulation of memory.  Even though theoretically the memory will only be read from once per clock cycle, the synthesis tool cannot ensure that this is true, and will instead create a register for each memory location.  This increases the size of the circuit dramatically, as the RAM on the FPGA chip will not be reused.  Instead, the synthesis tool expects a specific interface that ensures these properties, and will then transform the interface into a proper RAM during synthesis.  Therefore, a translation has to be performed from the naive use of memory in the state machine, to a proper use of a memory interface.
 
 %\begin{figure}