9 files changed, 66 insertions, 54 deletions
diff --git a/algorithm.tex b/algorithm.tex
index 53ffaaa..cf2eb50 100644
--- a/algorithm.tex
+++ b/algorithm.tex
@@ -23,7 +23,7 @@ Verilog~\cite{06_ieee_stand_veril_hardw_descr_languag} is an HDL that can be syn
 \paragraph{Choice of implementation language}
 We chose Coq as the implementation language because of its mature support for code extraction; that is, its ability to generate OCaml programs directly from the definitions used in the theorems.
 We note that other authors have had some success reasoning about the HLS process using other theorem provers such as Isabelle~\cite{ellis08}.
-\compcert{}~\cite{leroy09_formal_verif_realis_compil} was chosen as the front end framework, as it is a mature framework for simulation proofs about intermediate languages, and it already provides a validated C parser~\cite{jourdan12_valid_lr_parser}.
+\compcert{}~\cite{leroy09_formal_verif_realis_compil} was chosen as the front end because it has a mature \JW{We used `mature' a couple of sentences ago. Maybe change this second one to `well established'?} framework for simulation proofs about intermediate languages, and it already provides a validated C parser~\cite{jourdan12_valid_lr_parser}.
 The Vellvm framework~\cite{zhao12_formal_llvm_inter_repres_verif_progr_trans} was also considered because several existing HLS tools are already LLVM-based, but additional work would be required to support a high-level language like C as input.
 The .NET framework has been used as a basis for other HLS tools, such as Kiwi~\cite{kiwi}, and LLHD~\cite{schuiki20_llhd} has been recently proposed as an intermediate language for hardware design, but neither are suitable for us because they lack formal semantics.
 
@@ -47,7 +47,7 @@ The .NET framework has been used as a basis for other HLS tools, such as Kiwi~\c
     \node[language] at (6.7,-1.5) (verilog) {Verilog};
     \node at (0,1) {\bf\compcert{}};
     \node at (0,-1.5) {\bf\vericert{}};
-    \node[align=center] at (3.3,-2.4) {\footnotesize RAM\\[-0.5em]\footnotesize insertion};
+    \node[align=center] at (3.5,-2.4) {\footnotesize RAM\\[-0.5em]\footnotesize insertion};
     \draw[->,thick] (clight) -- (conta);
     \draw[->,thick] (conta) -- (cminor);
     \draw[->,thick] (cminor) -- (rtl);
@@ -58,7 +58,7 @@ The .NET framework has been used as a basis for other HLS tools, such as Kiwi~\c
     \draw[->,thick] (htl) -- (verilog);
     \draw[->,thick] (htl.west) to [out=180,in=150] (4,-2.2) to [out=330,in=270] (htl.south);
   \end{tikzpicture}%}
-  \caption{\vericert{} as a Verilog back end to \compcert{}}%
+  \caption{\vericert{} as a Verilog back end to \compcert{}. \JW{Did we ought to add CompCert's other back ends to the diagram? X86 etc? Otherwise it might look like we have a very out-of-date view of CompCert.}}%
   \label{fig:rtlbranch}
 \end{figure}
 
@@ -72,7 +72,7 @@ The main work flow of \vericert{} is given in Figure~\ref{fig:rtlbranch}, which
 We select CompCert's three-address code (3AC)\footnote{This is known as register transfer language (RTL) in the \compcert{} literature. `3AC' is used in this paper instead to avoid confusion with register-transfer level (RTL), which is another name for the final hardware target of the HLS tool.} as the starting point. Branching off \emph{before} this point (at CminorSel or earlier) denies \compcert{} the opportunity to perform optimisations such as constant propagation and dead code elimination, which, despite being designed for software compilers, have been found useful in HLS tools as well~\cite{cong+11}. And if we branch off \emph{after} this point (at LTL or later) then \compcert{} has already performed register allocation to reduce the number of registers and spill some variables to the stack; this transformation is not required in HLS because there are many more registers available, and these should be used instead of RAM whenever possible. %\JP{``\compcert{} performs register allocation during the translation to LTL, with some registers spilled onto the stack: this is unnecessary in HLS since as many registers as are required may be described in the output RTL.''} \JP{Maybe something about FPGAs being register-dense (so rarely a need to worry about the number of flops)?}
 
 3AC is also attractive because it is the closest intermediate language to LLVM IR, which is used by several existing HLS compilers. %\JP{We already ruled out LLVM as a starting point, so this seems like it needs further qualification.}\YH{Well not because it's not a good starting point, but the ecosystem in Coq isn't as good.  I think it's still OK here to say that being similar to LLVM IR is an advantage?} 
-It has an unlimited number of pseudo-registers, and is represented as a control flow graph (CFG) where each instruction is a node with links to the instructions that can follow it. One difference between LLVM IR and 3AC is that 3AC includes operations that are specific to the chosen target architecture; we chose to target the x86\_32 backend, because it generally produces relatively dense 3AC thanks to the availability of complex addressing modes.% reducing cycle counts in the absence of an effective scheduling approach.
+It has an unlimited number of pseudo-registers, and is represented as a control flow graph (CFG) where each instruction is a node with links to the instructions that can follow it. One difference between LLVM IR and 3AC is that 3AC includes operations that are specific to the chosen target architecture; we chose to target the x86\_32 backend because it generally produces relatively dense 3AC thanks to the availability of complex addressing modes.% reducing cycle counts in the absence of an effective scheduling approach.
 
 \begin{figure}
   \centering
@@ -179,11 +179,11 @@ As part of this translation, function inlining is performed on all functions, wh
 The next translation is from 3AC to a new hardware translation language (HTL). %, which is one step towards being completely translated to hardware described in Verilog.
 This involves going from a CFG representation of the computation to a finite state machine with data-path (FSMD) representation~\cite{hwang99_fsmd}. The core idea of the FSMD representation is that it separates the control flow from the operations on the memory and registers. %\JP{I've become less comfortable with this term, but it's personal preference so feel free to ignore. I think `generalised finite state machine' (i.e.\ thinking of the entire `data-path' as contributing to the overall state) is more accurate.}\YH{Hmm, yes, I mainly chose FSMD because there is quite a lot of literature around it.  I think for now I'll keep it but for the final draft we could maybe change it.}
 %This means that the state transitions can be translated into a simple finite state machine (FSM) where each state contains data operations that update the memory and registers. 
-Hence, an HTL program consists of two maps from states to Verilog statements: control logic and data-path maps that express state transitions and computations respectively.
+Hence, an HTL program consists of two maps from states to Verilog statements: the \emph{control logic} map, which expresses state transitions, and the \emph{data-path} map, which expresses computations.
 Figure~\ref{fig:accumulator_diagram} shows the resulting FSMD architecture. The right-hand block is the control logic that computes the next state, while the left-hand block updates all the registers and RAM based on the current program state.
 
 The HTL language was mainly introduced to make it easier to prove the translation from 3AC to Verilog, as these languages have very different semantics.
-It serves as an intermediate language with similar semantics to 3AC at the top-level, using maps to represents what to execute at every state, and similar semantics to Verilog at the low-level by already using Verilog statements instead of more abstract instructions. 
+It serves as an intermediate language with similar semantics to 3AC at the top level, using maps to represents what to execute at every state, and similar semantics to Verilog at the low level by already using Verilog statements instead of more abstract instructions. 
 Compared to plain Verilog, HTL is simpler to manipulate and analyse, thereby making it easier to prove optimisations like proper RAM insertion.
 
 \begin{figure*}
@@ -302,13 +302,15 @@ A high-level overview of the architecture and of the RAM interface can be seen i
 
 \paragraph{Translating instructions}
 
-Each 3AC instruction either corresponds to a hardware construct or does not have to be handled by the translation, such as function calls (because of inlining). \JW{Are function calls the only 3AC instruction that we ignore? (And I guess return statements too for the same reason.)}\YH{Actually, return instructions are translated (because you can return from main whenever), so call instructions (Icall, Ibuiltin and Itailcall) are the only functions that are not handled.}
-For example, line 2 in Figure~\ref{fig:accumulator_rtl} shows a 32-bit register \texttt{x5} being initialised to 3, after which the control flow moves execution to line 3. This initialisation is also encoded in the Verilog generated from HTL at state 8 in both the control logic and data-path always-blocks, shown in Figure~\ref{fig:accumulator_v}.  Simple operator instructions are translated in a similar way.  For example, the add instruction is just translated to the built-in add operator, similarly for the multiply operator.  All 32-bit instructions can be translated in this way, but some special instructions require extra care. One such is the \texttt{Oshrximm} instruction, which is discussed further in Section~\ref{sec:algorithm:optimisation:oshrximm}. Another is the \texttt{Oshldimm} instruction, which is a left rotate instruction that has no Verilog equivalent and therefore has to be implemented in terms of other operations and proven to be equivalent.
+\JW{Most 3AC instructions correspond to hardware constructs.} 
+%Each 3AC instruction either corresponds to a hardware construct or does not have to be handled by the translation, such as function calls (because of inlining). \JW{Are function calls the only 3AC instruction that we ignore? (And I guess return statements too for the same reason.)}\YH{Actually, return instructions are translated (because you can return from main whenever), so call instructions (Icall, Ibuiltin and Itailcall) are the only functions that are not handled.}
+% JW: Thanks; please check proposed new text.
+For example, line 2 in Figure~\ref{fig:accumulator_rtl} shows a 32-bit register \texttt{x5} being initialised to 3, after which the control flow moves execution to line 3. This initialisation is also encoded in the Verilog generated from HTL at state 8 in both the control logic and data-path always-blocks, shown in Figure~\ref{fig:accumulator_v}.  Simple operator instructions are translated in a similar way.  For example, the add instruction is just translated to the built-in add operator, similarly for the multiply operator.  All 32-bit instructions can be translated in this way, but some special instructions require extra care. One such is the \texttt{Oshrximm} instruction, which is discussed further in Section~\ref{sec:algorithm:optimisation:oshrximm}. Another is the \texttt{Oshldimm} instruction, which is a left rotate instruction that has no Verilog equivalent and therefore has to be implemented in terms of other operations and proven to be equivalent. \JW{The only 32-bit instructions that we do not translate are those related to function calls (\texttt{Icall}, \texttt{Ibuiltin}, and \texttt{Itailcall}), because of inlining.}
 
 \subsubsection{Translating HTL to Verilog}
 
 Finally, we have to translate the HTL code into proper Verilog. % and prove that it behaves the same as the 3AC according to the Verilog semantics.
-The challenge here is to translate our FSMD representation into a Verilog AST.  However, as all the instructions in HTL are already expressed as Verilog statements, only the top level data-path and control logic maps need to be translated to valid Verilog. We also require declarations for all the variables in the program, as well as declarations of the inputs and outputs to the module, so that the module can be used inside a larger hardware design.  In addition to translating the maps of Verilog statements, an always-block that will behave like the RAM also has to be created, which is only modelled abstractly at the HTL level.
+The challenge here is to translate our FSMD representation into a Verilog AST.  However, as all the instructions in HTL are already expressed as Verilog statements, only the top-level data-path and control logic maps need to be translated to valid Verilog. We also require declarations for all the variables in the program, as well as declarations of the inputs and outputs to the module, so that the module can be used inside a larger hardware design.  In addition to translating the maps of Verilog statements, an always-block that will behave like the RAM also has to be created, which is only modelled abstractly at the HTL level.
 Figure~\ref{fig:accumulator_v} shows the final Verilog output that is generated for our example.
 
 Although this translation seems quite straight\-forward, proving that this translation is correct is complex.
@@ -326,11 +328,11 @@ Although we would not claim that \vericert{} is a proper `optimising' HLS compil
 \subsubsection{Byte- and word-addressable memories}
 
 One big difference between C and Verilog is how memory is represented.  Although Verilog arrays use similar syntax to C arrays, they must be treated quite differently. To make loads and stores as efficient as possible, the RAM needs to be word-addressable, which means that an entire integer can be loaded or stored in one clock cycle.
-However, the memory model that \compcert{} uses for its intermediate languages is byte-addre\-ssa\-ble~\cite{blazy05_formal_verif_memor_model_c}.  If a byte-addressable memory was used in the target hardware, which is closer to \compcert{}'s memory model, then a load and store would instead take four clock cycles, because a RAM can only perform one read and write per clock cycle.  It therefore has to be proven that the byte-addressable memory behaves in the same way as the word-addressable memory in hardware.  Any modifications of the bytes in the \compcert{} memory model also have to be shown to modify the word-addressable memory in the same way.  Since only integer loads and stores are currently supported in \vericert{}, it follows that the addresses given to the loads and stores should be multiples of four.  If that is the case, then the translation from byte-addressed memory to word-addressed memory can be done by dividing the address by four.
+However, the memory model that \compcert{} uses for its intermediate languages is byte-addre\-ssa\-ble~\cite{blazy05_formal_verif_memor_model_c}.  If a byte-addressable memory was used in the target hardware, which is closer to \compcert{}'s memory model, then a load and store would instead take four clock cycles, because a RAM can only perform one read and write per clock cycle.  It therefore has to be proven that the byte-addressable memory behaves in the same way as the word-addressable memory in hardware.  Any modifications of the bytes in the \compcert{} memory model also have to be shown to modify the word-addressable memory in the same way.  Since only integer loads and stores are currently supported in \vericert{}, it follows that the addresses given to the loads and stores will be multiples of four.  If that is the case, then the translation from byte-addressed memory to word-addressed memory can be done by dividing the address by four.
 
 \subsubsection{Implementation of RAM interface}
 The simplest way to implement loads and stores in \vericert{} would be to access the Verilog array directly from within the data-path (i.e., inside the always-block on lines 16--32 of Figure~\ref{fig:accumulator_v}). This would be correct, but when a Verilog array is accessed at several program points, the synthesis tool is unlikely to detect that it can be implemented as a RAM block, and will resort to using lots of registers instead, ruining the circuit's area and performance. To avert this, we arrange that the data-path does not access memory directly, but simply sets the address it wishes to access and then toggles the \texttt{u\_en} flag. This activates the RAM interface (lines 9--15 of Figure~\ref{fig:accumulator_v}) on the next falling clock edge, which performs the requested load or store. By factoring all the memory accesses out into a separate interface like this, we ensure that the underlying array is only accessed from a single program point in the Verilog code, and thus ensure that the synthesis tool will correctly infer a RAM block.
-Interestingly, the syntax for the RAM interface is quite strict, as the synthesis tool will pattern-match on it and only work for a predefined set of interfaces.
+Interestingly, the Verilog syntax for the RAM interface is quite strict, as the synthesis tool will pattern-match on it and only work for a predefined set of interfaces.
 
 %\JW{I think the following sentence could be cut as we've said this kind of thing a couple of times already.} Without the interface, the array would be implemented using registers, which would increase the size of the hardware considerably.
 
@@ -339,9 +341,11 @@ Therefore, an extra compiler pass is added from HTL to HTL to extract all the di
 %\JW{I've called that negedge always-block the `RAM driver' in my proposed text above -- that feels like quite a nice a word for it to my mind -- what do you think?}\YH{Yes I quite like it!}
 %Verilog arrays can be used in a variety of ways, however, these do not all produce optimal hardware designs.  If, for example, arrays in Verilog are accessed immediately in the data-path, then the synthesis tool is not be able to identify it as having the right properties for a RAM, and would instead implement the array using registers.  This is extremely expensive, and for large memories this can easily blow up the area usage of the FPGA, and because of the longer wires that are needed, it would also affect the performance of the circuit.  The synthesis tools therefore provide code snippets that they know how to transform into various constructs, including snippets that will generate proper RAMs in the final hardware.  This process is called memory inference.  The initial translation from 3AC to HTL converts loads and stores to direct accesses to the memory, as this preserves the same behaviour without having to insert more registers and logic.  We therefore have another pass from HTL to itself which performs the translation from this na\"ive use of arrays to a representation which always allows for memory inference.  This pass creates a separate always block to perform the loads and stores to the memory, and adds the necessary data, address and enable signals to communicate with that always-block from other always-blocks.  This always-block is shown between lines 10-15 in Figure~\ref{fig:accumulator_v}.
 
-There are two interesting parts to the inserted RAM interface.  Firstly, the memory updates are triggered on the negative edge of the clock, out of phase with the rest of the design which is triggered on the positive edge of the clock.  The main advantage is that instead of loads and stores taking three clock cycles and two clock cycles respectively, they only take two clock cycles and one clock cycle instead, greatly improving their performance. \JW{Is this a standard `trick' in hardware design? If so it might be nice to cite it.}\YH{Hmm, not really, because it has the downside of kind of halving your available clock period. However, RAMs normally come in both forms on the FPGA (Page 12, Figure 2, \url{https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/ug/ug_ram_rom.pdf})} In addition to that, using the negative edge for the clock is supported by many synthesis tools, and does not affect the maximum frequency of the final design.
+There are two interesting parts to the inserted RAM interface.  Firstly, the memory updates are triggered on the negative (falling) edge of the clock, out of phase with the rest of the design which is triggered on the positive (rising) edge of the clock.  The advantage of this is that instead of loads and stores taking three clock cycles and two clock cycles respectively, they only take two clock cycles and one clock cycle instead, greatly improving their performance. %\JW{Is this a standard `trick' in hardware design? If so it might be nice to cite it.}\YH{Hmm, not really, because it has the downside of kind of halving your available clock period. However, RAMs normally come in both forms on the FPGA (Page 12, Figure 2, \url{https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/ug/ug_ram_rom.pdf})} 
+% JW: thanks!
+Using the negative edge of the clock is widely supported by synthesis tools, and does not affect the maximum frequency of the final design.
 
-Secondly, the logic in the enable signal of the RAM (\texttt{en != u\_en}) is also atypical.  In existing hardware designs, enable signals are normally manually controlled and inserted into the appropriate states, by using a check like the following in the RAM:\@ \texttt{en == 1}.  This means that the RAM only turns on when the enable signal is set.  However, to make the proof simpler and to not have to reason about possible side effects introduced by the RAM being enabled but not used, a RAM which disables itself after every use would be ideal.  One method for implementing this would be to insert an extra state after each load or store that disables the RAM accordingly, but this would eliminate the speed advantage of the negative-edge-triggered RAM. Another method would be to determine the next state after each load or store and add logic to disable the RAM in that state, but this could quickly become complicated, especially in the case where the next state contains another memory operation, and hence the disable signal should not be added. The method we ultimately chose was to have the RAM become enabled not when the enable signal is high, but when it toggles its value.  This can be arranged by keeping track of the old value of the enable signal in \texttt{en} and comparing it to the current value \texttt{u\_en} set by the data-path.  When the values are different, the RAM gets enabled, and then \texttt{en} is set to the value of \texttt{u\_en}. This ensures that the RAM will always be disabled directly after it was used without having to modify any extra states.
+Secondly, the logic in the enable signal of the RAM (\texttt{en != u\_en}) is also atypical.  Enable signals are normally manually controlled and inserted into the appropriate states, by using a check like the following in the RAM:\@ \texttt{en == 1}.  This means that the RAM only turns on when the enable signal is set.  However, to make the proof simpler and to not have to reason about possible side effects introduced by the RAM being enabled but not used, a RAM which disables itself after every use would be ideal.  One method for implementing this would be to insert an extra state after each load or store that disables the RAM, but this extra state would eliminate the speed advantage of the negative-edge-triggered RAM. Another method would be to determine the next state after each load or store and disable the RAM in that state, but this could quickly become complicated, especially in the case where the next state also contains a memory operation, and hence the disable signal should not be added. The method we ultimately chose was to have the RAM become enabled not when the enable signal is high, but when it \emph{toggles} its value.  This can be arranged by keeping track of the old value of the enable signal in \texttt{en} and comparing it to the current value \texttt{u\_en} set by the data-path.  When the values are different, the RAM gets enabled, and then \texttt{en} is set to the value of \texttt{u\_en}. This ensures that the RAM will always be disabled straight after it was used, without having to insert or modify any other states.
 
 %We can instead generate a second enable signal that is set by the user, and the original enable signal is then updated by the RAM to be equal to the value that the user set.  This means that the RAM should be enabled whenever the two signals are different, and disabled otherwise.
 
@@ -387,7 +391,7 @@ Secondly, the logic in the enable signal of the RAM (\texttt{en != u\_en}) is al
   \caption{Timing diagrams showing the execution of loads and stores over multiple clock cycles.}\label{fig:ram_load_store}
 \end{figure}
 
-Figure~\ref{fig:ram_load} shows an example of how the waveforms in the RAM shown in Figure~\ref{fig:accumulator_v} behave when a value is loaded.  To initiate a load, the data-path enable signal \texttt{u\_en} flag is toggled, the address \texttt{addr} is set and the write enable \texttt{wr\_en} is set to low.  This all happens at the positive edge of the clock, in the time slice 1.  Then, on the next negative edge of the clock, at time slice 2, the \texttt{u\_en} is now different to the RAM enable \texttt{en}, so the RAM is enabled.  A load is then performed by assigning the \texttt{d\_out} register to the value stored at the address in the RAM and the \texttt{en} is set to the same value as \texttt{u\_en} to disable the RAM again.  Finally, on the next positive edge of the clock, the value in \texttt{d\_out} is assigned to the destination register \texttt{r}.  An example of a store is shown in Figure~\ref{fig:ram_store}, which instead assigns the \texttt{d\_in} register with the value to be stored.  The store is then performed on the negative edge of the clock and is therefore complete by the next positive edge.
+\JW{The following paragraph could probably be cut, as the same explanation is already in the Figure 4 caption, and replaced with something like ``Figure~\ref{fig:ram_load_store} gives an example of how the RAM interface behaves when values are loaded and stored.''} Figure~\ref{fig:ram_load} shows an example of how the waveforms in the RAM in Figure~\ref{fig:accumulator_v} behave when a value is loaded.  To initiate a load, the data-path enable signal \texttt{u\_en} flag is toggled, the address \texttt{addr} is set and the write enable \texttt{wr\_en} is set to low.  This all happens at the positive edge of the clock, at time slice 1.  Then, on the next negative edge of the clock, at time slice 2, the \texttt{u\_en} is now different from the RAM enable \texttt{en}, so the RAM is enabled.  A load is then performed by assigning the \texttt{d\_out} register to the value stored at the address in the RAM and the \texttt{en} is set to the same value as \texttt{u\_en} to disable the RAM again.  Finally, on the next positive edge of the clock, the value in \texttt{d\_out} is assigned to the destination register \texttt{r}.  An example of a store is shown in Figure~\ref{fig:ram_store}. The \texttt{d\_in} register is assigned the value to be stored.  The store is then performed on the negative edge of the clock and is therefore complete by the next positive edge.
 
 \subsubsection{Implementing the \texttt{Oshrximm} instruction}\label{sec:algorithm:optimisation:oshrximm}
 
@@ -414,7 +418,7 @@ One might hope that the synthesis tool consuming our generated Verilog would con
 where $\gg$ stands for a logical right shift. %Once this equivalence about the shifts and division operator is proven correct, it can be used to implement the \texttt{Oshrximm} using the efficient shift version instead of how the \compcert{} semantics described it.
 When proving this equivalence, we actually found a bug in our original implementation that was due to the fact that a na\"{i}ve shift rounds towards $-\infty$.
 
-\compcert{} eventually performs a translation from this representation into assembly code which uses shifts to implement the division, however, the specification of the instruction itself still uses division instead of shifts, meaning the proof of the translation cannot be reused.  In \vericert{}, the equivalence of the representation in terms of divisions and shifts is proven over the integers and the specification, thereby making it simpler to prove the correctness of the Verilog implementation in terms of shifts.
+\JW{I don't really understand the following paragraph.} \compcert{} eventually performs a translation from this representation into assembly code which uses shifts to implement the division, however, the specification of the instruction itself still uses division instead of shifts, meaning the proof of the translation cannot be reused.  In \vericert{}, the equivalence of the representation in terms of divisions and shifts is proven over the integers and the specification, thereby making it simpler to prove the correctness of the Verilog implementation in terms of shifts.
 
 %The \compcert{} semantics for the \texttt{Oshrximm} instruction expresses its operation exactly as shown in the equation above, even though in hardware the computation that would be performed would be different.  In \vericert{}, if the same operation would be implemented using Verilog operators, it is not guaranteed to be optimised correctly by the synthesis tools that convert the Verilog into a circuit.  To guarantee an output that does not include divides, we therefore have to express it in Verilog using shifts, and then prove that this representation is equivalent to the divide representation used in the \compcert{} semantics.  While conducting the proof, we discovered quite a few bugs in our initial implementation of optimisations, which rounded to $-\infty$ instead of 0.
 
diff --git a/evaluation.tex b/evaluation.tex
index 82f79d2..c9426f8 100644
--- a/evaluation.tex
+++ b/evaluation.tex
@@ -10,8 +10,11 @@ Our evaluation is designed to answer the following three research questions.
 \subsection{Experimental Setup}
 \label{sec:evaluation:setup}
 
+\newcommand\legupnoopt{\legup{} no-opt}
+\newcommand\legupnooptchain{\legup{} no-opt no-chaining}
+
 \paragraph{Choice of HLS tool for comparison.} We compare \vericert{} against \legup{} 4.0, because it is open-source and hence easily accessible, but still produces hardware ``of comparable quality to a commercial high-level synthesis tool''~\cite{canis11_legup}.  We also compare against \legup{} with different optimisation levels in an effort to understand which optimisations have the biggest impact on the performance discrepancies between \legup{} and \vericert{}.  The baseline \legup{} version has all the default automatic optimisations turned on.  % \vericert{} is also compared with other optimisation levels of \legup{}. %JW: removed because we said that a couple of sentences ago.
-First, we only turn off the LLVM optimisations in \legup{}, to eliminate all the optimisations that are common to standard software compilers, referred to as \legup{} w/o opt.  Secondly, we also compare against \legup{} with LLVM optimisations and operation chaining turned off, referred to as \legup{} w/o opt+chain. Operation chaining is an HLS-specific optimisation that combines data-dependent operations into one clock cycle, and therefore dramatically reduces the number of cycles, without necessarily decreasing the clock speed.
+First, we only turn off the LLVM optimisations in \legup{}, to eliminate all the optimisations that are common to standard software compilers, referred to as `\legupnoopt{}'.  Secondly, we also compare against \legup{} with LLVM optimisations and operation chaining turned off, referred to as `\legupnooptchain{}'. Operation chaining \JW{Should we cite https://ieeexplore.ieee.org/document/4397305 here? Do you think that's the right reference for op-chaining?} is an HLS-specific optimisation that combines data-dependent operations into one clock cycle, and therefore dramatically reduces the number of cycles, without necessarily decreasing the clock speed.
 
 \paragraph{Choice and preparation of benchmarks.} We evaluate \vericert{} using the \polybench{} benchmark suite (version 4.2.1)~\cite{polybench}, which is a collection of 30 numerical kernels. \polybench{} is popular in the HLS context~\cite{choi+18,poly_hls_pouchet2013polyhedral,poly_hls_zhao2017,poly_hls_zuo2013}, since it has affine loop bounds, making it attractive for streaming computation on FPGA architectures.
 We were able to use 27 of the 30 programs; three had to be discarded (\texttt{correlation},~\texttt{gramschmidt} and~\texttt{deriche}) because they involve square roots, requiring floats, which we do not support. 
@@ -82,10 +85,10 @@ We configured \polybench{}'s parameters so that only integer types are used.  We
       % JW: redraw axis border which has been partially covered by the grey bars
       \draw (axis cs:-0.5,0.3) rectangle (axis cs:27.5,10);
 
-      \legend{\vericert{},\legup{} w/o opt+chain,\legup{} w/o opt};
+      \legend{\vericert{},\legupnooptchain{},\legupnoopt{}};
     \end{groupplot}
   \end{tikzpicture}
-  \caption{\polybench{} with division/modulo operations enabled.  The top graph shows the execution time of \vericert{}, \legup{} without LLVM optimisations and without operation chaining and \legup{} without front end LLVM optimisations relative to optimised \legup{}. The bottom graph shows the area relative to \legup{}.}\label{fig:polybench-div}
+  \caption{Performance of \vericert{} compared to \legup{}, with division and modulo operations enabled. The top graph compares the execution times and the bottom graph compares the area of the generated designs. In both cases, the performance of \vericert{}, \legup{} without LLVM optimisations and without operation chaining, and \legup{} without LLVM optimisations is compared against default \legup{}.}\label{fig:polybench-div}
 \end{figure}
 
 \pgfplotstableread[col sep=comma]{results/rel-time-nodiv.csv}{\nodivtimingtable}
@@ -139,10 +142,10 @@ We configured \polybench{}'s parameters so that only integer types are used.  We
       \draw (axis cs:-1,1) -- (axis cs:28,1);
       \draw (axis cs:-0.5,0.3) rectangle (axis cs:27.5,4);
 
-      \legend{\vericert{},\legup{} w/o opt+chain,\legup{} w/o opt};
+      \legend{\vericert{},\legupnooptchain{},\legupnoopt{}};
     \end{groupplot}
   \end{tikzpicture}
-  \caption{\polybench{} with division/modulo operations replaced by an iterative algorithm.  The top graph shows the execution time of \vericert{}, \legup{} without LLVM optimisations and without operation chaining and \legup{} without front end LLVM optimisations relative to optimised \legup{}. The bottom graph shows the area relative to \legup{}.}\label{fig:polybench-nodiv}
+  \caption{Performance of \vericert{} compared to \legup{}, with division and modulo operations replaced by an iterative algorithm in software. The top graph compares the execution times and the bottom graph compares the area of the generated designs. In both cases, the performance of \vericert{}, \legup{} without LLVM optimisations and without operation chaining, and \legup{} without LLVM optimisations is compared against default \legup{}.}\label{fig:polybench-nodiv}
 \end{figure}
 
 Firstly, before comparing any performance metrics, it is worth highlighting that any Verilog produced by \vericert{} is guaranteed to be \emph{correct}, whilst no such guarantee can be provided by \legup{}.
diff --git a/introduction.tex b/introduction.tex
index 2cdada9..21a7dd8 100644
--- a/introduction.tex
+++ b/introduction.tex
@@ -25,11 +25,11 @@ More recently, \citet{du21_fuzzin_high_level_synth_tools} fuzz-tested three comm
 
 \paragraph{Existing workarounds}
 
-Aware of the reliability shortcomings of HLS tools, hardware designers routinely check the generated hardware for functional correctness. This is commonly done by simulating the generated design against a large test-bench. But unless the test-bench covers all inputs exhaustively, which is often infeasible, there is a risk that bugs remain. 
+Aware of the reliability shortcomings of HLS tools, hardware designers routinely check the generated hardware for functional correctness. This is commonly done by simulating the generated design against a large test-bench. But unless the test-bench covers all inputs exhaustively -- which is often infeasible -- there is a risk that bugs remain. 
 
-An alternative is to use \emph{translation validation}~\cite{pnueli98_trans} to prove equivalence between the input program and the output design. Translation validation has been successfully applied to several HLS optimisations~\cite{kim04_autom_fsmd,karfa06_formal_verif_method_sched_high_synth,chouksey20_verif_sched_condit_behav_high_level_synth,banerjee14_verif_code_motion_techn_using_value_propag,chouksey19_trans_valid_code_motion_trans_invol_loops}. 
-But it is an expensive task, especially for large designs, and it must be repeated every time the compiler is invoked.
-For example, the translation validation for Catapult C~\cite{mentor20_catap_high_level_synth} may require several rounds of expert `adjustments'~\cite[p.~3]{slec_whitepaper} to the input C program before validation succeeds. And even when it succeeds, translation validation does not provide watertight guarantees unless the validator itself has been mechanically proven correct~\cite{tristan08_formal_verif_trans_valid}, which has not been the case in HLS tools to date. 
+One alternative is to use \emph{translation validation}~\cite{pnueli98_trans} to prove equivalence between the input program and the output design. Translation validation has been successfully applied to several HLS optimisations~\cite{kim04_autom_fsmd,karfa06_formal_verif_method_sched_high_synth,chouksey20_verif_sched_condit_behav_high_level_synth,banerjee14_verif_code_motion_techn_using_value_propag,chouksey19_trans_valid_code_motion_trans_invol_loops}. 
+Nevertheless, it is an expensive task, especially for large designs, and it must be repeated every time the compiler is invoked.
+For example, the translation validation for Catapult C~\cite{mentor20_catap_high_level_synth} may require several rounds of expert `adjustments'~\cite[p.~3]{slec_whitepaper} to the input C program before validation succeeds. And even when it succeeds, translation validation does not provide watertight guarantees unless the validator itself has been mechanically proven correct~\cite[e.g.][]{tristan08_formal_verif_trans_valid}, which has not been the case in HLS tools to date. 
 %\NR{There is also use of the word `input' in this paragraph for a different context.} %JW: Yes it was used in two different ways in two consecutive paragraphs. Thanks, fixed now.
 
 %\JW{Having nuanced our discussion of TV above, I feel like the text below belongs more in a `future directions' paragraph at the end of the paper than in an `existing workarounds' section.} Nevertheless translation validation has many benefits in a mechanically verified setting as well to simplify the proofs to depend less on the exact implementation of the optimisation.  It has also already been used to prove certain passes in \compcert{} correct.  The main issue with the translation validation methods applied in HLS tools normally is that they \NR{\sout{try and}} generalise over all optimisations that are performed and \NR{\sout{try to}} compare the generated hardware directly to the high-level input. \NR{The word input used here again.}  However, verification requires optimisations to be proven correct incrementally and separately, making translation validation more viable.  By proving specific optimisations with a constraint on the kinds of transformations it can perform, it is possible to write a verified validator that is also believed to be complete and should not fail on valid transformations unless bugs are present.
@@ -37,16 +37,17 @@ For example, the translation validation for Catapult C~\cite{mentor20_catap_high
 Our position is that none of the above workarounds are necessary if the HLS tool can simply be trusted to work correctly. %\NR{Perhaps, we can add something like: `... and our efforts are the first step towards building this trust within HLS tools.'.} %JW: I think that would be over-egging the cake.
 
 \paragraph{Our solution}
-We have designed a new HLS tool in the Coq theorem prover and proved that any output it produces always has the same behaviour as its input. Our tool, called \vericert{},\ifANONYMOUS\footnote{Tool name has been changed for blind review.}\fi{} is automatically extracted to an OCaml program from Coq, which ensures that the object of the proof is the same as the implementation of the tool. \vericert{} is built by extending the \compcert{} verified C compiler~\cite{leroy09_formal_verif_realis_compil} with a new hardware-specific intermediate language and a Verilog back end. It supports most C constructs, including integer operations, function calls, local arrays, structs, unions, and general control-flow statements, but currently excludes support for case statements, function pointers, recursive function calls, 32-bit integers, floats, and global variables.
+We have designed a new HLS tool in the Coq theorem prover and proved that any output design it produces always has the same behaviour as its input program. Our tool, called \vericert{},\ifANONYMOUS\footnote{Tool name has been changed for blind review.}\fi{} is automatically extracted to an OCaml program from Coq, which ensures that the object of the proof is the same as the implementation of the tool. \vericert{} is built by extending the \compcert{} verified C compiler~\cite{leroy09_formal_verif_realis_compil} with a new hardware-specific intermediate language and a Verilog back end. It supports most C constructs, including integer operations, function calls (which are all inlined), local arrays, structs, unions, and general control-flow statements, but currently excludes support for case statements, function pointers, recursive function calls, \JW{non-32-bit?} 32-bit integers, floats, and global variables.
 
 \paragraph{Contributions and Outline}
 The contributions of this paper are as follows:
 
 \begin{itemize}
   \item We present \vericert{}, the first mechanically verified HLS tool that compiles C to Verilog. In Section~\ref{sec:design}, we describe the design of \vericert{}, including a few optimisations related to memory accesses and division.
-  \item We state the correctness theorem of \vericert{} with respect to an existing semantics for Verilog due to \citet{loow19_proof_trans_veril_devel_hol}. In Section~\ref{sec:verilog}, we describe how we extended this semantics to make it suitable as an HLS target.  We also describe how the Verilog semantics are integrated into CompCert's model of the semantics, and how CompCert's memory model is translated to Verilog's low-level and finite memory model.
+  \item We state the correctness theorem of \vericert{} with respect to an existing semantics for Verilog due to \citet{loow19_proof_trans_veril_devel_hol}. In Section~\ref{sec:verilog}, we describe how we extended this semantics to make it suitable as an HLS target.  We also describe how the Verilog semantics are integrated into CompCert's model of the semantics \JW{I'm not sure what that means}, and how CompCert's infinite memory model is translated to Verilog's low-level, finite memory model. \JW{I think ``Verilog's memory model'' is a little misleading, because the memory model isn't part of Verilog. Can we say ``CompCert's memory model is mapped onto a finite Verilog array''?}
   \item In Section~\ref{sec:proof}, we describe how we proved the correctness theorem. The proof follows standard \compcert{} techniques -- forward simulations, intermediate specifications, and determinism results -- but we encountered several challenges peculiar to our hardware-oriented setting. %\NR{`specific' is better than `peculiar'?} %JW: I think this is a nice use of peculiar. Note that it means `distinctive' here, not `weird' -- the third meaning at https://www.dictionary.com/browse/peculiar 
-  These include handling discrepancies between byte- and word-addressable memories which other \compcert{} back ends do not change, different handling of unsigned comparisons between C and Verilog, correctly mapping \compcert{}'s infinite memory model onto a finite Verilog array and finally correctly rearranging memory reads and writes so that these behave properly as a RAM in hardware.
+  These include handling discrepancies between the byte-addressed memory assumed by the input software and the word-addressed memory that we implement in the output hardware, %which other \compcert{} back ends do not change 
+  different handling of unsigned comparisons between C and Verilog, correctly mapping \compcert{}'s infinite memory model onto a finite Verilog array \JW{Already mentioned that in the bullet above. Should remove one occurrence.}, and correctly rearranging \JW{Not sure `rearranging' is quite the right word. Sounds like you're rearranging independent reads/writes w.r.t. each other. Maybe change `correctly rearranging' to `carefully implementing'?} memory reads and writes so that these behave properly as a RAM in hardware.
   \item In Section~\ref{sec:evaluation}, we evaluate \vericert{} on the \polybench{} benchmark suite~\cite{polybench}, and compare the performance of our generated hardware against an existing, unverified HLS tool called \legup{}~\cite{canis11_legup}. We show that \vericert{} generates hardware that is \slowdownOrig$\times$ slower (\slowdownDiv$\times$ slower in the absence of division) and \areaIncr$\times$ larger than that generated by \legup{}. This performance gap can be largely attributed to \vericert{}'s current lack of support for instruction-level parallelism and the absence of an efficient, pipelined division operator. We intend to close this gap in the future by introducing (and verifying) HLS optimisations of our own, such as scheduling and memory analysis. %\NR{Question rather than comment: Will there be verification issues to add support for hard IPs like division blocks?}\YH{Not really any issues, just many different levels of reliability.  You don't have to prove IP correct, but theoretically could.}
 \end{itemize}
 %\JW{This sentence seems pretty good to me; is it up-to-date with the latest `challenges' you've faced?}
diff --git a/limitations.tex b/limitations.tex
index 1979b36..f64167c 100644
--- a/limitations.tex
+++ b/limitations.tex
@@ -4,7 +4,7 @@ There are various limitations in \vericert{} compared to other HLS tools due to
 
 \paragraph{Lack of instruction-level parallelism}
 
-The main limitation of \vericert{} is that it does not perform instruction scheduling, meaning that instructions cannot be gathered into the same state and executed in parallel.  However, the design of the intermediate languages in \vericert{} take this optimisation into account and are designed to support scheduling in the future. For instance, our HTL language allows arbitrary Verilog to appear in each state of the FSMD, including parallel assignments to registers. Our plan for adding scheduling support involves adding a new intermediate language after 3AC, tentatively called 3ACPar. This would be similar to 3AC but rather than mapping program counters to instructions, it would map program counters to \emph{lists} of instructions that can all be executed in parallel. The translation from 3AC to 3ACPar would be performed by a scheduling tool. Following \cite{tristan08_formal_verif_trans_valid} and \citet{six+20}, we expect to use translation validation to verify that each generated schedule is correct (rather than verifying the scheduling tool itself). The translation from 3ACPar to HTL would not change conceptually, except for the fact that multiple instructions can now be translated into the same state.
+The main limitation of \vericert{} is that it does not perform instruction scheduling, meaning that instructions cannot be gathered into the same state and executed in parallel.  However, the design of the intermediate languages in \vericert{} take this optimisation into account and are designed to support scheduling in the future. For instance, our HTL language allows arbitrary Verilog to appear in each state of the FSMD, including parallel assignments to registers. Our plan for adding scheduling support involves adding a new intermediate language after 3AC, tentatively called 3ACPar. This would be similar to 3AC but rather than mapping program counters to instructions, it would map program counters to \emph{lists} of instructions that can all be executed in parallel. The translation from 3AC to 3ACPar would be performed by a scheduling tool. Following \citet{tristan08_formal_verif_trans_valid} and \citet{six+20}, we expect to use translation validation to verify that each generated schedule is correct (rather than verifying the scheduling tool itself). The translation from 3ACPar to HTL would not change conceptually, except for the fact that multiple instructions can now be translated into the same state.
 
 %To simplify the proof of the scheduling algorithm, and to minimise the changes necessary for the current translation from 3AC to HTL, a new language must be introduced, called 3ACPar, which would be equivalent to 3AC but instead of consisting of a map from program counter to instruction, it would consist of a map from program counter to list of instructions, which can all be executed in parallel.  A new proof for the scheduling algorithm would have to be written for the translation from 3AC to 3ACPar, for which a verified translation validation approach might be appropriate, however, the translation form 3ACPar to HTL would not change conceptually, except for the fact that multiple instructions can now be translated into the same state.  This small difference means that most of the proof can be reused without any changes, as the translation of instructions was the main body of the proof and did not change.
 
diff --git a/main.tex b/main.tex
index b3f9871..8fbb630 100644
--- a/main.tex
+++ b/main.tex
@@ -1,5 +1,5 @@
 %% For double-blind review submission, w/o CCS and ACM Reference (max submission space)
-\documentclass[acmsmall,10pt,pagebackref=true]{acmart}\settopmatter{printfolios=true,printccs=false,printacmref=false}
+\documentclass[acmsmall,10pt,anonymous,review,pagebackref=true]{acmart}\settopmatter{printfolios=true,printccs=false,printacmref=false}
 %\documentclass[pagebackref=true,acmsmall,review]{acmart}\settopmatter{printfolios=true,printccs=false,printacmref=false}
 %% For double-blind review submission, w/ CCS and ACM Reference
 %\documentclass[acmsmall,review,anonymous]{acmart}\settopmatter{printfolios=true}
@@ -57,7 +57,7 @@
 \usemintedstyle{manni}
 
 \newif\ifANONYMOUS
-\ANONYMOUSfalse
+\ANONYMOUStrue
 
 \newif\ifCOMMENTS
 \COMMENTStrue
@@ -142,7 +142,7 @@
 \begin{abstract}
   High-level synthesis (HLS), which refers to the automatic compilation of software into hardware, is rapidly gaining popularity.  In a world increasingly reliant on application-specific hardware accelerators, HLS promises hardware designs of comparable performance and energy efficiency to those coded by hand in a hardware description language such as Verilog, while maintaining the convenience and the rich ecosystem of software development.  However, current HLS tools cannot always guarantee that the hardware designs they produce are equivalent to the software they were given, thus undermining any reasoning conducted at the software level.  Furthermore, there is mounting evidence that existing HLS tools are quite unreliable, sometimes generating wrong hardware or crashing when given valid inputs.
 
-  To address this problem, we present the first HLS tool that is mechanically verified to preserve the behaviour of its input software.  Our tool, called \vericert{}, extends the \compcert{} verified C compiler with a new hardware-oriented intermediate language and a Verilog back end, and has been proven correct in Coq. \vericert{} supports most C constructs, including all integer operations, function calls, local arrays, structs, unions and general control-flow statements.  An evaluation on the PolyBench/C benchmark suite indicates that \vericert{} generates hardware that is around an order of magnitude slower (only around 2$\times$ slower in the absence of division) and about the same size as hardware generated by an existing, optimising (but unverified) HLS tool.
+  To address this problem, we present the first HLS tool that is mechanically verified to preserve the behaviour of its input software.  Our tool, called \vericert{}, extends the \compcert{} verified C compiler with a new hardware-oriented intermediate language and a Verilog back end, and has been proven correct in Coq. \vericert{} supports most C constructs, including all integer operations, function calls, local arrays, structs, unions, and general control-flow statements.  An evaluation on the \polybench{} benchmark suite indicates that \vericert{} generates hardware that is around an order of magnitude slower (only around 2$\times$ slower in the absence of division) and about the same size as hardware generated by an existing, optimising (but unverified) HLS tool.
 \end{abstract}
 
 %% 2012 ACM Computing Classification System (CSS) concepts
diff --git a/proof.tex b/proof.tex
index 4566619..b0962de 100644
--- a/proof.tex
+++ b/proof.tex
@@ -2,6 +2,20 @@
 
 Now that the Verilog semantics have been adapted to the CompCert model, we are in a position to formally prove the correctness of our C-to-Verilog compilation.  This section describes the main correctness theorem that was proven and the main ideas behind the proof. The full Coq proof is available in auxiliary material.
 
+\subsection{Main challenges in the proof}
+
+The proof of correctness of the Verilog back end is quite different from the usual proofs performed in CompCert, mainly because of the difference in Verilog semantics compared to the standard CompCert intermediate languages and because of the translation of the memory model.
+
+\begin{itemize}
+\item First, because the memory model in our Verilog semantics is finite and concrete, but the CompCert memory model is more abstract and infinite with additional bounds, the equivalence of both these models needs to be proven.  Moreover, our memory is word-addressed for efficiency reasons, whereas CompCert's memory is byte-addressed. \JW{This point has been made a couple of times already by now. I think it's ok to say it again briefly here, but I'd be tempted to acknowledge that it's repetitive by prepending it with something like ``As already mentioned in Section blah,'' }
+
+\item Second, the Verilog semantics operates quite differently to the usual intermediate languages in CompCert.  All the CompCert intermediate languages use a map from control-flow nodes to instructions.  An instruction can therefore be selected using an abstract program pointer. Meanwhile, in the Verilog semantics the whole design is executed at every clock cycle, because hardware is inherently parallel. The program pointer is part of the design as well, not just part of an abstract state. This makes the semantics of Verilog simpler, but comparing it to the semantics of 3AC becomes more challenging, as one has to map the abstract notion of the state to concrete values in registers.
+\end{itemize}
+
+Together, these differences mean that translating 3AC directly to Verilog is infeasible, as the differences in the semantics are too large.  Instead, a new intermediate language needs to be introduced, called HTL, which bridges the gap in the semantics between the two languages.  HTL still consists of maps, like many of the other CompCert languages, however, each state corresponds to a Verilog statement. \JW{This is good text, but the problem is that it reads like you're introducing HTL here for the first time. In fact, the reader has already encountered HTL in Section 2. So this needs acknowledging.}
+
+\subsection{Formulating the correctness theorem}
+
 The main correctness theorem is analogous to that stated in \compcert{}~\cite{leroy09_formal_verif_realis_compil}: for all Clight source programs $C$, if the translation to the target Verilog code succeeds, and $C$ has safe observable behaviour $B$ when executed, then the target Verilog code will have the same behaviour $B$. Here, a `safe' execution is one that either converges or diverges, but does not ``go wrong''. If the program does admit some wrong behaviour (like undefined behaviour in C), the correctness theorem does not apply. A behaviour, then, is either a final state (in the case of convergence) or divergence. In \compcert{}, a behaviour is also associated with a trace of I/O events, but since external function calls are not supported in \vericert{}, this trace will always be empty. This correctness theorem is also appropriate for HLS \JW{Perhaps it would be worth first explaining why somebody might think this correctness theorem might \emph{not} be appropriate for HLS. At the moment, it feels like you're giving the answer without saying the question. Is it to do with the fact that hardware tends to run forever?}, as HLS is often used as a part of a larger hardware design that is connected together using a hardware description language like Verilog.  This means that HLS designs are normally triggered multiple times and results are returned each time when the computation terminates, which is the property that the correctness theorem states.  Note that the compiler is allowed to fail and not produce any output; the correctness theorem only applies when the translation succeeds.
 
 %The following `backwards simulation' theorem describes the correctness theorem, where $\Downarrow$ stands for simulation and execution respectively.
@@ -16,20 +30,9 @@ The main correctness theorem is analogous to that stated in \compcert{}~\cite{le
 The theorem is a `backwards simulation' result (from target to source), following the terminology used in the \compcert{} literature~\cite{leroy09_formal_verif_realis_compil}. The theorem does not demand the `if' direction too, because compilers are permitted to resolve any non-determinism present in their source programs.
 In practice, Clight programs are all deterministic, as are the Verilog programs in the fragment we consider. This means that we can prove the correctness theorem above by first inverting it to become a forwards simulation result, following standard \compcert{} practice.
 
-Furthermore, to prove the forward simulation, it suffices to prove forward simulations between each intermediate language, as these results can be composed to prove the correctness of the whole HLS tool. 
+Furthermore, to prove the forward simulation, it suffices to prove forward simulations between each pair of consecutive intermediate languages, as these results can be composed to prove the correctness of the whole HLS tool. 
 The forward simulation from 3AC to HTL is stated in Lemma~\ref{lemma:htl} (Section~\ref{sec:proof:3ac_htl}), the forward simulation for the RAM insertion is shown in Lemma~\ref{lemma:htl_ram} (Section~\ref{sec:proof:ram_insertion}), then the forward simulation between HTL and Verilog is shown in Lemma~\ref{lemma:verilog} (Section~\ref{sec:proof:htl_verilog}) and finally, the proof that Verilog is deterministic is given in Lemma~\ref{lemma:deterministic} (Section~\ref{sec:proof:deterministic}).
 
-\subsection{Main challenges in the proof}
-
-The proof of correctness of the Verilog back end is quite different to the usual proofs performed in CompCert, mainly because of the difference in Verilog semantics compared to the standard CompCert intermediate languages and because of the translation of the memory model.
-
-First, because the memory model in our Verilog semantics is finite and concrete, but the CompCert memory model is more abstract and infinite with additional bounds, the equivalence of both these models needs to be proven.  Moreover, our memory is word-addressed for efficiency reasons, whereas CompCert's memory is byte-addressed.
-
-Second, the Verilog semantics operates quite differently to the usual intermediate languages in the backend.  All the CompCert intermediate languages use a map from control-flow nodes to instructions.  An instruction can therefore be selected using an abstract program pointer. On the other hand, in the Verilog semantics the whole design is executed at every clock cycle, because hardware is inherently parallel. The program pointer is part of the design as well, not just part of an abstract state.  This makes the semantics of Verilog simpler, but comparing it to the semantics of 3AC becomes more challenging, as one has to map the abstract notion of the state to concrete values in registers.
-
-Both these differences mean that translating 3AC directly to Verilog is infeasible, as the differences in the semantics is too large.  Instead, a new intermediate language needs to be introduced, called HTL, which bridges the gap in the semantics between the two languages.  HTL still consists of maps, like many of the other CompCert languages, however, each state corresponds to a Verilog statement.
-
-
 \subsection{Forward simulation from 3AC to HTL}\label{sec:proof:3ac_htl}
 
 As HTL is quite far removed from 3AC, this first translation is the most involved and therefore requires a larger proof, because the translation from 3AC instructions to Verilog statements needs to be proven correct in this step.  In addition to that, the semantics of HTL are also quite different to the 3AC semantics, as instead of defining small-step semantics for each construct in Verilog, the semantics are instead defined over one clock cycle and mirror the semantics defined for Verilog.  Lemma~\ref{lemma:htl} shows the result that needs to be proven in this subsection.
diff --git a/references.bib b/references.bib
index 44e8be8..1100fef 100644
--- a/references.bib
+++ b/references.bib
@@ -582,6 +582,7 @@ author = {Gauthier, Stephane and Wadood, Zubair},
 title = {High-Level Synthesis:
 Can it outperform
 hand-coded HDL?},
+note = {White paper},
 url = {https://bit.ly/2IDhKBR},
 year = {2020},
 }
diff --git a/related.tex b/related.tex
index 29c335d..80b54e5 100644
--- a/related.tex
+++ b/related.tex
@@ -44,12 +44,12 @@
 
 A summary of the related works can be found in Figure~\ref{fig:related_euler}, which is represented as an Euler diagram.  The categories chosen for the Euler diagram are: whether the tool is usable \JWcouldcut{and available}, whether it takes a high-level software language as input, whether it has a correctness proof, and finally whether that proof is mechanised.  The goal of \vericert{} is to cover all of these categories.
 
-Most practical HLS tools~\citep{canis11_legup,xilinx20_vivad_high_synth,intel20_sdk_openc_applic,nigam20_predic_accel_desig_time_sensit_affin_types} fit into the category of usable tools that take high-level inputs.  On the other spectrum, there are tools such as BEDROC~\citep{chapman92_verif_bedroc} for which there is no practical tool, and even though it is described as high-level synthesis, it more closely resembles today's hardware synthesis tools.
+Most practical HLS tools~\citep{canis11_legup,xilinx20_vivad_high_synth,intel20_sdk_openc_applic,nigam20_predic_accel_desig_time_sensit_affin_types} fit into the category of usable tools that take high-level inputs.  On the other end of the spectrum, there are tools such as BEDROC~\citep{chapman92_verif_bedroc} for which there is no practical tool, and even though it is described as high-level synthesis, it more closely resembles today's hardware synthesis tools.
 
 Ongoing work in translation validation~\citep{pnueli98_trans} seeks to prove equivalence between the hardware generated by an HLS tool and the original behavioural description in C.  An example of a tool that implements this is Mentor's Catapult~\citep{mentor20_catap_high_level_synth}, which tries to match the states in the 3AC description to states in the original C code after an unverified translation.  Using translation validation is quite effective for verifying complex optimisations such as scheduling~\citep{kim04_autom_fsmd,karfa06_formal_verif_method_sched_high_synth,chouksey20_verif_sched_condit_behav_high_level_synth} or code motion~\citep{banerjee14_verif_code_motion_techn_using_value_propag,chouksey19_trans_valid_code_motion_trans_invol_loops}, but the validation has to be run every time the HLS is performed.  In addition to that, the proofs are often not mechanised or directly related to the actual implementation, meaning the verifying algorithm might be wrong and hence could give false positives or false negatives.
 
-Finally, there are a few relevant mechanically verified tools.  First, K\^{o}ika is a formally verified translation from a core fragment of BlueSpec into a circuit representation which can then be printed as a Verilog design.  This is a translation from a high-level hardware description language into an equivalent circuit representation, so is a different approach to HLS.  \citet{loow19_proof_trans_veril_devel_hol} used a verified translation from HOL4 code describing state transitions into Verilog to design a verified processor, which is described in \citet{loow19_verif_compil_verif_proces}.  In addition to that, there is also work on formally verifying a synthesis tool to transform, which can transform hardware descriptions into low-level netlists~\cite{10.1145/3437992.3439916}. Their approach translated a shallow embedding in HOL4 into a deep embedding of Verilog.
-\citet{perna12_mechan_wire_wise_verif_handel_c_synth,perna11_correc_hardw_synth} designed a formally verified translation from a deep embedding of Handel-C~\citep{aubury1996handel}, which is translated to a deep embedding of a circuit.
+Finally, there are a few relevant mechanically verified tools.  First, K\^{o}ika is a formally verified translation from a core fragment of BlueSpec into a circuit representation which can then be printed as a Verilog design.  This is a translation from a high-level hardware description language into an equivalent circuit representation, so is a different approach to HLS.  \citet{loow19_proof_trans_veril_devel_hol} used a verified translation from HOL4 code describing state transitions into Verilog to design a verified processor, which is described by \citet{loow19_verif_compil_verif_proces}. \citet{10.1145/3437992.3439916} has also worked on formally verifying a synthesis tool that can transform hardware descriptions into low-level netlists. His approach translates a shallow embedding in HOL4 into a deep embedding of Verilog.
+Perna et al. designed a formally verified translation from a deep embedding of Handel-C~\citep{aubury1996handel} into a deep embedding of a circuit~\cite{perna12_mechan_wire_wise_verif_handel_c_synth,perna11_correc_hardw_synth}.
 Finally, \citet{ellis08} used Isabelle to implement and reason about intermediate languages for software/hardware compilation, where parts could be implemented in hardware and the correctness could still be shown.
 
 %%% Local Variables:
diff --git a/verilog.tex b/verilog.tex
index c169aaf..21fadfd 100644
--- a/verilog.tex
+++ b/verilog.tex
@@ -8,7 +8,7 @@ The Verilog semantics we use is ported to Coq from a semantics written in HOL4 b
 This semantics is quite practical as it is restricted to a small subset of Verilog, which can nonetheless be used to model the hardware constructs required for HLS.  The main features that are excluded are continuous assignment and combinational \alwaysblock{}s; these are modelled in other semantics such as that by~\citet{meredith10_veril}. %however, these are not necessarily needed, but require more complicated event queues and execution model.
 
 The semantics of Verilog differs from regular programming languages, as it is used to describe hardware directly, which is inherently parallel, rather than an algorithm, which is usually sequential.  The main construct in Verilog is the \alwaysblock{}.
-A module can contain multiple \alwaysblock{}s, all of which run in parallel.  These \alwaysblock{}s further contain statements such as if-statements or assignments to variables.  We support only \emph{synchronous} logic, which means that the \alwaysblock{} is triggered on (and only on) the rising or falling edge of a clock signal.
+A module can contain multiple \alwaysblock{}s, all of which run in parallel.  These \alwaysblock{}s further contain statements such as if-statements or assignments to variables.  We support only \emph{synchronous} logic, which means that the \alwaysblock{} is triggered on (and only on) the positive or negative edge of a clock signal.
 %\NR{We should mention that variables cannot be driven by multiple \alwaysblock{}s, since one might get confused with data races when relating to concurrent processes in software.} \JW{Given the recent discussion on Teams, it seems to me that we perhaps don't need to mention here what happens if a variable is driven multiple times per clock cycle, especially since \vericert{} isn't ever going to do that.}
 
 The semantics combines the big-step and small-step styles. The overall execution of the hardware is described using a small-step semantics, with one small step per clock cycle; this is appropriate because hardware is routinely designed to run for an unlimited number of clock cycles and the big-step style is ill-suited to describing infinite executions. Then, within each clock cycle, a big-step semantics is used to execute all the statements.
@@ -51,13 +51,13 @@ always @(posedge clk) begin x[0] = 1; x[1] <= 1; end
 which modifies one array element using blocking assignment and then a second using nonblocking assignment. If the existing semantics were used to update the array, then during the merge, the entire array \texttt{x} from the nonblocking association map would replace the entire array from the blocking association map.  This would replace \texttt{x[0]} with its original value and therefore behave incorrectly. Accordingly, we modified the maps so they record updates on a per-el\-em\-ent basis. Our state $\Gamma$ is therefore split up into $\Gamma_{r}$ for instantaneous updates to variables, and $\Gamma_{a}$ for instantaneous updates to arrays; $\Delta$ is split similarly. The merge function then ensures that only the modified indices get updated when $\Gamma_{a}$ is merged with the nonblocking map equivalent $\Delta_{a}$.
 
 \paragraph{Adding negative edge support}
-To support memory inference efficiently and create and reason about a circuit that executes at the negative edge of the clock, support for the negative edge triggered \alwaysblock{}s was added to the semantics.  This is shown in the modifications of the \textsc{Module} rule shown below:
+To reason about circuits that execute on the negative edge of the clock (such as our RAM interface described in Section~\ref{sec:algorithm:optimisation:ram}),  support for negative-edge-triggered \alwaysblock{}s was added to the semantics. This is shown in the modifications of the \textsc{Module} rule shown below:
 
 \begin{equation*}
   \inferrule[Module]{(\Gamma, \epsilon, \vec{m})\ \downarrow_{\text{module}^{+}} (\Gamma', \Delta') \\ (\Gamma'\ //\ \Delta', \epsilon, \vec{m}) \downarrow_{\text{module}^{-}} (\Gamma'', \Delta'')}{(\Gamma, \yhkeyword{module } \yhconstant{main} \yhkeyword{(...);}\ \vec{m}\ \yhkeyword{endmodule}) \downarrow_{\text{program}} (\Gamma''\ //\ \Delta'')}
 \end{equation*}
 
-The main execution of the module $\downarrow_{\text{module}}$ is split into $\downarrow_{\text{module}^{+}}$ and $\downarrow_{\text{module}^{-}}$, which are rules that only execute always blocks triggered at the positive and at the negative edge respectively.  As in the initial \textsc{Module} rule, the positive edge triggered \alwaysblock{}s are processed in the same way.  The output maps $\Gamma'$ and $\Delta'$ are then merged and passed as the blocking assignments map into the negative edge execution, so that all the blocking and nonblocking assignments are present.  Finally, all the negative edge triggered \alwaysblock{}s are processed and merged to give the final state.
+The main execution of the module $\downarrow_{\text{module}}$ is split into $\downarrow_{\text{module}^{+}}$ and $\downarrow_{\text{module}^{-}}$, which are rules that only execute \alwaysblock{}s triggered at the positive and at the negative edge respectively. The positive-edge-triggered \alwaysblock{}s are processed in the same way as in the original \textsc{Module} rule. The output maps $\Gamma'$ and $\Delta'$ are then merged and passed as the blocking assignments map into the negative edge execution, so that all the blocking and nonblocking assignments are present.  Finally, all the negative-edge-triggered \alwaysblock{}s are processed and merged to give the final state.
 
 \paragraph{Adding declarations} Explicit support for declaring inputs, outputs and internal variables was added to the semantics to make sure that the generated Verilog also contains the correct declarations.  This adds some guarantees to the generated Verilog and ensures that it synthesises and simulates correctly.
 
@@ -99,13 +99,13 @@ To support this computational model, we extend the Verilog module we generate wi
 \begin{description}
   \item[program counter] The program counter can be modelled using a register that keeps track of the state, denoted as $\sigma$.
   \item[function entry point] When a function is called, the entry point denotes the first instruction that will be executed. This can be modelled using a reset signal that sets the state accordingly, denoted as \textit{rst}.
-  \item[return value] The return value can be modelled by setting a finished flag to 1 when the result is ready, and putting the result into a 32-bit output register. These are denoted as \textit{fin} and \textit{rtrn} respectively.
-  \item[stack] The function stack can be modelled as a RAM block, which is implemented using an array in the module, and denoted as \textit{stk}.
+  \item[return value] The return value can be modelled by setting a finished flag to 1 when the result is ready, and putting the result into a 32-bit output register. These are denoted as \textit{fin} and \textit{rtrn} respectively. \JW{Is there a mismatch between `ret' in the figure and `rtrn' in the text?}
+  \item[stack] The function stack can be modelled as a RAM block, which is implemented using an array in the module, and denoted as \textit{stk}. \JW{Is there a mismatch between `st' in the figure and `stk' in the text?}
 \end{description}
 
 Figure~\ref{fig:inference_module} shows the inference rules for moving between the computational states.  The first, \textsc{Step}, is the normal rule of execution.  It defines one step in the \texttt{State} state, assuming that the module is not being reset, that the finish state has not been reached yet, that the current and next state are $v$ and $v'$, and that the module runs from state $\Gamma$ to $\Gamma'$ using the \textsc{Step} rule.  The \textsc{Finish} rule returns the final value of running the module and is applied when the \textit{fin} register is set; the return value is then taken from the \textit{ret} register.
 
-Note that there is no step from \texttt{State} to \texttt{Callstate}; this is because function calls are not supported, and it is therefore impossible in our semantics to ever reach a \texttt{Callstate} except for the initial call to main. So the \textsc{Call} rule is only used at the very beginning of execution; likewise, the \textsc{Return} rule is only matched for the final return value from the main function. %as there is no rule that allocates another stack frame \textit{sf} except for the initial call to main.  
+Note that there is no step from \texttt{State} to \texttt{Callstate}; this is because function calls are not supported, and it is therefore impossible in our semantics ever to reach a \texttt{Callstate} except for the initial call to main. So the \textsc{Call} rule is only used at the very beginning of execution; likewise, the \textsc{Return} rule is only matched for the final return value from the main function. %as there is no rule that allocates another stack frame \textit{sf} except for the initial call to main.  
 Therefore, in addition to the rules shown in Figure~\ref{fig:inference_module}, an initial state and final state need to be defined:
 
 \begin{gather*}
@@ -117,7 +117,7 @@ Therefore, in addition to the rules shown in Figure~\ref{fig:inference_module},
 
 \subsection{Memory Model}\label{sec:verilog:memory}
 
-The Verilog semantics do not define a memory model for Verilog, as this is not needed for a hardware description language.  There is no preexisting architecture that Verilog will produce, it can describe any memory layout that is needed.  Instead of having specific semantics for memory, the semantics only needs to support the language features that can produce these different memory layouts, these being Verilog arrays.  We therefore define semantics for updating Verilog arrays using blocking and nonblocking assignment.  We then have to prove that the C memory model that \compcert{} uses matches with the interpretation of arrays that are used in Verilog.  The \compcert{} memory model is infinite, whereas our representation of arrays in Verilog is inherently finite.  There have already been various efforts to define a finite memory model for all compiler passes in \compcert{}, such as Comp\-Cert\-S~\cite{besson18_compc}, Comp\-Cert\-ELF~\cite{wang20_compc} and Comp\-Cert\-TSO~\cite{sevcik13_compc}, however, we define the translation from \compcert{}'s standard infinite memory model to finitely sized arrays that can be represented in Verilog, leaving the compiler passes intact.
+The Verilog semantics do not define a memory model for Verilog, as this is not needed for a hardware description language.  There is no preexisting architecture that Verilog will produce; it can describe any memory layout that is needed.  Instead of having specific semantics for memory, the semantics only needs to support the language features that can produce these different memory layouts, these being Verilog arrays.  We therefore define semantics for updating Verilog arrays using blocking and nonblocking assignment.  We then have to prove that the C memory model that \compcert{} uses matches with the interpretation of arrays used in Verilog.  The \compcert{} memory model is infinite, whereas our representation of arrays in Verilog is inherently finite.  There have already been various efforts to define a finite memory model for all compiler passes in \compcert{}, such as Comp\-Cert\-S~\cite{besson18_compc}, Comp\-Cert\-ELF~\cite{wang20_compc} and Comp\-Cert\-TSO~\cite{sevcik13_compc}, however, we define the translation from \compcert{}'s standard infinite memory model to finitely sized arrays that can be represented in Verilog, leaving the compiler passes intact. \JW{I'm not quite sure I understand. Let me check: Are you saying that previous work has shown how all the existing CompCert passes can be adapted from an infinite to a finite memory model, but what we're doing is leaving the default (infinite) memory model for the CompCert front end, and just converting from an infinite memory model to a finite memory model when we go from 3AC to HTL?}
 
 \begin{figure}
   \centering
@@ -193,7 +193,7 @@ The Verilog semantics do not define a memory model for Verilog, as this is not n
 
 %\JW{It's not completely clear what the relationship is between your work and those works. The use of `only' suggests that you've re-done a subset of work that has already been done -- is that the right impression?}\YH{Hopefully that's more clear.}
 
-This translation is represented in Figure~\ref{fig:memory_model_transl}, where \compcert{} defines a map from blocks to maps from memory address to memory contents.  Each block represents an area in memory, for example, a block can represent a global variable or a stack for a function.  Instead, our Verilog semantics define two finitely sized arrays of optional values, one for the blocking assignments map $\Gamma_{\rm a}$ and one for the nonblocking assignments map $\Delta_{\rm a}$.  The optional values are present to ensure correct merging of the two association maps at the end of the clock cycle.  During our translation we only convert block 0 to a Verilog memory, and ensure that it is the only block that is present.  This means that the block necessarily represents the stack of the main function.  The invariant that then has to hold in the proofs, is that block 0 should be equivalent to the merged representation of the $\Gamma_{\rm a}$ and $\Delta_{\rm a}$ maps.
+This translation is represented in Figure~\ref{fig:memory_model_transl}.  \compcert{} defines a map from blocks to maps from memory addresses to memory contents.  Each block represents an area in memory; for example, a block can represent a global variable or a stack for a function. \JW{Maybe mention here that block 0 is always the stack for the main function.} \JW{So the stack frame for a function called by main would be in a different block, is that the idea? Seems unusual not to have a single stack.} \JW{Would global variables normally be put in blocks 1, 2, etc.?}  Meanwhile, our Verilog semantics defines two finite arrays of optional values, one for the blocking assignments map $\Gamma_{\rm a}$ and one for the nonblocking assignments map $\Delta_{\rm a}$. \JW{It's a slight shame that `block' is used in two different senses in the preceding two sentences. I guess that can't be helped.} The optional values are present to ensure correct merging of the two association maps at the end of the clock cycle.  During our translation we only convert block 0 to a Verilog memory, and ensure that it is the only block that is present.  This means that the block necessarily represents the stack of the main function.  The invariant that then has to hold in the proofs is that block 0 should be equivalent to the merged representation of the $\Gamma_{\rm a}$ and $\Delta_{\rm a}$ maps.
 
 %However, in practice, assigning and reading from an array directly in the state machine will not produce a memory in the final hardware design, as the synthesis tool cannot identify the array as having the necessary properties that a RAM needs, even though this is the most natural formulation of memory.  Even though theoretically the memory will only be read from once per clock cycle, the synthesis tool cannot ensure that this is true, and will instead create a register for each memory location.  This increases the size of the circuit dramatically, as the RAM on the FPGA chip will not be reused.  Instead, the synthesis tool expects a specific interface that ensures these properties, and will then transform the interface into a proper RAM during synthesis.  Therefore, a translation has to be performed from the naive use of memory in the state machine, to a proper use of a memory interface.