7 files changed, 30 insertions, 28 deletions
diff --git a/algorithm.tex b/algorithm.tex
index 66fb788..53ffaaa 100644
--- a/algorithm.tex
+++ b/algorithm.tex
@@ -302,7 +302,7 @@ A high-level overview of the architecture and of the RAM interface can be seen i
 
 \paragraph{Translating instructions}
 
-Each 3AC instruction either corresponds to a hardware construct or does not have to be handled by the translation, such as function calls (because of inlining). \JW{Are function calls the only 3AC instruction that we ignore? (And I guess return statements too for the same reason.)}
+Each 3AC instruction either corresponds to a hardware construct or does not have to be handled by the translation, such as function calls (because of inlining). \JW{Are function calls the only 3AC instruction that we ignore? (And I guess return statements too for the same reason.)}\YH{Actually, return instructions are translated (because you can return from main whenever), so call instructions (Icall, Ibuiltin and Itailcall) are the only functions that are not handled.}
 For example, line 2 in Figure~\ref{fig:accumulator_rtl} shows a 32-bit register \texttt{x5} being initialised to 3, after which the control flow moves execution to line 3. This initialisation is also encoded in the Verilog generated from HTL at state 8 in both the control logic and data-path always-blocks, shown in Figure~\ref{fig:accumulator_v}.  Simple operator instructions are translated in a similar way.  For example, the add instruction is just translated to the built-in add operator, similarly for the multiply operator.  All 32-bit instructions can be translated in this way, but some special instructions require extra care. One such is the \texttt{Oshrximm} instruction, which is discussed further in Section~\ref{sec:algorithm:optimisation:oshrximm}. Another is the \texttt{Oshldimm} instruction, which is a left rotate instruction that has no Verilog equivalent and therefore has to be implemented in terms of other operations and proven to be equivalent.
 
 \subsubsection{Translating HTL to Verilog}
@@ -330,14 +330,16 @@ However, the memory model that \compcert{} uses for its intermediate languages i
 
 \subsubsection{Implementation of RAM interface}
 The simplest way to implement loads and stores in \vericert{} would be to access the Verilog array directly from within the data-path (i.e., inside the always-block on lines 16--32 of Figure~\ref{fig:accumulator_v}). This would be correct, but when a Verilog array is accessed at several program points, the synthesis tool is unlikely to detect that it can be implemented as a RAM block, and will resort to using lots of registers instead, ruining the circuit's area and performance. To avert this, we arrange that the data-path does not access memory directly, but simply sets the address it wishes to access and then toggles the \texttt{u\_en} flag. This activates the RAM interface (lines 9--15 of Figure~\ref{fig:accumulator_v}) on the next falling clock edge, which performs the requested load or store. By factoring all the memory accesses out into a separate interface like this, we ensure that the underlying array is only accessed from a single program point in the Verilog code, and thus ensure that the synthesis tool will correctly infer a RAM block.
-Interestingly, the syntax for the RAM interface is quite strict, as the synthesis tool will pattern-match on it and only work for a predefined set of interfaces.  \JW{I think the following sentence could be cut as we've said this kind of thing a couple of times already.} Without the interface, the array would be implemented using registers, which would increase the size of the hardware considerably.
+Interestingly, the syntax for the RAM interface is quite strict, as the synthesis tool will pattern-match on it and only work for a predefined set of interfaces.
+
+%\JW{I think the following sentence could be cut as we've said this kind of thing a couple of times already.} Without the interface, the array would be implemented using registers, which would increase the size of the hardware considerably.
 
 Therefore, an extra compiler pass is added from HTL to HTL to extract all the direct accesses to the Verilog array and replace them by signals that access the RAM interface in a separate always-block. The translation is performed by going through all the instructions and replacing each load and store expression in turn.  Stores can simply be replaced by the necessary wires directly. Loads are a little more subtle: loads that use the RAM interface take two clock cycles where a direct load from an array takes only one, so this pass inserts an extra state after each load.
 
 %\JW{I've called that negedge always-block the `RAM driver' in my proposed text above -- that feels like quite a nice a word for it to my mind -- what do you think?}\YH{Yes I quite like it!}
 %Verilog arrays can be used in a variety of ways, however, these do not all produce optimal hardware designs.  If, for example, arrays in Verilog are accessed immediately in the data-path, then the synthesis tool is not be able to identify it as having the right properties for a RAM, and would instead implement the array using registers.  This is extremely expensive, and for large memories this can easily blow up the area usage of the FPGA, and because of the longer wires that are needed, it would also affect the performance of the circuit.  The synthesis tools therefore provide code snippets that they know how to transform into various constructs, including snippets that will generate proper RAMs in the final hardware.  This process is called memory inference.  The initial translation from 3AC to HTL converts loads and stores to direct accesses to the memory, as this preserves the same behaviour without having to insert more registers and logic.  We therefore have another pass from HTL to itself which performs the translation from this na\"ive use of arrays to a representation which always allows for memory inference.  This pass creates a separate always block to perform the loads and stores to the memory, and adds the necessary data, address and enable signals to communicate with that always-block from other always-blocks.  This always-block is shown between lines 10-15 in Figure~\ref{fig:accumulator_v}.
 
-There are two interesting parts to the inserted RAM interface.  Firstly, the memory updates are triggered on the negative edge of the clock, out of phase with the rest of the design which is triggered on the positive edge of the clock.  The main advantage is that instead of loads and stores taking three clock cycles and two clock cycles respectively, they only take two clock cycles and one clock cycle instead, greatly improving their performance. \JW{Is this a standard `trick' in hardware design? If so it might be nice to cite it.} In addition to that, using the negative edge for the clock is supported by many synthesis tools, and does not affect the maximum frequency of the final design.
+There are two interesting parts to the inserted RAM interface.  Firstly, the memory updates are triggered on the negative edge of the clock, out of phase with the rest of the design which is triggered on the positive edge of the clock.  The main advantage is that instead of loads and stores taking three clock cycles and two clock cycles respectively, they only take two clock cycles and one clock cycle instead, greatly improving their performance. \JW{Is this a standard `trick' in hardware design? If so it might be nice to cite it.}\YH{Hmm, not really, because it has the downside of kind of halving your available clock period. However, RAMs normally come in both forms on the FPGA (Page 12, Figure 2, \url{https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/ug/ug_ram_rom.pdf})} In addition to that, using the negative edge for the clock is supported by many synthesis tools, and does not affect the maximum frequency of the final design.
 
 Secondly, the logic in the enable signal of the RAM (\texttt{en != u\_en}) is also atypical.  In existing hardware designs, enable signals are normally manually controlled and inserted into the appropriate states, by using a check like the following in the RAM:\@ \texttt{en == 1}.  This means that the RAM only turns on when the enable signal is set.  However, to make the proof simpler and to not have to reason about possible side effects introduced by the RAM being enabled but not used, a RAM which disables itself after every use would be ideal.  One method for implementing this would be to insert an extra state after each load or store that disables the RAM accordingly, but this would eliminate the speed advantage of the negative-edge-triggered RAM. Another method would be to determine the next state after each load or store and add logic to disable the RAM in that state, but this could quickly become complicated, especially in the case where the next state contains another memory operation, and hence the disable signal should not be added. The method we ultimately chose was to have the RAM become enabled not when the enable signal is high, but when it toggles its value.  This can be arranged by keeping track of the old value of the enable signal in \texttt{en} and comparing it to the current value \texttt{u\_en} set by the data-path.  When the values are different, the RAM gets enabled, and then \texttt{en} is set to the value of \texttt{u\_en}. This ensures that the RAM will always be disabled directly after it was used without having to modify any extra states.
 
@@ -397,8 +399,6 @@ Many of the \compcert{} instructions map well to hardware, but \texttt{Oshrximm}
 \end{equation*}
 (where $x, y \in \mathbb{Z}$, $0 \leq y < 31$, and $-2^{31} \leq x < 2^{31}$) and instantiating divider circuits in hardware is well known to cripple performance. Moreover, since \vericert{} requires the result of a divide operation to be ready within a single clock cycle, the divide circuit needs to be entirely combinational. This is inefficient in terms of area, but also in terms of latency, because it means that the maximum frequency of the hardware must be reduced dramatically so that the divide circuit has enough time to finish.
 
-\compcert{} eventually performs a similar translation \JW{I'm not sure which `translation' this is referring to.} when generating the assembly code, however, the specification of the instruction itself still uses division instead of shifts, meaning the proof of the translation cannot be reused.  In \vericert{}, the equivalence of the representation in terms of divisions and shifts is proven over the integers and the specification, thereby \JW{missing word?} it simpler to prove the correctness of the Verilog implementation in terms of shifts.
-
 %\JP{Multi-cycle paths might be something worth exploring in future work: fairly error-prone/dangerous for hand-written code, but might be interesting in generated code.}\YH{Definitely is on the list for next things to look into, will make divide so much more efficient.}
 
 %These small optimisations were found to be the most error prone, and guaranteeing that the new representation is equivalent to representation used in the \compcert{} semantics is difficult without proving this for all possible inputs.
@@ -414,6 +414,8 @@ One might hope that the synthesis tool consuming our generated Verilog would con
 where $\gg$ stands for a logical right shift. %Once this equivalence about the shifts and division operator is proven correct, it can be used to implement the \texttt{Oshrximm} using the efficient shift version instead of how the \compcert{} semantics described it.
 When proving this equivalence, we actually found a bug in our original implementation that was due to the fact that a na\"{i}ve shift rounds towards $-\infty$.
 
+\compcert{} eventually performs a translation from this representation into assembly code which uses shifts to implement the division, however, the specification of the instruction itself still uses division instead of shifts, meaning the proof of the translation cannot be reused.  In \vericert{}, the equivalence of the representation in terms of divisions and shifts is proven over the integers and the specification, thereby making it simpler to prove the correctness of the Verilog implementation in terms of shifts.
+
 %The \compcert{} semantics for the \texttt{Oshrximm} instruction expresses its operation exactly as shown in the equation above, even though in hardware the computation that would be performed would be different.  In \vericert{}, if the same operation would be implemented using Verilog operators, it is not guaranteed to be optimised correctly by the synthesis tools that convert the Verilog into a circuit.  To guarantee an output that does not include divides, we therefore have to express it in Verilog using shifts, and then prove that this representation is equivalent to the divide representation used in the \compcert{} semantics.  While conducting the proof, we discovered quite a few bugs in our initial implementation of optimisations, which rounded to $-\infty$ instead of 0.
 
 %%% Local Variables:
diff --git a/evaluation.tex b/evaluation.tex
index ef43a9a..82f79d2 100644
--- a/evaluation.tex
+++ b/evaluation.tex
@@ -10,13 +10,14 @@ Our evaluation is designed to answer the following three research questions.
 \subsection{Experimental Setup}
 \label{sec:evaluation:setup}
 
-\paragraph{Choice of HLS tool for comparison.} We compare \vericert{} against \legup{} 4.0, because it is open-source and hence easily accessible, but still produces hardware ``of comparable quality to a commercial high-level synthesis tool''~\cite{canis11_legup}.  We also compare against \legup{} with different optimisation levels in an effort to understand which optimisations have the biggest impact on the performance discrepancies between \legup{} and \vericert{}.  The baseline \legup{} version has all the default automatic optimisations turned on.  The benchmarks are also not manually optimised to run through \legup{} optimally, such as adding pragmas and other manual indications to add further more advanced optimisations.  \vericert{} is also compared with other optimisation levels of \legup{}.  First, we only turn off the LLVM optimisations in \legup{}, to eliminate all the optimisations that are common to standard software compilers, referred to as \legup{} w/o opt.  Secondly, we also compare against \legup{} with LLVM optimisations and operation chaining turned off, referred to as \legup{} w/o opt+chain. Operation chaining is an HLS-specific optimisation that combines data-dependent operations into one clock cycle, and therefore dramatically reduces the number of cycles, without necessarily decreasing the clock speed.
+\paragraph{Choice of HLS tool for comparison.} We compare \vericert{} against \legup{} 4.0, because it is open-source and hence easily accessible, but still produces hardware ``of comparable quality to a commercial high-level synthesis tool''~\cite{canis11_legup}.  We also compare against \legup{} with different optimisation levels in an effort to understand which optimisations have the biggest impact on the performance discrepancies between \legup{} and \vericert{}.  The baseline \legup{} version has all the default automatic optimisations turned on.  % \vericert{} is also compared with other optimisation levels of \legup{}. %JW: removed because we said that a couple of sentences ago.
+First, we only turn off the LLVM optimisations in \legup{}, to eliminate all the optimisations that are common to standard software compilers, referred to as \legup{} w/o opt.  Secondly, we also compare against \legup{} with LLVM optimisations and operation chaining turned off, referred to as \legup{} w/o opt+chain. Operation chaining is an HLS-specific optimisation that combines data-dependent operations into one clock cycle, and therefore dramatically reduces the number of cycles, without necessarily decreasing the clock speed.
 
 \paragraph{Choice and preparation of benchmarks.} We evaluate \vericert{} using the \polybench{} benchmark suite (version 4.2.1)~\cite{polybench}, which is a collection of 30 numerical kernels. \polybench{} is popular in the HLS context~\cite{choi+18,poly_hls_pouchet2013polyhedral,poly_hls_zhao2017,poly_hls_zuo2013}, since it has affine loop bounds, making it attractive for streaming computation on FPGA architectures.
 We were able to use 27 of the 30 programs; three had to be discarded (\texttt{correlation},~\texttt{gramschmidt} and~\texttt{deriche}) because they involve square roots, requiring floats, which we do not support. 
 % Interestingly, we were also unable to evaluate \texttt{cholesky} on \legup{}, since it produce an error during its HLS compilation. 
 %In summary, we evaluate 27 programs from the latest Polybench suite. 
-We configured \polybench{}'s parameters so that only integer types are used.  We use \polybench{}'s smallest datasets for each program to ensure that data can reside within on-chip memories of the FPGA, avoiding any need for off-chip memory accesses.
+We configured \polybench{}'s parameters so that only integer types are used.  We use \polybench{}'s smallest datasets for each program to ensure that data can reside within on-chip memories of the FPGA, avoiding any need for off-chip memory accesses. We have not modified the benchmarks to make them run through \legup{} optimally, e.g. by adding pragmas that trigger more advanced optimisations.
 
 \vericert{} implements divisions and modulo operations in C using the corresponding built-in Verilog operators. These built-in operators are designed to complete within a single clock cycle, and this causes substantial penalties in clock frequency. Other HLS tools, including LegUp, supply their own multi-cycle division/modulo implementations, and we plan to do the same in future versions of \vericert{}. Implementing pipelined operators such as the divide and modulus operator can be solved by scheduling the instructions so that these can execute in parallel, which is the main optimisation that needs to be added to \vericert{}. In the meantime, we have prepared an alternative version of the benchmarks in which each division/modulo operation is replaced with our own implementation that uses repeated division and multiplications by 2.  Figure~\ref{fig:polybench-div} shows the results of comparing \vericert{} with optimised LegUp 4.0 on the \polybench{} benchmarks, where divisions have been left intact.  Figure~\ref{fig:polybench-nodiv} performs the comparison where the division/modulo operations have been replaced by the iterative algorithm.
 
diff --git a/introduction.tex b/introduction.tex
index 90ef832..2cdada9 100644
--- a/introduction.tex
+++ b/introduction.tex
@@ -29,7 +29,7 @@ Aware of the reliability shortcomings of HLS tools, hardware designers routinely
 
 An alternative is to use \emph{translation validation}~\cite{pnueli98_trans} to prove equivalence between the input program and the output design. Translation validation has been successfully applied to several HLS optimisations~\cite{kim04_autom_fsmd,karfa06_formal_verif_method_sched_high_synth,chouksey20_verif_sched_condit_behav_high_level_synth,banerjee14_verif_code_motion_techn_using_value_propag,chouksey19_trans_valid_code_motion_trans_invol_loops}. 
 But it is an expensive task, especially for large designs, and it must be repeated every time the compiler is invoked.
-For example, the translation validation for Catapult C~\cite{mentor20_catap_high_level_synth} may require several rounds of expert `adjustments'~\cite[p.~3]{slec_whitepaper} to the input C program before validation succeeds. And even when it succeeds, translation validation does not provide watertight guarantees unless the validator itself has been mechanically proven correct~\cite{tristan08_formal_verif_trans_valid}, \JW{which has not been the case in HLS tools to date}. 
+For example, the translation validation for Catapult C~\cite{mentor20_catap_high_level_synth} may require several rounds of expert `adjustments'~\cite[p.~3]{slec_whitepaper} to the input C program before validation succeeds. And even when it succeeds, translation validation does not provide watertight guarantees unless the validator itself has been mechanically proven correct~\cite{tristan08_formal_verif_trans_valid}, which has not been the case in HLS tools to date. 
 %\NR{There is also use of the word `input' in this paragraph for a different context.} %JW: Yes it was used in two different ways in two consecutive paragraphs. Thanks, fixed now.
 
 %\JW{Having nuanced our discussion of TV above, I feel like the text below belongs more in a `future directions' paragraph at the end of the paper than in an `existing workarounds' section.} Nevertheless translation validation has many benefits in a mechanically verified setting as well to simplify the proofs to depend less on the exact implementation of the optimisation.  It has also already been used to prove certain passes in \compcert{} correct.  The main issue with the translation validation methods applied in HLS tools normally is that they \NR{\sout{try and}} generalise over all optimisations that are performed and \NR{\sout{try to}} compare the generated hardware directly to the high-level input. \NR{The word input used here again.}  However, verification requires optimisations to be proven correct incrementally and separately, making translation validation more viable.  By proving specific optimisations with a constraint on the kinds of transformations it can perform, it is possible to write a verified validator that is also believed to be complete and should not fail on valid transformations unless bugs are present.
@@ -46,8 +46,8 @@ The contributions of this paper are as follows:
   \item We present \vericert{}, the first mechanically verified HLS tool that compiles C to Verilog. In Section~\ref{sec:design}, we describe the design of \vericert{}, including a few optimisations related to memory accesses and division.
   \item We state the correctness theorem of \vericert{} with respect to an existing semantics for Verilog due to \citet{loow19_proof_trans_veril_devel_hol}. In Section~\ref{sec:verilog}, we describe how we extended this semantics to make it suitable as an HLS target.  We also describe how the Verilog semantics are integrated into CompCert's model of the semantics, and how CompCert's memory model is translated to Verilog's low-level and finite memory model.
   \item In Section~\ref{sec:proof}, we describe how we proved the correctness theorem. The proof follows standard \compcert{} techniques -- forward simulations, intermediate specifications, and determinism results -- but we encountered several challenges peculiar to our hardware-oriented setting. %\NR{`specific' is better than `peculiar'?} %JW: I think this is a nice use of peculiar. Note that it means `distinctive' here, not `weird' -- the third meaning at https://www.dictionary.com/browse/peculiar 
-  These include handling discrepancies between byte- and word-addressable memories which other \compcert{} back ends do not change, different handling of unsigned comparisons between C and Verilog, correctly mapping \compcert{}'s \NR{`infinite'?} \JW{Fine by me; YH to make the call.} memory model onto a finite Verilog array and finally correctly rearranging memory reads and writes so that these behave properly as a RAM in hardware.
-  \item In Section~\ref{sec:evaluation}, we evaluate \vericert{} on the \polybench{} benchmark suite~\cite{polybench}, and compare the performance of our generated hardware against an existing, unverified HLS tool called \legup{}~\cite{canis11_legup}. We show that \vericert{} generates hardware that is \slowdownOrig$\times$ slower (\slowdownDiv$\times$ slower in the absence of division) and \areaIncr$\times$ larger than that generated by \legup{}. \JW{This performance gap can be largely attributed to \vericert{}'s current lack of support for instruction-level parallelism and the absence of an efficient, pipelined division operator.} We intend to close this gap in the future by introducing (and verifying) HLS optimisations of our own, such as scheduling and memory analysis. \NR{Question rather than comment: Will there be verification issues to add support for hard IPs like division blocks?}
+  These include handling discrepancies between byte- and word-addressable memories which other \compcert{} back ends do not change, different handling of unsigned comparisons between C and Verilog, correctly mapping \compcert{}'s infinite memory model onto a finite Verilog array and finally correctly rearranging memory reads and writes so that these behave properly as a RAM in hardware.
+  \item In Section~\ref{sec:evaluation}, we evaluate \vericert{} on the \polybench{} benchmark suite~\cite{polybench}, and compare the performance of our generated hardware against an existing, unverified HLS tool called \legup{}~\cite{canis11_legup}. We show that \vericert{} generates hardware that is \slowdownOrig$\times$ slower (\slowdownDiv$\times$ slower in the absence of division) and \areaIncr$\times$ larger than that generated by \legup{}. This performance gap can be largely attributed to \vericert{}'s current lack of support for instruction-level parallelism and the absence of an efficient, pipelined division operator. We intend to close this gap in the future by introducing (and verifying) HLS optimisations of our own, such as scheduling and memory analysis. %\NR{Question rather than comment: Will there be verification issues to add support for hard IPs like division blocks?}\YH{Not really any issues, just many different levels of reliability.  You don't have to prove IP correct, but theoretically could.}
 \end{itemize}
 %\JW{This sentence seems pretty good to me; is it up-to-date with the latest `challenges' you've faced?}
 \vericert{} is fully open source and available online.
diff --git a/limitations.tex b/limitations.tex
index c653a23..1979b36 100644
--- a/limitations.tex
+++ b/limitations.tex
@@ -1,22 +1,20 @@
-\section{Limitations}
+\section{Limitations and Future Work}
 
-There are various limitations in \vericert{} compared to other HLS tools due to the fact that our main focus was on formally verifying the translation from 3AC to Verilog. \JW{In this section, we outline the current limitations of our tool and suggest how they can be overcome in future work.}
+There are various limitations in \vericert{} compared to other HLS tools due to the fact that our main focus was on formally verifying the translation from 3AC to Verilog. In this section, we outline the current limitations of our tool and suggest how they can be overcome in future work.
 
 \paragraph{Lack of instruction-level parallelism}
 
-The main limitation of \vericert{} is that it does not perform instruction scheduling, meaning that instructions cannot be gathered into the same state and executed in parallel.  However, the design of the intermediate languages in \vericert{} take this optimisation into account and are designed to support scheduling in the future. For instance, our HTL language allows arbitrary Verilog to appear in each state of the FSMD, including parallel assignments to registers. 
+The main limitation of \vericert{} is that it does not perform instruction scheduling, meaning that instructions cannot be gathered into the same state and executed in parallel.  However, the design of the intermediate languages in \vericert{} take this optimisation into account and are designed to support scheduling in the future. For instance, our HTL language allows arbitrary Verilog to appear in each state of the FSMD, including parallel assignments to registers. Our plan for adding scheduling support involves adding a new intermediate language after 3AC, tentatively called 3ACPar. This would be similar to 3AC but rather than mapping program counters to instructions, it would map program counters to \emph{lists} of instructions that can all be executed in parallel. The translation from 3AC to 3ACPar would be performed by a scheduling tool. Following \cite{tristan08_formal_verif_trans_valid} and \citet{six+20}, we expect to use translation validation to verify that each generated schedule is correct (rather than verifying the scheduling tool itself). The translation from 3ACPar to HTL would not change conceptually, except for the fact that multiple instructions can now be translated into the same state.
 
-Our plan for adding scheduling support involves adding a new intermediate language after 3AC, tentatively called 3ACPar. This would be similar to 3AC but rather than mapping program counters to instructions, it would map program counters to \emph{lists} of instructions that can all be executed in parallel. The translation from 3AC to 3ACPar would be performed by a scheduling algorithm. Following \cite{tristan08_formal_verif_trans_valid} and \citet{six+20}, we expect to use verified translation validation to check that the output of the scheduling algorithm is correct (rather than verify the scheduling algorithm itself).
-
-
-To simplify the proof of the scheduling algorithm, and to minimise the changes necessary for the current translation from 3AC to HTL, a new language must be introduced, called 3ACPar, \JW{I think this is getting too far into the details of how to add scheduling, and risks stealing the thunder of our next submission. I think the basic point to make is: Vericert is currently all sequential, that's pretty bad, but it's not a fundamental limitation. So it's good to point out that HTL is a very general language -- anything that can be put into an always-block in Verilog can be put into an HTL state -- but I wouldn't go into details about new languages like 3ACPar here.} which would be equivalent to 3AC but instead of consisting of a map from program counter to instruction, it would consist of a map from program counter to list of instructions, which can all be executed in parallel.  A new proof for the scheduling algorithm would have to be written for the translation from 3AC to 3ACPar, for which a verified translation validation approach might be appropriate, however, the translation form 3ACPar to HTL would not change conceptually, except for the fact that multiple instructions can now be translated into the same state.  This small difference means that most of the proof can be reused without any changes, as the translation of instructions was the main body of the proof and did not change.
+%To simplify the proof of the scheduling algorithm, and to minimise the changes necessary for the current translation from 3AC to HTL, a new language must be introduced, called 3ACPar, which would be equivalent to 3AC but instead of consisting of a map from program counter to instruction, it would consist of a map from program counter to list of instructions, which can all be executed in parallel.  A new proof for the scheduling algorithm would have to be written for the translation from 3AC to 3ACPar, for which a verified translation validation approach might be appropriate, however, the translation form 3ACPar to HTL would not change conceptually, except for the fact that multiple instructions can now be translated into the same state.  This small difference means that most of the proof can be reused without any changes, as the translation of instructions was the main body of the proof and did not change.
 
 \paragraph{Lack of pipelined division}
-In a similar vein, the introduction of pipelined operators, especially for division, can alleviate many of the timing issues shown in the \polybench{} benchmarks with divisions included in Section~\ref{sec:evaluation}.  These operators can execute different stages of the operation in parallel, and therefore run long operations in parallel while sharing the same hardware.  In HTL, operations like this can be represented in a similar fashion to the load and store instructions to the RAM, by using wires to communicate with an abstract computation block modelled in HTL and later replaced by a hardware implementation.  However, 3ACPar would have to be modified to also describe such instructions so that these can be placed optimally using the external scheduling algorithm.
+In a similar vein, the introduction of pipelined operators, especially for division, would alleviate the slow clock speed experienced in the \polybench{} benchmarks with divisions included in Section~\ref{sec:evaluation}.  Pipelined operators can execute different stages of the operation in parallel, and therefore run long operations in parallel while sharing the same hardware.  In HTL, operations like this can be represented in a similar fashion to the load and store instructions by using wires to communicate with an abstract computation block modelled in HTL and later replaced by a hardware implementation. %JW I've chopped the following sentence because it felt like it was going into too much detail. 
+%However, 3ACPar would have to be modified to also describe such instructions so that these can be placed optimally using the external scheduling algorithm.
 
 \paragraph{Limitations with I/O}
 
-\vericert{} is currently limited in terms of I/O as well, because the main function cannot accept any arguments for the Clight program to be well-formed. \JW{Is it worth mentioning that it will compile programs with main(a,b), just as CompCert does, but just not prove them correct? This might be a nice `little-known fact' about CompCert?}  In addition to that, external function calls that produce traces have also not been implemented yet, however, these could enable the C program to read and write values to a bus that is shared by various other components in the hardware design.
+\vericert{} is currently limited in terms of I/O as well, because the main function cannot accept any arguments for the Clight program to be well-formed. However, just like \compcert{}, \vericert{} can compile main functions with arbitrary arguments and will handle the inputs appropriately.  However, the main correctness theorem in \compcert{} assumes that the main function does not have any arguments, so it may be possible that unexpected behaviour is introduced.  In addition to that, external function calls that produce traces have also not been implemented yet, however, these could enable the C program to read and write values to a bus that is shared by various other components in the hardware design.
 
 \paragraph{Lack of support for global variables}
 In \compcert{}, each global variable is stored in its own memory.  A generalisation of the stack translation into a RAM block could therefore be performed to translate global variables in the same manner.  This would require a slight generalisation of pointers so that they store provenance information to ensure that each pointer accesses the right RAM. It would also be necessary to generalise the RAM interface so that it decodes the provenance information and indexes the right array.
diff --git a/main.tex b/main.tex
index e852294..b3f9871 100644
--- a/main.tex
+++ b/main.tex
@@ -178,8 +178,8 @@
 \input{verilog}
 \input{proof}
 \input{evaluation}
-\input{limitations}
 \input{related}
+\input{limitations}
 \input{conclusion}
 
 %% Bibliography
diff --git a/proof.tex b/proof.tex
index 804c10d..4566619 100644
--- a/proof.tex
+++ b/proof.tex
@@ -2,7 +2,7 @@
 
 Now that the Verilog semantics have been adapted to the CompCert model, we are in a position to formally prove the correctness of our C-to-Verilog compilation.  This section describes the main correctness theorem that was proven and the main ideas behind the proof. The full Coq proof is available in auxiliary material.
 
-The main correctness theorem is analogous to that stated in \compcert{}~\cite{leroy09_formal_verif_realis_compil}: for all Clight source programs $C$, if the translation to the target Verilog code succeeds, and $C$ has safe observable behaviour $B$ when executed, then the target Verilog code will have the same behaviour $B$. Here, a `safe' execution is one that either converges or diverges, but does not ``go wrong''. If the program does admit some wrong behaviour (like undefined behaviour in C), the correctness theorem does not apply. A behaviour, then, is either a final state (in the case of convergence) or divergence. In \compcert{}, a behaviour is also associated with a trace of I/O events, but since external function calls are not supported in \vericert{}, this trace will always be empty.  This correctness theorem is also appropriate for HLS, as HLS is often used as a part of a larger hardware design that is connected together using a hardware description language like Verilog.  This means that HLS designs are normally triggered multiple times and results are returned each time when the computation terminates, which is the property that the correctness theorem states.  Note that the compiler is allowed to fail and not produce any output; the correctness theorem only applies when the translation succeeds.
+The main correctness theorem is analogous to that stated in \compcert{}~\cite{leroy09_formal_verif_realis_compil}: for all Clight source programs $C$, if the translation to the target Verilog code succeeds, and $C$ has safe observable behaviour $B$ when executed, then the target Verilog code will have the same behaviour $B$. Here, a `safe' execution is one that either converges or diverges, but does not ``go wrong''. If the program does admit some wrong behaviour (like undefined behaviour in C), the correctness theorem does not apply. A behaviour, then, is either a final state (in the case of convergence) or divergence. In \compcert{}, a behaviour is also associated with a trace of I/O events, but since external function calls are not supported in \vericert{}, this trace will always be empty. This correctness theorem is also appropriate for HLS \JW{Perhaps it would be worth first explaining why somebody might think this correctness theorem might \emph{not} be appropriate for HLS. At the moment, it feels like you're giving the answer without saying the question. Is it to do with the fact that hardware tends to run forever?}, as HLS is often used as a part of a larger hardware design that is connected together using a hardware description language like Verilog.  This means that HLS designs are normally triggered multiple times and results are returned each time when the computation terminates, which is the property that the correctness theorem states.  Note that the compiler is allowed to fail and not produce any output; the correctness theorem only applies when the translation succeeds.
 
 %The following `backwards simulation' theorem describes the correctness theorem, where $\Downarrow$ stands for simulation and execution respectively.
 
@@ -13,7 +13,7 @@ The main correctness theorem is analogous to that stated in \compcert{}~\cite{le
   \end{equation*}
 \end{theorem}
 
-The theorem is a `backwards simulation' result (from target to source), following the terminology used in the \compcert{} literature.  The theorem does not demand the `if' direction too, because compilers are permitted to resolve any non-determinism present in their source programs.
+The theorem is a `backwards simulation' result (from target to source), following the terminology used in the \compcert{} literature~\cite{leroy09_formal_verif_realis_compil}. The theorem does not demand the `if' direction too, because compilers are permitted to resolve any non-determinism present in their source programs.
 In practice, Clight programs are all deterministic, as are the Verilog programs in the fragment we consider. This means that we can prove the correctness theorem above by first inverting it to become a forwards simulation result, following standard \compcert{} practice.
 
 Furthermore, to prove the forward simulation, it suffices to prove forward simulations between each intermediate language, as these results can be composed to prove the correctness of the whole HLS tool. 
@@ -21,14 +21,15 @@ The forward simulation from 3AC to HTL is stated in Lemma~\ref{lemma:htl} (Secti
 
 \subsection{Main challenges in the proof}
 
-The proof of correctness of the Verilog back end is quite different to the usual proofs performed in CompCert, mainly because the difference in Verilog semantics compared to the standard CompCert intermediate languages and because of the translation of the memory model.
+The proof of correctness of the Verilog back end is quite different to the usual proofs performed in CompCert, mainly because of the difference in Verilog semantics compared to the standard CompCert intermediate languages and because of the translation of the memory model.
 
-Because the memory model in our Verilog semantics is finite and concrete, whereas the CompCert memory model is more abstract and infinite with additional bounds, the equivalence of both these models needs to be proven.  Moreover, our memory is word-addressed for efficiency reasons, whereas CompCert's memory is byte-addressed.
+First, because the memory model in our Verilog semantics is finite and concrete, but the CompCert memory model is more abstract and infinite with additional bounds, the equivalence of both these models needs to be proven.  Moreover, our memory is word-addressed for efficiency reasons, whereas CompCert's memory is byte-addressed.
 
-The Verilog semantics operates quite differently to the usual intermediate languages in the backend.  All the CompCert intermediate languages use a map from control-flow nodes to instructions.  An instruction can therefore be selected using an abstract program pointer. On the other hand, in the Verilog semantics the whole design is executed at every clock cycle, because hardware is inherently parallel. The program pointer is part of the design as well, not just part of an abstract state.  This makes the semantics of Verilog simpler, but comparing it to the semantics of 3AC becomes more challenging, as one has to map the abstract notion of the state to concrete values in registers.
+Second, the Verilog semantics operates quite differently to the usual intermediate languages in the backend.  All the CompCert intermediate languages use a map from control-flow nodes to instructions.  An instruction can therefore be selected using an abstract program pointer. On the other hand, in the Verilog semantics the whole design is executed at every clock cycle, because hardware is inherently parallel. The program pointer is part of the design as well, not just part of an abstract state.  This makes the semantics of Verilog simpler, but comparing it to the semantics of 3AC becomes more challenging, as one has to map the abstract notion of the state to concrete values in registers.
 
 Both these differences mean that translating 3AC directly to Verilog is infeasible, as the differences in the semantics is too large.  Instead, a new intermediate language needs to be introduced, called HTL, which bridges the gap in the semantics between the two languages.  HTL still consists of maps, like many of the other CompCert languages, however, each state corresponds to a Verilog statement.
 
+
 \subsection{Forward simulation from 3AC to HTL}\label{sec:proof:3ac_htl}
 
 As HTL is quite far removed from 3AC, this first translation is the most involved and therefore requires a larger proof, because the translation from 3AC instructions to Verilog statements needs to be proven correct in this step.  In addition to that, the semantics of HTL are also quite different to the 3AC semantics, as instead of defining small-step semantics for each construct in Verilog, the semantics are instead defined over one clock cycle and mirror the semantics defined for Verilog.  Lemma~\ref{lemma:htl} shows the result that needs to be proven in this subsection.
@@ -227,9 +228,9 @@ The final lemma we need is that the Verilog we generate is deterministic. This r
   \label{tab:proof_statistics}
 \end{table*}
 
-The lines of code for the implementation and proof of \vericert{} can be found in Table~\ref{tab:proof_statistics}.  Overall, it took about 1.5 person-years to build \vericert{} -- about three person-months on implementation and 15 person-months on proofs.  The largest proof is the correctness proof for the HTL generation, which required equivalence proofs between all integer operations supported by \compcert{} and those supported in hardware.  From the 3069 lines of proof code in the HTL generation, 1189 are for the correctness proof of just the load and store instructions.  These were tedious to prove correct because of the substantial difference between the memory models used, and the need to prove properties such as stores outside of the allocated memory being undefined, so that a finite array could be used. In addition to that, since pointers in HTL and Verilog are represented as integers, whereas there is a separate pointer value in the \compcert{} semantics, it was painful to reason about them and many new theorems had to be proven about integers and pointers in \vericert{}.  In addition to that, the second largest proof of the correct RAM generation includes many proofs about the extensional equality of array operations, such as merging arrays with different assignments.  As the negative edge implies two merges take place every clock cycle, the proofs about the equality of the arrays becomes more tedious as well.
+The lines of code for the implementation and proof of \vericert{} can be found in Table~\ref{tab:proof_statistics}.  Overall, it took about 1.5 person-years to build \vericert{} -- about three person-months on implementation and 15 person-months on proofs.  The largest proof is the correctness proof for the HTL generation, which required equivalence proofs between all integer operations supported by \compcert{} and those supported in hardware.  From the 3069 lines of proof code in the HTL generation, 1189 are for the correctness proof of just the load and store instructions.  These were tedious to prove correct because of the substantial difference between the memory models used, and the need to prove properties such as stores outside of the allocated memory being undefined, so that a finite array could be used. In addition to that, since pointers in HTL and Verilog are represented as integers, whereas there is a separate `pointer' value in the \compcert{} semantics, it was painful to reason about them and many new theorems had to be proven about integers and pointers in \vericert{}.  In addition to that, \JW{Repeated `In addition to that'. I quite like `Moreover', for a bit of variety.} the second-largest proof of the correct RAM generation includes many proofs about the extensional equality of array operations, such as merging arrays with different assignments.  As the negative edge implies two merges take place every clock cycle, the proofs about the equality of the arrays becomes more tedious as well.
 
-Looking at the trusted base of \vericert{}, the Verilog semantics are 431 lines of code.  This, together with the Clight semantics from \compcert{}, are the only parts of the compiler that need to be trusted.  Compared to the 1721 lines of the implementation that are written in Coq, which are the verified parts of the HLS tool, this is larger than the 431 lines of Verilog semantics specification, even if the Clight semantics are added.  In addition to that, reading semantics specifications is simpler than trying to understand algorithms, meaning the trusted base has been successfully reduced.
+Looking at the trusted \JW{computing?} base of \vericert{}, the Verilog semantics are 431 lines of code.  This, together with the Clight semantics from \compcert{}, are the only parts of the compiler that need to be trusted.  Compared to the 1721 lines of the implementation that are written in Coq, which are the verified parts of the HLS tool, this is larger than the 431 lines of Verilog semantics specification, even if the Clight semantics are added.  In addition to that, reading semantics specifications is simpler than trying to understand algorithms, meaning the trusted base has been successfully reduced.
 
 %\JW{Can we include a comment about the size of the trusted base, in case we get that reviewer again?}
 
diff --git a/related.tex b/related.tex
index c2cd8ff..29c335d 100644
--- a/related.tex
+++ b/related.tex
@@ -42,7 +42,7 @@
   \caption{Summary of related work}\label{fig:related_euler}
 \end{figure}
 
-A summary of the related works can be found in Figure~\ref{fig:related_euler}, which is represented as an Euler diagram.  The categories that were chosen for the Euler diagram are: if the tool is usable and available, if it takes a high-level software language as input, if it has a correctness proof and finally if that proof is mechanised.  The goal of \vericert{} is to cover all of these categories.
+A summary of the related works can be found in Figure~\ref{fig:related_euler}, which is represented as an Euler diagram.  The categories chosen for the Euler diagram are: whether the tool is usable \JWcouldcut{and available}, whether it takes a high-level software language as input, whether it has a correctness proof, and finally whether that proof is mechanised.  The goal of \vericert{} is to cover all of these categories.
 
 Most practical HLS tools~\citep{canis11_legup,xilinx20_vivad_high_synth,intel20_sdk_openc_applic,nigam20_predic_accel_desig_time_sensit_affin_types} fit into the category of usable tools that take high-level inputs.  On the other spectrum, there are tools such as BEDROC~\citep{chapman92_verif_bedroc} for which there is no practical tool, and even though it is described as high-level synthesis, it more closely resembles today's hardware synthesis tools.