Update on Overleaf.

author: John Wickerson <j.wickerson@imperial.ac.uk> 2020-11-20 22:06:21 +0000
committer: overleaf <overleaf@localhost> 2020-11-20 22:07:13 +0000
commit: 1866d48e1303005462f60a629370beef28d51fe3 (patch)
tree: 80ae03cac123291ea15797e5a90c8b6cb24a3019
parent: 7cb9bf05e91519211e4e526467029891d05ab25f (diff)
download: oopsla21_fvhls-1866d48e1303005462f60a629370beef28d51fe3.tar.gz
oopsla21_fvhls-1866d48e1303005462f60a629370beef28d51fe3.zip
9 files changed, 207 insertions, 165 deletions
diff --git a/algorithm.tex b/algorithm.tex
index 0749e88..f61368d 100644
--- a/algorithm.tex
+++ b/algorithm.tex
@@ -1,10 +1,10 @@
 \section{Designing a verified HLS tool}
 \label{sec:design}
 
-This section covers the main architecture of the HLS tool, and the way in which the Verilog back end was added to \compcert{}.  This section will also cover an example of converting a simple C program into hardware, expressed in the Verilog language.
+This section covers \JP{describes} the main architecture of the HLS tool, and the way in which the Verilog back end \JP{back-end? backend?} was added to \compcert{}.  This section will also cover an example of converting a simple C program into hardware, expressed in the Verilog language.
 
 \paragraph{Choice of source language}
-First of all, the choice of C as the input language of \vericert{} is simply because it is what most major HLS tools use~\cite{canis11_legup, xilinx20_vivad_high_synth, intel_hls, bambu_hls}. This, in turn, may be because C is ``[t]he starting point for the vast majority of algorithms to be implemented in hardware''~\cite{5522874}.
+First of all, the choice of C as the input language of \vericert{} is simply because it is what most major HLS tools use~\cite{canis11_legup, xilinx20_vivad_high_synth, intel_hls, bambu_hls}. This, in turn, may be because C is ``[t]he starting point for the vast majority of algorithms to be implemented in hardware''~\cite{5522874}.  \JP{I think this could be reworded: ``C was chosen as the source language as it remains the most common source language amongst production-quality HLS tools~\cite{canis11_legup, xilinx20_vivad_high_synth, intel_hls, bambu_hls}. This, in turn, may be because it is ``[t]he starting point for the vast majority of algorithms to be implemented in hardware''~\cite{5522874}, lending practicality. The availability of \compcert{} [cite], also provides a solid basis for formally verified C compilation.''}
 %Since a lot of existing code for HLS is written in C, supporting C as an input language, rather than a custom domain-specific language, means that \vericert{} is more practical. 
 %An alternative was to support LLVM IR as an input language, however, to get a full work flow from a higher level language to hardware, a front end for that language to LLVM IR would also have to be verified. \JW{Maybe save LLVM for the `Choice of implementation language'?}
 We considered Bluespec~\cite{nikhil04_blues_system_veril}, but decided that although it ``can be classed as a high-level language''~\cite{greaves_note}, it is too hardware-oriented to be used for traditional HLS.
@@ -15,14 +15,14 @@ We also considered using a language with built-in parallel constructs that map w
 
 
 \paragraph{Choice of target language}
-Verilog~\cite{06_ieee_stand_veril_hardw_descr_languag} is an HDL that can be synthesised into logic cells which can be either placed onto a field-programmable gate array (FPGA) or turned into an application-specific integrated circuit (ASIC).  Verilog was chosen as the output language for \vericert{} because it is one of the most popular HDLs and there already exist a few formal semantics for it that could be used as a target~\cite{loow19_verif_compil_verif_proces, meredith10_veril}. Other possible targets could have been Bluespec, from which there exists a formally verified translation to circuits using K\^{o}ika~\cite{bourgeat20_essen_blues}.% but targeting this language would not be trivial as it is not meant to be targeted by an automatic tool, instead strives to a formally verified high-level hardware description language instead. 
+Verilog~\cite{06_ieee_stand_veril_hardw_descr_languag} is an HDL that can be synthesised into logic cells which can be either placed onto a field-programmable gate array (FPGA) or turned into an application-specific integrated circuit (ASIC).  Verilog was chosen as the output language for \vericert{} because it is one of the most popular HDLs and there already exist a few formal semantics for it that could be used as a target~\cite{loow19_verif_compil_verif_proces, meredith10_veril}. \JP{We've just mentioned Bluespec as a \textit{source} language and now we're saying it could be a target language: this is fine but needs comment.} \JP{``Bluespec, previously ruled out as a source language, is another possible target and there exists a formally verified translation to circuits using K\^{o}ika~\cite{bourgeat20_essen_blues}.''} Other possible targets could have been Bluespec, from which there exists a formally verified translation to circuits using K\^{o}ika~\cite{bourgeat20_essen_blues}. \JP{This needs an extra comment maybe?} % but targeting this language would not be trivial as it is not meant to be targeted by an automatic tool, instead strives to a formally verified high-level hardware description language instead. 
 
 %\JW{Can we mention one or two alternatives that we considered? Bluespec or Chisel or one of Adam Chlipala's languages, perhaps?}
 
 \paragraph{Choice of implementation language}
-We chose Coq as the implementation language because of its mature support for code extraction; that is, its ability to generate OCaml programs directly from the definitions used in the theorems. 
+We chose Coq as the implementation language because of its mature support for code extraction; that is, its ability to generate OCaml programs directly from the definitions used in the theorems.
 We note that other authors have had some success reasoning about the HLS process using other theorem provers such as Isabelle~\cite{ellis08}.
-The framework that was chosen for the front end was \compcert{}, as it is a mature framework for simulation proofs about intermediate languages, and it already provides a validated C parser~\cite{jourdan12_valid_lr_parser}.
+The framework that was chosen for the front end was \compcert{} \JP{``\compcert{} was chosen as the front end framework''}\JP{But careful if we've already mentioned this before. Also needs a citation?}, as it is a mature framework for simulation proofs about intermediate languages, and it already provides a validated C parser~\cite{jourdan12_valid_lr_parser}.
 The Vellvm~\cite{zhao12_formal_llvm_inter_repres_verif_progr_trans} framework was also considered because several existing HLS tools are already LLVM-based, but additional work would be required to support a high-level language like C as input.
 The .NET framework has been used as a basis for other HLS tools, such as Kiwi~\cite{kiwi}, and LLHD~\cite{schuiki20_llhd} has been recently proposed as an intermediate language for hardware design, but neither are suitable for us because they lack formal semantics.
 
@@ -59,13 +59,13 @@ The .NET framework has been used as a basis for other HLS tools, such as Kiwi~\c
 \end{figure}
 
 \paragraph{Architecture of \vericert{}}
-The main work flow of \vericert{} is given in Figure~\ref{fig:rtlbranch}, which shows the parts of the translation that are performed in \compcert{}, and which have been added.
+The main work flow of \vericert{} is given in Figure~\ref{fig:rtlbranch}, which shows those parts of the translation that are performed in \compcert{}, and those which \JP{that?..} have been added.
 
-\compcert{} is made up of 11 intermediate languages in between the Clight input and the assembly output, so the first thing we must decide is where best to branch off to our Verilog back end.
+\compcert{} translates Clight input into assembly output via a sequence of intermediate languages; we must decide which of these eleven languages \JP{I can only count 9 in total (including CompCert C and Asm) now that I look at the docs; I remember having this issue before...} is the most suitable starting point to begin the HLS-specific translation stages. \JP{Should we say what Clight is here? (i.e. deterministic C with pure expressions)} \JW{Yeah maybe a footnote?}
 
-We select CompCert's three-address code (3AC)\footnote{This is known as register transfer language (RTL) in the \compcert{} literature. We use `3AC' is used in this paper instead to avoid confusion with register-transfer level (RTL), which is another name for the final hardware target of the HLS tool.} as the branching off point. If we branch off before this point (at CminorSel or earlier), then CompCert has not had the opportunity to perform such optimisations as constant propagation and dead code elimination, which have been shown to be valuable not just in conventional compilation but also in HLS~\cite{cong+11}. And if we branch off after this point (at LTL or later) then CompCert has already performed register allocation to reduce the number of registers and spill some variables to the stack; this transformation is not required in HLS because there are many more registers available, and these should be used instead of RAM whenever possible.
+We select CompCert's three-address code (3AC)\footnote{This is known as register transfer language (RTL) in the \compcert{} literature. `3AC' is used in this paper instead to avoid confusion with register-transfer level (RTL), which is another name for the final hardware target of the HLS tool.} as the starting point. Branching off before this point (at CminorSel or earlier) denies \compcert{} the opportunity to perform optimisations such as constant propagation and dead code elimination, which have been found useful in HLS tools as well as software compilers~\cite{cong+11}. And if we branch off after this point (at LTL or later) then CompCert has already performed register allocation to reduce the number of registers and spill some variables to the stack; this transformation is not required in HLS because there are many more registers available, and these should be used instead of RAM whenever possible. \JP{``\compcert{} performs register allocation during the translation to LTL, with some registers spilled onto the stack: this is unnecessary in HLS since as many registers as are required may be described in the output RTL.''} \JP{Maybe something about FPGAs being register-dense (so rarely a need to worry about the number of flops)?}
 
-3AC is also attractive because it is the closest intermediate language to LLVM IR, which is used by several existing HLS compilers. It has an unlimited number of pseudo-registers, and is represented as a control flow graph (CFG) where each instruction is a node with links to the instructions that can follow it. One difference between LLVM IR and 3AC is that 3AC includes operations that are specific to the chosen target architecture; we chose x86\_32 because each instruction maps well to hardware.
+3AC is also attractive because it is the closest intermediate language to LLVM IR, which is used by several existing HLS compilers \JP{We already ruled out LLVM as a starting point, so this seems like it needs further qualification.}. It has an unlimited number of pseudo-registers, and is represented as a control flow graph (CFG) where each instruction is a node with links to the instructions that can follow it. One difference between LLVM IR and 3AC is that 3AC includes operations that are specific to the chosen target architecture; we chose x86\_32 because each instruction maps well to hardware.\JP{I'm still not convinced: all instructions map well to hardware (in theory) because they're all implemented on real processors, even if they require micro-ops or multiple cycles. How about ``Targeting x86\_32 generally produces relatively dense RTL thanks to the availability of more complex addressing modes, reducing cycle counts when the HLS tool uses a naive scheduling approach'' or something like that?} 
 
 \begin{figure}
   \centering
@@ -172,18 +172,20 @@ module main(reset, clk, finish, return_val);
       endcase
 endmodule
 \end{minted}
-\caption{Verilog output for our running example, as generated by \vericert{}. The left column contains the datapath and the right column contains the control logic.}\label{fig:accumulator_v}
+\caption{Verilog produced by \vericert{}. The left column contains the datapath and the right column contains the control logic.}\label{fig:accumulator_v}
 \end{subfigure}
-  \caption{Using \compcert{} to translate a simple program from C to three address code (3AC).}\label{fig:accumulator_c_rtl}
+  \caption{Translating a simple program from C to Verilog.}\label{fig:accumulator_c_rtl}
 \end{figure}
 
 \subsection{Translating C to Verilog, by example}
-
-Using the simple program in Figure~\ref{fig:accumulator_c} as a worked example, this section describes how \vericert{} translates a behavioural description in C into a hardware design in Verilog.
+Figure~\ref{fig:accumulator_c_rtl} illustrates the translation of a program that sums the elements of an array.
+In this section we describe the stages of the \vericert{} translation process using this program as an example.
 
 \subsubsection{Translating C to 3AC}
 
-The first step of the translation is to use \compcert{} to transform the input C code into the 3AC shown in Figure~\ref{fig:accumulator_rtl}. As part of this, \compcert{} performs such optimisations as constant propagation and dead-code elimination.  Function inlining is also performed, which allows us to support function calls without having to support the \texttt{Icall} 3AC instruction.  Although the duplication of the function bodies caused by inlining can increase the area of the hardware, it can have a positive effect on latency. Moreover, inlining excludes support for recursive function calls, but this feature isn't supported in most other HLS tools either~\cite{davidthomas_asap16}.
+The first stage of the translation uses unmodified \compcert{} to transform the C input, shown in Figure~\ref{fig:accumulator_c}, into a 3AC intermediate representation, shown in Figure~\ref{fig:accumulator_rtl}.
+
+As part of this translation, \compcert{} performs such optimisations as constant propagation and dead-code elimination.\JP{We already wrote about this, remove?}  Function inlining is also performed, which allows us to support function calls without having to support the \texttt{Icall} 3AC instruction.  Although the duplication of the function bodies caused by inlining can increase the area of the hardware, it can have a positive effect on latency. Moreover, inlining excludes support for recursive function calls, but this feature isn't supported in most other HLS tools either~\cite{davidthomas_asap16}.
 
 %\JW{Is that definitely true? Was discussing this with Nadesh and George recently, and I ended up not being so sure. Inlining could actually lead to \emph{reduced} resource usage because once everything has been inlined, the (big) scheduling problem could then be solved quite optimally. Certainly inlining is known to increase register pressure, but that's not really an issue here. If we're  not sure, we could just say that inlining everything leads to bloated Verilog files and the inability to support recursion, and leave it at that.}\YH{I think that is true, just because we don't do scheduling.  With scheduling I think that's true, inlining actually becomes quite good.}
 
@@ -197,9 +199,11 @@ The first step of the translation is to use \compcert{} to transform the input C
 %\JW{I feel like this could use some sort of citation, but I'm not sure what. I guess this is all from "Hardware Design 101", right?}\YH{I think I found a good one actually, which goes over the basics.}
 %\JW{I think it would be worth having a sentence to explain how the C model of memory is translated to a hardware-centric model of memory. For instance, in C we have global variables/arrays, stack-allocated variables/arrays, and heap-allocated variables/arrays (anything else?). In Verilog we have registers and RAM blocks. So what's the correspondence between the two worlds? Globals and heap-allocated are not handled, stack-allocated variables become registers, and stack-allocated arrays become RAM blocks? Am I close?}\YH{Stack allocated variables become RAM as well, so that we can deal with addresses easily and take addresses of any variable.} \JW{I see, thanks. So, in short, the only registers in your hardware designs are those that store things like the current state, etc. You generate a fixed number of registers every time you synthesis -- you don't generate extra registers to store any of the program variables. Right?}
 
-The first translation performed in \vericert{} is from 3AC to a new hardware translation language (HTL). %, which is one step towards being completely translated to hardware described in Verilog.
-This involves going from a CFG representation of the computation to a finite state machine with datapath (FSMD) representation~\cite{hwang99_fsmd}. The core idea of the FSMD representation is that it separates the control flow from the operations on the memory and registers. %This means that the state transitions can be translated into a simple finite state machine (FSM) where each state contains data operations that update the memory and registers. 
-An HTL program thus consists of two maps: a control map that describes how to calculate the next state from the current state, and a datapath map that describes how to update the registers and RAM given the current state. Figure~\ref{fig:accumulator_diagram} shows the resulting architecture of the FSMD.
+The next translation is from 3AC to a new hardware translation language (HTL). %, which is one step towards being completely translated to hardware described in Verilog.
+This involves going from a CFG representation of the computation to a finite state machine with datapath (FSMD) representation~\cite{hwang99_fsmd}. The core idea of the FSMD representation is that it separates the control flow from the operations on the memory and registers. \JP{I've become less comfortable with this term, but it's personal preference so feel free to ignore. I think `generalised finite state machine' (i.e.\ thinking of the entire `datapath' as contributing to the overall state) is more accurate.}
+%This means that the state transitions can be translated into a simple finite state machine (FSM) where each state contains data operations that update the memory and registers. 
+Hence, an HTL program thus consists of two maps: control and datapath maps that express state transitions and computations respectively. \JP{Hence + thus in same sentence referring to the exact same implication.}
+Figure~\ref{fig:accumulator_diagram} shows the resulting FSMD architecture. The right block is the control logic that computes the next state whereas the left block updates all the registers and RAM based on the current program state. 
 
 \begin{figure*}
   \centering
@@ -290,29 +294,41 @@ An HTL program thus consists of two maps: a control map that describes how to ca
 \end{figure*}
 
 \paragraph{Translating memory}
-Hardware does not have the same memory model as C, so the memory model needs to be translated, as follows.  Global variables are not translated in \vericert{} at the moment. The stack of the main function becomes a block of RAM, \JW{check this -- maybe change to 2D array of registers?} as seen in Figure~\ref{fig:accumulator_diagram}. Program variables that have their address taken are stored in this RAM, as are any arrays or structs defined in the function. Variables that do not have their address taken are kept in registers.
+Typically, HLS-generated hardware consists of a sea of registers and RAM memories. \JP{Does it? Verilog has neither physical registers nor RAMs, just language constructs which the synthesiser might implement with registers and RAMs. We should be clear whether we're talking about the HDL representation, or the synthesised result: in our case these can be very different since we don't target any specific architectural features of an FPGA fabric of ASIC process.}
+This memory view is very different to the C memory model, so we perform the following translation. 
+Variables that do not have their address taken are kept in registers. \JP{CompCert does this for us.}
+All address-taken variables, arrays or structs are kept in RAM.
+The stack of the main function becomes a block of RAM, \JW{check this -- maybe change to 2D array of registers?} \JP{Agree, we describe an unpacked array of 32-bit vectors: the synthesiser might turn this into some kind of RAM, but this isn't guaranteed at all, and I think we found that the synthesis flow we used for evaluation put everything in registers?} as seen in Figure~\ref{fig:accumulator_diagram}. 
+Finally, global variables are not translated in \vericert{} at the moment.
 
 \paragraph{Translating instructions}
-Each 3AC instruction either corresponds to a hardware construct, or does not have to be handled by the translation, such as function calls.
-For example, in state 15 in figure~\ref{fig:accumulator_rtl}, the register \texttt{x8} is initialised to 1, after which the control flow moves to state 14.  This is encoded in HTL by initialising a 32-bit register \texttt{reg\_8} to 1 in the data-flow section, and also adding a transition to the state 14 in the control logic section.  Simple operator instructions are translated in a similar way.  For example, in state 5, the value in the array is added to the current value of the accumulated sum, which is simply translated to an addition of the equivalent registers in the HTL code.
+Each 3AC instruction either corresponds to a hardware construct, or does not have to be handled by the translation, such as function calls. \JP{Confusing: we never generate function calls due to inlining. We should be clear that this is because CompCert will never actually codegen a call: if it did, we would have to handle it (or more likely error out).}
+For example, state 15 in Figure~\ref{fig:accumulator_rtl} shows a 32-bit register \texttt{x8} being initialised to 1, after which the control flow moves to state 14. This initialisation is also encoded in HTL at state 15 in both the control and datapath always blocks, as shown in Figure~\ref{fig:accumulator_v}.  Simple operator instructions are translated in a similar way.  For example, in state 5, the value of the array element is added to the current sum value, which is simply translated to an addition of the equivalent registers in the HTL code.
 
 \paragraph{Key challenge: signedness} Note that the comparison in state 3 is signed. This is because C and Verilog handle signedness quite differently. By default, all operators and registers in Verilog (and HTL) are unsigned, so to force an operation to handle the bits as signed, both operators have to be forced to be signed.  In addition to that, Verilog resizes expressions to the largest needed size by default, which can affect the result of the computation.  This feature is also not supported by the Verilog semantics we adopted, and there would therefore be a mismatch between the Verilog semantics and the actual behaviour of Verilog according to the standard.  To bypass this issue, braces are used to stop the Verilog simulator or synthesis tool from resizing anything inside the braces.  Instead, explicit resizing is used in the semantics and operations can only be performed on two registers that
-have the same size.
+have the same size. \JP{I feel like this is an issue of syntax? We embed a simplified syntax in Coq which elides braces etc. There isn't really a semantic mismatch (which sounds really scary and wrong).}
 
-\subsection{Translating HTL to Verilog}
+\subsubsection{Translating HTL to Verilog}
 
-Finally, we have to translate the HTL code into proper Verilog. % and prove that it behaves the same as the 3AC according to the Verilog semantics.
-Whereas HTL is a language that is specifically designed to represent the FSMDs we are interested in, Verilog is a general-purpose HDL.\@  So the challenge here is to translate our FSMD representation into a Verilog AST.  However, as all the instructions are already expressed in Verilog, only the maps need to be translated to valid Verilog, and correct declarations for all the variables in the program need to be added as well.
+Finally, we have to translate the HTL code into proper Verilog, % and prove that it behaves the same as the 3AC according to the Verilog semantics.
+where HTL is a language that is specifically designed to represent the FSMDs we are interested in and Verilog is a general-purpose HDL.\@  So the challenge here is to translate our FSMD representation into a Verilog AST.\JP{We've already introduced both of these languages.}  However, as all the instructions are already expressed in Verilog, only the maps need to be translated to valid Verilog, and correct declarations for all the variables in the program need to be added as well.
 
-This translation seems quite straightforward, but proving it correct is not that simple, as all the implicit assumptions that were made in HTL need to be translated explicitly to Verilog statements and it needs to be shown that these explicit behaviours are equivalent to the assumptions made in the HTL semantics.
-Figure~\ref{fig:accumulator_v} shows the final Verilog output that is generated for our worked example.  In general, the structure is similar to that of the HTL code, but the control and datapath maps become Verilog case-statements.  The other main addition to the code is the initialisation of all the variables in the code to the correct bitwidths and the declaration of the inputs and outputs to the module, so that the module can be used inside a larger hardware design.
+Although this translation seems quite straightforward, proving that this translation is correct is complex. 
+All the implicit assumptions that were made in HTL need to be translated explicitly to Verilog statements and it needs to be shown that these explicit behaviours are equivalent to the assumptions made in the HTL semantics.
+We discuss these proofs in upcoming sections.
+
+Figure~\ref{fig:accumulator_v} shows the final Verilog output that is generated for our example.  In general, the generated Verilog structure has similar to that of the HTL code.
+The key difference is that the control and datapath maps become Verilog case-statements. 
+Other additions are the initialisation of all the variables in the code to the correct bitwidths and the declaration of the inputs and outputs to the module, so that the module can be used inside a larger hardware design.
 
 \subsection{Optimisations}
 
-Although we would not claim that \vericert{} is a proper `optimising' HLS compiler yet, we have nonetheless implemented a few optimisations that aim to improve the quality of the hardware designs it produces.
+Although we would not claim that \vericert{} is a proper `optimising' HLS compiler yet, we have nonetheless implemented a few optimisations that aim to improve the quality of the hardware designs it produces. \JP{I think saying `optimisations' in a PLDI paper implies something else? Perhaps ``..we have nonetheless made several design choices that aim..''}
 
 \subsubsection{Byte- and word-addressable memories}
 
+\JP{``Although Verilog arrays might seem to mirror their C counterparts directly, they must be treated quite differently. To reduce the design area and avoid issues meeting timing, it is beneficial if Verilog arrays can be synthesised as RAMs, but this imposes various constraints on how Verilog arrays are used. RAMs often only allow one read and one write operation per clock cycle: to make loads and stores as efficient as possible, the RAM needs to be word-addressable, which means that an entire integer can be loaded or stored in one clock cycle.''}
+
 One big difference between C and Verilog is how memory is represented.  In hardware, efficient RAMs are not as available as in software, and need to be explicitly implemented by declaring arrays with specific properties.  A major limitation is that RAMs often only allow one read and one write per clock cycle. So, to make loads and stores as efficient as possible, the RAM needs to be word-addressable, which means that an entire integer can be loaded or stored in one clock cycle.
 However, the memory model that \compcert{} uses for its intermediate languages is byte-addre\-ssa\-ble~\cite{blazy05_formal_verif_memor_model_c}.  It therefore has to be proven that the byte-addressable memory behaves in the same way as the word-addressable memory in hardware.  Any modifications of the bytes in the \compcert{} memory model also have to be shown to modify the word-addressable memory in the same way.  Since only integer loads and stores are currently supported in \vericert{}, it follows that the addresses given to the loads and stores should be multiples of four.  If that is the case, then the translation from byte-addressed memory to word-addressed memory can be done by dividing the address by four.
 
diff --git a/archive/related_work.tex b/archive/related_work.tex
new file mode 100644
index 0000000..3b880b3
--- /dev/null
+++ b/archive/related_work.tex
@@ -0,0 +1,13 @@
+\JW{Some papers to cite somewhere:
+  \begin{itemize}
+    \item \citet{kundu08_valid_high_level_synth} have done translation validation on an HLS tool called SPARK. They don't have \emph{very} strong evidence that HLS tools are buggy; they just say ``HLS tools are large and complex software systems, often with hundreds of
+      thousands of lines of code, and as with any software of this scale, they are prone
+      to logical and implementation errors''. They did, however, find two bugs in SPARK as a result of their efforts, so that provides a small bit of evidence that HLS tools are buggy. \@
+    \item \citet{chapman92_verif_bedroc} have proved (manually, as far as JW can tell) that the front-end and scheduler of the BEDROC synthesis system is correct. (NB:\@ Chapman also wrote a PhD thesis about this (1994) but it doesn't appear to be online.)
+    \item \citet{ellis08} wrote a PhD thesis that was supervised by Cliff Jones whom you met at the VeTSS workshop thing last year. At a high level, it's about verifying a high-level synthesis tool using Isabelle, so very relevant to this project, though scanning through the pdf, it's quite hard to ascertain exactly what the contributions are. Strange that they never published a paper about the work -- it makes it very hard to find!
+  \end{itemize}
+}
+
+\YH{Amazing, thank you, yes it seems like Kundu \textit{et al.} have quite a few papers on formal verification of HLS algorithms using translation validation, as well as~\citet{karfa06_formal_verif_method_sched_high_synth}, all using the SPARK~\cite{gupta03_spark} synthesis tool.  And yes thank you, will definitely cite that paper.  There seem to be similar early proofs of high-level synthesis like \citet{hwang91_formal_approac_to_sched_probl} or \citet{grass94_high}.  There are also the Occam papers that I need to mention too, like \citet{page91_compil_occam}, \citet{jifeng93_towar}, \citet{perna11_correc_hardw_synth} and \citet{perna12_mechan_wire_wise_verif_handel_c_synth}.}
+\JW{Well it's a good job there's no page limit on bibliographies these days then! I'll keep an eye out for more papers we could cite when I get a free moment. (No need to cite \emph{everything} of course.)}
+\YH{Yes that's true, there are too many to cite! And great thank you, that would help a lot, I have quite a few papers I still want to cite, but have to think about where they will fit in.}
diff --git a/conclusion.tex b/conclusion.tex
index 7c06245..a53c268 100644
--- a/conclusion.tex
+++ b/conclusion.tex
@@ -1,6 +1,12 @@
 \section{Conclusion}
 
-\vericert{} is the first fully mechanised proof of correctness for HLS, translating C into Verilog.  However, to make the tool more usable there are many more optimisations that could be implemented to get the performance closer to \legup{}.  The main optimisation that should be performed is scheduling operations into the same clock cycle, so that these can run in parallel.  In addition to that, another issue that \vericert{} faces is that \compcert{}'s 3AC uses many different registers, even if these are not anymore later on.  It would therefore be useful to have a register allocation algorithm that could ensure that registers could be reused appropriately.  Finally, another optimisation would be to support resource sharing to reduce the area that the final design uses.
+\vericert{} is the first fully mechanised proof of correctness for HLS in Coq, translating C into Verilog.  
+
+However, to make the tool more usable there are many more optimisations that could be implemented to get the performance closer to \legup{}.  The main optimisation that should be performed is scheduling operations into the same clock cycle, so that these can run in parallel.  In addition to that, another issue that \vericert{} faces is that \compcert{}'s 3AC uses many different registers, even if these are not used anymore later on.  It would therefore be useful to have a register allocation algorithm that could ensure that registers could be reused appropriately, where the register allocation algorithm would find the minimum number of registers needed.  Finally, another optimisation would be to support resource sharing to reduce the area that the final design uses, this could include multi-cycle operations and pipelining optimisations so that division and multiplication operators also become more efficient.
+
+Finally, it's worth considering how trustworthy \vericert{} is compared to other HLS tools.  A formally verified tool does not guarantee an output for every input and it is therefore allowed to fail.  However, the subset of C that will compile is well-described and the tool should therefore not fail in surprising ways.  However, the guarantee that the output behaves the same as the input is strong, as it uses well-known C and existing Verilog semantics to prove the equivalence.
+
+
 
 %%% Local Variables:
 %%% mode: latex
diff --git a/evaluation.tex b/evaluation.tex
index cd0a278..edb5795 100644
--- a/evaluation.tex
+++ b/evaluation.tex
@@ -3,8 +3,8 @@
 
 Our evaluation is designed to answer the following three research questions.
 \begin{description}
-\item[RQ1] How fast is the hardware generated by \vericert{}, and how does this compare to existing HLS tools?
-\item[RQ2] How area-efficient is the hardware generated by \vericert{}, and how does this compare to existing HLS tools?
+\item[RQ1] How fast is the hardware generated by \vericert{}?
+\item[RQ2] How area-efficient is the hardware generated by \vericert{}?
 \item[RQ3] How long does \vericert{} take to produce hardware?
 \end{description}
 
@@ -12,21 +12,22 @@ Our evaluation is designed to answer the following three research questions.
 
 \paragraph{Choice of HLS tool for comparison.} We compare \vericert{} against \legup{} 5.1 because it is open-source and hence easily accessible, but still produces hardware ``of comparable quality to a commercial high-level synthesis tool''~\cite{canis11_legup}.
 
-\paragraph{Choice of benchmarks.} We evaluate \vericert{} using the PolyBench/C benchmark suite~\cite{polybench}, consisting of a collection of well-known numerical kernels. PolyBench/C is widely-used in the HLS context~\cite{choi+18,poly_hls_pouchet2013polyhedral,poly_hls_zhao2017,poly_hls_zuo2013}, since it consists of affine loop bounds, making it attractive for regular and streaming computation on FPGA architectures.
-We chose Polybench 4.2.1 for our experiments, which consists of 30 programs. 
-Out of these 30 programs, three programs utilise square root functions: \texttt{corelation},~\texttt{gramschmidt} and~\texttt{deriche}. 
-Hence, we were unable evaluate these programs, since they mandatorily require \texttt{float}s.
+\paragraph{Choice and preparation of benchmarks.} We evaluate \vericert{} using the PolyBench/C benchmark suite (version 4.2.1)~\cite{polybench}, which consists of a collection of 30 numerical kernels. PolyBench/C is popular in the HLS context~\cite{choi+18,poly_hls_pouchet2013polyhedral,poly_hls_zhao2017,poly_hls_zuo2013}, since it has affine loop bounds, making it attractive for streaming computation on FPGA architectures.
+We were able to use 27 of the 30 programs; three had to be discarded (\texttt{correlation},~\texttt{gramschmidt} and~\texttt{deriche}) because they involve square roots, which require floats, which we do not support. 
 % Interestingly, we were also unable to evaluate \texttt{cholesky} on \legup{}, since it produce an error during its HLS compilation. 
-In summary, we evaluate 27 programs from the latest Polybench suite. 
+%In summary, we evaluate 27 programs from the latest Polybench suite. 
 
-\paragraph{Configuring Polybench for experimentation}
-We configure Polybench's metadata and slightly modified the source code to suit our purposes.
+We configured Polybench's parameters so that only integer types are used, since we do not support floats or doubles currently. We use Polybench's smallest datasets for each program to ensure that data can reside within on-chip memories of the FPGA, avoiding any need for off-chip memory accesses.
+
+
+
+metadata and slightly modified the source code to suit our purposes.
 First, we restrict Polybench to only generate integer data types, since we do not support floats or doubles currently.
-Secondly, we utilise Polybench's smallest data set size for each program to ensure that data can reside within on-chip memories of the FPGA, avoiding any need for off-chip memory accesses.
+Second, we use Polybench's smallest datasets for each program to ensure that data can reside within on-chip memories of the FPGA, avoiding any need for off-chip memory accesses.
 Furthermore, using the C divide or modulo operators results in directly translate to built-in Verilog divide and modulo operators. 
 Unfortunately, the built-in operators are designed as single-cycle operation, causing large penalties in clock frequency. 
 To work around this issue, we use a C implementation of the divide and modulo operations, which is indirectly compiles them as multi-cycle operations on the FPGA. 
-In addition, we initial the input arrays and check the output arrays of all programs entirely on-chip. 
+%In addition, we initial the input arrays and check the output arrays of all programs entirely on-chip. 
 
 % For completeness, we use the full set of 24 benchmarks. We set the benchmark parameters so that all datatypes are integers (since \vericert{} only supports integers) and all datasets are `small' (to fit into the small on-chip memories). A current limitation of \vericert{}, as discussed in Section~\ref{?}, is that it does not support addition and subtraction operations involving integer literals not divisible by 4. To work around this, we lightly modified each benchmark program so that literals other than multiples of 4 are stored into variables before being added or subtracted. \JW{Any other notable changes to the benchmarks?}
 
diff --git a/main.tex b/main.tex
index 81fed60..9cfe5f1 100644
--- a/main.tex
+++ b/main.tex
@@ -80,7 +80,7 @@
 \newcommand\yhconstant[1]{\texttt{\textcolor{constantcolour}{#1}}}
 
 \ifANONYMOUS
-\newcommand{\vericert}{Coq\-Up}
+\newcommand{\vericert}{HLS\-Cert}
 \else
 \newcommand{\vericert}{Veri\-cert}
 \fi
@@ -152,7 +152,7 @@
 \begin{abstract}
   High-level synthesis (HLS), which refers to the automatic compilation of software into hardware, is rapidly gaining popularity.  In a world increasingly reliant on application-specific hardware accelerators, HLS promises hardware designs of comparable performance and energy efficiency to those coded by hand in a hardware description language like Verilog, while maintaining the convenience and the rich ecosystem of software development.  However, current HLS tools cannot always guarantee that the hardware designs they produce are equivalent to the software they were given, thus undermining any reasoning conducted at the software level.  Worse, there is mounting evidence that existing HLS tools are quite unreliable, sometimes generating wrong hardware or crashing when given valid inputs.
 
-  To address this problem, we present the first HLS tool that is mechanically verified to preserve the behaviour of its input software.  Our tool, called \vericert{}, extends the \compcert{} verified C compiler with a new hardware-esque intermediate language and a Verilog back end, and has been proven correct in Coq.  \vericert{} supports all C constructs except for function pointers, recursive function calls, non-integer types and global variables.  We evaluate \vericert{} on the PolyBench/C benchmark suite. We can generate Polybench hardware that is guaranteed to be translated correctly from C to Verilog.  Our generated hardware is a magnitude slower and larger than hardware generated by an existing, optimised but unverified HLS tool.
+  To address this problem, we present the first HLS tool that is mechanically verified to preserve the behaviour of its input software.  Our tool, called \vericert{}, extends the \compcert{} verified C compiler with a new hardware-esque intermediate language and a Verilog back end, and has been proven correct in Coq.  \vericert{} supports all C constructs except for function pointers, recursive function calls, non-integer types and global variables.  We evaluate \vericert{} on the PolyBench/C benchmark suite. We can generate Polybench hardware that is guaranteed to be translated correctly from C to Verilog.  Our generated hardware is around one order of magnitude slower and larger than hardware generated by an existing, optimising (but unverified) HLS tool.
 \end{abstract}
 
 %% 2012 ACM Computing Classification System (CSS) concepts
diff --git a/proof.tex b/proof.tex
index fc15f42..4e4c721 100644
--- a/proof.tex
+++ b/proof.tex
@@ -1,147 +1,147 @@
 \section{Proof}\label{sec:proof}
 
-Now that the Verilog semantics have been adapted to the CompCert model, we are in a position to formally prove the correctness of our C to Verilog compilation.  This section describes the main correctness theorem that was proven and the main ideas behind the proof.
+Now that the Verilog semantics have been adapted to the CompCert model, we are in a position to formally prove the correctness of our C-to-Verilog compilation.  This section describes the main correctness theorem that was proven and the main ideas behind the proof. The full Coq proof is available in auxiliary material.
 
-The main correctness theorem is the exact same correctness theorem as stated in \compcert{}~\cite{leroy09_formal_verif_realis_compil} which states that for all Clight source programs $C$, if the translation from the source to the target Verilog code succeeds, assuming that $C$ has safe observable behaviour $B$ when executed, then the target Verilog code will simulate with the same behaviour $B$, meaning execution should either converge or diverge, but should not ``go wrong''.  Behaviour is also associated with a trace $t$ of any I/O events, however, as external function calls are not supported, this trace $t$ will always be empty for programs passing through \vericert{}.  The following backwards simulation theorem describes the correctness theorem, where $\Downarrow$ stands for simulation and execution respectively.
+The main correctness theorem is analogous to that stated in \compcert{}~\cite{leroy09_formal_verif_realis_compil}: for all Clight source programs $C$, if the translation to the target Verilog code succeeds, and $C$ has safe observable behaviour $B$ when executed, then the target Verilog code will have the same behaviour $B$. Here, a `safe' execution is one that either converges or diverges, but does not ``go wrong''. If the program does admit some wrong behaviour (like undefined behaviour in C), the correctness theorem does not apply. A behaviour, then, is either a final state (in the case of convergence) or divergence. In \compcert{}, a behaviour is also associated with a trace of I/O events, but since external function calls are not supported in \vericert{}, this trace will always be empty for us. Note that the compiler is allowed to fail and not produce any output; the correctness theorem only applies when the translation succeeds.
 
-\begin{theorem}
-  Assuming that the translation from $C$ to Verilog $V$ succeeds, then if $V$ has behaviour $B$, $C$ should have the same behaviour $B$.
+%The following `backwards simulation' theorem describes the correctness theorem, where $\Downarrow$ stands for simulation and execution respectively.
 
+\begin{theorem}
+  For any safe behaviour $B$, whenever the translation from $C$ succeeds and produces Verilog $V$, then $V$ has behaviour $B$ only if $C$ has behaviour $B$.
   \begin{equation*}
     \forall C, V, B \in \texttt{Safe},\, \yhfunction{HLS} (C) = \yhconstant{OK} (V) \land V \Downarrow B \implies C \Downarrow B.
   \end{equation*}
 \end{theorem}
 
-To prove this the first observation that should be made is that instead of proving the backwards simulation, we can prove the forwards simulation, followed by a proof that the target semantics of the translation, which in our case is Verilog, is deterministic.  This means that there is only one possible behaviour $B$ for $V$, and that therefore the backwards simulation holds as well.
+The theorem is a `backwards simulation' result (from target to source). The theorem does not demand the `if' direction too, because compilers are permitted to resolve any non-determinism present in their source programs.
 
-The second observation that needs to be made is that to prove the forward simulation, it suffices to prove the forward simulations between each intermediate language, as they can be composed to prove the correctness of the whole HLS tool.
+In practice, Clight programs are all deterministic, as are the Verilog programs in the fragment we consider. This means that we can prove the correctness theorem above by first inverting it to become a forwards simulation result, following standard \compcert{} practice.
 
-The forward simulation between 3AC and HTL is stated in Lemma~\ref{lemma:htl} in Section~\ref{sec:proof:3ac_htl}, next the forward simulation between HTL and Verilog is shown in Lemma~\ref{lemma:verilog} in Section~\ref{sec:proof:htl_verilog} and finally, the proof that Verilog is deterministic is given in Lemma~\ref{lemma:deterministic} in Section~\ref{sec:proof:deterministic}.
+The second observation that needs to be made is that to prove this forward simulation, it suffices to prove forward simulations between each intermediate language, as these results can be composed to prove the correctness of the whole HLS tool. 
+The forward simulation from 3AC to HTL is stated in Lemma~\ref{lemma:htl} (Section~\ref{sec:proof:3ac_htl}), then the forward simulation between HTL and Verilog is shown in Lemma~\ref{lemma:verilog} (Section~\ref{sec:proof:htl_verilog}) and finally, the proof that Verilog is deterministic is given in Lemma~\ref{lemma:deterministic} (Section~\ref{sec:proof:deterministic}).
 
-\subsection{Forward simulation between 3AC and HTL}\label{sec:proof:3ac_htl}
+\subsection{Forward simulation from 3AC to HTL}\label{sec:proof:3ac_htl}
 
-As HTL is quite different to 3AC, this first translation is the most involved and therefore requires a larger proof, because the translation from 3AC instructions to Verilog statements needs to be proven correct in this step.  In addition to that, the semantics of HTL are also quite different to the 3AC semantics, as instead of defining small-step semantics for each construct in Verilog, the semantics are instead defined over one clock cycle and mirror the semantics defined for Verilog.  Lemma~\ref{lemma:htl} shows the theorem that needs to be proven in this section.
-
-\YH{Should this section state the lemma at the end maybe? With the proof complete?  Stating the forward simulation and specification first and then proving the top level lemma? Or is it better like this and I get rid of the proof of this  lemma?}
-\begin{lemma}\label{lemma:htl}
-  Forward simulation of the 3AC and HTL intermediate language, assuming that the translation from 3AC to HTL succeeded using \texttt{tr\_htl} for the translation.
+As HTL is quite far removed from 3AC, this first translation is the most involved and therefore requires a larger proof, because the translation from 3AC instructions to Verilog statements needs to be proven correct in this step.  In addition to that, the semantics of HTL are also quite different to the 3AC semantics, as instead of defining small-step semantics for each construct in Verilog, the semantics are instead defined over one clock cycle and mirror the semantics defined for Verilog.  Lemma~\ref{lemma:htl} shows the result that needs to be proven in this subsection.
 
+\begin{lemma}[Forward simulation from 3AC to HTL]\label{lemma:htl}
+  We write \texttt{tr\_htl} for the translation from 3AC to HTL.
   \begin{equation*}
     \forall c, h, B \in \texttt{Safe}, \yhfunction{tr\_htl} (c) = \yhconstant{OK} (h) \land c \Downarrow B \implies h \Downarrow B.
   \end{equation*}
 \end{lemma}
 
-\begin{proof}
-  Follows by using a specification of the important properties $\yhfunction{tr\_htl}$ which can be used as an assumption in the proof instead.  The forward simulation is then proven by showing that the initial states and final states between the 3AC semantics and HTL semantics match, and then showing that the simulation diagram in Lemma~\ref{lemma:simulation_diagram} holds.
-\end{proof}
+We prove this lemma by first establishing a specification of the translation function $\yhfunction{tr\_htl}$ that captures its important properties, and then splitting the proof into two parts: one to show that the translation function does indeed meet its specification, and one to show that the specification implies the desired simulation result. This strategy is in keeping with standard \compcert{} practice. % The forward simulation is then proven by showing that the initial states and final states between the 3AC semantics and HTL semantics match, and then showing that the simulation diagram in Lemma~\ref{lemma:simulation_diagram} holds.
 
-\subsubsection{Specification}\label{sec:proof:3ac_htl:specification}
-
-To simplify the proof, instead of using the translation algorithm as an assumption, as was done in Lemma~\ref{lemma:htl}, a specification of the translation can be constructed instead which contains all the properties that are needed to prove the correctness.  For example, for the translation from 3AC to HTL, the specification for the translation of 3AC instructions into HTL data path and control logic can be defined as follows:
+\subsubsection{From Implementation to Specification}\label{sec:proof:3ac_htl:specification}
 
+%To simplify the proof, instead of using the translation algorithm as an assumption, as was done in Lemma~\ref{lemma:htl}, a specification of the translation can be constructed instead which contains all the properties that are needed to prove the correctness.  For example, for the translation from 3AC to HTL, 
+The specification for the translation of 3AC instructions into HTL data path and control logic can be defined by the following predicate:
 \begin{equation*}
   \yhfunction{spec\_instr } \textit{fin rtrn }\ \sigma\ \textit{stk }\ i\ \textit{data }\ \textit{control}
 \end{equation*}
 
-\noindent The \textit{control} and \textit{data} parts of the specification are the statements that the current 3AC instruction $i$ should translate to.  A specific example of a rule describing the translation of an \texttt{Iop} operation in 3AC is the following:
-
+\noindent Here, the \textit{control} and \textit{data} parameters are the statements that the current 3AC instruction $i$ should translate to. The other parameters are the special registers defined in Section~\ref{sec:verilog:integrating}. An example of a rule describing the translation of an arithmetic/logical operation from 3AC is the following:
 \begin{equation*}
   \inferrule[Iop]{\yhfunction{tr\_op } \textit{op }\ \vec{a} = \yhconstant{OK } e}{\yhfunction{spec\_instr } \textit{fin rtrn }\ \sigma\ \textit{stk }\ (\yhconstant{Iop } \textit{op }\ \vec{a}\ d\ n)\ (d\ \yhkeyword{<=}\ e)\ (\sigma\ \yhkeyword{<=}\ n)}
 \end{equation*}
 
-\noindent This rule describes the specification of the translation by describing the nonblocking assignments that are added to the data path and control logic, assuming that the translation of the operator was successful and resulted in expression $e$.
+\noindent Assuming that the translation of the operator \textit{op} with operands $\vec{a}$ is successful and results in expression $e$, the rule describes how the destination register $d$ is updated to $e$ via a non-blocking assignment in the data path, and how the program counter $\sigma$ is updated to point to the next CFG node $n$ via another non-blocking assignment in the control path.
+
+In the following lemma, $\yhfunction{spec\_htl}$ is the top-level specification predicate, which is built using $\yhfunction{spec\_instr}$ at the level of instructions. 
 
 \begin{lemma}\label{lemma:specification}
   If a 3AC program $c$ is translated correctly to an HTL program $h$, then the specification of the translation holds.
-
   \begin{equation*}
     \forall\ c\ h,\ \yhfunction{tr\_htl} (c) = \yhconstant{OK}(h) \implies \yhfunction{spec\_htl}\ c\ h.
   \end{equation*}
 \end{lemma}
 
-\begin{proof}
-  Follows from the definition of the specification and therefore should match the implementation of the algorithm.
-\end{proof}
+%\begin{proof}
+%  Follows from the definition of the specification and therefore should match the implementation of the algorithm.
+%\end{proof}
 
-\subsubsection{Forward simulation}
+\subsubsection{From Specification to Simulation}
 
-The first step in proving the forward simulation is to define a relation that matches an 3AC state to an HTL state, which shows when the states are equivalent.  This relation also defines assumptions that are made about the 3AC code that we receive, so that these assumptions can be used to prove the translations correct.  These assumptions then have to be proven to always hold assuming the HTL code was created by the translation algorithm.  Some of these assertions that need to be made about the 3AC and HTL code for a state to match are:
+To prove that the specification predicate implies the desired forward simulation, we must first define a relation that matches each 3AC state to an equivalent HTL state.  This relation also captures the assumptions made about the 3AC code that we receive from \compcert{}.% so that these assumptions can be used to prove the translations correct.  
+These assumptions then have to be proven to always hold assuming the HTL code was created by the translation algorithm.  Some of the assumptions that need to be made about the 3AC and HTL code for a pair of states to match are:
 
 \begin{itemize}
-  \item the 3AC register file $R \le \Gamma_{r}$ needs to be ``less defined'' than the HTL register map $\Gamma_{r}$, meaning all entries should be  and the RAM values represented by a Verilog array in $\Gamma_{a}$ need to match the 3AC function's stack contents, which are part of the memory $M$: $M \le \Gamma_{a}$,
-  \item the state is well formed, meaning the value of the state register matches the current value of the program counter: $\textit{pc} = \Gamma_{r}!\sigma$,
+  \item The 3AC register file $R$ needs to be `less defined' than the HTL register map $\Gamma_{r}$ (written $R \le \Gamma_{r}$). This means that all entries should be equal to each other, unless a value in $R$ is undefined, in which case any value can match it.
+  \item The RAM values represented by each Verilog array in $\Gamma_{a}$ need to match the 3AC function's stack contents, which are part of the memory $M$; that is: $M \le \Gamma_{a}$.
+  \item The state is well formed, which means that the value of the state register matches the current value of the program counter; that is: $\textit{pc} = \Gamma_{r}!\sigma$.
 \end{itemize}
 
-As well as some invariants $\mathcal{I}$ that have to hold for the current state to be valid:
+We also define the following set $\mathcal{I}$ of invariants that must hold for the current state to be valid:
 
 \begin{itemize}
-  \item all pointers in the program use the stack as a base pointer,
-  \item that a load and store to a location outside of the bounds of the stack does not occur, and does not modify the \compcert{} memory.  Even if it occurs in the program, as it is undefined behaviour we can prove that our behaviour is still correct given the input.
-  \item that \textit{rst} and \textit{fin} are not modified and therefore stay at a constant 0 throughout execution.
+  \item that all pointers in the program use the stack as a base pointer,
+  \item that any loads or stores to locations outside of the bounds of the stack result in undefined behaviour (and hence we do not need to handle them),
+  \item that \textit{rst} and \textit{fin} are not modified and therefore stay at a constant 0 throughout execution, and
   \item that the stack frames match.
 \end{itemize}
 
-We can then define the simulation diagram for the translation, where the 3AC state can be represented by $(R,M,\textit{pc})$.  The state of HTL can be represented by the maps $\Gamma_{r}$ for the current state of all registers in the module, and $\Gamma_{a}$, for the state of all arrays in the Verilog module, which represents the stack.  Finally, $\mathcal{I}$ stands for the other invariants that need to hold for the states to match:
+We can now define the simulation diagram for the translation. The 3AC state can be represented by the tuple $(R,M,\textit{pc})$, which captures the register file, memory, and program counter. The HTL state can be represented by the pair $(\Gamma_{r}, \Gamma_{a})$, which captures the states of all the registers and arrays in the module.  Finally, $\mathcal{I}$ stands for the other invariants that need to hold for the states to match.
 
 \begin{lemma}\label{lemma:simulation_diagram}
-  Given the 3AC state $(R,M,\textit{pc})$ and the matching HTL state $(\Gamma_{r}, \Gamma_{a})$, assuming one step in the 3AC semantics produces state $(R',M',\textit{pc}')$, there exists one or more steps in the HTL semantics that result in matching states $(\Gamma_{r}', \Gamma_{a}')$.  This is all under the assumption that the specification $\yhfunction{tr\_{htl}}$ holds for the translation.
+  Given the 3AC state $(R,M,\textit{pc})$ and the matching HTL state $(\Gamma_{r}, \Gamma_{a})$, assuming one step in the 3AC semantics produces state $(R',M',\textit{pc}')$, there exist one or more steps in the HTL semantics that result in matching states $(\Gamma_{r}', \Gamma_{a}')$.  This is all under the assumption that the specification $\yhfunction{tr\_{htl}}$ holds for the translation.
 
   \begin{center}
     \begin{tikzpicture}
       \begin{scope}
         \node[circle] (s1) at (0,1.5) {$R, M, \textit{pc}$};
-        \node[circle] (r1) at (6.5,1.5) {$\Gamma_{r}, \Gamma_{a}$};
+        \node[circle] (r1) at (7.2,1.5) {$\Gamma_{r}, \Gamma_{a}$};
         \node[circle] (s2) at (0,0) {$R', M', \textit{pc}'$};
-        \node[circle] (r2) at (6.5,0) {$\Gamma_{r}', \Gamma_{a}'$};
-        \node at (6.8,0.75) {+};
-        \draw (s1) -- node[above] {$\mathcal{I} \land R \le \Gamma_{r} \land M \le \Gamma_{a} \land \textit{pc} = \Gamma_{r}!\sigma$} ++ (r1);
+        \node[circle] (r2) at (7.2,0) {$\Gamma_{r}', \Gamma_{a}'$};
+        %\node at (6.8,0.75) {+};
+        \draw (s1) -- node[above] {$\mathcal{I} \land (R \le \Gamma_{r}) \land (M \le \Gamma_{a}) \land (\textit{pc} = \Gamma_{r}!\sigma)$} ++ (r1);
         \draw[-{Latex}] ($(s1.south) + (0,0.4)$) -- ($(s2.north) - (0,0.4)$);
-        \draw[-{Latex},dashed] ($(r1.south) + (0,0.2)$) -- ($(r2.north) - (0,0.2)$);
-        \draw[dashed] (s2) -- node[above] {$\mathcal{I} \land R' \le \Gamma_{r}' \land M' \le \Gamma_{a}' \land \textit{pc}' = \Gamma_{r}'!\sigma$} ++ (r2);
+        \draw[-{Latex},dashed] ($(r1.south) + (0,0.2)$) to[auto, pos=0.7] node {+} ($(r2.north) - (0,0.2)$);
+        \draw[dashed] (s2) -- node[above] {$\mathcal{I} \land (R' \le \Gamma_{r}') \land (M' \le \Gamma_{a}') \land (\textit{pc}' = \Gamma_{r}'!\sigma)$} ++ (r2);
       \end{scope}
     \end{tikzpicture}
   \end{center}
 \end{lemma}
 
-\begin{proof}
+\begin{proof}[Proof sketch]
   This simulation diagram is proven by induction over the operational semantics of 3AC, which allows us to find one or more steps in the HTL semantics that will produce the same final matching state.
 \end{proof}
 
-\YH{Need to add explanation of proof of initial and final states, as these are needed to prove the full forward simulation as well}
-
-\subsection{HTL to Verilog forward simulation}\label{sec:proof:htl_verilog}
+\subsection{Forward simulation from HTL to Verilog}\label{sec:proof:htl_verilog}
 
-The HTL to Verilog simulation is conceptually simple, as the only transformation is from the map representation of the code to the case statement representation.  As the representations are quite different though, the proof is more involved, as the semantics of a map structure are quite different to the semantics of the case statement they are converted to.
-
-\YH{Maybe want to split this up into two lemmas?  One which states the proof about the map property of uniqueness of keys, and another proving the final theorem?}
-\begin{lemma}\label{lemma:verilog}
-  The forward simulation for this translation assumes that the Verilog $V$ was obtained from the HTL to Verilog translation function $\yhfunction{tr\_verilog}$, note that the translation can not fail for this step.
+The HTL-to-Verilog simulation is conceptually simple, as the only transformation is from the map representation of the code to the case-statement representation.  The proof is more involved, as the semantics of a map structure are quite different to the semantics of the case-statement they are converted to.
 
+%\YH{Maybe want to split this up into two lemmas?  One which states the proof about the map property of uniqueness of keys, and another proving the final theorem?}
+\begin{lemma}[Forward simulation from HTL to Verilog]\label{lemma:verilog}
+  We write $\yhfunction{tr\_verilog}$ for the translation from HTL to Verilog. (Note that this translation cannot fail, so we do not need the \yhconstant{OK} constructor here.)
   \begin{align*}
-    &\forall h, V, B \in \texttt{Safe}, \\
-    &\quad\yhfunction{tr\_verilog} (h) = V \land h \Downarrow B \implies V \Downarrow B.
+    &\forall h, V, B \in \texttt{Safe}, \yhfunction{tr\_verilog} (h) = V \land h \Downarrow B \implies V \Downarrow B.
   \end{align*}
 \end{lemma}
 
-\begin{proof}
-  The translation from maps to case statements is done by turning each node of the tree into a case expression with the statements in each being the same.  The main difficulty for the proof is that a structure that can be directly accessed is transformed into an inductive structure where a certain number of constructors need to be called to get to the correct case.  The proof of the translation from maps to case statements follows by induction over the list of elements in the map and the fact that each key will be unique.  In addition to that, the statement that is currently being evaluated is guaranteed by the correctness of the list of elements to be in that list.  The latter fact therefore eliminates the base case, as an empty list does not contain the element we know is in the list.  The other two cases follow by the fact that either the key is equal to the evaluated value of the case expression, or it isn't.  In the first case we can then evaluate the statement and get the state after the case expression, as the uniqueness of the key tells us that the key cannot show up in the list anymore.  In the other case we can just apply the inductive hypothesis and remove the current case from the case statement, as it did not match.
-
-  Another problem with the representation of the state as an actual register is that we have to make sure that the state does not overflow.  Currently, the state register will always be 32 bits, meaning the maximum number of states can only be $2^{32} - 1$.  We therefore have to prove that the state value will never go over that value.  This means that during the translation we have to check for each state that it can fit into an integer.  Finally, as we have to assume that there are $2^{32} - 1$ states, \vericert{} will error out when there are more instructions to be translated, which allows us to satisfy and prove that assumption correct.
+\begin{proof}[Proof sketch]
+  The translation from maps to case-statements is done by turning each node of the tree into a case-expression with the statements in each being the same.  The main difficulty for the proof is that a random-access structure is transformed into an inductive structure where a certain number of constructors need to be called to get to the correct case. 
+  %\JW{I would chop from here.}\YH{Looks good to me.}
+  %The proof of the translation from maps to case-statements follows by induction over the list of elements in the map and the fact that each key will be unique.  In addition to that, the statement that is currently being evaluated is guaranteed by the correctness of the list of elements to be in that list.  The latter fact therefore eliminates the base case, as an empty list does not contain the element we know is in the list.  The other two cases follow by the fact that either the key is equal to the evaluated value of the case expression, or it isn't.  In the first case we can then evaluate the statement and get the state after the case expression, as the uniqueness of the key tells us that the key cannot show up in the list anymore.  In the other case we can just apply the inductive hypothesis and remove the current case from the case statement, as it did not match.
 \end{proof}
 
+One problem with our representation of the state as an actual register is that we have to make sure that the state does not overflow.  Currently, the state register always has 32 bits, meaning the maximum number of states can only be $2^{32} - 1$. \JW{Why not $2^{32}$?}\YH{OK, mainly because states use positive numbers which start at 1, I guess I could make the mapping by subtracting 1, but currently don't do that.} % We therefore have to prove that the state value will never go over that value.  
+This means that during the translation we have to check that each state can fit into an integer. \vericert{} will error out if there are more than  $2^{32}$ instructions to be translated, thus satisfying the correctness theorem vacuously.
+
 \subsection{Deterministic Semantics}\label{sec:proof:deterministic}
 
-Finally, to prove the backward simulation given the forward simulation, it has to be shown that if we generate hardware with a specific behaviour, that it is the only possible program with that behaviour.  This only has to be performed for the final intermediate language, which is Verilog, so that the backward simulation holds for the whole chain from Clight to Verilog.
+%Finally, to obtain the backward simulation that we want, it has to be shown that if we generate hardware with a specific behaviour, that it is the only possible program with that behaviour.  This only has to be performed for the final intermediate language, which is Verilog, so that the backward simulation holds for the whole chain from Clight to Verilog.
+The final lemma we need is that the Verilog we generate is deterministic. This result allows us to replace the forwards simulation we have proved with the backwards simulation we desire.
 
 \begin{lemma}\label{lemma:deterministic}
-  Semantics are deterministic if it can be shown that two behaviours $B_{1}$ and $B_{2}$ for the same Verilog program $V$ implies that the behaviours are the same.
+  If a Verilog program $V$ admits both behaviours $B_1$ and $B_2$, then $B_1$ and $B_2$ must be the same.
 
   \begin{equation*}
     \forall V, B_{1}, B_{2}, V \Downarrow B_{1} \land V \Downarrow B_{2} \implies B_{1} = B_{2}.
   \end{equation*}
 \end{lemma}
 
-\begin{proof}
+\begin{proof}[Proof sketch]
   The Verilog semantics that are used are deterministic, as the order of operation of all the constructs is defined, and there is therefore only one way to evaluate the module and only one possible behaviour.  This was proven correct for the small-step semantics shown in Figure~\ref{fig:inferrence_module}.
 \end{proof}
 
@@ -168,11 +168,11 @@ Finally, to prove the backward simulation given the forward simulation, it has t
     \textbf{Total}                                    & 1184 & 747 & 1007 & 5526 & 8464 \\
     \bottomrule
   \end{tabular}
-  \caption{Statistics about the proof and implementation of \vericert{}.}
+  \caption{Statistics about the numbers of lines of code in the proof and implementation of \vericert{}.}
   \label{tab:proof_statistics}
 \end{table*}
 
-The lines of code for the implementation and proof of \vericert{} can be found in Table~\ref{tab:proof_statistics}.  In general it took 1 person year to finish the implementation and proofs of \vericert{}.  The main proof is the correctness proof for the HTL generation, which required the equivalence proofs between all integer operations supported by \compcert{} and the ones supported in hardware.  From the 3349 lines of proof code in the HTL generation, 1189 are only for the correctness proof of the load and store instructions.  These were tedious to prove correct because of the large difference in memory models used, and the need to prove properties such as writes to the outside of the allocated memory being undefined, so that a finite sized array could be used.  In addition to that, as pointers in HTL and Verilog are represented as integers, whereas there is a separate pointer value in the \compcert{} semantics, it was painful to reason about them and a many new theorems had to be proven about integers and pointers in \vericert{}.
+The lines of code for the implementation and proof of \vericert{} can be found in Table~\ref{tab:proof_statistics}.  Overall, it took about 1 person-year to build \vericert{} -- about two person-months on implementation and ten person-months on proofs. The largest proof is the correctness proof for the HTL generation, which required equivalence proofs between all integer operations supported by \compcert{} and those supported in hardware.  From the 3349 lines of proof code in the HTL generation, 1189 are for the correctness proof of just the load and store instructions.  These were tedious to prove correct because of the substantial difference between the memory models used, and the need to prove properties such as stores outside of the allocated memory being undefined, so that a finite array could be used. In addition to that, since pointers in HTL and Verilog are represented as integers, whereas there is a separate pointer value in the \compcert{} semantics, it was painful to reason about them and many new theorems had to be proven about integers and pointers in \vericert{}.
 
 %%% Local Variables:
 %%% mode: latex
diff --git a/references.bib b/references.bib
index d4cd81f..95014da 100644
--- a/references.bib
+++ b/references.bib
@@ -65,7 +65,8 @@ year = {2020},
   booktitle = {{ASAP}},
   pages     = {91--98},
   publisher = {{IEEE} Computer Society},
-  year      = {2016}
+  year      = {2016},
+  doi = {10.1109/ASAP.2016.7760777}
 }  
   
 @article{cong+11,
@@ -80,7 +81,8 @@ year = {2020},
   volume    = {30},
   number    = {4},
   pages     = {473--491},
-  year      = {2011}
+  year      = {2011},
+  doi = {10.1109/TCAD.2011.2110592}
 }
   
 @inproceedings{kiwi,
@@ -90,7 +92,8 @@ year = {2020},
   booktitle = {{FCCM}},
   pages     = {3--12},
   publisher = {{IEEE} Computer Society},
-  year      = {2008}
+  year      = {2008},
+  doi={10.1109/FCCM.2008.46}
 }
   
 @inproceedings{spatial,
@@ -105,11 +108,12 @@ year = {2020},
                Ardavan Pedram and
                Christos Kozyrakis and
                Kunle Olukotun},
-  title     = {Spatial: a language and compiler for application accelerators},
+  title     = {Spatial: A Language and Compiler for Application Accelerators},
   booktitle = {{PLDI}},
   pages     = {296--311},
   publisher = {{ACM}},
-  year      = {2018}
+  year      = {2018},
+  doi       = {https://doi.org/10.1145/3192366.3192379}
 }
   
 @article{greaves_note,
@@ -211,6 +215,7 @@ year = {2020},
   editor          = "Steffen, Bernhard",
   isbn            = "978-3-540-69753-4",
   publisher       = "Springer",
+  doi = "10.1007/BFb0054170"
 }
 
 @article{chouksey19_trans_valid_code_motion_trans_invol_loops,
@@ -327,6 +332,7 @@ year = {2020},
   editor          = "Gupta, Aarti and Malik, Sharad",
   isbn            = "978-3-540-70545-1",
   publisher       = "Springer",
+  doi = "10.1007/978-3-540-70545-1_44"
 }
 
 @inproceedings{chapman92_verif_bedroc,
@@ -378,7 +384,7 @@ year = {2020},
 
 @inproceedings{jifeng93_towar,
   author = "Jifeng, He and Page, Ian and Bowen, Jonathan",
-  title = "Towards a provably correct hardware implementation of occam",
+  title = "Towards a provably correct hardware implementation of Occam",
   booktitle = "Correct Hardware Design and Verification Methods",
   year = 1993,
   pages = "214--225",
@@ -462,17 +468,19 @@ year = {2020},
   year            = 2011,
   pages           = {33--36},
   publisher       = {{ACM}},
+  doi = {10.1145/1950413.1950423}
 }
 
-@inproceedings{choi+18,
-  author    = {Young{-}kyu Choi and
-               Jason Cong},
-  title     = {{HLS}-based optimization and design space exploration for applications
-               with variable loop bounds},
-  booktitle = {{ICCAD}},
-  publisher = {{ACM}},
-  year      = {2018}
-}
+@INPROCEEDINGS{choi+18,
+  author={Y. {Choi} and J. {Cong}},
+  booktitle={2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)}, 
+  title={HLS-Based Optimization and Design Space Exploration for Applications with Variable Loop Bounds}, 
+  year={2018},
+  volume={},
+  number={},
+  pages={1-8},
+  doi={10.1145/3240765.3240815}}
+
 
 @book{micheli94_synth_optim_digit_circuit,
   author          = {De Micheli, Giovanni},
@@ -547,7 +555,8 @@ year = {2020},
   booktitle = {ReConFig},
   pages     = {1--8},
   publisher = {{IEEE}},
-  year      = {2014}
+  year      = {2014},
+  doi={10.1109/ReConFig.2014.7032504}
 }
 
 @INPROCEEDINGS{7818341,
@@ -577,7 +586,8 @@ year = {2020},
   booktitle = {{FPL}},
   pages     = {1--4},
   publisher = {{IEEE}},
-  year      = {2013}
+  year      = {2013},
+  doi       = {10.1109/FPL.2013.6645550}
 }
 
 @inproceedings{nigam20_predic_accel_desig_time_sensit_affin_types,
@@ -648,7 +658,8 @@ year = {2020},
 	address = "Berlin, Heidelberg",
 	editor = "Lau, Kung-Kiu and Banach, Richard",
 	isbn = "978-3-540-32250-4",
-	publisher = "Springer Berlin Heidelberg"
+	publisher = "Springer Berlin Heidelberg",
+	doi = {0.1007/11576280\_20}
 }
 
 @inproceedings{slind08_brief_overv_hol4,
@@ -688,7 +699,8 @@ year = {2020},
 	address = "Berlin, Heidelberg",
 	editor = "Seidl, Helmut",
 	isbn = "978-3-642-28869-2",
-	publisher = "Springer Berlin Heidelberg"
+	publisher = "Springer Berlin Heidelberg",
+	doi = "10.1007/978-3-642-28869-2_20"
 }
 
 @article{zhao12_formal_llvm_inter_repres_verif_progr_trans,
@@ -718,6 +730,7 @@ year = {2020},
   booktitle       = {Proceedings of the conference on Design, automation and test in Europe},
   year            = 1999,
   pages           = {7--es},
+  doi = {10.1109/DATE.1999.761092}
 }
 
 @inproceedings{loow19_proof_trans_veril_devel_hol,
@@ -787,7 +800,8 @@ year = {2020},
   booktitle={2013 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ ISSS)},
   pages={1--10},
   year={2013},
-  organization={IEEE}
+  organization={IEEE},
+  doi={https://doi.org/10.1109/CODES-ISSS.2013.6659002}
 }
 
 @inproceedings{poly_hls_zhao2017,
@@ -804,7 +818,8 @@ year = {2020},
   author={Pouchet, Louis-Noel and Zhang, Peng and Sadayappan, Ponnuswamy and Cong, Jason},
   booktitle={Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays},
   pages={29--38},
-  year={2013}
+  year={2013},
+  doi={https://doi.org/10.1145/2435264.2435273}
 }
 
 @misc{quartus,
@@ -818,24 +833,22 @@ year = {2020},
 @unpublished{du_fuzzin_high_level_synth_tools,
   author          = {Du, Zewei and Herklotz, Yann and Ramanathan, Nadesh and Wickerson, John},
   note            = {unpublished},
-  title           = {Fuzzing High-Level Synthesis Tools},
-}
-
-
-@inproceedings{kiwi_hls,
-  title={Kiwi: Synthesis of FPGA circuits from parallel programs},
-  author={Singh, Satnam and Greaves, David J},
-  booktitle={2008 16th International Symposium on Field-Programmable Custom Computing Machines},
-  pages={3--12},
-  year={2008},
-  organization={IEEE}
+  title           = {{Fuzzing High-Level Synthesis Tools}},
 }
 
 @inproceedings{chisel,
-  title={Chisel: constructing hardware in a scala embedded language},
+  title={{Chisel: Constructing hardware in a Scala embedded language}},
   author={Bachrach, Jonathan and Vo, Huy and Richards, Brian and Lee, Yunsup and Waterman, Andrew and Avi{\v{z}}ienis, Rimas and Wawrzynek, John and Asanovi{\'c}, Krste},
   booktitle={DAC Design Automation Conference 2012},
   pages={1212--1221},
   year={2012},
-  organization={IEEE}
-}
-\ No newline at end of file
+  organization={IEEE},
+  doi={https://doi.org/10.1145/2228360.2228584},
+}
+
+@article{aubury1996handel,
+  title={Handel-C language reference guide},
+  author={Aubury, Matthew and Page, Ian and Randall, Geoff and Saul, Jonathan and Watts, Robin},
+  journal={Computing Laboratory. Oxford University, UK},
+  year={1996}
+}
diff --git a/related.tex b/related.tex
index 7b519e7..88cd60a 100644
--- a/related.tex
+++ b/related.tex
@@ -12,24 +12,24 @@
     \tikzset{myellipse/.style={draw=none, fill opacity=0.2}}
     \tikzset{myoutline/.style={draw=white, thick}}
 
-    \draw[myellipse, fill=colorusabletool] (-1.1,1.8) ellipse (2.9 and 2.1);
+    \draw[myellipse, fill=colorusabletool] (-1.1,1.6) ellipse (3.1 and 2.3);
     \draw[myellipse, fill=colorhighlevel] (1.1,1.8) ellipse (2.9 and 2.1);
     \draw[myellipse, fill=colorproof] (0,0.3) ellipse (3.7 and 1.9);
-    \draw[myellipse, fill=colormechanised] (0,0) ellipse (2.7 and 1.0);
+    \draw[myellipse, fill=colormechanised] (0,0) ellipse (3 and 1.0);
 
-    \draw[myoutline] (-1.1,1.8) ellipse (2.9 and 2.1);
+    \draw[myoutline] (-1.1,1.6) ellipse (3.1 and 2.3);
     \draw[myoutline] (1.1,1.8) ellipse (2.9 and 2.1);
     \draw[myoutline] (0,0.3) ellipse (3.7 and 1.9);
-    \draw[myoutline] (0,0) ellipse (2.7 and 1.0);
+    \draw[myoutline] (0,0) ellipse (3 and 1.0);
 
     \node[align=center] at (0,2.8) {Standard HLS \\ tools~\cite{canis11_legup,xilinx20_vivad_high_synth,intel20_sdk_openc_applic,nigam20_predic_accel_desig_time_sensit_affin_types}};
     \node[align=center] at (0,1.5) {Translation validation \\ approaches~\cite{mentor20_catap_high_level_synth,kundu08_valid_high_level_synth, clarke_kroening_yorav_03}};
     \node at (0,0.5) {\bf \vericert{}};
-    \node[align=left] at (-1.5,0.4) {Koika~\cite{bourgeat20_essen_blues}};
-    \node[align=left] at (-1.5,0.0) {L\"o\"ow et al.~\cite{loow19_verif_compil_verif_proces}};
+    \node[align=left] at (-1.8,0.4) {Koika~\cite{bourgeat20_essen_blues}};
+    \node[align=left] at (-1.8,0.0) {L\"o\"ow et al.~\cite{loow19_verif_compil_verif_proces}};
 
-    \node at (1.8,0.2) {Ellis~\cite{ellis08}};
-    \node at (0,-0.6) {Perna et al.~\cite{perna12_mechan_wire_wise_verif_handel_c_synth}};
+    \node at (2.2,0.2) {Ellis~\cite{ellis08}};
+    \node at (-1.3,-0.4) {Perna et al.~\cite{perna12_mechan_wire_wise_verif_handel_c_synth}};
     \node at (0,-1.3) {BEDROC~\cite{chapman92_verif_bedroc}};
 
     \node[align=left] at (-2.9,-1.7) {\color{colorproof}Correctness \\ \color{colorproof}proof};
@@ -44,21 +44,11 @@
 
 A summary of the related works can be found in Figure~\ref{fig:related_venn}, which is represented as a Venn diagram.  The categories that were chosen for the Venn diagram are if the tool is usable and available, if it takes a high-level software language as input, if it has a correctness proof and finally if that proof is mechanised.  The goal of \vericert{} is to cover all of these categories.
 
-Work is being done to prove the equivalence between the generated hardware and the original behavioural description in C.  An example of a tool that implements this is Mentor's Catapult~\cite{mentor20_catap_high_level_synth}, which tries to match the states in the three address code (3AC) description to states in the original C code after an unverified translation.  This technique is called translation validation~\cite{pnueli98_trans}, whereby the translation that the HLS tool performed is proven to have been correct for that input, by showing that they behave in the same way for all possible inputs.  Using translation validation is quite effective for proving complex optimisations such as scheduling~\cite{kim04_autom_fsmd,karfa06_formal_verif_method_sched_high_synth,chouksey20_verif_sched_condit_behav_high_level_synth} or code motion~\cite{banerjee14_verif_code_motion_techn_using_value_propag,chouksey19_trans_valid_code_motion_trans_invol_loops}, however, the validation has to be run every time the high-level synthesis is performed.  In addition to that, the proofs are often not mechanised or directly related to the actual implementation, meaning the verifying algorithm might be wrong and could give false positives or false negatives.
+Most practical HLS tools~\cite{canis11_legup,xilinx20_vivad_high_synth,intel20_sdk_openc_applic,nigam20_predic_accel_desig_time_sensit_affin_types} fit into the category of usable tools that take high-level inputs.  On the other spectrum, there are tools such as BEDROC~\cite{chapman92_verif_bedroc} for which there is no practical tool, and even though it is described as high-level synthesis, it resembles today's hardware description tools more.
 
-\JW{Some papers to cite somewhere:
-  \begin{itemize}
-    \item \citet{kundu08_valid_high_level_synth} have done translation validation on an HLS tool called SPARK. They don't have \emph{very} strong evidence that HLS tools are buggy; they just say ``HLS tools are large and complex software systems, often with hundreds of
-      thousands of lines of code, and as with any software of this scale, they are prone
-      to logical and implementation errors''. They did, however, find two bugs in SPARK as a result of their efforts, so that provides a small bit of evidence that HLS tools are buggy. \@
-    \item \citet{chapman92_verif_bedroc} have proved (manually, as far as JW can tell) that the front-end and scheduler of the BEDROC synthesis system is correct. (NB:\@ Chapman also wrote a PhD thesis about this (1994) but it doesn't appear to be online.)
-    \item \citet{ellis08} wrote a PhD thesis that was supervised by Cliff Jones whom you met at the VeTSS workshop thing last year. At a high level, it's about verifying a high-level synthesis tool using Isabelle, so very relevant to this project, though scanning through the pdf, it's quite hard to ascertain exactly what the contributions are. Strange that they never published a paper about the work -- it makes it very hard to find!
-  \end{itemize}
-}
+Work is being done to prove the equivalence between the generated hardware and the original behavioural description in C.  An example of a tool that implements this is Mentor's Catapult~\cite{mentor20_catap_high_level_synth}, which tries to match the states in the three address code (3AC) description to states in the original C code after an unverified translation.  This technique is called translation validation~\cite{pnueli98_trans}, whereby the translation that the HLS tool performed is proven to have been correct for that input, by showing that they behave in the same way for all possible inputs.  Using translation validation is quite effective for proving complex optimisations such as scheduling~\cite{kim04_autom_fsmd,karfa06_formal_verif_method_sched_high_synth,chouksey20_verif_sched_condit_behav_high_level_synth} or code motion~\cite{banerjee14_verif_code_motion_techn_using_value_propag,chouksey19_trans_valid_code_motion_trans_invol_loops}, however, the validation has to be run every time the high-level synthesis is performed.  In addition to that, the proofs are often not mechanised or directly related to the actual implementation, meaning the verifying algorithm might be wrong and could give false positives or false negatives.
 
-\YH{Amazing, thank you, yes it seems like Kundu \textit{et al.} have quite a few papers on formal verification of HLS algorithms using translation validation, as well as~\citet{karfa06_formal_verif_method_sched_high_synth}, all using the SPARK~\cite{gupta03_spark} synthesis tool.  And yes thank you, will definitely cite that paper.  There seem to be similar early proofs of high-level synthesis like \citet{hwang91_formal_approac_to_sched_probl} or \citet{grass94_high}.  There are also the Occam papers that I need to mention too, like \citet{page91_compil_occam}, \citet{jifeng93_towar}, \citet{perna11_correc_hardw_synth} and \citet{perna12_mechan_wire_wise_verif_handel_c_synth}.}
-\JW{Well it's a good job there's no page limit on bibliographies these days then! I'll keep an eye out for more papers we could cite when I get a free moment. (No need to cite \emph{everything} of course.)}
-\YH{Yes that's true, there are too many to cite! And great thank you, that would help a lot, I have quite a few papers I still want to cite, but have to think about where they will fit in.}
+Finally, there are a few relevant mechanically verified tools.  First, K\^{o}ika is a formally verified translation from a core subset of BlueSpec into a circuit representation which can then be printed as a Verilog design.  This is a translation from a high-level hardware description language into an equivalent circuit representation, which is a different approach to HLS.  \citet{loow19_proof_trans_veril_devel_hol} used a verified translation from HOL4 code describing state transitions into Verilog to design a verified processor~\cite{loow19_verif_compil_verif_proces}.  Their approach translated a shallow embedding in HOL4 into a deep embedding of Verilog.  Perna et al.\cite{perna12_mechan_wire_wise_verif_handel_c_synth,perna11_correc_hardw_synth} designed a formally verified translation from a deep embedding of Handel-C~\cite{aubury1996handel}, which is translated to a deep embedding of a circuit.  Finally, Ellis' work~\cite{ellis08} implemented and reasoned about intermediate languages for software/hardware compilation, where parts could be implemented in hardware and the correctness could still be shown.
 
 %%% Local Variables:
 %%% mode: latex
diff --git a/verilog.tex b/verilog.tex
index 380bbcb..ed2e7af 100644
--- a/verilog.tex
+++ b/verilog.tex
@@ -5,7 +5,7 @@
 This section describes the Verilog semantics that was chosen for the target language, including the changes that were made to the semantics to make it a suitable HLS target.  The Verilog standard is quite large~\cite{06_ieee_stand_veril_hardw_descr_languag,05_ieee_stand_veril_regis_trans_level_synth}, but the syntax and semantics can be reduced to a small subset that \vericert{} needs to target.
 
 The Verilog semantics we use is ported to Coq from a semantics written in HOL4 by \citet{loow19_proof_trans_veril_devel_hol} and used to prove the translation from HOL4 to Verilog~\cite{loow19_verif_compil_verif_proces}. % which was used to create a formal translation from a logic representation encoded in the HOL4~\cite{slind08_brief_overv_hol4} theorem prover into an equivalent Verilog design. 
-This semantics is quite practical as it is restricted to a small subset of Verilog, which can nonetheless be used to model all hardware constructs one would want to use.  The main features that are excluded are continuous assignment and combinational \alwaysblock{}s; these are modelled in other semantics such as that by~\citet{meredith10_veril}. %however, these are not necessarily needed, but require more complicated event queues and execution model.
+This semantics is quite practical as it is restricted to a small subset of Verilog, which can nonetheless be used to model all hardware constructs one would want to use. \JP{``can nonetheless be used to model the hardware constructs required for HLS.''} The main features that are excluded are continuous assignment and combinational \alwaysblock{}s; these are modelled in other semantics such as that by~\citet{meredith10_veril}. %however, these are not necessarily needed, but require more complicated event queues and execution model.
 
 The semantics of Verilog differs from regular programming languages, as it is used to describe hardware directly, which is inherently parallel, rather than an algorithm, which is usually sequential.  The main construct in Verilog is the \alwaysblock{}.
 A module can contain multiple \alwaysblock{}s, all of which run in parallel.  These \alwaysblock{}s further contain statements such as if-statements or assignments to variables.  We support only \emph{synchronous} logic, which means that the \alwaysblock{} is triggered on (and only on) the rising edge of a clock signal.
@@ -60,6 +60,7 @@ which modifies one array element using blocking assignment and then a second usi
 \paragraph{Simplifying representation of bitvectors} Finally, we use 32-bit integers to represent bitvectors rather of arrays of Booleans. This is because \vericert{} (currently) only supports types represented by 32 bits.
 
 \subsection{Integrating the Verilog Semantics into \compcert{}'s Model}
+\label{sec:verilog:integrating}
 
 \begin{figure*}
   \centering
@@ -100,11 +101,13 @@ Figure~\ref{fig:inferrence_module} shows the inference rules for moving between
 
 Note that there is no step from \texttt{State} to \texttt{Callstate}; this is because function calls are not supported, and it is therefore impossible in our semantics to ever reach a \texttt{Callstate} except for the initial call to main. So the \textsc{Call} rule is only used at the very beginning of execution; likewise, the \textsc{Return} rule is only matched for the final return value from the main function. %as there is no rule that allocates another stack frame \textit{sf} except for the initial call to main.  
 Therefore, in addition to the rules shown in Figure~\ref{fig:inferrence_module}, an initial state and final state need to be defined:
-\begin{align*}
-  \texttt{initial:}\quad &\forall P,\ \yhconstant{Callstate } []\ P.\texttt{main } []\\
-  \texttt{final:}\quad &\forall n,\ \yhconstant{Returnstate } []\ n \implies n
-\end{align*}
-\noindent where the initial state is the \texttt{Callstate} with an empty stack frame and no arguments for the \texttt{main} function of program $P$, and the final state results in the program output of value $n$ when reaching a \texttt{Returnstate} with an empty stack frame.
+
+\begin{gather*}
+  \inferrule[Initial]{\yhfunction{is\_internal}\ (P.\texttt{main})}{\yhfunction{initial\_state}\ (\yhconstant{Callstate } []\ (P.\texttt{main})\ [])}\\
+  \inferrule[Final]{ }{\yhfunction{final\_state}\ (\yhconstant{Returnstate } []\ n)\ n}
+\end{gather*}
+
+\noindent where the initial state is the \texttt{Callstate} with an empty stack frame and no arguments for the \texttt{main} function of program $P$, where this \texttt{main} function needs to be in the current translation unit.  The final state results in the program output of value $n$ when reaching a \texttt{Returnstate} with an empty stack frame.
 
 %%% Local Variables:
 %%% mode: latex
author	John Wickerson <j.wickerson@imperial.ac.uk>	2020-11-20 22:06:21 +0000
committer	overleaf <overleaf@localhost>	2020-11-20 22:07:13 +0000
commit	1866d48e1303005462f60a629370beef28d51fe3 (patch)
tree	80ae03cac123291ea15797e5a90c8b6cb24a3019
parent	7cb9bf05e91519211e4e526467029891d05ab25f (diff)
download	oopsla21_fvhls-1866d48e1303005462f60a629370beef28d51fe3.tar.gz oopsla21_fvhls-1866d48e1303005462f60a629370beef28d51fe3.zip