Update on Overleaf.

author: Yann Herklotz <ymh15@ic.ac.uk> 2020-11-21 00:03:19 +0000
committer: overleaf <overleaf@localhost> 2020-11-21 00:04:19 +0000
commit: 15aab23c1e89e9369b631eb3cefc7fc708d6da36 (patch)
tree: 1fa2ba7cf2b0af2f369918959326555368fd0b1f
parent: 5f228d3512b2e3e39305e6f53bc873c76439e96f (diff)
download: oopsla21_fvhls-15aab23c1e89e9369b631eb3cefc7fc708d6da36.tar.gz
oopsla21_fvhls-15aab23c1e89e9369b631eb3cefc7fc708d6da36.zip
9 files changed, 110 insertions, 70 deletions
diff --git a/algorithm.tex b/algorithm.tex
index 302494f..1d79eee 100644
--- a/algorithm.tex
+++ b/algorithm.tex
@@ -1,10 +1,10 @@
 \section{Designing a verified HLS tool}
 \label{sec:design}
 
-This section covers \JP{describes} the main architecture of the HLS tool, and the way in which the Verilog back end \JP{back-end? backend?} was added to \compcert{}.  This section will also cover an example of converting a simple C program into hardware, expressed in the Verilog language.
+This section describes the main architecture of the HLS tool, and the way in which the Verilog back end \JP{back-end? backend?}\YH{I actually always used back end} was added to \compcert{}.  This section will also cover an example of converting a simple C program into hardware, expressed in the Verilog language.
 
 \paragraph{Choice of source language}
-First of all, the choice of C as the input language of \vericert{} is simply because it is what most major HLS tools use~\cite{canis11_legup, xilinx20_vivad_high_synth, intel_hls, bambu_hls}. This, in turn, may be because C is ``[t]he starting point for the vast majority of algorithms to be implemented in hardware''~\cite{5522874}.  \JP{I think this could be reworded: ``C was chosen as the source language as it remains the most common source language amongst production-quality HLS tools~\cite{canis11_legup, xilinx20_vivad_high_synth, intel_hls, bambu_hls}. This, in turn, may be because it is ``[t]he starting point for the vast majority of algorithms to be implemented in hardware''~\cite{5522874}, lending a degree of practicality. The availability of \compcert{} [cite] also provides a solid basis for formally verified C compilation.''}
+C was chosen as the source language as it remains the most common source language amongst production-quality HLS tools~\cite{canis11_legup, xilinx20_vivad_high_synth, intel_hls, bambu_hls}. This, in turn, may be because it is ``[t]he starting point for the vast majority of algorithms to be implemented in hardware''~\cite{5522874}, lending a degree of practicality. The availability of \compcert{}~\cite{??} also provides a solid basis for formally verified C compilation.
 %Since a lot of existing code for HLS is written in C, supporting C as an input language, rather than a custom domain-specific language, means that \vericert{} is more practical. 
 %An alternative was to support LLVM IR as an input language, however, to get a full work flow from a higher level language to hardware, a front end for that language to LLVM IR would also have to be verified. \JW{Maybe save LLVM for the `Choice of implementation language'?}
 We considered Bluespec~\cite{nikhil04_blues_system_veril}, but decided that although it ``can be classed as a high-level language''~\cite{greaves_note}, it is too hardware-oriented to be used for traditional HLS.
@@ -15,14 +15,14 @@ We also considered using a language with built-in parallel constructs that map w
 
 
 \paragraph{Choice of target language}
-Verilog~\cite{06_ieee_stand_veril_hardw_descr_languag} is an HDL that can be synthesised into logic cells which can be either placed onto a field-programmable gate array (FPGA) or turned into an application-specific integrated circuit (ASIC).  Verilog was chosen as the output language for \vericert{} because it is one of the most popular HDLs and there already exist a few formal semantics for it that could be used as a target~\cite{loow19_verif_compil_verif_proces, meredith10_veril}. \JP{We've just mentioned Bluespec as a \textit{source} language and now we're saying it could be a target language: this is fine but needs comment.} \JP{``Bluespec, previously ruled out as a source language, is another possible target and there exists a formally verified translation to circuits using K\^{o}ika~\cite{bourgeat20_essen_blues}.''} Other possible targets could have been Bluespec, from which there exists a formally verified translation to circuits using K\^{o}ika~\cite{bourgeat20_essen_blues}. \JP{This needs an extra comment maybe?} % but targeting this language would not be trivial as it is not meant to be targeted by an automatic tool, instead strives to a formally verified high-level hardware description language instead. 
+Verilog~\cite{06_ieee_stand_veril_hardw_descr_languag} is an HDL that can be synthesised into logic cells which can be either placed onto a field-programmable gate array (FPGA) or turned into an application-specific integrated circuit (ASIC).  Verilog was chosen as the output language for \vericert{} because it is one of the most popular HDLs and there already exist a few formal semantics for it that could be used as a target~\cite{loow19_verif_compil_verif_proces, meredith10_veril}.  Bluespec, previously ruled out as a source language, is another possible target and there exists a formally verified translation to circuits using K\^{o}ika~\cite{bourgeat20_essen_blues}. \JP{This needs an extra comment maybe?}\YH{Maybe about bluespec not being an ideal target language because it's quite high-level?} % but targeting this language would not be trivial as it is not meant to be targeted by an automatic tool, instead strives to a formally verified high-level hardware description language instead. 
 
 %\JW{Can we mention one or two alternatives that we considered? Bluespec or Chisel or one of Adam Chlipala's languages, perhaps?}
 
 \paragraph{Choice of implementation language}
 We chose Coq as the implementation language because of its mature support for code extraction; that is, its ability to generate OCaml programs directly from the definitions used in the theorems.
 We note that other authors have had some success reasoning about the HLS process using other theorem provers such as Isabelle~\cite{ellis08}.
-The framework that was chosen for the front end was \compcert{} \JP{``\compcert{} was chosen as the front end framework''}\JP{But careful if we've already mentioned this before. Also needs a citation?}, as it is a mature framework for simulation proofs about intermediate languages, and it already provides a validated C parser~\cite{jourdan12_valid_lr_parser}.
+\compcert{} was chosen as the front end framework\JP{But careful if we've already mentioned this before. Also needs a citation?}\YH{Citation for compcert?}, as it is a mature framework for simulation proofs about intermediate languages, and it already provides a validated C parser~\cite{jourdan12_valid_lr_parser}.
 The Vellvm~\cite{zhao12_formal_llvm_inter_repres_verif_progr_trans} framework was also considered because several existing HLS tools are already LLVM-based, but additional work would be required to support a high-level language like C as input.
 The .NET framework has been used as a basis for other HLS tools, such as Kiwi~\cite{kiwi}, and LLHD~\cite{schuiki20_llhd} has been recently proposed as an intermediate language for hardware design, but neither are suitable for us because they lack formal semantics.
 
@@ -59,13 +59,13 @@ The .NET framework has been used as a basis for other HLS tools, such as Kiwi~\c
 \end{figure}
 
 \paragraph{Architecture of \vericert{}}
-The main work flow of \vericert{} is given in Figure~\ref{fig:rtlbranch}, which shows those parts of the translation that are performed in \compcert{}, and those which \JP{that?..} have been added.
+The main work flow of \vericert{} is given in Figure~\ref{fig:rtlbranch}, which shows those parts of the translation that are performed in \compcert{}, and those that have been added.
 
-\compcert{} translates Clight input into assembly output via a sequence of intermediate languages; we must decide which of these eleven languages \JP{I can only count 9 in total (including CompCert C and Asm) now that I look at the docs; I remember having this issue before...} is the most suitable starting point to begin the HLS-specific translation stages. \JP{Should we say what Clight is here? (i.e. deterministic C with pure expressions)} \JW{Yeah maybe a footnote?}
+\compcert{} translates Clight\footnote{A deterministic subset of C with pure expressions.} input into assembly output via a sequence of intermediate languages; we must decide which of these eleven languages \JP{I can only count 9 in total (including CompCert C and Asm) now that I look at the docs; I remember having this issue before...}\YH{I thought we discussed it and found 11, but I count 10 now....: Clight, C\#minor, Cminor, CminorSel, RTL, LTL, LTLin, Linear, Mach, PPC (ASM)} is the most suitable starting point to begin the HLS-specific translation stages. \JP{Should we say what Clight is here? (i.e. deterministic C with pure expressions)} \JW{Yeah maybe a footnote?}\YH{Added}
 
-We select CompCert's three-address code (3AC)\footnote{This is known as register transfer language (RTL) in the \compcert{} literature. `3AC' is used in this paper instead to avoid confusion with register-transfer level (RTL), which is another name for the final hardware target of the HLS tool.} as the starting point. Branching off before this point (at CminorSel or earlier) denies \compcert{} the opportunity to perform optimisations such as constant propagation and dead code elimination, which have been found useful in HLS tools as well as software compilers~\cite{cong+11}. And if we branch off after this point (at LTL or later) then CompCert has already performed register allocation to reduce the number of registers and spill some variables to the stack; this transformation is not required in HLS because there are many more registers available, and these should be used instead of RAM whenever possible. \JP{``\compcert{} performs register allocation during the translation to LTL, with some registers spilled onto the stack: this is unnecessary in HLS since as many registers as are required may be described in the output RTL.''} \JP{Maybe something about FPGAs being register-dense (so rarely a need to worry about the number of flops)?}
+We select CompCert's three-address code (3AC)\footnote{This is known as register transfer language (RTL) in the \compcert{} literature. `3AC' is used in this paper instead to avoid confusion with register-transfer level (RTL), which is another name for the final hardware target of the HLS tool.} as the starting point. Branching off before this point (at CminorSel or earlier) denies \compcert{} the opportunity to perform optimisations such as constant propagation and dead code elimination, which have been found useful in HLS tools as well as software compilers~\cite{cong+11}. And if we branch off after this point (at LTL or later) then \compcert{} has already performed register allocation to reduce the number of registers and spill some variables to the stack; this transformation is not required in HLS because there are many more registers available, and these should be used instead of RAM whenever possible. \JP{``\compcert{} performs register allocation during the translation to LTL, with some registers spilled onto the stack: this is unnecessary in HLS since as many registers as are required may be described in the output RTL.''} \JP{Maybe something about FPGAs being register-dense (so rarely a need to worry about the number of flops)?}
 
-3AC is also attractive because it is the closest intermediate language to LLVM IR, which is used by several existing HLS compilers \JP{We already ruled out LLVM as a starting point, so this seems like it needs further qualification.}. It has an unlimited number of pseudo-registers, and is represented as a control flow graph (CFG) where each instruction is a node with links to the instructions that can follow it. One difference between LLVM IR and 3AC is that 3AC includes operations that are specific to the chosen target architecture; we chose x86\_32 because each instruction maps well to hardware.\JP{I'm still not convinced: all instructions map well to hardware (in theory) because they're all implemented on real processors, even if they require micro-ops or multiple cycles. How about ``Targeting x86\_32 generally produces relatively dense RTL thanks to the availability of more complex addressing modes, reducing cycle counts when the HLS tool uses a naive scheduling approach'' or something like that?} 
+3AC is also attractive because it is the closest intermediate language to LLVM IR, which is used by several existing HLS compilers \JP{We already ruled out LLVM as a starting point, so this seems like it needs further qualification.}\YH{Well not because it's not a good starting point, but the ecosystem in Coq isn't as good.  I think it's still OK here to say that being similar to LLVM IR is an advantage?}. It has an unlimited number of pseudo-registers, and is represented as a control flow graph (CFG) where each instruction is a node with links to the instructions that can follow it. One difference between LLVM IR and 3AC is that 3AC includes operations that are specific to the chosen target architecture; we chose to target the x86\_32 backend, because it generally produces relatively dense 3AC thanks to the availability of more complex addressing modes, reducing cycle counts when the HLS tool uses a na\"{i}ve scheduling approach.
 
 \begin{figure}
   \centering
@@ -179,13 +179,12 @@ endmodule
 
 \subsection{Translating C to Verilog, by example}
 Figure~\ref{fig:accumulator_c_rtl} illustrates the translation of a program that sums the elements of an array.
-In this section we describe the stages of the \vericert{} translation process using this program as an example.
+In this section, we describe the stages of the \vericert{} translation, referring to this program as an example.
 
 \subsubsection{Translating C to 3AC}
 
 The first stage of the translation uses unmodified \compcert{} to transform the C input, shown in Figure~\ref{fig:accumulator_c}, into a 3AC intermediate representation, shown in Figure~\ref{fig:accumulator_rtl}.
-
-As part of this translation, \compcert{} performs such optimisations as constant propagation and dead-code elimination.\JP{We already wrote about this, remove?}  Function inlining is also performed, which allows us to support function calls without having to support the \texttt{Icall} 3AC instruction.  Although the duplication of the function bodies caused by inlining can increase the area of the hardware, it can have a positive effect on latency. Moreover, inlining excludes support for recursive function calls, but this feature isn't supported in most other HLS tools either~\cite{davidthomas_asap16}.
+As part of this translation, function inlining is also performed on all functions, which allows us to support function calls without having to support the \texttt{Icall} 3AC instruction.  Although the duplication of the function bodies caused by inlining can increase the area of the hardware, it can have a positive effect on latency. Moreover, inlining excludes support for recursive function calls, but this feature isn't supported in most other HLS tools either~\cite{davidthomas_asap16}.
 
 %\JW{Is that definitely true? Was discussing this with Nadesh and George recently, and I ended up not being so sure. Inlining could actually lead to \emph{reduced} resource usage because once everything has been inlined, the (big) scheduling problem could then be solved quite optimally. Certainly inlining is known to increase register pressure, but that's not really an issue here. If we're  not sure, we could just say that inlining everything leads to bloated Verilog files and the inability to support recursion, and leave it at that.}\YH{I think that is true, just because we don't do scheduling.  With scheduling I think that's true, inlining actually becomes quite good.}
 
@@ -200,9 +199,9 @@ As part of this translation, \compcert{} performs such optimisations as constant
 %\JW{I think it would be worth having a sentence to explain how the C model of memory is translated to a hardware-centric model of memory. For instance, in C we have global variables/arrays, stack-allocated variables/arrays, and heap-allocated variables/arrays (anything else?). In Verilog we have registers and RAM blocks. So what's the correspondence between the two worlds? Globals and heap-allocated are not handled, stack-allocated variables become registers, and stack-allocated arrays become RAM blocks? Am I close?}\YH{Stack allocated variables become RAM as well, so that we can deal with addresses easily and take addresses of any variable.} \JW{I see, thanks. So, in short, the only registers in your hardware designs are those that store things like the current state, etc. You generate a fixed number of registers every time you synthesis -- you don't generate extra registers to store any of the program variables. Right?}
 
 The next translation is from 3AC to a new hardware translation language (HTL). %, which is one step towards being completely translated to hardware described in Verilog.
-This involves going from a CFG representation of the computation to a finite state machine with datapath (FSMD) representation~\cite{hwang99_fsmd}. The core idea of the FSMD representation is that it separates the control flow from the operations on the memory and registers. \JP{I've become less comfortable with this term, but it's personal preference so feel free to ignore. I think `generalised finite state machine' (i.e.\ thinking of the entire `datapath' as contributing to the overall state) is more accurate.}
+This involves going from a CFG representation of the computation to a finite state machine with datapath (FSMD) representation~\cite{hwang99_fsmd}. The core idea of the FSMD representation is that it separates the control flow from the operations on the memory and registers. \JP{I've become less comfortable with this term, but it's personal preference so feel free to ignore. I think `generalised finite state machine' (i.e.\ thinking of the entire `datapath' as contributing to the overall state) is more accurate.}\YH{Hmm, yes, I mainly chose FSMD because there is quite a lot of literature around it.  I think for now I'll keep it but for the final draft we could maybe change it.}
 %This means that the state transitions can be translated into a simple finite state machine (FSM) where each state contains data operations that update the memory and registers. 
-Hence, an HTL program thus consists of two maps: control and datapath maps that express state transitions and computations respectively. \JP{Hence + thus in same sentence referring to the exact same implication.}
+Hence, an HTL program consists of two maps: control and datapath maps that express state transitions and computations respectively.
 Figure~\ref{fig:accumulator_diagram} shows the resulting FSMD architecture. The right block is the control logic that computes the next state whereas the left block updates all the registers and RAM based on the current program state. 
 
 \begin{figure*}
@@ -293,25 +292,26 @@ Figure~\ref{fig:accumulator_diagram} shows the resulting FSMD architecture. The
   \caption{The FSMD for the example shown in Figure~\ref{fig:accumulator_c_rtl}, split into a data path and control logic for the next state calculation.  The Update block takes the current state, current values of all registers and at most one value stored in the array, and calculates a new value that can either be stored back in the array or in a register.}\label{fig:accumulator_diagram}
 \end{figure*}
 
+%\JP{Does it? Verilog has neither physical registers nor RAMs, just language constructs which the synthesiser might implement with registers and RAMs. We should be clear whether we're talking about the HDL representation, or the synthesised result: in our case these can be very different since we don't target any specific architectural features of an FPGA fabric of ASIC process.}
 \paragraph{Translating memory}
-Typically, HLS-generated hardware consists of a sea of registers and RAM memories. \JP{Does it? Verilog has neither physical registers nor RAMs, just language constructs which the synthesiser might implement with registers and RAMs. We should be clear whether we're talking about the HDL representation, or the synthesised result: in our case these can be very different since we don't target any specific architectural features of an FPGA fabric of ASIC process.}
+Typically, HLS-generated hardware consists of a sea of registers and RAM memories. 
 This memory view is very different to the C memory model, so we perform the following translation. 
-Variables that do not have their address taken are kept in registers. \JP{CompCert does this for us.}
+Variables that do not have their address taken are kept in registers, which correspond to the registers in 3AC.
 All address-taken variables, arrays or structs are kept in RAM.
-The stack of the main function becomes a block of RAM, \JW{check this -- maybe change to 2D array of registers?} \JP{Agree, we describe an unpacked array of 32-bit vectors: the synthesiser might turn this into some kind of RAM, but this isn't guaranteed at all, and I think we found that the synthesis flow we used for evaluation put everything in registers?} as seen in Figure~\ref{fig:accumulator_diagram}. 
+The stack of the main function becomes an unpacked array of 32-bit integers, which may be translated to a RAM when the hardware description is passed through a synthesis tool.  A high-level overview of the architecture can be seen in Figure~\ref{fig:accumulator_diagram}. 
 Finally, global variables are not translated in \vericert{} at the moment.
 
 \paragraph{Translating instructions}
-Each 3AC instruction either corresponds to a hardware construct, or does not have to be handled by the translation, such as function calls. \JP{Confusing: we never generate function calls due to inlining. We should be clear that this is because CompCert will never actually codegen a call: if it did, we would have to handle it (or more likely error out).}
+Each 3AC instruction either corresponds to a hardware construct, or does not have to be handled by the translation, such as function calls, as a call instruction will never be generated due to inlining all calls.
 For example, state 15 in Figure~\ref{fig:accumulator_rtl} shows a 32-bit register \texttt{x8} being initialised to 1, after which the control flow moves to state 14. This initialisation is also encoded in HTL at state 15 in both the control and datapath always blocks, as shown in Figure~\ref{fig:accumulator_v}.  Simple operator instructions are translated in a similar way.  For example, in state 5, the value of the array element is added to the current sum value, which is simply translated to an addition of the equivalent registers in the HTL code.
 
-\paragraph{Key challenge: signedness} Note that the comparison in state 3 is signed. This is because C and Verilog handle signedness quite differently. By default, all operators and registers in Verilog (and HTL) are unsigned, so to force an operation to handle the bits as signed, both operators have to be forced to be signed.  In addition to that, Verilog resizes expressions to the largest needed size by default, which can affect the result of the computation.  This feature is also not supported by the Verilog semantics we adopted, and there would therefore be a mismatch between the Verilog semantics and the actual behaviour of Verilog according to the standard.  To bypass this issue, braces are used to stop the Verilog simulator or synthesis tool from resizing anything inside the braces.  Instead, explicit resizing is used in the semantics and operations can only be performed on two registers that
-have the same size. \JP{I feel like this is an issue of syntax? We embed a simplified syntax in Coq which elides braces etc. There isn't really a semantic mismatch (which sounds really scary and wrong).}
+\paragraph{Key challenge: signedness} Note that the comparison in state 3 is signed. C and Verilog handle signedness quite differently, by default, all operators and registers in Verilog (and HTL) are unsigned, so to force an operation to handle the bits as signed, both operators have to be forced to be signed.  In addition to that, Verilog implicitly resizes expressions to the largest needed size by default, which can affect the result of the computation.  This feature is not supported by the Verilog semantics we adopted, and therefore, to match the semantics to the behaviour of the simulator and synthesis tool, braces are placed around all expressions as this hinders implicit resizing.  Instead, explicit resizing is used in the semantics and operations can only be performed on two registers that
+have the same size. %\JP{I feel like this is an issue of syntax? We embed a simplified syntax in Coq which elides braces etc. There isn't really a semantic mismatch (which sounds really scary and wrong).}
 
 \subsubsection{Translating HTL to Verilog}
 
 Finally, we have to translate the HTL code into proper Verilog, % and prove that it behaves the same as the 3AC according to the Verilog semantics.
-where HTL is a language that is specifically designed to represent the FSMDs we are interested in and Verilog is a general-purpose HDL.\@  So the challenge here is to translate our FSMD representation into a Verilog AST.\JP{We've already introduced both of these languages.}  However, as all the instructions are already expressed in Verilog, only the maps need to be translated to valid Verilog, and correct declarations for all the variables in the program need to be added as well.
+The challenge here is to translate our FSMD representation into a Verilog AST.  However, as all the instructions in HTL are already expressed as Verilog statements, only the top level data path and control logic maps need to be translated to valid Verilog.  To generate correct Verilog, declarations for all the variables in the program need to be added as well.
 
 Although this translation seems quite straightforward, proving that this translation is correct is complex. 
 All the implicit assumptions that were made in HTL need to be translated explicitly to Verilog statements and it needs to be shown that these explicit behaviours are equivalent to the assumptions made in the HTL semantics.
@@ -323,13 +323,11 @@ Other additions are the initialisation of all the variables in the code to the c
 
 \subsection{Optimisations}
 
-Although we would not claim that \vericert{} is a proper `optimising' HLS compiler yet, we have nonetheless implemented a few optimisations that aim to improve the quality of the hardware designs it produces. \JP{I think saying `optimisations' in a PLDI paper implies something else? Perhaps ``..we have nonetheless made several design choices that aim..''}
+Although we would not claim that \vericert{} is a proper `optimising' HLS compiler yet, we have nonetheless made several design choices that aim to improve the quality of the hardware designs it produces.
 
 \subsubsection{Byte- and word-addressable memories}
 
-\JP{``Although Verilog arrays might seem to mirror their C counterparts directly, they must be treated quite differently. To reduce the design area and avoid issues meeting timing, it is beneficial if Verilog arrays can be synthesised as RAMs, but this imposes various constraints on how Verilog arrays are used. RAMs often only allow one read and one write operation per clock cycle: to make loads and stores as efficient as possible, the RAM needs to be word-addressable, which means that an entire integer can be loaded or stored in one clock cycle.''}
-
-One big difference between C and Verilog is how memory is represented.  In hardware, efficient RAMs are not as available as in software, and need to be explicitly implemented by declaring arrays with specific properties.  A major limitation is that RAMs often only allow one read and one write per clock cycle. So, to make loads and stores as efficient as possible, the RAM needs to be word-addressable, which means that an entire integer can be loaded or stored in one clock cycle.
+One big difference between C and Verilog is how memory is represented.  Although Verilog arrays might seem to mirror their C counterparts directly, they must be treated quite differently. To reduce the design area and avoid issues meeting timing, it is beneficial if Verilog arrays can be synthesised as RAMs, but this imposes various constraints on how Verilog arrays are used. RAMs often only allow one read and one write operation per clock cycle: to make loads and stores as efficient as possible, the RAM needs to be word-addressable, which means that an entire integer can be loaded or stored in one clock cycle.
 However, the memory model that \compcert{} uses for its intermediate languages is byte-addre\-ssa\-ble~\cite{blazy05_formal_verif_memor_model_c}.  It therefore has to be proven that the byte-addressable memory behaves in the same way as the word-addressable memory in hardware.  Any modifications of the bytes in the \compcert{} memory model also have to be shown to modify the word-addressable memory in the same way.  Since only integer loads and stores are currently supported in \vericert{}, it follows that the addresses given to the loads and stores should be multiples of four.  If that is the case, then the translation from byte-addressed memory to word-addressed memory can be done by dividing the address by four.
 
 \subsubsection{Implementing the \texttt{Oshrximm} instruction}
diff --git a/appendix.tex b/appendix.tex
index 11bfe41..defe44d 100644
--- a/appendix.tex
+++ b/appendix.tex
@@ -1,3 +1,4 @@
+\newpage
 \appendix
 
 \section{Appendix}
@@ -50,7 +51,7 @@
       \inferrule[Nonblocking Array]{\yhkeyword{name}\ \textit{lhs} = \yhkeyword{OK}\ n \\ \textit{erun}\  \Gamma\ \textit{rhs}\ v_{\textit{rhs}}}{\textit{srun}\  (\Gamma_{r}, \Gamma_{a}, \Delta_{r}, \Delta_{a})\ (\textit{lhs} \Leftarrow \textit{rhs})\ (\Gamma_{r}, \Gamma_{a}, \Delta_{r} // \{n \rightarrow v_{\textit{rhs}}\}, \Delta_{a})}
     \end{gather*}
   \end{minipage}
-  \caption{Inferrence rules for statements.}\label{fig:inferrence_statements}
+  \caption{Inference rules for statements.}\label{fig:inferrence_statements}
 \end{figure*}
 
 \begin{figure}
diff --git a/conclusion.tex b/conclusion.tex
index a53c268..755dedc 100644
--- a/conclusion.tex
+++ b/conclusion.tex
@@ -1,11 +1,23 @@
 \section{Conclusion}
+\label{sec:conclusion}
 
-\vericert{} is the first fully mechanised proof of correctness for HLS in Coq, translating C into Verilog.  
+We have presented \vericert{}, the first mechanically verified HLS tool for translating software in C into hardware in Verilog.
+%In this article, we present \vericert{}, the first fully mechanised proof of correctness for HLS in Coq, translating C into Verilog.
+We built \vericert{} by extending \compcert{} with a new hardware-specific intermediate language and a Verilog back end, and we verified it with respect to a semantics for Verilog due to \citet{loow19_proof_trans_veril_devel_hol}.
+%The behaviour of any C program compiled via \vericert{} is guaranteed to be translated correctly to Verilog. 
+We evaluated \vericert{} against the existing \legup{} HLS tool on the Polybench/C benchmark suite.
+%Our Polybench hardware is guaranteed to preserve the C behaviour, whilst \legup{} cannot provide any such guarantee about its generated hardware.
+Currently, our hardware is \slowdownOrig$\times$ slower and \areaIncr$\times$ larger compared to \legup{}, though it is only \slowdownDiv$\times$ slower if inefficient divisions are removed.
 
-However, to make the tool more usable there are many more optimisations that could be implemented to get the performance closer to \legup{}.  The main optimisation that should be performed is scheduling operations into the same clock cycle, so that these can run in parallel.  In addition to that, another issue that \vericert{} faces is that \compcert{}'s 3AC uses many different registers, even if these are not used anymore later on.  It would therefore be useful to have a register allocation algorithm that could ensure that registers could be reused appropriately, where the register allocation algorithm would find the minimum number of registers needed.  Finally, another optimisation would be to support resource sharing to reduce the area that the final design uses, this could include multi-cycle operations and pipelining optimisations so that division and multiplication operators also become more efficient.
-
-Finally, it's worth considering how trustworthy \vericert{} is compared to other HLS tools.  A formally verified tool does not guarantee an output for every input and it is therefore allowed to fail.  However, the subset of C that will compile is well-described and the tool should therefore not fail in surprising ways.  However, the guarantee that the output behaves the same as the input is strong, as it uses well-known C and existing Verilog semantics to prove the equivalence.
+There are abundant opportunities for improving \vericert{}'s performance. For instance, as discussed in Section~\ref{sec:evaluation}, simply replacing the na\"ive single-cycle division and modulo operations with C implementations increases clock frequency by $5.6\times$. 
+%Going forward, we envision introducing HLS-specific optimisations that are intended to improve the hardware quality of \vericert{}, whilst maintaining correctness.
+% However, to make the tool more usable there are many more optimisations that could be implemented to get the performance closer to \legup{}.  
+Beyond this, we plan to implement scheduling and loop pipelining, since this allows more operations to be packed into fewer clock cycles; recent work by \citet{six+20} indicates how these scheduling algorithms can be implemented in \compcert.
+%Another possibility is re-using registers, since \compcert{}'s 3AC does not include register allocation.
+Other optimisations include resource sharing to reduce the circuit area, and using tailored hardware operators that use hard IP blocks on chip and can be pipelined. 
+% this could include multi-cycle operations and pipelining optimisations so that division and multiplication operators also become more efficient.
 
+Finally, it's worth considering how trustworthy \vericert{} is compared to other HLS tools. The guarantee of full functional equivalence between input and output that $\vericert$ provides is a strong one, the semantics for the source and target languages are both well-established, and Coq is a mature and thoroughly tested system. However, \vericert{} cannot guarantee to provide an output for every valid input -- for instance, as remarked in Section~\ref{sec:proof:htl_verilog}, \vericert{} will error out if given a program with more than about four million instructions! -- but our evaluation indicates that it does not seem to error out too frequently. And of course, \vericert{} cannot guarantee that the final hardware produced will be correct, because the Verilog it generates must pass through a series of unverified tools along the way. This concern may be allayed in the future by ongoing work we are aware of to produce a verified hardware synthesis tool.
 
 
 %%% Local Variables:
diff --git a/evaluation.tex b/evaluation.tex
index 1fb7162..ef4883e 100644
--- a/evaluation.tex
+++ b/evaluation.tex
@@ -9,15 +9,15 @@ Our evaluation is designed to answer the following three research questions.
 \end{description}
 
 \subsection{Experimental Setup}
+\label{sec:evaluation:setup}
 
 \paragraph{Choice of HLS tool for comparison.} We compare \vericert{} against \legup{} 5.1 because it is open-source and hence easily accessible, but still produces hardware ``of comparable quality to a commercial high-level synthesis tool''~\cite{canis11_legup}.
 
 \paragraph{Choice and preparation of benchmarks.} We evaluate \vericert{} using the PolyBench/C benchmark suite (version 4.2.1)~\cite{polybench}, which consists of a collection of 30 numerical kernels. PolyBench/C is popular in the HLS context~\cite{choi+18,poly_hls_pouchet2013polyhedral,poly_hls_zhao2017,poly_hls_zuo2013}, since it has affine loop bounds, making it attractive for streaming computation on FPGA architectures.
-We were able to use 27 of the 30 programs; three had to be discarded (\texttt{correlation},~\texttt{gramschmidt} and~\texttt{deriche}) because they involve square roots, which require floats, which we do not support. 
+We were able to use 27 of the 30 programs; three had to be discarded (\texttt{correlation},~\texttt{gramschmidt} and~\texttt{deriche}) because they involve square roots, requiring floats, which we do not support. 
 % Interestingly, we were also unable to evaluate \texttt{cholesky} on \legup{}, since it produce an error during its HLS compilation. 
 %In summary, we evaluate 27 programs from the latest Polybench suite. 
-
-We configured Polybench's parameters so that only integer types are used, since we do not support floats or doubles currently. We use Polybench's smallest datasets for each program to ensure that data can reside within on-chip memories of the FPGA, avoiding any need for off-chip memory accesses.
+We configured Polybench's parameters so that only integer types are used, since we do not support floats. We use Polybench's smallest datasets for each program to ensure that data can reside within on-chip memories of the FPGA, avoiding any need for off-chip memory accesses.
 
 \vericert{} implements divisions and modulo operations in C using the corresponding built-in Verilog operators. These built-in operators are designed to complete within a single clock cycle, and this causes substantial penalties in clock frequency. Other HLS tools, including LegUp, supply their own multi-cycle division/modulo implementations, and we plan to do the same in future versions of \vericert{}. In the meantime, we have prepared an alternative version of the benchmarks in which each division/modulo operation is overridden with our own implementation that uses repeated division and multiplications by 2. Where this change makes an appreciable difference to the performance results, we give the results for both benchmark sets.
 
@@ -26,6 +26,7 @@ We configured Polybench's parameters so that only integer types are used, since
 \subsection{RQ1: How fast is \vericert{}-generated hardware?}
 
 \begin{figure}
+\definecolor{cyclecountcol}{HTML}{1b9e77}
 \begin{tikzpicture}
 \begin{axis}[
     xmode=log,
@@ -41,7 +42,7 @@ We configured Polybench's parameters so that only integer types are used, since
     %log ticks with fixed point,
   ]
   
-\addplot[draw=none, mark=*, draw opacity=0, fill opacity=0.3]
+\addplot[draw=none, mark=*, draw opacity=0, fill opacity=0.6,cyclecountcol]
   table [x=legupcycles, y=vericertcycles, col sep=comma]
   {results/poly.csv};
 
@@ -50,11 +51,13 @@ We configured Polybench's parameters so that only integer types are used, since
   
 \end{axis}
 \end{tikzpicture}
-\caption{A comparison of the cycle count of hardware designs generated by \vericert{} and by \legup{}, where the diagonal represents $y=x$.}
+\caption{A comparison of the cycle count of hardware designs generated by \vericert{} and by \legup{}.}
 \label{fig:comparison_cycles}
 \end{figure}
 
 \begin{figure}
+\definecolor{polycol}{HTML}{e6ab02}
+\definecolor{polywocol}{HTML}{7570b3}
 \begin{tikzpicture}
 \begin{axis}[
     xmode=log,
@@ -67,23 +70,29 @@ We configured Polybench's parameters so that only integer types are used, since
     xmax=1000000,
     ymax=1000000,
     ymin=10,
+    legend pos=south east,
     %log ticks with fixed point,
   ]
+ 
   
-\addplot[draw=none, mark=*, draw opacity=0, fill opacity=0.3,color=red]
+\addplot[draw=none, mark=*, draw opacity=0, fill opacity=0.8, polycol]
+  table [x expr={\thisrow{legupcycles}/\thisrow{legupfreqMHz}}, y expr={\thisrow{vericertcycles}/\thisrow{vericertoldfreqMHz}}, col sep=comma]
+  {results/poly.csv};
+ 
+\addlegendentry{PolyBench}
+
+\addplot[draw=none, mark=o, fill opacity=0, polywocol]
   table [x expr={\thisrow{legupcycles}/\thisrow{legupfreqMHz}}, y expr={\thisrow{vericertcycles}/\thisrow{vericertfreqMHz}}, col sep=comma]
   {results/poly.csv};
   
-\addplot[draw=none, mark=*, draw opacity=0, fill opacity=0.3, color=blue]
-  table [x expr={\thisrow{legupcycles}/\thisrow{legupfreqMHz}}, y expr={\thisrow{vericertcycles}/\thisrow{vericertoldfreqMHz}}, col sep=comma]
-  {results/poly.csv};
+\addlegendentry{PolyBench w/o division}
 
 \addplot[dotted, domain=10:1000000]{x};
 %\addplot[dashed, domain=10:10000]{9.02*x + 442};
   
 \end{axis}
 \end{tikzpicture}
-\caption{A comparison of the execution time of hardware designs generated by \vericert{} and by \legup{}, where the diagonal represents $y=x$.}
+\caption{A comparison of the execution time of hardware designs generated by \vericert{} and by \legup{}.}
 \label{fig:comparison_time}
 \end{figure}
 
@@ -101,15 +110,12 @@ It is notable that even without \vericert{} performing many optimisations, a few
 %We are very encouraged by these data points. 
 As we improve \vericert{} by incorporating further  optimisations, this gap should reduce whilst preserving our correctness guarantees.
 
-Cycle count is one factor in calculating execution times; the other is the clock frequency, which determines the duration of each of these cycles. Figure~\ref{fig:comparison_time} compares the execution times of \vericert{} and \legup{}. Across the original Polybench/C benchmarks, we see that \vericert{} designs are about $60\times$ slower than \legup{} designs. This dramatic discrepancy in performance can be largely attributed to the division
-
-As mentioned earlier, we modify the Polybench programs to utilise C-based divides and modulos. We had to avoid using the built-in Verilog divides and modulos, since Quartus interprets them as single-cycle operations. This interpretation affects clock frequency drastically. On average, when using the built-in Verilog approach, \vericert{}'s clock frequency was 21MHz, compared to \legup{}'s clock frequency of 247MHz. By moving to the C-based approach, our average clock frequency is now 112MHz. Hence, we reduce the frequency gap from approximately $12\times$ to $2\times$. This gap still exists because \legup{} uses various optimisation tactics and Intel-specific IPs, which requires further engineering effort and testing from our side. 
-
-% The difference in cycle counts shows the degree of  parallelism that \legup{}'s scheduling and memory system can offer. However, their Verilog generation is not guaranteed to be correct. Although the runtime LegUp outputs are tested to be correct for these programs, this does not provide any confidence on the correctness of Verilog generation of any other programs. Our Coq proof mechanisation guarantees that generated Verilog is correct for any input program that uses the subset of \compcert{} instructions that we have proven to be correct.
+Cycle count is one factor in calculating execution times; the other is the clock frequency, which determines the duration of each of these cycles. Figure~\ref{fig:comparison_time} compares the execution times of \vericert{} and \legup{}. Across the original Polybench/C benchmarks, we see that \vericert{} designs are about \slowdownOrig$\times$ slower than \legup{} designs. This dramatic discrepancy in performance can be largely attributed to \vericert's na\"ive implementations of division and modulo operations, as explained in Section~\ref{sec:evaluation:setup}. Indeed, \vericert{} achieved an average clock frequency of just 21MHz, while \legup{} managed about 247MHz. After replacing the division/modulo operations with our own C-based implementations, \vericert{}'s average clock frequency becomes about 112MHz. This is better, but still substantially below \legup{}, which uses various additional optimisations and Intel-specific IP blocks. Across the modified Polybench/C benchmarks, we see that \vericert{} designs are about \slowdownDiv$\times$ slower than \legup{} designs.
 
 \subsection{RQ2: How area-efficient is \vericert{}-generated hardware?}
 
 \begin{figure}
+\definecolor{resourceutilcol}{HTML}{e7298a}
 \begin{tikzpicture}
 \begin{axis}[
     height=80mm,
@@ -120,7 +126,7 @@ As mentioned earlier, we modify the Polybench programs to utilise C-based divide
     xmax=1, ymax=30,
   ]
   
-\addplot[draw=none, mark=*, draw opacity=0, fill opacity=0.3]
+\addplot[draw=none, mark=*, draw opacity=0, fill opacity=0.6,resourceutilcol]
   table [x expr=(\thisrow{legupluts}/427200*100), y expr=(\thisrow{vericertluts}/427200*100), col sep=comma]
   {results/poly.csv};
   
@@ -128,27 +134,25 @@ As mentioned earlier, we modify the Polybench programs to utilise C-based divide
   
 \end{axis}
 \end{tikzpicture}
-\caption{A comparison of the resource utilisation of designs generated by \vericert{} and by \legup{}}
+\caption{A comparison of the resource utilisation of designs generated by \vericert{} and by \legup{}.}
 \label{fig:comparison_area}
 \end{figure}
 
-Figure~\ref{fig:comparison_area} compares the resource utilisation of Polybench programs generated by \vericert{} and \legup{}.
-On average, \vericert{} produces hardware that is $21\times$ larger than \legup{} for the same Polybench programs. 
-\vericert{} designs are also filling up to 30\% of a large FPGA chip, which indicates that unoptimised Verilog generation can be costly. 
-The key reason for this behaviour, is that absence of RAM inference for the \vericert{} designs. 
-RAM are dedicated hardware blocks on the FPGA that are customly designed to allow for array synthesis.
-Synthesis tools like Quartus generally require array accesses to be in a certain template or form to enable inference of RAM use.
-\legup{}'s Verilog generation is tailored to enable RAM inference by Quartus.
-\vericert{}'s array accesses in Verilog are more generic, allowing us to target different FPGA synthesis tools and vendors, but underfits for Quartus requirements.
-For a fair comparison, we chose Quartus for these experiments because LegUp supports Quartus efficiently. 
+Figure~\ref{fig:comparison_area} compares the resource utilisation of the Polybench programs generated by \vericert{} and \legup{}.
+On average, we see that \vericert{} produces hardware that is about $21\times$ larger than \legup{}. \vericert{} designs are filling up to 30\% of a (large) FPGA chip, while $\legup$ uses no more than 1\% of the chip.
+The main reason for this is mainly because RAM is not inferred automatically for the Verilog that is generated by \vericert{}; instead, large arrays of registers are synthesised instead.
+Synthesis tools such as Quartus generally require array accesses to be in a specific form in order for RAM inference to activate.
+\legup{}'s Verilog generation is tailored to enable RAM inference by Quartus, while \vericert{} generates more generic array accesses. This may make \vericert{} more portable across different FPGA synthesis tools and vendors.
+%For a fair comparison, we chose Quartus for these experiments because LegUp supports Quartus efficiently. 
 % Consequently, on average, \legup{} designs use $XX$ RAMs whereas \vericert{} use none. 
-Improving RAM inference is part of our future plans. 
+Enabling RAM inference is part of our future plans. 
 
 % We see that \vericert{} designs use between 1\% and 30\% of the available logic on the FPGA, averaging at around 10\%, whereas LegUp designs all use less than 1\% of the FPGA, averaging at around 0.45\%. The main reason for this is mainly because RAM is not inferred automatically for the Verilog that is generated by \vericert{}.  Other synthesis tools can infer the RAM correctly for \vericert{} output, so this issue could be solved by either using a different synthesis tool and targeting a different FPGA, or by generating the correct template which allows Quartus to identify the RAM automatically.
 
 \subsection{RQ3: How long does \vericert{} take to produce hardware?}
 
 \begin{figure}
+\definecolor{compiltimecol}{HTML}{66a61e}
 \begin{tikzpicture}
 \begin{axis}[
     height=80mm,
@@ -164,7 +168,7 @@ Improving RAM inference is part of our future plans.
     ymax=0.20,
   ]
   
-\addplot[draw=none, mark=*, draw opacity=0, fill opacity=0.3] 
+\addplot[draw=none, mark=*, draw opacity=0, fill opacity=0.6,compiltimecol] 
   table [x=legupcomptime, y=vericertcomptime, col sep=comma]
   {results/poly.csv};
 
@@ -176,9 +180,9 @@ Improving RAM inference is part of our future plans.
 \label{fig:comparison_comptime}
 \end{figure}
 
-Figure~\ref{fig:comparison_comptime} compares the compilation times of \vericert{} and by \legup{}, with each data point corresponding to one of the PolyBench/C benchmarks. On average, \vericert{} compilation is about $47\times$ faster than \legup{} compilation. \vericert{} is much faster because it omits many complext time-consuming HLS optimisations performed by \legup{}, such as scheduling and memory analysis. This comparison also shows our approach do not add any significant overheads in compilation time, since we do not invoke verification for every compilation instance.
+Figure~\ref{fig:comparison_comptime} compares the compilation times of \vericert{} and of \legup{}, with each data point corresponding to one of the PolyBench/C benchmarks. On average, \vericert{} compilation is about $47\times$ faster than \legup{} compilation. \vericert{} is much faster because it omits many of the time-consuming HLS optimisations performed by \legup{}, such as scheduling and memory analysis. This comparison also demonstrates that our fully verified approach does not add substantial overheads in compilation time, since we do not invoke verification for every compilation instance, unlike the approaches based on translation validation that we mentioned in Section~\ref{sec:intro}.
 
-\NR{Do we want to finish the section off with some highlights or a summary?}
+%\NR{Do we want to finish the section off with some highlights or a summary?}
 
 %%% Local Variables:
 %%% mode: latex
diff --git a/introduction.tex b/introduction.tex
index 865dfd1..af148ac 100644
--- a/introduction.tex
+++ b/introduction.tex
@@ -1,4 +1,5 @@
 \section{Introduction}
+\label{sec:intro}
 
 %% Motivation for why HLS might be needed
 
@@ -8,7 +9,7 @@
 
 \paragraph{Can you trust your high-level synthesis tool?}
 
-As latency, throughput and energy efficiency become increasingly important, custom hardware accelerators are being designed for numerous applications~\cite{??}. 
+As latency, throughput and energy efficiency become increasingly important, custom hardware accelerators are being designed for numerous applications. 
 Alas, designing these accelerators can be a tedious and error-prone process using a hardware description language (HDL) such as Verilog.
 An attractive alternative is high-level synthesis (HLS), in which hardware designs are automatically compiled from software written in a language like C. 
 Modern HLS tools such as \legup{}~\cite{canis11_legup}, Vivado HLS~\cite{xilinx20_vivad_high_synth}, Intel i++~\cite{intel_hls}, and Bambu HLS~\cite{bambu_hls} promise designs with comparable performance and energy-efficiency to hand-written HDL designs~\cite{homsirikamol+14, silexicahlshdl, 7818341}, while offering the convenient abstractions and rich ecosystems of software development.
@@ -41,7 +42,7 @@ The contributions of this paper are as follows:
 \begin{itemize}
   \item We present \vericert{}, the first mechanically verified HLS tool that compiles C to Verilog. In Section~\ref{sec:design}, we describe the design of \vericert{}.
   \item We prove \vericert{} correct w.r.t. an existing semantics for Verilog due to \citet{loow19_proof_trans_veril_devel_hol}. In Section~\ref{sec:verilog}, we describe how we lightly extended this semantics to make it suitable as an HLS target. Then, in  Section~\ref{sec:proof}, we describe the proof.
-  \item We evaluate \vericert{} on the Polybench/C benchmark suite~\cite{polybench}, and compare the performance of our generated hardware against and the well-known but  unverified \legup{} HLS tool~\cite{canis11_legup}. In Section~\ref{sec:evaluation}, we show that \vericert{}-generated Polybench hardware is 10x slower and 21x larger than, HLS-optimised but unverified, \legup{} hardware. We intend to bridge this performance gap in the future by introducing HLS optimisations of our own, such as scheduling and memory analysis.
+  \item We evaluate \vericert{} on the Polybench/C benchmark suite~\cite{polybench}, and compare the performance of our generated hardware against and the well-known but  unverified \legup{} HLS tool~\cite{canis11_legup}. In Section~\ref{sec:evaluation}, we show that \vericert{}-generated Polybench hardware is \slowdownOrig$\times$ slower and \areaIncr$\times$ larger than, HLS-optimised but unverified, \legup{} hardware. We intend to bridge this performance gap in the future by introducing HLS optimisations of our own, such as scheduling and memory analysis.
 \end{itemize}
 
 \vericert{} is fully open source and available online.
diff --git a/main.tex b/main.tex
index 9cfe5f1..06667b5 100644
--- a/main.tex
+++ b/main.tex
@@ -47,6 +47,7 @@
 \usepackage{microtype}
 
 \usetikzlibrary{shapes,calc,arrows.meta}
+\pgfplotsset{compat=1.16}
 
 \setminted{fontsize=\small}
 \usemintedstyle{manni}
@@ -87,6 +88,11 @@
 \newcommand{\compcert}{Comp\-Cert}
 \newcommand{\legup}{Leg\-Up}
 
+\newcommand{\slowdownOrig}{56}
+\newcommand{\slowdownDiv}{10}
+\newcommand{\areaIncr}{21}
+
+
 \newtheorem{theorem}{Theorem}
 \newtheorem{lemma}{Lemma}
 \newtheorem*{remark}{Remark}
@@ -214,7 +220,7 @@
 
 
 %% Appendix
-\input{appendix}
+%\input{appendix}
 
 \end{document}
 
diff --git a/references.bib b/references.bib
index 7303a48..6551126 100644
--- a/references.bib
+++ b/references.bib
@@ -15,6 +15,14 @@
   publisher =    {ACM},
 }
 
+@article{six+20,
+ author={Cyril Six and Sylvain Boulm\'e and David Monniaux},
+ title = {{Certified and efficient instruction scheduling: Application to interlocked {VLIW} processors}},
+  journal   = {Proc. {ACM} Program. Lang.},
+  number    = {{OOPSLA}},
+  year      = {2020}
+ }
+
 @inproceedings{lidbury15_many_core_compil_fuzzin,
   author          = {Lidbury, Christopher and Lascu, Andrei and Chong, Nathan and Donaldson,
                   Alastair F.},
@@ -659,7 +667,7 @@ year = {2020},
 	editor = "Lau, Kung-Kiu and Banach, Richard",
 	isbn = "978-3-540-32250-4",
 	publisher = "Springer Berlin Heidelberg",
-	doi = {0.1007/11576280\_20}
+	doi = {0.1007/11576280_20}
 }
 
 @inproceedings{slind08_brief_overv_hol4,
@@ -833,7 +841,7 @@ year = {2020},
 @unpublished{du_fuzzin_high_level_synth_tools,
   author          = {Du, Zewei and Herklotz, Yann and Ramanathan, Nadesh and Wickerson, John},
   note            = {Unpublished},
-  url             = {https://yannherklotz.com/docs/drafts/fuzzing_hls.pdf}
+  url             = {https://yannherklotz.com/docs/drafts/fuzzing_hls.pdf},
   title           = {{Fuzzing High-Level Synthesis Tools}},
 }
 
@@ -853,3 +861,13 @@ year = {2020},
   journal={Computing Laboratory. Oxford University, UK},
   year={1996}
 }
+
+@inproceedings{clarke03_behav_c_veril,
+  author          = {E. {Clarke} and D. {Kroening} and K. {Yorav}},
+  title           = {Behavioral consistency of {C} and {Verilog} programs using bounded model checking},
+  booktitle       = {Proceedings 2003. Design Automation Conference (IEEE Cat. No.03CH37451)},
+  year            = 2003,
+  pages           = {368-371},
+  doi             = {10.1145/775832.775928},
+  url             = {https://doi.org/10.1145/775832.775928},
+}
diff --git a/related.tex b/related.tex
index 88cd60a..e0cefa3 100644
--- a/related.tex
+++ b/related.tex
@@ -23,7 +23,7 @@
     \draw[myoutline] (0,0) ellipse (3 and 1.0);
 
     \node[align=center] at (0,2.8) {Standard HLS \\ tools~\cite{canis11_legup,xilinx20_vivad_high_synth,intel20_sdk_openc_applic,nigam20_predic_accel_desig_time_sensit_affin_types}};
-    \node[align=center] at (0,1.5) {Translation validation \\ approaches~\cite{mentor20_catap_high_level_synth,kundu08_valid_high_level_synth, clarke_kroening_yorav_03}};
+    \node[align=center] at (0,1.5) {Translation validation \\ approaches~\cite{mentor20_catap_high_level_synth,kundu08_valid_high_level_synth,clarke03_behav_c_veril}};
     \node at (0,0.5) {\bf \vericert{}};
     \node[align=left] at (-1.8,0.4) {Koika~\cite{bourgeat20_essen_blues}};
     \node[align=left] at (-1.8,0.0) {L\"o\"ow et al.~\cite{loow19_verif_compil_verif_proces}};
@@ -42,13 +42,13 @@
   \caption{Summary of related work}\label{fig:related_venn}
 \end{figure}
 
-A summary of the related works can be found in Figure~\ref{fig:related_venn}, which is represented as a Venn diagram.  The categories that were chosen for the Venn diagram are if the tool is usable and available, if it takes a high-level software language as input, if it has a correctness proof and finally if that proof is mechanised.  The goal of \vericert{} is to cover all of these categories.
+A summary of the related works can be found in Figure~\ref{fig:related_venn}, which is represented as a Venn diagram.  The categories that were chosen for the Venn diagram are: if the tool is usable and available, if it takes a high-level software language as input, if it has a correctness proof and finally if that proof is mechanised.  The goal of \vericert{} is to cover all of these categories.
 
-Most practical HLS tools~\cite{canis11_legup,xilinx20_vivad_high_synth,intel20_sdk_openc_applic,nigam20_predic_accel_desig_time_sensit_affin_types} fit into the category of usable tools that take high-level inputs.  On the other spectrum, there are tools such as BEDROC~\cite{chapman92_verif_bedroc} for which there is no practical tool, and even though it is described as high-level synthesis, it resembles today's hardware description tools more.
+Most practical HLS tools~\cite{canis11_legup,xilinx20_vivad_high_synth,intel20_sdk_openc_applic,nigam20_predic_accel_desig_time_sensit_affin_types} fit into the category of usable tools that take high-level inputs.  On the other spectrum, there are tools such as BEDROC~\cite{chapman92_verif_bedroc} for which there is no practical tool, and even though it is described as high-level synthesis, it more closely resembles today's hardware synthesis tools.
 
-Work is being done to prove the equivalence between the generated hardware and the original behavioural description in C.  An example of a tool that implements this is Mentor's Catapult~\cite{mentor20_catap_high_level_synth}, which tries to match the states in the three address code (3AC) description to states in the original C code after an unverified translation.  This technique is called translation validation~\cite{pnueli98_trans}, whereby the translation that the HLS tool performed is proven to have been correct for that input, by showing that they behave in the same way for all possible inputs.  Using translation validation is quite effective for proving complex optimisations such as scheduling~\cite{kim04_autom_fsmd,karfa06_formal_verif_method_sched_high_synth,chouksey20_verif_sched_condit_behav_high_level_synth} or code motion~\cite{banerjee14_verif_code_motion_techn_using_value_propag,chouksey19_trans_valid_code_motion_trans_invol_loops}, however, the validation has to be run every time the high-level synthesis is performed.  In addition to that, the proofs are often not mechanised or directly related to the actual implementation, meaning the verifying algorithm might be wrong and could give false positives or false negatives.
+Ongoing work in translation validation~\cite{pnueli98_trans} seeks to prove equivalence between the hardware generated by an HLS tool and the original behavioural description in C.  An example of a tool that implements this is Mentor's Catapult~\cite{mentor20_catap_high_level_synth}, which tries to match the states in the 3AC description to states in the original C code after an unverified translation.  Using translation validation is quite effective for verifying complex optimisations such as scheduling~\cite{kim04_autom_fsmd,karfa06_formal_verif_method_sched_high_synth,chouksey20_verif_sched_condit_behav_high_level_synth} or code motion~\cite{banerjee14_verif_code_motion_techn_using_value_propag,chouksey19_trans_valid_code_motion_trans_invol_loops}, but the validation has to be run every time the HLS is performed.  In addition to that, the proofs are often not mechanised or directly related to the actual implementation, meaning the verifying algorithm might be wrong and hence could give false positives or false negatives.
 
-Finally, there are a few relevant mechanically verified tools.  First, K\^{o}ika is a formally verified translation from a core subset of BlueSpec into a circuit representation which can then be printed as a Verilog design.  This is a translation from a high-level hardware description language into an equivalent circuit representation, which is a different approach to HLS.  \citet{loow19_proof_trans_veril_devel_hol} used a verified translation from HOL4 code describing state transitions into Verilog to design a verified processor~\cite{loow19_verif_compil_verif_proces}.  Their approach translated a shallow embedding in HOL4 into a deep embedding of Verilog.  Perna et al.\cite{perna12_mechan_wire_wise_verif_handel_c_synth,perna11_correc_hardw_synth} designed a formally verified translation from a deep embedding of Handel-C~\cite{aubury1996handel}, which is translated to a deep embedding of a circuit.  Finally, Ellis' work~\cite{ellis08} implemented and reasoned about intermediate languages for software/hardware compilation, where parts could be implemented in hardware and the correctness could still be shown.
+Finally, there are a few relevant mechanically verified tools.  First, K\^{o}ika is a formally verified translation from a core fragment of BlueSpec into a circuit representation which can then be printed as a Verilog design.  This is a translation from a high-level hardware description language into an equivalent circuit representation, so is a different approach to HLS.  \citet{loow19_proof_trans_veril_devel_hol} used a verified translation from HOL4 code describing state transitions into Verilog to design a verified processor~\cite{loow19_verif_compil_verif_proces}.  Their approach translated a shallow embedding in HOL4 into a deep embedding of Verilog.  Perna et al.~\cite{perna12_mechan_wire_wise_verif_handel_c_synth,perna11_correc_hardw_synth} designed a formally verified translation from a deep embedding of Handel-C~\cite{aubury1996handel}, which is translated to a deep embedding of a circuit.  Finally, Ellis~\cite{ellis08} used Isabelle to implement and reason about intermediate languages for software/hardware compilation, where parts could be implemented in hardware and the correctness could still be shown.
 
 %%% Local Variables:
 %%% mode: latex
diff --git a/verilog.tex b/verilog.tex
index ed2e7af..688a472 100644
--- a/verilog.tex
+++ b/verilog.tex
@@ -5,7 +5,7 @@
 This section describes the Verilog semantics that was chosen for the target language, including the changes that were made to the semantics to make it a suitable HLS target.  The Verilog standard is quite large~\cite{06_ieee_stand_veril_hardw_descr_languag,05_ieee_stand_veril_regis_trans_level_synth}, but the syntax and semantics can be reduced to a small subset that \vericert{} needs to target.
 
 The Verilog semantics we use is ported to Coq from a semantics written in HOL4 by \citet{loow19_proof_trans_veril_devel_hol} and used to prove the translation from HOL4 to Verilog~\cite{loow19_verif_compil_verif_proces}. % which was used to create a formal translation from a logic representation encoded in the HOL4~\cite{slind08_brief_overv_hol4} theorem prover into an equivalent Verilog design. 
-This semantics is quite practical as it is restricted to a small subset of Verilog, which can nonetheless be used to model all hardware constructs one would want to use. \JP{``can nonetheless be used to model the hardware constructs required for HLS.''} The main features that are excluded are continuous assignment and combinational \alwaysblock{}s; these are modelled in other semantics such as that by~\citet{meredith10_veril}. %however, these are not necessarily needed, but require more complicated event queues and execution model.
+This semantics is quite practical as it is restricted to a small subset of Verilog, which can nonetheless be used to model the hardware constructs required for HLS.  The main features that are excluded are continuous assignment and combinational \alwaysblock{}s; these are modelled in other semantics such as that by~\citet{meredith10_veril}. %however, these are not necessarily needed, but require more complicated event queues and execution model.
 
 The semantics of Verilog differs from regular programming languages, as it is used to describe hardware directly, which is inherently parallel, rather than an algorithm, which is usually sequential.  The main construct in Verilog is the \alwaysblock{}.
 A module can contain multiple \alwaysblock{}s, all of which run in parallel.  These \alwaysblock{}s further contain statements such as if-statements or assignments to variables.  We support only \emph{synchronous} logic, which means that the \alwaysblock{} is triggered on (and only on) the rising edge of a clock signal.
author	Yann Herklotz <ymh15@ic.ac.uk>	2020-11-21 00:03:19 +0000
committer	overleaf <overleaf@localhost>	2020-11-21 00:04:19 +0000
commit	15aab23c1e89e9369b631eb3cefc7fc708d6da36 (patch)
tree	1fa2ba7cf2b0af2f369918959326555368fd0b1f
parent	5f228d3512b2e3e39305e6f53bc873c76439e96f (diff)
download	oopsla21_fvhls-15aab23c1e89e9369b631eb3cefc7fc708d6da36.tar.gz oopsla21_fvhls-15aab23c1e89e9369b631eb3cefc7fc708d6da36.zip