5 files changed, 84 insertions, 34 deletions
diff --git a/algorithm.tex b/algorithm.tex
index c38d7e0..52f3f7a 100644
--- a/algorithm.tex
+++ b/algorithm.tex
@@ -7,15 +7,17 @@ This section covers the main architecture of the HLS tool, and the way in which
 First of all, the choice of C for the input language of \vericert{} is simply because it is what most major HLS tools use~\cite{canis11_legup, xilinx20_vivad_high_synth, intel_hls, bambu_hls}. This, in turn, may be because C is ``[t]he starting point for the vast majority of algorithms to be implemented in hardware''~\cite{5522874}.
 %Since a lot of existing code for HLS is written in C, supporting C as an input language, rather than a custom domain-specific language, means that \vericert{} is more practical. 
 %An alternative was to support LLVM IR as an input language, however, to get a full work flow from a higher level language to hardware, a front end for that language to LLVM IR would also have to be verified. \JW{Maybe save LLVM for the `Choice of implementation language'?}
-\JW{We considered Bluespec~\cite{bluespec}, but decided that although it ``can be classed as a high-level language''~\cite{greaves_note}, it is too hardware-oriented to be used for traditional HLS.
-We also considered using a language with built-in parallel constructs that map well to parallel hardware, such as occam~\cite{page91_compil_occam} or Spatial~\cite{spatial}, but found these languages are too niche.}
+We considered Bluespec~\cite{nikhil04_blues_system_veril}, but decided that although it ``can be classed as a high-level language''~\cite{greaves_note}, it is too hardware-oriented to be used for traditional HLS.
+We also considered using a language with built-in parallel constructs that map well to parallel hardware, such as occam~\cite{page91_compil_occam} or Spatial~\cite{spatial}, but found these languages are too niche.
 % However, this would not qualify as being HLS due to the manual parallelism that would have to be performed. \JW{I don't think the presence of parallelism stops it being proper HLS.} 
 %\JP{I think I agree with Yann here, but it could be worded better. At any rate not many people have experience writing what is essentially syntactic sugar over a process calculus.} 
 %\JW{I mean: there are plenty of software languages that involve parallel constructs. Anyway, perhaps we can just dismiss occam for being too obscure.}
 
 
 \paragraph{Choice of target language}
-Verilog~\cite{06_ieee_stand_veril_hardw_descr_languag} is an HDL that can be synthesised into logic cells which can be either placed onto a field-programmable gate array (FPGA) or turned into an application-specific integrated circuit (ASIC).  Verilog was chosen as the output language for \vericert{} because it is one of the most popular HDLs and there already exist a few formal semantics for it that could be used as a target~\cite{loow19_verif_compil_verif_proces, meredith10_veril}. Other possible targets could have been Bluespec, a higher level hardware description language, for which there exists a formally verified translation to circuits using K\^{o}ika~\cite{bourgeat20_essen_blues}, however, targeting this language would not be trivial as it is not meant to be targeted by an automatic tool.  Finally, a custom circuit language could also have been targeted, which can then be translated to Verilog in an unverified way, however, some guarantees would be lost and it would not be possible to completely trust the output. \JP{Is this meant to be a dig at Koika :p It feels a bit odd mentioning Koika and then saying this without any comment or evaluation.}\YH{Yeah not at all, I guess it could have been another avenue to go down, don't know how to not make it sound like a dig, it's quite likely an author of koika might review this :)} \JP{What about that IR from ETH (I think) at PLDI 2020? Obviously didn't exist at the start, but might be worth comment.}\YH{Yeah, LLHD might be worth mentioning, but I think we can just say lack of formalisation makes it difficult} %\JW{Can we mention one or two alternatives that we considered? Bluespec or Chisel or one of Adam Chlipala's languages, perhaps?}
+Verilog~\cite{06_ieee_stand_veril_hardw_descr_languag} is an HDL that can be synthesised into logic cells which can be either placed onto a field-programmable gate array (FPGA) or turned into an application-specific integrated circuit (ASIC).  Verilog was chosen as the output language for \vericert{} because it is one of the most popular HDLs and there already exist a few formal semantics for it that could be used as a target~\cite{loow19_verif_compil_verif_proces, meredith10_veril}. Other possible targets could have been Bluespec, a higher level hardware description language, for which there exists a formally verified translation to circuits using K\^{o}ika~\cite{bourgeat20_essen_blues}, however, targeting this language would not be trivial as it is not meant to be targeted by an automatic tool, instead strives to a formally verified high-level hardware description language instead.  Finally, an intermediate language like LLHD~\cite{schuiki20_llhd} was also considered, however, currently there are no existing formal semantics.
+
+%\JW{Can we mention one or two alternatives that we considered? Bluespec or Chisel or one of Adam Chlipala's languages, perhaps?}
 
 \paragraph{Choice of implementation language}
 The framework that was chosen for the frontend was \compcert{}, as it is a mature framework for simulation proofs about intermediate languages, in addition to already providing a validated parser~\cite{jourdan12_valid_lr_parser} from C into the internal representation of Clight.  Other frameworks were also considered, such as Vellvm~\cite{zhao12_formal_llvm_inter_repres_verif_progr_trans}, as LLVM IR in particular is often used by HLS tools anyways, however, these would require more work to support a higher level language such as C as input, or even providing a parser for LLVM IR.\@ \JP{No mention of Coq here.} \JW{Could also mention Kiwi here~\cite{kiwi} as an example of an HLS tool built upon the .NET framework.}
@@ -110,7 +112,9 @@ Using the simple accumulator program shown in Figure~\ref{fig:accumulator_c} as
 
 \subsubsection{Translating C to 3AC}
 
-The first step of the translation is to use \compcert{} to transform the input C code into the 3AC shown in Figure~\ref{fig:accumulator_rtl}. As part of this, \compcert{} performs optimisations such as constant propagation and dead-code elimination.  Function inlining is also performed, which allows us to support function calls without having to support the \texttt{Icall} 3AC instruction.  The duplication of the function bodies caused by inlining does affect the total area of the hardware \JW{Is that definitely true? Was discussing this with Nadesh and George recently, and I ended up not being so sure. Inlining could actually lead to \emph{reduced} resource usage because once everything has been inlined, the (big) scheduling problem could then be solved quite optimally. Certainly inlining is known to increase register pressure, but that's not really an issue here. If we're  not sure, we could just say that inlining everything leads to bloated Verilog files and the inability to support recursion, and leave it at that.}\YH{I think that is true, just because we don't do scheduling.  With scheduling I think that's true, inlining actually becomes quite good.}, however it improves the latency of the hardware.  In addition to that, inlining removes the possibility of supporting recursive function calls, however, this is a feature that isn't supported in most other HLS tools either.
+The first step of the translation is to use \compcert{} to transform the input C code into the 3AC shown in Figure~\ref{fig:accumulator_rtl}. As part of this, \compcert{} performs optimisations such as constant propagation and dead-code elimination.  Function inlining is also performed, which allows us to support function calls without having to support the \texttt{Icall} 3AC instruction.  The duplication of the function bodies caused by inlining does affect the total area of the hardware, however, latency is improved instead.  In addition to that, inlining removes the possibility of supporting recursive function calls, however, this is a feature that isn't supported in most other HLS tools either.
+
+%\JW{Is that definitely true? Was discussing this with Nadesh and George recently, and I ended up not being so sure. Inlining could actually lead to \emph{reduced} resource usage because once everything has been inlined, the (big) scheduling problem could then be solved quite optimally. Certainly inlining is known to increase register pressure, but that's not really an issue here. If we're  not sure, we could just say that inlining everything leads to bloated Verilog files and the inability to support recursion, and leave it at that.}\YH{I think that is true, just because we don't do scheduling.  With scheduling I think that's true, inlining actually becomes quite good.}
 
 \subsubsection{Translating 3AC to HTL}
 
@@ -119,8 +123,10 @@ The first step of the translation is to use \compcert{} to transform the input C
 %   + TODO Clarify connection between CFG and FSMD
 
 %   + TODO Explain how memory is mapped
+%\JW{I feel like this could use some sort of citation, but I'm not sure what. I guess this is all from "Hardware Design 101", right?}\YH{I think I found a good one actually, which goes over the basics.}
+%\JW{I think it would be worth having a sentence to explain how the C model of memory is translated to a hardware-centric model of memory. For instance, in C we have global variables/arrays, stack-allocated variables/arrays, and heap-allocated variables/arrays (anything else?). In Verilog we have registers and RAM blocks. So what's the correspondence between the two worlds? Globals and heap-allocated are not handled, stack-allocated variables become registers, and stack-allocated arrays become RAM blocks? Am I close?}\YH{Stack allocated variables become RAM as well, so that we can deal with addresses easily and take addresses of any variable.} \JW{I see, thanks. So, in short, the only registers in your hardware designs are those that store things like the current state, etc. You generate a fixed number of registers every time you synthesis -- you don't generate extra registers to store any of the program variables. Right?}
 
-The first translation performed in \vericert{} is from 3AC to a new hardware translation language (HTL), which is one step towards being completely translated to hardware described in Verilog.  The main translation that is performed is going from a CFG representation of the computation to a finite state machine with datapath (FSMD)~\cite{hwang99_fsmd}\JW{I feel like this could use some sort of citation, but I'm not sure what. I guess this is all from "Hardware Design 101", right?}\YH{I think I found a good one actually, which goes over the basics.} representation in HTL.\@  The core idea of the FSMD representation is that it separates the control flow from the operations on the memory and registers, so that the state transitions can be translated into a simple finite state machine (FSM) and each state then contains data operations that update the memory and registers.  Figure~\ref{fig:accumulator_diagram} shows the resulting architecture of the FSMD. \JW{I think it would be worth having a sentence to explain how the C model of memory is translated to a hardware-centric model of memory. For instance, in C we have global variables/arrays, stack-allocated variables/arrays, and heap-allocated variables/arrays (anything else?). In Verilog we have registers and RAM blocks. So what's the correspondence between the two worlds? Globals and heap-allocated are not handled, stack-allocated variables become registers, and stack-allocated arrays become RAM blocks? Am I close?}\YH{Stack allocated variables become RAM as well, so that we can deal with addresses easily and take addresses of any variable.} \JW{I see, thanks. So, in short, the only registers in your hardware designs are those that store things like the current state, etc. You generate a fixed number of registers every time you synthesis -- you don't generate extra registers to store any of the program variables. Right?} Hardware does not have the same memory model as C, the memory model therefore needs to be translated in the following way.  Global variables are not translated in \vericert{} at the moment, however, the stack of the main function will become the RAM seen in Figure~\ref{fig:accumulator_diagram}.  Variables that have their address is taken will therefore be stored in the RAM, as well as any arrays or structs defined in the function.  Variables that did not have their address taken will be kept in registers.
+The first translation performed in \vericert{} is from 3AC to a new hardware translation language (HTL), which is one step towards being completely translated to hardware described in Verilog.  The main translation that is performed is going from a CFG representation of the computation to a finite state machine with datapath (FSMD)~\cite{hwang99_fsmd} representation in HTL.\@  The core idea of the FSMD representation is that it separates the control flow from the operations on the memory and registers, so that the state transitions can be translated into a simple finite state machine (FSM) and each state then contains data operations that update the memory and registers.  Figure~\ref{fig:accumulator_diagram} shows the resulting architecture of the FSMD.  Hardware does not have the same memory model as C, the memory model therefore needs to be translated in the following way.  Global variables are not translated in \vericert{} at the moment, however, the stack of the main function will become the RAM seen in Figure~\ref{fig:accumulator_diagram}.  Variables that have their address is taken will therefore be stored in the RAM, as well as any arrays or structs defined in the function.  Variables that did not have their address taken will be kept in registers.
 
 \begin{figure*}
   \centering
@@ -226,7 +232,9 @@ However, the memory model that \compcert{} uses for its intermediate languages~\
 
 % Mention that this optimisation is not performed sometimes (clang -03).
 
-\vericert{} performs some optimisations at the level of the instructions that are generated, so that the hardware performs the instructions as quickly as possible and so that the maximum frequency at which the hardware can run is increased.  One of the main constructs that cripple performance of the generated hardware is the instantiation of divider circuits in the hardware.  In the case of \vericert{}, it requires the result of the divide operation to be ready in the same clock cycle, meaning the divide circuit needs to be implemented fully combinationally.  This is inefficient in terms of hardware size, but also in terms of latency, because it means that the maximum frequency of the hardware needs to be reduced dramatically so that the divide circuit has enough time to finish. \JP{Multi-cycle paths might be something worth exploring in future work: fairly error-prone/dangerous for hand-written code, but might be interesting in generated code.}\YH{Definitely is on the list for next things to look into, will make divide so much more efficient.}
+\vericert{} performs some optimisations at the level of the instructions that are generated, so that the hardware performs the instructions as quickly as possible and so that the maximum frequency at which the hardware can run is increased.  One of the main constructs that cripple performance of the generated hardware is the instantiation of divider circuits in the hardware.  In the case of \vericert{}, it requires the result of the divide operation to be ready in the same clock cycle, meaning the divide circuit needs to be implemented fully combinationally.  This is inefficient in terms of hardware size, but also in terms of latency, because it means that the maximum frequency of the hardware needs to be reduced dramatically so that the divide circuit has enough time to finish.
+
+%\JP{Multi-cycle paths might be something worth exploring in future work: fairly error-prone/dangerous for hand-written code, but might be interesting in generated code.}\YH{Definitely is on the list for next things to look into, will make divide so much more efficient.}
 
 These small optimisations were found to be the most error prone, and guaranteeing that the new representation is equivalent to representation used in the \compcert{} semantics is difficult without proving this for all possible inputs.
 
diff --git a/archive/verilog.tex b/archive/verilog.tex
index 5fc8999..a363596 100644
--- a/archive/verilog.tex
+++ b/archive/verilog.tex
@@ -1,3 +1,8 @@
+%\JW{I think most of the following paragraph could be chopped. We've already mentioned that we're only targeting a small fragment of Verilog, so we could mention at that point that in the fragment we target there is no discrepancy between the simulation behaviour and the synthesis behaviour. I don't know how important it is to talk about the details of evaluation order here -- does the reader really need to be aware of this in order to understand your work?}
+When targeting a hardware description language such as Verilog, it is important to be consistent between simulating the hardware and the behaviour of the synthesised result on actual hardware.  In the target Verilog semantics, only clocked \alwaysblock{}s are supported as these ensure that there are not mismatches between simulation and synthesis, as combinational \alwaysblock{}s that trigger on any change of an internal signal may behave differently in simulation or synthesis based on the order of operations.  This limitation in the semantics therefore helps keep the Verilog correct and consistent.  In addition to that, only nonblocking assignment is used in signals that are used in multiple \alwaysblock{}, which stops any race conditions from occurring as all the signal updates happen deterministically. \NR{So the semantics allows multiple drivers then?}  Finally, a specific order of evaluation for the \alwaysblock{} is chosen, and because of the limitations listed before, choosing an order is guaranteed to have the same behaviour as executing the \alwaysblock{} in any order.\NR{Order is not guaranteed if we have multiple drivers.}
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
 Using these constructs it is therefore possible to describe how hardware functions, where always blocks that are triggered by a clock periodically get translated into flip-flops and always blocks triggered by changes in any internal signals are translated into combinational logic.
 
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
diff --git a/main.tex b/main.tex
index 24b0808..cd0aef6 100644
--- a/main.tex
+++ b/main.tex
@@ -44,6 +44,7 @@
 \usepackage{subcaption}
 \usepackage{tikz}
 \usepackage{minted}
+\usepackage{microtype}
 
 \usetikzlibrary{shapes,calc,arrows.meta}
 
diff --git a/references.bib b/references.bib
index f8a4aff..747d0b3 100644
--- a/references.bib
+++ b/references.bib
@@ -49,7 +49,7 @@ year = {2020},
   pages={857-858},
   doi={10.1145/1837274.1837489}}
   
-@INPROCEEDINGS{bluespec,
+@inproceedings{bluespec,
   author={Nikhil, Rishiyur S. },
   booktitle={Proceedings. Second ACM and IEEE International Conference on Formal Methods and Models for Co-Design, 2004. MEMOCODE '04.}, 
   title={Bluespec System Verilog: efficient, correct RTL from high level specifications}, 
@@ -713,3 +713,45 @@ year = {2020},
 	publisher = {IEEE Press},
 	series = {FormaliSE '19}
 }
+
+@inproceedings{nikhil04_blues_system_veril,
+  author          = {R. {Nikhil}},
+  title           = {Bluespec System Verilog: efficient, correct RTL from high level specifications},
+  booktitle       = {Proceedings. Second ACM and IEEE International Conference on Formal Methods and
+                  Models for Co-Design, 2004. MEMOCODE '04.},
+  year            = 2004,
+  pages           = {69-70},
+  doi             = {10.1109/MEMCOD.2004.1459818},
+  url             = {https://doi.org/10.1109/MEMCOD.2004.1459818},
+}
+
+@inproceedings{schuiki20_llhd,
+  author          = {Schuiki, Fabian and Kurth, Andreas and Grosser, Tobias and Benini, Luca},
+  title           = {LLHD: A Multi-Level Intermediate Representation for Hardware Description
+                  Languages},
+  booktitle       = {Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design
+                  and Implementation},
+  year            = 2020,
+  pages           = {258-271},
+  doi             = {10.1145/3385412.3386024},
+  url             = {https://doi.org/10.1145/3385412.3386024},
+  address         = {New York, NY, USA},
+  isbn            = 9781450376136,
+  location        = {London, UK},
+  numpages        = 14,
+  publisher       = {Association for Computing Machinery},
+  series          = {PLDI 2020},
+}
+
+@InProceedings{zhu13_mechan_approac_linkin_operat_seman,
+  author          = "Zhu, Huibiao and Liu, Peng and He, Jifeng and Qin, Shengchao",
+  title           = "Mechanical Approach to Linking Operational Semantics and Algebraic Semantics
+                  for Verilog Using Maude",
+  booktitle       = "Unifying Theories of Programming",
+  year            = 2013,
+  pages           = "164--185",
+  address         = "Berlin, Heidelberg",
+  editor          = "Wolff, Burkhart and Gaudel, Marie-Claude and Feliachi, Abderrahmane",
+  isbn            = "978-3-642-35705-3",
+  publisher       = "Springer Berlin Heidelberg",
+}
diff --git a/verilog.tex b/verilog.tex
index 5768a43..2ca42dd 100644
--- a/verilog.tex
+++ b/verilog.tex
@@ -2,67 +2,63 @@
 
 \newcommand{\alwaysblock}{\texttt{always}-block}
 
-This section describes the Verilog semantics that were \JW{was} chosen for the target language, including the changes that were made to the semantics to make it \JWcouldcut{a better fit} a \JW{suitable} HLS target.  The Verilog standard is quite large~\cite{06_ieee_stand_veril_hardw_descr_languag,05_ieee_stand_veril_regis_trans_level_synth}, but the syntax and semantics can be reduced to a small subset that \vericert{} needs to target.
-\NR{Have we discussed what our options were and why we chose the HOL4 semantics?} \JW{Good point -- we should cite the Maude semantics too. I think that's the only viable alternative.}\YH{Yes, I should make that more clear, I cited Meredith semantics in the algorithm section, will add maude as well, the main reason is practicality and we don't need continuous assignment, basically the same reasons that Loow used to justify creating his own semantics.}
+This section describes the Verilog semantics that was chosen for the target language, including the changes that were made to the semantics to make it a suitable HLS target.  The Verilog standard is quite large~\cite{06_ieee_stand_veril_hardw_descr_languag,05_ieee_stand_veril_regis_trans_level_synth}, but the syntax and semantics can be reduced to a small subset that \vericert{} needs to target.
 
 The Verilog semantics we use is ported to Coq from a semantics written in HOL4  by \citet{loow19_proof_trans_veril_devel_hol}, used to prove the translation from HOL4 to Verilog~\cite{loow19_verif_compil_verif_proces}.% which was used to create a formal translation from a logic representation encoded in the HOL4~\cite{slind08_brief_overv_hol4} theorem prover into an equivalent Verilog design. 
-This semantics is quite practical as it is restricted to a small subset of Verilog, which can nonetheless be used to model all hardware constructs one would want to design.  The main features that are not supported by the syntax and semantics are continuous assignment and combinational always blocks.
-\NR{Shall we use special font fo always-blocks: maybe \alwaysblock{}}
+This semantics is quite practical as it is restricted to a small subset of Verilog, which can nonetheless be used to model all hardware constructs one would want to design.  The main features that are not supported by the syntax and semantics are continuous assignment and combinational \alwaysblock{}s, which are modelled in other semantics such as~\citet{meredith10_veril}, however, these are not necessarily needed, but require more complicated event queues and execution model.
 
-The semantics of Verilog differs from regular programming languages, as it is used to describe hardware directly, which is inherently parallel, rather than an algorithm, which is usually sequential.  The main construct in Verilog is the always block \JW{consider `always-block' as it's easier to parse}. \JWcouldcut{which is a construct that contains statements.}
-A module can contain multiple always blocks, all of which run in parallel.  These always blocks further contain statements such as if-statements or assignments to variables.  Each always block also contains a list of events at which it should trigger, which could either contain signals that are assigned to other signals in that always block, or a different signal such as a clock which will trigger the always block periodically, only the latter are actually supported in our target semantics. \JW{That sentence is a bit wordy. Can you just say something like: `We support only \emph{synchronous} logic, which means that the always block is triggered on (and only on) the rising edge of a clock signal.'?}  
-\NR{We should mention that variables cannot be driven by multiple \alwaysblock{}s, since one might get confused with data races when relating to concurrent processes in software.} \JW{Given the recent discussion on Teams, it seems to me that we perhaps don't need to mention here what happens if a variable is driven multiple times per clock cycle, especially since Vericert isn't ever going to do that.}
+The semantics of Verilog differs from regular programming languages, as it is used to describe hardware directly, which is inherently parallel, rather than an algorithm, which is usually sequential.  The main construct in Verilog is the \alwaysblock{}.
+A module can contain multiple \alwaysblock{}s, all of which run in parallel.  These \alwaysblock{}s further contain statements such as if-statements or assignments to variables.  We support only \emph{synchronous} logic, which means that the \alwaysblock{} is triggered on (and only on) the rising edge of a clock signal.
+%\NR{We should mention that variables cannot be driven by multiple \alwaysblock{}s, since one might get confused with data races when relating to concurrent processes in software.} \JW{Given the recent discussion on Teams, it seems to me that we perhaps don't need to mention here what happens if a variable is driven multiple times per clock cycle, especially since Vericert isn't ever going to do that.}
 
-As hardware designs normally describe events that will be executed periodically for an infinite amount of time, the top-level of the semantics is best described using small-step semantics, whereas the execution of one small step is described using big-step semantics. 
-\JW{I reckon that can be made clearer by talking about clock cycles. Maybe something like: ``The semantics combines the big-step and small-step styles. The overall execution of the hardware is described using a small-step semantics, with one small step per clock cycle; this is appropriate because hardware is routinely designed to run for an unlimited number of clock cycles and the big-step style is ill-suited to describing infinite executions. Then, within each clock cycle, a big-step semantics is used to execute all the statements.'' }
-An example of a rule \JWcouldcut{in the semantics} for executing an always block is shown below, where $\Sigma$ is the state of the registers in the module and $s$ is the statement inside the always block:
+The semantics combines the big-step and small-step styles. The overall execution of the hardware is described using a small-step semantics, with one small step per clock cycle; this is appropriate because hardware is routinely designed to run for an unlimited number of clock cycles and the big-step style is ill-suited to describing infinite executions. Then, within each clock cycle, a big-step semantics is used to execute all the statements.
+An example of a rule for executing an \alwaysblock{} is shown below, where $\Sigma$ is the state of the registers in the module and $s$ is the statement inside the \alwaysblock{}:
 
 \begin{equation*}
   \inferrule[Always]{(\Sigma, s)\downarrow_{\text{stmnt}} \Sigma'}{(\Sigma, \yhkeyword{always @(posedge clk) } s) \downarrow_{\text{always}} \Sigma'}
 \end{equation*}
 
-\noindent which shows that assuming the statement $s$ in the always block runs with state $\Sigma$ and produces the new state $\Sigma'$, the always block will result in the same final state.  As only clocked always blocks are supported, and one step in the semantics correspond to one clock cycle, it means that this rule is run once per clock cycle \NRcouldcut{which is what it is defined to do}.
-\NR{The mention about small steps being one cycle and 'only clocked/synchronous \alwaysblock{} are supported' can move earlier.} \JW{Actually couldn't this rule be run multiple times per clock cycle in the case where you have multiple always blocks in your module all triggered on the same rising edge?}\YH{Yes, it is run multiple times, which is controlled by the \textsc{Module} rule, which runs the always blocks sequentially. This is controlled by another big step semantics that either takes the Cons of the list and runs the left and right hand sides, or nil which returns the same state.}
+\noindent which shows that assuming the statement $s$ in the \alwaysblock{} runs with state $\Sigma$ and produces the new state $\Sigma'$, the \alwaysblock{} will result in the same final state.  As only clocked \alwaysblock{} are supported, and one step in the semantics correspond to one clock cycle, it means that this rule is run once per clock cycle for each \alwaysblock{}.
 
-Two types of assignments are supported in always blocks: nonblocking and blocking assignment.  Nonblocking assignment modifies the signal at the end of the clock cycle and atomically.  Blocking assignment, on the other hand, assigns the variable \NRreplace{directly}{instantaneously?} in the always block for later signals to pick up.  To model both these assignments, the state $\Sigma$ has to be split into two \NRreplace{parts}{sets?}: $\Gamma$, containing the current values of all variables, and $\Delta$, containing the values that will be assigned to the variables at the end of the clock cycle, we can therefore say that $\Sigma = (\Gamma, \Delta)$.~\NR{Can we say that $\Gamma$ contains instantaneous (ephemeral) updates for the current cycle and $\Delta$ contains periodical updates for the next cycle? Will be good to distinguish those two updates with the same terms across the paper, which can tie in with small-step and big-step distinction maybe asynchronous vs synchronous updates. }\YH{Hmm, so I don't think nonblocking and blocking ties in with big-step and smallstep, but yes it would be better to use the same terms throughoug the paper}  The nonblocking assignment can therefore be expressed as the following:
+Two types of assignments are supported in \alwaysblock{}: nonblocking and blocking assignment.  Nonblocking assignment modifies the signal at the end of the clock cycle and atomically.  Blocking assignment, on the other hand, assigns the variable instantaneously in the \alwaysblock{} for later signals to pick up.  To model both these assignments, the state $\Sigma$ has to be split into two maps: $\Gamma$, containing the current values of all variables, and $\Delta$, containing the values that will be assigned to the variables at the end of the clock cycle, we can therefore say that $\Sigma = (\Gamma, \Delta)$.  The nonblocking assignment can therefore be expressed as the following:
+
+%~\NR{Can we say that $\Gamma$ contains instantaneous (ephemeral) updates for the current cycle and $\Delta$ contains periodical updates for the next cycle? Will be good to distinguish those two updates with the same terms across the paper, which can tie in with small-step and big-step distinction maybe asynchronous vs synchronous updates. }\YH{Hmm, so I don't think nonblocking and blocking ties in with big-step and smallstep, but yes it would be better to use the same terms throughoug the paper}
 
 \begin{equation*}
   \inferrule[Nonblocking Reg]{\yhkeyword{name}\ d = \yhkeyword{OK}\ n \\ (\Gamma, e) \downarrow_{\text{expr}} v}{((\Gamma, \Delta), d\ \yhkeyword{ <= } e) \downarrow_{\text{stmnt}} (\Gamma, \Delta [n \mapsto v])}\\
 \end{equation*}
 
-\noindent where assuming that $\downarrow_{\text{expr}}$ evaluates an expression $e$ to a value $v$, then the nonblocking assignment $d\ \yhkeyword{ <= } e$ will update the future state of the variable $d$ with value $v$.  Finally, at the end of the clock cycle, the registers need to be updated with their future values in $\Delta$, which can be shown by the rule in the semantics that runs a whole module, $\vec{m}$ is a list of variable declarations and always blocks, and the $\downarrow_{\vec{m}}$ evaluates these sequentially and obtains the final state $(\Gamma', \Delta')$. \JW{I'm a little torn. On the one hand, this is interesting stuff, and I almost want to suggest putting in the `Blocking Reg' rule alongside the one above, because seeing the rules side-by-side makes it really clear how they both work. But I'm conscious that talking about the distinction between non-blocking and blocking assignments is all a bit moot because Vericert never produces blocking assignments (right?). And this text is all just explaining Loow's work, not your own work. So, in short, I'm a bit torn. Probably best to keep it as-is for now, but bear this in mind as a candidate for chopping nearer the deadline. What do yout think?}
+\noindent where assuming that $\downarrow_{\text{expr}}$ evaluates an expression $e$ to a value $v$, then the nonblocking assignment $d\ \yhkeyword{ <= } e$ will update the future state of the variable $d$ with value $v$.  Finally, at the end of the clock cycle, the registers need to be updated with their future values in $\Delta$, which can be shown by the rule in the semantics that runs a whole module, $\vec{m}$ is a list of variable declarations and \alwaysblock{}s, and the $\downarrow_{\vec{m}}$ evaluates these sequentially and obtains the final state $(\Gamma', \Delta')$.
+%\JW{I'm a little torn. On the one hand, this is interesting stuff, and I almost want to suggest putting in the `Blocking Reg' rule alongside the one above, because seeing the rules side-by-side makes it really clear how they both work. But I'm conscious that talking about the distinction between non-blocking and blocking assignments is all a bit moot because Vericert never produces blocking assignments (right?). And this text is all just explaining Loow's work, not your own work. So, in short, I'm a bit torn. Probably best to keep it as-is for now, but bear this in mind as a candidate for chopping nearer the deadline. What do yout think?}
 
 
 \begin{equation*}
   \inferrule[Module]{(\Gamma, \epsilon, \vec{m})\ \downarrow_{\text{module}} (\Gamma', \Delta')}{(\Gamma, \yhkeyword{module } \yhconstant{main} \yhkeyword{(...);}\ \vec{m}\ \yhkeyword{endmodule}) \downarrow_{\text{program}} (\Gamma' // \Delta')}
 \end{equation*}
 
-\noindent This rule dictates how the whole module runs in one clock cycle, from an initial state of all the variables $\Gamma$, to a final state $(\Gamma' // \Delta')$, where $//$ defines a merge between the two maps from variable names to values, where the right hand side values take priority if there is a conflict. \JW{Is $//$ a standard notation for this? If so, that's fine. It's just I haven't come across it before.}\YH{It is not, should I use different notation? Is there standard notation?} This rule correctly implements the behaviours for the blocking and nonblocking assignment by overwriting the assignments made through blocking assignment at the end of the clock cycle with the atomic assignments from the nonblocking assignments.
-
-\JW{I think most of the following paragraph could be chopped. We've already mentioned that we're only targeting a small fragment of Verilog, so we could mention at that point that in the fragment we target there is no discrepancy between the simulation behaviour and the synthesis behaviour. I don't know how important it is to talk about the details of evaluation order here -- does the reader really need to be aware of this in order to understand your work?}
-When targeting a hardware description language such as Verilog, it is important to be consistent between simulating the hardware and the behaviour of the synthesised result on actual hardware.  In the target Verilog semantics, only clocked always blocks are supported as these ensure that there are not mismatches between simulation and synthesis, as combinational always blocks that trigger on any change of an internal signal may behave differently in simulation or synthesis based on the order of operations.  This limitation in the semantics therefore helps keep the Verilog correct and consistent.  In addition to that, only nonblocking assignment is used in signals that are used in multiple always blocks, which stops any race conditions from occurring as all the signal updates happen deterministically. \NR{So the semantics allows multiple drivers then?}  Finally, a specific order of evaluation for the always blocks is chosen, and because of the limitations listed before, choosing an order is guaranteed to have the same behaviour as executing the always blocks in any order.\NR{Order is not guaranteed if we have multiple drivers.}
+\noindent This rule dictates how the whole module runs in one clock cycle, from an initial state of all the variables $\Gamma$, to a final state $(\Gamma' // \Delta')$, where $//$ defines a merge between the two maps from variable names to values, where the right hand side values take priority if there is a conflict.  This rule correctly implements the behaviours for the blocking and nonblocking assignment by overwriting the assignments made through blocking assignment at the end of the clock cycle with the atomic assignments from the nonblocking assignments.
+%\JW{Is $//$ a standard notation for this? If so, that's fine. It's just I haven't come across it before.}\YH{It is not, should I use different notation? Is there standard notation?}
 
 \subsection{Changes to the Semantics}
 
 Some small changes were made to the semantics proposed by \citet{loow19_proof_trans_veril_devel_hol} to make it suitable as a HLS target.
 
-The main change is the addition of support for two-dimensional arrays, which are needed to model RAM in Verilog.  RAM is needed to model the stack in C efficiently, without having to declare a variable for each possible stack location.  In the original semantics, RAMs, as well as inputs and outputs to the module could be modelled using a function from variable names (strings) to values, which could be modified accordingly to model inputs to the module.  This is quite an abstract description of memory, which can also be expressed as an array of bitvectors, which is the path we took instead.  This requires the addition of array operators to the semantics and correct reasoning of loads and stores to the array in different always blocks simultaneously.  Consider the following Verilog code:
+The main change is the addition of support for two-dimensional arrays, which are needed to model RAM in Verilog.  RAM is needed to model the stack in C efficiently, without having to declare a variable for each possible stack location.  In the original semantics, RAMs, as well as inputs and outputs to the module could be modelled using a function from variable names (strings) to values, which could be modified accordingly to model inputs to the module.  This is quite an abstract description of memory, which can also be expressed as an array of bitvectors, which is the path we took instead.  This requires the addition of array operators to the semantics and correct reasoning of loads and stores to the array in different \alwaysblock{}s simultaneously.  Consider the following Verilog code:
 
 \begin{minted}{verilog}
 always @(posedge clk) begin
-  x[0] = 23;
+  x[0] = 1;
   x[1] <= 1;
 end
 \end{minted}
 
-The code above modifies one array value using nonblocking assignment and it then modifies a second entry in the array with nonblocking assignment.  If the existing semantics used for values was used to update the array, then during the merge of values for the array, the array \texttt{x} from the nonblocking association map would replace the array from the blocking association map.  This would replace \texttt{x[0]} with it's original value and therefore behave incorrectly.  Instead, updates to arrays need to be recorded for each index in the array, and merging of the blocking and nonblocking queue needs to take into account each index separately.
+The code above modifies one array value using nonblocking assignment and it then modifies a second entry in the array with nonblocking assignment.  If the existing semantics used for values was used to update the array, then during the merge of values for the array, the array \texttt{x} from the nonblocking association map would replace the array from the blocking association map.  This would replace \texttt{x[0]} with it's original value and therefore behave incorrectly.  Instead, updates to arrays need to be recorded for each index in the array, and merging of the blocking and nonblocking queue needs to take into account each index separately.  Our state $\Gamma$ in practice are therefore split up into a map $\Gamma_{r}$, for instantaneous updates to variables, and $\Gamma_{a}$ for instantaneous updates to arrays, where the merge function then ensures that only the modified indices get updated when it is merged with the nonblocking map equivalent $\Delta_{a}$.
 
 Explicit support for declaring inputs, outputs and internal variables was added to the semantics to make sure that the generated Verilog also contains the correct declarations.  This adds some guarantees to the generated Verilog and ensures that it synthesises and simulates correctly.
 
 In addition to adding support for two-dimensional arrays, support for receiving external inputs was removed from the semantics for the case of simplicity, as these are not needed for an HLS target.  The main module in Verilog models the main function in C, and as the inputs to a C function shouldn't change during it's execution, there is no need to add support for external inputs for Verilog modules.  Finally, another simplification to the semantics that was made is to use 32 bit integers instead of arrays of booleans for the bitvector representation.  As the translation only currently has support for \texttt{int}, it is possible to simplify the semantics further and not have to handle bitvectors of an arbitrary size.
 
-\subsection{Integrating the Semantics in \compcert{}'s Model}
-\NR{The \textit{Verilog} semantics perhaps.}
+\subsection{Verilog Semantics in \compcert{}'s Model}
 
 \begin{figure*}
   \centering
@@ -81,12 +77,10 @@ In addition to adding support for two-dimensional arrays, support for receiving
   \label{fig:inferrence_module}
 \end{figure*}
 
-\NR{Would signposting this section at the start: "\compcert{} defines a specific model of computation which all semantics have to follow in order to prove properties about them. In this section, we first describe three of \compcert{}'s original computational states. Then, we describe the five registers and semantic rules we add to Verilog semantics to integrate it to the \compcert{} model."} 
-
-\compcert{} defines a specific model of computation which all semantics have to follow in order to prove properties about them.  \compcert{} has three main states that need to be defined:
+\compcert{} defines a specific model of computation which all semantics have to follow in order to prove properties about them. In this section, we first describe three of \compcert{}'s original computational states. Then, we describe the five registers and semantic rules we add to Verilog semantics to integrate it to the \compcert{} model.  \compcert{} has three main states that need to be defined:
 
 \begin{description}
-  \item[\texttt{State} \textit{sf} $m$ $v$ $\Gamma_{r}$ $\Gamma_{a}$] The main state when executing a function, with stack frame \textit{sf}, current module $m$, current state $v$ and variable states $\Gamma_{r}$ and $\Gamma_{a}$\NR{does r and a mean register and array?}.
+  \item[\texttt{State} \textit{sf} $m$ $v$ $\Gamma_{r}$ $\Gamma_{a}$] The main state when executing a function, with stack frame \textit{sf}, current module $m$, current state $v$ and variable states $\Gamma_{r}$ and $\Gamma_{a}$.
   \item[\texttt{Callstate} \textit{sf} $m$ $\vec{r}$] The state that is reached when a function is called, with the current stack frame \textit{sf}, current module $m$ and arguments $\vec{r}$.
   \item[\texttt{Returnstate} \textit{sf} $r$] The state that is reached when a function returns back to the caller, with stack frame \textit{sf} and return value $r$.
 \end{description}