Fix capitalisation of titles

author: Yann Herklotz <git@yannherklotz.com> 2021-09-09 20:43:47 +0100
committer: Yann Herklotz <git@yannherklotz.com> 2021-09-09 20:43:47 +0100
commit: c8f372795fa8ca90f62f556fc8cf4f17250e99ad (patch)
tree: 17ebbfdb95803b0b43767e6858085073a87105e7
parent: 8507d0413b34fcc2744a922048ce55ca06b7978f (diff)
download: oopsla21_fvhls-c8f372795fa8ca90f62f556fc8cf4f17250e99ad.tar.gz
oopsla21_fvhls-c8f372795fa8ca90f62f556fc8cf4f17250e99ad.zip
5 files changed, 21 insertions, 21 deletions
diff --git a/algorithm.tex b/algorithm.tex
index 5fee4db..81763d7 100644
--- a/algorithm.tex
+++ b/algorithm.tex
@@ -1,4 +1,4 @@
-\section{Designing a verified HLS tool}\label{sec:design}
+\section{Designing a Verified HLS Tool}\label{sec:design}
 
 This section describes the main architecture of the HLS tool, and the way in which the Verilog back end was added to \compcert{}.  This section also covers an example of converting a simple C program into hardware, expressed in the Verilog language.
 
@@ -83,12 +83,12 @@ We select CompCert's three-address code (3AC)\footnote{This is known as register
 3AC is also attractive because it is the closest intermediate language to LLVM IR, which is used by several existing HLS compilers. %\JP{We already ruled out LLVM as a starting point, so this seems like it needs further qualification.}\YH{Well not because it's not a good starting point, but the ecosystem in Coq isn't as good.  I think it's still OK here to say that being similar to LLVM IR is an advantage?} 
 It has an unlimited number of pseudo-registers, and is represented as a control flow graph (CFG) where each instruction is a node with links to the instructions that can follow it. One difference between LLVM IR and 3AC is that 3AC includes operations that are specific to the chosen target architecture; we chose to target the x86\_32 back end because it generally produces relatively dense 3AC thanks to the availability of complex addressing modes.% reducing cycle counts in the absence of an effective scheduling approach.
 
-\subsection{An introduction to Verilog}
+\subsection{An Introduction to Verilog}
 
 This section will introduce Verilog for readers who may not be familiar with the language, concentrating on the features that are used in the output of \vericert{}.  Verilog is a hardware description language (HDL) and is used to design hardware ranging from complete CPUs that are eventually produced as an integrated circuit, to small application-specific accelerators that are placed on an FPGA.  Verilog is a popular language because it allows for fine-grained control over the hardware, and also provides high-level constructs to simplify the development.
 
 Verilog behaves quite differently to standard software programming languages due to it having to express the parallel nature of hardware.  The basic construct to achieve this is the always-block, which is a collection of assignments that are executed every time some event occurs.  In the case of \vericert{}, this event is either a positive (rising) or a negative (falling) clock edge.  All always-blocks triggering on the same event are executed in parallel. Always-blocks can also express control-flow using if-statements and case-statements.
-\NR{Might be useful to talk about registers must be updated only within an always block.} \JW{That's important for Verilog programming in general, but is it necessary for understanding this paper?}\YH{Yeah, I don't think it is too important for this section.}
+\NR{Might be useful to talk about registers must be updated only within an always-block.} \JW{That's important for Verilog programming in general, but is it necessary for understanding this paper?}\YH{Yeah, I don't think it is too important for this section.}
 
 \begin{figure}
   \centering
@@ -223,7 +223,7 @@ endmodule
 % \NR{Is the default in line 41 of (c) supposed to be empty?}\YH{Yep, that's how it's generated.}
 \end{figure}
 
-\subsection{Translating C to Verilog, by example}
+\subsection{Translating C to Verilog, by Example}
 Figure~\ref{fig:accumulator_c_rtl} illustrates the translation of a simple program that stores and retrieves values from an array.
 In this section, we describe the stages of the \vericert{} translation, referring to this program as an example.
 
@@ -395,12 +395,12 @@ We discuss these proofs in upcoming sections.
 
 Although we would not claim that \vericert{} is a proper `optimising' HLS compiler yet, we have nonetheless made several design choices that aim to improve the quality of the hardware designs it produces.
 
-\subsubsection{Byte- and word-addressable memories}
+\subsubsection{Byte- and Word-Addressable Memories}
 
 One big difference between C and Verilog is how memory is represented.  Although Verilog arrays use similar syntax to C arrays, they must be treated quite differently. To make loads and stores as efficient as possible, the RAM needs to be word-addressable, which means that an entire integer can be loaded or stored in one clock cycle.
 However, the memory model that \compcert{} uses for its intermediate languages is byte-addre\-ssa\-ble~\cite{blazy05_formal_verif_memor_model_c}.  If a byte-addressable memory was used in the target hardware, which is closer to \compcert{}'s memory model, then a load and store would instead take four clock cycles, because a RAM can only perform one read and write per clock cycle.  It therefore has to be proven that the byte-addressable memory behaves in the same way as the word-addressable memory in hardware.  Any modifications of the bytes in the \compcert{} memory model also have to be shown to modify the word-addressable memory in the same way.  Since only integer loads and stores are currently supported in \vericert{}, it follows that the addresses given to the loads and stores will be multiples of four.  Translating from byte-addressed memory to word-addressed memory can then be done by dividing the address by four.
 
-\subsubsection{Implementation of RAM interface}\label{sec:algorithm:optimisation:ram}
+\subsubsection{Implementation of RAM Interface}\label{sec:algorithm:optimisation:ram}
 The simplest way to implement loads and stores in \vericert{} would be to access the Verilog array directly from within the data-path (i.e., inside the always-block on lines 16--32 of Figure~\ref{fig:accumulator_v}). This would be correct, but when a Verilog array is accessed at several program points, the synthesis tool is unlikely to detect that it can be implemented as a RAM block, and will resort to using lots of registers instead, ruining the circuit's area and performance.  To avert this, we arrange that the data-path does not access memory directly, but simply sets the address it wishes to access and then toggles the \texttt{u\_en} flag. This activates the RAM interface (lines 9--15 of Figure~\ref{fig:accumulator_v}) on the next falling clock edge, which performs the requested load or store. By factoring all the memory accesses out into a separate interface like this, we ensure that the underlying array is only accessed from a single program point in the Verilog code, and thus ensure that the synthesis tool will correctly infer a RAM block.\footnote{Interestingly, the Verilog code shown for the RAM interface must not be modified, because the synthesis tool will only generate a RAM when the code matches a small set of specific patterns.}
 %\JW{I tweaked this slightly in an attempt to clarify; please check.} %\NR{Bring forward this sentence to help with flow.}
 
@@ -409,7 +409,7 @@ The simplest way to implement loads and stores in \vericert{} would be to access
 Therefore, an extra compiler pass is added from HTL to HTL to extract all the direct accesses to the Verilog array and replace them by signals that access the RAM interface in a separate always-block. The translation is performed by going through all the instructions and replacing each load and store expression in turn.  Stores can simply be replaced by the necessary wires directly. Loads are a little more subtle: loads that use the RAM interface take two clock cycles where a direct load from an array takes only one, so this pass inserts an extra state after each load.
 
 %\JW{I've called that negedge always-block the `RAM driver' in my proposed text above -- that feels like quite a nice a word for it to my mind -- what do you think?}\YH{Yes I quite like it!}
-%Verilog arrays can be used in a variety of ways, however, these do not all produce optimal hardware designs.  If, for example, arrays in Verilog are accessed immediately in the data-path, then the synthesis tool is not be able to identify it as having the right properties for a RAM, and would instead implement the array using registers.  This is extremely expensive, and for large memories this can easily blow up the area usage of the FPGA, and because of the longer wires that are needed, it would also affect the performance of the circuit.  The synthesis tools therefore provide code snippets that they know how to transform into various constructs, including snippets that will generate proper RAMs in the final hardware.  This process is called memory inference.  The initial translation from 3AC to HTL converts loads and stores to direct accesses to the memory, as this preserves the same behaviour without having to insert more registers and logic.  We therefore have another pass from HTL to itself which performs the translation from this na\"ive use of arrays to a representation which always allows for memory inference.  This pass creates a separate always block to perform the loads and stores to the memory, and adds the necessary data, address and enable signals to communicate with that always-block from other always-blocks.  This always-block is shown between lines 10-15 in Figure~\ref{fig:accumulator_v}.
+%Verilog arrays can be used in a variety of ways, however, these do not all produce optimal hardware designs.  If, for example, arrays in Verilog are accessed immediately in the data-path, then the synthesis tool is not be able to identify it as having the right properties for a RAM, and would instead implement the array using registers.  This is extremely expensive, and for large memories this can easily blow up the area usage of the FPGA, and because of the longer wires that are needed, it would also affect the performance of the circuit.  The synthesis tools therefore provide code snippets that they know how to transform into various constructs, including snippets that will generate proper RAMs in the final hardware.  This process is called memory inference.  The initial translation from 3AC to HTL converts loads and stores to direct accesses to the memory, as this preserves the same behaviour without having to insert more registers and logic.  We therefore have another pass from HTL to itself which performs the translation from this na\"ive use of arrays to a representation which always allows for memory inference.  This pass creates a separate always-block to perform the loads and stores to the memory, and adds the necessary data, address and enable signals to communicate with that always-block from other always-blocks.  This always-block is shown between lines 10-15 in Figure~\ref{fig:accumulator_v}.
 
 There are two interesting parts to the inserted RAM interface.  Firstly, the memory updates are triggered on the negative (falling) edge of the clock, out of phase with the rest of the design which is triggered on the positive (rising) edge of the clock.  The advantage of this is that instead of loads and stores taking three clock cycles and two clock cycles respectively, they only take two clock cycles and one clock cycle instead, greatly improving their performance. %\JW{Is this a standard `trick' in hardware design? If so it might be nice to cite it.}\YH{Hmm, not really, because it has the downside of kind of halving your available clock period. However, RAMs normally come in both forms on the FPGA (Page 12, Figure 2, \url{https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/ug/ug_ram_rom.pdf})} 
 % JW: thanks!
@@ -463,7 +463,7 @@ Secondly, the logic in the enable signal of the RAM (\texttt{en != u\_en}) is al
 
 Figure~\ref{fig:ram_load_store} gives an example of how the RAM interface behaves when values are loaded and stored.
 
-\subsubsection{Implementing the \texttt{Oshrximm} instruction}\label{sec:algorithm:optimisation:oshrximm}
+\subsubsection{Implementing the \texttt{Oshrximm} Instruction}\label{sec:algorithm:optimisation:oshrximm}
 
 % Mention that this optimisation is not performed sometimes (clang -03).
 
diff --git a/archive/verilog.tex b/archive/verilog.tex
index a363596..e4c7816 100644
--- a/archive/verilog.tex
+++ b/archive/verilog.tex
@@ -3,7 +3,7 @@ When targeting a hardware description language such as Verilog, it is important
 
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 
-Using these constructs it is therefore possible to describe how hardware functions, where always blocks that are triggered by a clock periodically get translated into flip-flops and always blocks triggered by changes in any internal signals are translated into combinational logic.
+Using these constructs it is therefore possible to describe how hardware functions, where always-blocks that are triggered by a clock periodically get translated into flip-flops and always blocks triggered by changes in any internal signals are translated into combinational logic.
 
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 
diff --git a/limitations.tex b/limitations.tex
index 9ed65ec..141ee91 100644
--- a/limitations.tex
+++ b/limitations.tex
@@ -4,7 +4,7 @@ There are various limitations in \vericert{} compared to other HLS tools due to
 
 %\NR{You have very different types of limitations. I wonder if grouped them into software and hardware limitations respectively might simplify this section. Just a suggestion.}
 
-\subsection{Limitations to the generated hardware}
+\subsection{Limitations to the Generated Hardware}
 
 %This section describes the current limitations and possible improvements that could be made to the generated hardware.
 
diff --git a/main.tex b/main.tex
index 110d901..ccbbf8e 100644
--- a/main.tex
+++ b/main.tex
@@ -213,10 +213,10 @@
 \input{conclusion}
 
 \subsection*{Acknowledgements}
-We'd like to thank Sandrine Blazy, Jianyi Cheng, Alastair Donaldson, Andreas Lööw, and the anonymous
-reviewers for their helpful feedback.  This work was financially supported by the EPSRC via the
-Research Institute for Verified Trustworthy Software Systems (VeTSS) and the IRIS Programme Grant
-(EP/R006865/1).
+We would like to thank Sandrine Blazy, Jianyi Cheng, Alastair Donaldson, Andreas Lööw, and the
+anonymous reviewers for their helpful feedback.  This work was financially supported by the EPSRC
+via the Research Institute for Verified Trustworthy Software Systems (VeTSS) and the IRIS Programme
+Grant (EP/R006865/1).
 
 %% Bibliography
 \bibliography{references.bib}
diff --git a/proof.tex b/proof.tex
index 7d142d9..0d6d4b1 100644
--- a/proof.tex
+++ b/proof.tex
@@ -2,7 +2,7 @@
 
 Now that the Verilog semantics have been adapted to the CompCert model, we are in a position to formally prove the correctness of our C-to-Verilog compilation.  This section describes the main correctness theorem that was proven and the main ideas behind the proof. The full Coq proof is available in auxiliary material.
 
-\subsection{Main challenges in the proof}
+\subsection{Main Challenges in the Proof}
 
 The proof of correctness of the Verilog back end is quite different from the usual proofs performed in CompCert, mainly because of the difference in Verilog semantics compared to the standard CompCert intermediate languages and because of the translation of the memory model.
 
@@ -16,7 +16,7 @@ The proof of correctness of the Verilog back end is quite different from the usu
 Together, these differences mean that translating 3AC directly to Verilog is infeasible, as the differences in the semantics are too large.  Instead, HTL, which was introduced in Section~\ref{sec:design}, bridges the gap in the semantics between the two languages.  HTL still consists of maps, like many of the other CompCert languages, however, each state corresponds to a Verilog statement.
 %\JW{This is good text, but the problem is that it reads like you're introducing HTL here for the first time. In fact, the reader has already encountered HTL in Section 2. So this needs acknowledging.}\YH{Added that.}
 
-\subsection{Formulating the correctness theorem}
+\subsection{Formulating the Correctness Theorem}
 
 The main correctness theorem is analogous to that stated in \compcert{}~\cite{leroy09_formal_verif_realis_compil}: for all Clight source programs $C$, if the translation to the target Verilog code succeeds, and $C$ has safe observable behaviour $B$ when executed, then the target Verilog code will have the same behaviour $B$. Here, a `safe' execution is one that either converges or diverges, but does not `go wrong'. If the program does admit some wrong behaviour (such as undefined behaviour in C), the correctness theorem does not apply. A behaviour, then, is either a final state (in the case of convergence) or divergence. In \compcert{}, a behaviour is also associated with a trace of I/O events, but since external function calls are not supported in \vericert{}, this trace will always be empty.  
 
@@ -40,7 +40,7 @@ How can we prove this theorem? First, note that the theorem is a `backwards simu
 To prove this forward simulation, it suffices to prove forward simulations between each pair of consecutive intermediate languages, as these results can be composed to prove the correctness of the whole HLS tool. 
 The forward simulation from 3AC to HTL is stated in Lemma~\ref{lemma:htl} (Section~\ref{sec:proof:3ac_htl}), the forward simulation for the RAM insertion is shown in Lemma~\ref{lemma:htl_ram} (Section~\ref{sec:proof:ram_insertion}), then the forward simulation between HTL and Verilog is shown in Lemma~\ref{lemma:verilog} (Section~\ref{sec:proof:htl_verilog}) and finally, the proof that Verilog is deterministic is given in Lemma~\ref{lemma:deterministic} (Section~\ref{sec:proof:deterministic}).
 
-\subsection{Forward simulation from 3AC to HTL}\label{sec:proof:3ac_htl}
+\subsection{Forward Simulation from 3AC to HTL}\label{sec:proof:3ac_htl}
 
 As HTL is quite far removed from 3AC, this first translation is the most involved and therefore requires a larger proof, because the translation from 3AC instructions to Verilog statements needs to be proven correct in this step.  In addition to that, the semantics of HTL are also quite different to the 3AC semantics, as instead of defining small-step semantics for each construct in Verilog, the semantics are instead defined over one clock cycle and mirror the semantics defined for Verilog.  Lemma~\ref{lemma:htl} shows the result that needs to be proven in this subsection.
 
@@ -127,7 +127,7 @@ We can now define the simulation diagram for the translation. The 3AC state can
   This simulation diagram is proven by induction over the operational semantics of 3AC, which allows us to find one or more steps in the HTL semantics that will produce the same final matching state.
 \end{proof}
 
-\subsection{Forward simulation of RAM insertion}\label{sec:proof:ram_insertion}
+\subsection{Forward Simulation of RAM Insertion}\label{sec:proof:ram_insertion}
 
 \begin{figure}
   \centering
@@ -145,7 +145,7 @@ We can now define the simulation diagram for the translation. The 3AC state can
 
 HTL can only represent a single state machine, it is therefore necessary to model the RAM abstractly to reason about the correctness of replacing the direct read and writes to the array by loads and stores to a RAM.  The specification used to model the RAM is shown in Figure~\ref{fig:htl_ram_spec}, which defines how the RAM $r$ will behave for all the possible combinations of the input signals.
 
-\subsubsection{From implementation to specification}
+\subsubsection{From Implementation to Specification}
 
 The first step in proving the simulation correct is to build a specification of the algorithm.  The three possibilities of the transformation is that for each Verilog statement in the map at location $i$, either the statement is a load or a store, in which case it is translated to the equivalent signal representation, otherwise it is not changed.  An example of the specification for the store instruction is shown below, where $\sigma$ is state register, $r$ is the RAM, $d$ and $c$ are the input data-path and control logic maps and $i$ is the current state.  $n$ is the newly inserted state which only applies to the translation of loads.
 
@@ -156,7 +156,7 @@ The first step in proving the simulation correct is to build a specification of
 
 A similar specification is created for the load.  We then also prove that the implementation of the translation proves that the specification holds.
 
-\subsubsection{From specification to simulation}
+\subsubsection{From Specification to Simulation}
 
 Another simulation proof is performed to prove that the insertion of the RAM is semantics preserving.  As in the simulation proof shown in Lemma~\ref{lemma:simulation_diagram}, some invariants need to be defined, which always hold at the start of the simulation and at the end of the simulation.  The invariants needed for the simulation of the RAM insertion are quite different to the previous invariants, so we can define these invariants $\mathcal{I}_{r}$ to be the following:
 
@@ -175,7 +175,7 @@ The other invariants and assumptions for defining two matching states in HTL are
   \end{align*}
 \end{lemma}
 
-\subsection{Forward simulation from HTL to Verilog}\label{sec:proof:htl_verilog}
+\subsection{Forward Simulation from HTL to Verilog}\label{sec:proof:htl_verilog}
 
 The HTL-to-Verilog simulation is conceptually simple, as the only transformation is from the map representation of the code to the case-statement representation.  The proof is more involved, as the semantics of a map structure are quite different to the semantics of the case-statement they are converted to.
author	Yann Herklotz <git@yannherklotz.com>	2021-09-09 20:43:47 +0100
committer	Yann Herklotz <git@yannherklotz.com>	2021-09-09 20:43:47 +0100
commit	c8f372795fa8ca90f62f556fc8cf4f17250e99ad (patch)
tree	17ebbfdb95803b0b43767e6858085073a87105e7
parent	8507d0413b34fcc2744a922048ce55ca06b7978f (diff)
download	oopsla21_fvhls-c8f372795fa8ca90f62f556fc8cf4f17250e99ad.tar.gz oopsla21_fvhls-c8f372795fa8ca90f62f556fc8cf4f17250e99ad.zip