AddFix more things

author: Yann Herklotz <git@yannherklotz.com> 2021-04-16 20:53:57 +0100
committer: Yann Herklotz <git@yannherklotz.com> 2021-04-16 20:53:57 +0100
commit: 1479bb42c2877c29376549d768a97676e1b96841 (patch)
tree: 42dedad20c175e8c9340316a6abde823c213b44b
parent: 7d8150af139d30058a6be3b962f252505fd45d9b (diff)
download: oopsla21_fvhls-1479bb42c2877c29376549d768a97676e1b96841.tar.gz
oopsla21_fvhls-1479bb42c2877c29376549d768a97676e1b96841.zip
5 files changed, 23 insertions, 18 deletions
diff --git a/algorithm.tex b/algorithm.tex
index ba44613..f8514b8 100644
--- a/algorithm.tex
+++ b/algorithm.tex
@@ -114,7 +114,7 @@ module main(reset, clk, finish, return_val);
   reg [0:0] en = 0, u_en = 0;
   reg [31:0] state = 0, reg_2 = 0, reg_4 = 0, d_out = 0, reg_1 = 0;
   reg [31:0] stack [1:0];
-  // RAM template
+  // RAM interface
   always @(negedge clk)
     if ({u_en != en}) begin
       if (wr_en) stack[addr] <= d_in;
@@ -162,7 +162,7 @@ In this section, we describe the stages of the \vericert{} translation, referrin
 \subsubsection{Translating C to 3AC}
 
 The first stage of the translation uses unmodified \compcert{} to transform the C input, shown in Figure~\ref{fig:accumulator_c}, into a 3AC intermediate representation, shown in Figure~\ref{fig:accumulator_rtl}.
-As part of this translation, function inlining is performed on all functions, which allows us to support function calls without having to support the \texttt{Icall} 3AC instruction.  Although the duplication of the function bodies caused by inlining can increase the area of the hardware, it can have a positive effect on latency\YH{TODO: Add sentence on why inlining is often the better choice}. Inlining precludes support for recursive function calls, but this feature is not supported in most other HLS tools either~\cite{davidthomas_asap16}.
+As part of this translation, function inlining is performed on all functions, which allows us to support function calls without having to support the \texttt{Icall} 3AC instruction.  Although the duplication of the function bodies caused by inlining can increase the area of the hardware, it can have a positive effect on latency. Inlining precludes support for recursive function calls, but this feature is not supported in most other HLS tools either~\cite{davidthomas_asap16}.
 
 %\JW{Is that definitely true? Was discussing this with Nadesh and George recently, and I ended up not being so sure. Inlining could actually lead to \emph{reduced} resource usage because once everything has been inlined, the (big) scheduling problem could then be solved quite optimally. Certainly inlining is known to increase register pressure, but that's not really an issue here. If we're  not sure, we could just say that inlining everything leads to bloated Verilog files and the inability to support recursion, and leave it at that.}\YH{I think that is true, just because we don't do scheduling.  With scheduling I think that's true, inlining actually becomes quite good.}
 
@@ -293,7 +293,7 @@ Typically, HLS-generated hardware consists of a sea of registers and RAM memorie
 This memory view is very different to the C memory model, so we perform the following translation. 
 Variables that do not have their address taken are kept in registers, which correspond to the registers in 3AC.
 All address-taken variables, arrays, and structs are kept in RAM.
-The stack of the main function becomes an unpacked array of 32-bit integers, which is translated to a RAM when the hardware description is passed through a synthesis tool.  In this pass, loads and stores are translated to direct accesses to the Verilog array representing memory.
+The stack of the main function becomes an unpacked array of 32-bit integers, which is translated to a RAM when the hardware description is passed through a synthesis tool.  In a first pass, RAM is represented by direct reads and writes to an array, however, in a RAM insertion pass a proper RAM interface is added, and reads and writes are performed through that interface.
 Finally, global variables are not translated in \vericert{} at the moment.
 A high-level overview of the architecture can be seen in Figure~\ref{fig:accumulator_diagram}. 
 
@@ -304,7 +304,7 @@ For example, line 2 in Figure~\ref{fig:accumulator_rtl} shows a 32-bit register
 
 C and Verilog handle signedness quite differently. By default, all operators and registers in Verilog (and HTL) are unsigned, so to force an operation to handle the bits as signed, both operators have to be forced to be signed. Moreover, Verilog implicitly resizes expressions to the largest needed size by default, which can affect the result of the computation.  This feature is not supported by the Verilog semantics we adopted, so to match the semantics to the behaviour of the simulator and synthesis tool, braces are placed around all expressions to inhibit implicit resizing.  Instead, explicit resizing is used in the semantics, and operations can only be performed on two registers that have the same size.
 
-In addition to that, equality between \emph{unsigned} variables is actually not supported, because this requires supporting the comparison of pointers, which should only be performed between pointers with the same provenance.  In \vericert{} there is currently no way to determine the provenance of a pointer, and it therefore cannot model the semantics of unsigned comparison in \compcert{}. This is not a severe restriction in practice, however, because, as dynamic allocation is not supported either, equality comparison of pointers is rarely needed, and equality comparison of integers can still be performed by casting them both to signed integers.
+In addition to that, equality checks between \emph{unsigned} variables is actually not supported, because this requires supporting the comparison of pointers, which should only be performed between pointers with the same provenance.  In \vericert{} there is currently no way to determine the provenance of a pointer, and it therefore cannot model the semantics of unsigned comparison in \compcert{}. This is not a severe restriction in practice, however, because, as dynamic allocation is not supported either, equality comparison of pointers is rarely needed, and equality comparison of integers can still be performed by casting them both to signed integers.
 
 Finally, the \texttt{mulhs} and \texttt{mulhu} instructions, which fetch the upper bits of a 32-bit multiplication, are not translated by \vericert{} either, because 64-bit numbers are not supported. These instructions are only generated to optimise divisions by a constant that is not a power of two, so turning off constant propagation will allow these programs to pass without error.
 
@@ -331,17 +331,19 @@ Although we would not claim that \vericert{} is a proper `optimising' HLS compil
 One big difference between C and Verilog is how memory is represented.  Although Verilog arrays use similar syntax to C arrays, they must be treated quite differently. To make loads and stores as efficient as possible, the RAM needs to be word-addressable, which means that an entire integer can be loaded or stored in one clock cycle.
 However, the memory model that \compcert{} uses for its intermediate languages is byte-addre\-ssa\-ble~\cite{blazy05_formal_verif_memor_model_c}.  If a byte-addressable memory was used in the target hardware, which is closer to \compcert{}'s memory model, then a load and store would instead take four clock cycles, because a RAM can only perform one read and write per clock cycle.  It therefore has to be proven that the byte-addressable memory behaves in the same way as the word-addressable memory in hardware.  Any modifications of the bytes in the \compcert{} memory model also have to be shown to modify the word-addressable memory in the same way.  Since only integer loads and stores are currently supported in \vericert{}, it follows that the addresses given to the loads and stores should be multiples of four.  If that is the case, then the translation from byte-addressed memory to word-addressed memory can be done by dividing the address by four.
 
-\subsubsection{Implementation of RAM templates}
-The simplest way to implement loads and stores in \vericert{} would be to access the Verilog array directly from within the data-path (i.e., inside the always-block on lines 16--32 of Figure~\ref{fig:accumulator_v}). This would be correct, but when a Verilog array is accessed at several program points, the synthesis tool is unlikely to detect that it can be implemented as a RAM block, and will resort to using lots of registers instead, ruining the circuit's area and performance. To avert this, we arrange that the data-path does not access memory directly, but simply sets the address it wishes to access and then toggles the \texttt{u\_en} flag. This activates the RAM driver (lines 9--15 of Figure~\ref{fig:accumulator_v}) on the next falling clock edge, which performs the requested load or store. By factoring all the memory accesses out into a separate driver like this, we ensure that the underlying array is only accessed from a single program point in the Verilog code, and thus ensure that the synthesis tool will correctly infer a RAM block.
+\subsubsection{Implementation of RAM interface}
+The simplest way to implement loads and stores in \vericert{} would be to access the Verilog array directly from within the data-path (i.e., inside the always-block on lines 16--32 of Figure~\ref{fig:accumulator_v}). This would be correct, but when a Verilog array is accessed at several program points, the synthesis tool is unlikely to detect that it can be implemented as a RAM block, and will resort to using lots of registers instead, ruining the circuit's area and performance. To avert this, we arrange that the data-path does not access memory directly, but simply sets the address it wishes to access and then toggles the \texttt{u\_en} flag. This activates the RAM interface (lines 9--15 of Figure~\ref{fig:accumulator_v}) on the next falling clock edge, which performs the requested load or store. By factoring all the memory accesses out into a separate interface like this, we ensure that the underlying array is only accessed from a single program point in the Verilog code, and thus ensure that the synthesis tool will correctly infer a RAM block.  Without the interface, the array would be implemented using registers, which would increase the size of the hardware considerably.
 
-Therefore, an extra compiler pass is added from HTL to HTL to extracts all the direct accesses to the Verilog array implementing memory and replaces them by signals which access the memory in a separate always-block.  This ensures that the synthesis tool correctly identifies the array as being a RAM, so that it is not implemented by logic directly.  The translation is performed by going through all the instructions and replacing each load and store expression one after another.  Stores can simply be replaced by the necessary wires directly, however, loads using the RAM block take two clock cycles instead of a direct load from an array which only takes one clock cycles.  This pass therefore creates a extra state which is inserted after each load.
+Therefore, an extra compiler pass is added from HTL to HTL to extracts all the direct accesses to the Verilog array and replaces them by signals which access the RAM interface in a separate always-block. The translation is performed by going through all the instructions and replacing each load and store expression one after another.  Stores can simply be replaced by the necessary wires directly, however, loads using the RAM block take two clock cycles instead of a direct load from an array which only takes one clock cycles.  This pass therefore creates a extra state which is inserted after each load.
 
 %\JW{I've called that negedge always-block the `RAM driver' in my proposed text above -- that feels like quite a nice a word for it to my mind -- what do you think?}\YH{Yes I quite like it!}
 %Verilog arrays can be used in a variety of ways, however, these do not all produce optimal hardware designs.  If, for example, arrays in Verilog are accessed immediately in the data-path, then the synthesis tool is not be able to identify it as having the right properties for a RAM, and would instead implement the array using registers.  This is extremely expensive, and for large memories this can easily blow up the area usage of the FPGA, and because of the longer wires that are needed, it would also affect the performance of the circuit.  The synthesis tools therefore provide code snippets that they know how to transform into various constructs, including snippets that will generate proper RAMs in the final hardware.  This process is called memory inference.  The initial translation from 3AC to HTL converts loads and stores to direct accesses to the memory, as this preserves the same behaviour without having to insert more registers and logic.  We therefore have another pass from HTL to itself which performs the translation from this na\"ive use of arrays to a representation which always allows for memory inference.  This pass creates a separate always block to perform the loads and stores to the memory, and adds the necessary data, address and enable signals to communicate with that always-block from other always-blocks.  This always-block is shown between lines 10-15 in Figure~\ref{fig:accumulator_v}.
 
-There are two interesting parts to the inserted RAM template.  Firstly, the memory updates are triggered on the negative edge of the clock, out of phase with the rest of the design which is triggered on the positive edge of the clock.  The main advantage is that instead of loads and stores taking three clock cycles and two clock cycles respectively, they only take two clock cycles and one clock cycle instead, greatly improving their performance.  In addition to that, using the negative edge for the clock is supported by many synthesis tools, it therefore does not affect the maximum frequency of the final design.
+There are two interesting parts to the inserted RAM interface.  Firstly, the memory updates are triggered on the negative edge of the clock, out of phase with the rest of the design which is triggered on the positive edge of the clock.  The main advantage is that instead of loads and stores taking three clock cycles and two clock cycles respectively, they only take two clock cycles and one clock cycle instead, greatly improving their performance.  In addition to that, using the negative edge for the clock is supported by many synthesis tools, it therefore does not affect the maximum frequency of the final design.
 
-Secondly, the logic in the enable signal of the RAM (\texttt{en != u\_en}) is also atypical.  To make the proof simpler, the goal is to create a RAM which disables itself after every use, so that firstly, the proof can assume that the RAM is disabled at the start and end of every clock cycle, and secondly so that only the state which contains the load and store need to be modified to ensure this property.  Using a simple enable signal, it would not be possible to disable it in the RAM itself, because this would result in a register being driven twice from two different locations.  It has to be enabled from the data-path and disabled from the RAM.  The only other solutions are to either insert extra states that disable the RAM accordingly, thereby eliminating the speed advantage of the negative edge triggered RAM, or to check the next state after a load and store and insert disables into that state.  The latter solution can quickly become complicated though, especially as this new state could contain another memory operation, in which case the disable signal should not be added to that state.  We can instead generate a second enable signal that is set by the user, and the original enable signal is then updated by the RAM to be equal to the value that the user set.  This means that the RAM should be enabled whenever the two signals are different, and disabled otherwise.  A solution to this problem is to create another enable signal that is controlled by the self-disabling RAM, which is always set to be equal to the enable signal set by the data-path.  The RAM is then considered enabled if the data-path enable and the RAM enable are different.
+Secondly, the logic in the enable signal of the RAM (\texttt{en != u\_en}) is also atypical.  To make the proof simpler, the goal is to create a RAM which disables itself after every use, so that firstly, the proof can assume that the RAM is disabled at the start and end of every clock cycle, and secondly so that only the state which contains the load and store need to be modified to ensure this property.  With a simple enable signal, it would not be possible to disable it in the RAM itself, because this would result in a register being driven twice from two different locations, once from the RAM interface and once from the data-path.  The only other solutions are to either insert extra states that disable the RAM accordingly, thereby eliminating the speed advantage of the negative edge triggered RAM, or to check the next state after a load and store and insert disables into that state.  The latter solution can quickly become complicated though, especially as this new state could contain another memory operation, in which case the disable signal should not be added to that state.  A solution to this problem is to create another enable signal that is controlled by the self-disabling RAM, which is always set to be equal to the enable signal set by the data-path.  The RAM is then considered enabled if the data-path enable and the RAM enable are different.
+
+%We can instead generate a second enable signal that is set by the user, and the original enable signal is then updated by the RAM to be equal to the value that the user set.  This means that the RAM should be enabled whenever the two signals are different, and disabled otherwise.
 
 \begin{figure}
   \centering
diff --git a/evaluation.tex b/evaluation.tex
index c682336..f82c83b 100644
--- a/evaluation.tex
+++ b/evaluation.tex
@@ -154,7 +154,7 @@ This gap does not represent the performance cost that comes with formally provin
 Instead, it is simply a gap between an unoptimised \vericert{} versus an optimised \legup{}.
 As we improve \vericert{} by incorporating further optimisations, this gap should reduce whilst preserving the correctness guarantees.
 
-Secondly, looking at the maximum clock frequency that each design can achieve, \vericert{} designs can only achieve 8.2$\times$ the maximum clock frequency of \legup{} \JW{That sounds wrong? Shouldn't it be less than legup's fmax?} when division/modulo operations are present.  This is in great contrast to the maximum clock frequency that \vericert{} can achieve when no divide/modulus \JW{modulo?} operations are present, where \vericert{} generates designs that are actually 2$\times$ better than the frequency achieved by \legup{} designs.  The dramatic discrepancy in performance for the former case can be largely attributed to \vericert{}'s na\"ive implementations of division and modulo operations, as explained in Section~\ref{sec:evaluation:setup}. Indeed, \vericert{} achieved an average clock frequency of just 13MHz, while \legup{} managed about 111MHz. After replacing the division/modulo operations with our own C-based implementations, \vericert{}'s average clock frequency becomes about 220MHz.  This improvement in frequency can maybe be explained by scheduling trying to pack too many instructions into a cycle, or by the fact that \legup{} uses a more involved RAM template so that the hardware produces a dual-port RAM, which can perform two reads and writes per clock cycle.
+Secondly, looking at the maximum clock frequency that each design can achieve, \vericert{} designs can only achieve 8.2$\times$ the maximum clock frequency of \legup{} \JW{That sounds wrong? Shouldn't it be less than legup's fmax?} when division/modulo operations are present.  This is in great contrast to the maximum clock frequency that \vericert{} can achieve when no divide/modulus \JW{modulo?} operations are present, where \vericert{} generates designs that are actually 2$\times$ better than the frequency achieved by \legup{} designs.  The dramatic discrepancy in performance for the former case can be largely attributed to \vericert{}'s na\"ive implementations of division and modulo operations, as explained in Section~\ref{sec:evaluation:setup}. Indeed, \vericert{} achieved an average clock frequency of just 13MHz, while \legup{} managed about 111MHz. After replacing the division/modulo operations with our own C-based implementations, \vericert{}'s average clock frequency becomes about 220MHz.  This improvement in frequency can maybe be explained by scheduling trying to pack too many instructions into a cycle, or by the fact that \legup{} uses a more involved RAM interface so that the hardware produces a dual-port RAM, which can perform two reads and writes per clock cycle.
 
 Looking at a few benchmarks in particular in Figure~\ref{fig:polybench-nodiv} for some interesting cases.  For the trmm benchmark, \vericert{} produces hardware that executes with the same cycle count as \legup{}, and manages to create hardware that achieves twice the frequency compared to \legup{}, thereby actually producing a design that executes twice as fast as \legup{}.  Another interesting benchmark is \JW{tt formatting for benchmark program names?} doitgen, where \vericert{} is comparable to \legup{} without LLVM optimisations, however, LLVM optimisations seem to have a large effect on the cycle count.
 
diff --git a/introduction.tex b/introduction.tex
index c96e795..d2ec9cc 100644
--- a/introduction.tex
+++ b/introduction.tex
@@ -42,7 +42,7 @@ The contributions of this paper are as follows:
 \begin{itemize}
   \item We present \vericert{}, the first mechanically verified HLS tool that compiles C to Verilog. In Section~\ref{sec:design}, we describe the design of \vericert{}, including a few optimisations related to memory accesses and division.
   \item We state the correctness theorem of \vericert{} with respect to an existing semantics for Verilog due to \citet{loow19_proof_trans_veril_devel_hol}. In Section~\ref{sec:verilog}, we describe how we extended this semantics to make it suitable as an HLS target.
-  \item In Section~\ref{sec:proof}, we describe how we proved the correctness theorem. The proof follows standard \compcert{} techniques -- forward simulations, intermediate specifications, and determinism results -- but we encountered several challenges peculiar to our hardware-oriented setting.  These include handling discrepancies between byte- and word-addressable memories, different handling of unsigned comparisons between C and Verilog, correctly mapping CompCert's memory model onto a finite Verilog array and finally correctly rearranging memory reads and writes so that these behave properly as a RAM in hardware.
+  \item In Section~\ref{sec:proof}, we describe how we proved the correctness theorem. The proof follows standard \compcert{} techniques -- forward simulations, intermediate specifications, and determinism results -- but we encountered several challenges peculiar to our hardware-oriented setting.  These include handling discrepancies between byte- and word-addressable memories, different handling of unsigned comparisons between C and Verilog, correctly mapping \compcert{}'s memory model onto a finite Verilog array and finally correctly rearranging memory reads and writes so that these behave properly as a RAM in hardware.
   \item In Section~\ref{sec:evaluation}, we evaluate \vericert{} on the \polybench{} benchmark suite~\cite{polybench}, and compare the performance of our generated hardware against an existing, unverified HLS tool called \legup{}~\cite{canis11_legup}. We show that \vericert{} generates hardware that is \slowdownOrig$\times$ slower (\slowdownDiv$\times$ slower in the absence of division) and \areaIncr$\times$ larger than that generated by \legup{}. We intend to bridge this performance gap in the future by introducing (and verifying) HLS optimisations of our own, such as scheduling and memory analysis.
 \end{itemize}
 %\JW{This sentence seems pretty good to me; is it up-to-date with the latest `challenges' you've faced?}
diff --git a/proof.tex b/proof.tex
index 39a39d2..bc1d95f 100644
--- a/proof.tex
+++ b/proof.tex
@@ -13,7 +13,8 @@ The main correctness theorem is analogous to that stated in \compcert{}~\cite{le
   \end{equation*}
 \end{theorem}
 
-The theorem is a `backwards simulation' result (from target to source). The theorem does not demand the `if' direction too, because compilers are permitted to resolve any non-determinism present in their source programs.
+The theorem is a `backwards simulation' result (from target to source), following the terminology used in the \compcert{} literature.  The theorem does not demand the `if' direction too, because compilers are permitted to resolve any non-determinism present in their source programs.
+
 In practice, Clight programs are all deterministic, as are the Verilog programs in the fragment we consider. This means that we can prove the correctness theorem above by first inverting it to become a forwards simulation result, following standard \compcert{} practice.
 
 The second observation that needs to be made is that to prove this forward simulation, it suffices to prove forward simulations between each intermediate language, as these results can be composed to prove the correctness of the whole HLS tool. 
@@ -146,6 +147,9 @@ Another simulation proof is performed to prove that the insertion of the RAM is
 
 The other invariants and assumptions for defining two matching states in HTL are quite similar to the simulation performed in Lemma~\ref{lemma:simulation_diagram}, such as ensuring that the states have the same value, and that the values in the registers are less-defined.  In particular, the less-defined relation matches up all the registers, except for the new registers introduced by the RAM.
 
+\begin{lemma}\label{lemma:simulation_diagram_ram}
+
+
 \subsection{Forward simulation from HTL to Verilog}\label{sec:proof:htl_verilog}
 
 The HTL-to-Verilog simulation is conceptually simple, as the only transformation is from the map representation of the code to the case-statement representation.  The proof is more involved, as the semantics of a map structure are quite different to the semantics of the case-statement they are converted to.
diff --git a/verilog.tex b/verilog.tex
index 8f03de3..89a40dd 100644
--- a/verilog.tex
+++ b/verilog.tex
@@ -21,8 +21,7 @@ An example of a rule for executing an \alwaysblock{} that is triggered at the po
 \noindent This rule says that assuming the statement $s$ in the \alwaysblock{} runs with state $\Sigma$ and produces the new state $\Sigma'$, the \alwaysblock{} will result in the same final state.  %Since only clocked \alwaysblock{} are supported, and one step in the semantics correspond to one clock cycle, it means that this rule is run once per clock cycle for each \alwaysblock{}.
 
 Two types of assignments are supported in \alwaysblock{}s: nonblocking and blocking assignment.  Nonblocking assignments all take effect simultaneously at the end of the clock cycle, %and atomically.
-while blocking assignments happen instantly so that later assignments in the clock cycle can pick them up.  To model both of these assignments, the state $\Sigma$ has to be split into two maps: $\Gamma$, which contains the current values of all variables and arrays, and $\Delta$, which contains the values that will be assigned at the end of the clock cycle. %, we can therefore say that $\Sigma = (\Gamma, \Delta)$.
-$\Gamma$ and $\Delta$ each contain
+while blocking assignments happen instantly so that later assignments in the clock cycle can pick them up.  To model both of these assignments, the state $\Sigma$ has to be split into two maps: $\Gamma$, which contains the current values of all variables and arrays, and $\Delta$, which contains the values that will be assigned at the end of the clock cycle. $\Sigma$ can therefore be defined as follows: $\Sigma = (\Gamma, \Delta)$.
 Nonblocking assignment can therefore be expressed as follows:
 \begin{equation*}
   \inferrule[Nonblocking Reg]{\yhkeyword{name}\ d = \yhkeyword{OK}\ n \\ (\Gamma, e) \downarrow_{\text{expr}} v}{((\Gamma, \Delta), d\ \yhkeyword{ <= } e) \downarrow_{\text{stmnt}} (\Gamma, \Delta [n \mapsto v])}\\
@@ -86,7 +85,7 @@ The main execution of the module $\downarrow_{\text{module}}$ is split into $\do
   \label{fig:inference_module}
 \end{figure*}
 
-The \compcert{} computation model defines a set of states through which execution passes. In this subsection, we explain how we extend our Verilog semantics with five special-purpose  registers in order to integrate it into \compcert{}.
+The \compcert{} computation model defines a set of states through which execution passes. In this subsection, we explain how we extend our Verilog semantics with four special-purpose  registers in order to integrate it into \compcert{}.
 
 \compcert{} executions pass through three main states:
 \begin{description}
@@ -95,7 +94,7 @@ The \compcert{} computation model defines a set of states through which executio
   \item[\texttt{Returnstate} \textit{sf} $v$] The state that is reached when a function returns back to the caller, with stack frame \textit{sf} and return value $v$.
 \end{description}
 
-To support this computational model, we extend the Verilog module we generate with the following five registers and a RAM block:
+To support this computational model, we extend the Verilog module we generate with the following four registers and a RAM block:
 
 \begin{description}
   \item[program counter] The program counter can be modelled using a register that keeps track of the state, denoted as $\sigma$.
@@ -118,7 +117,7 @@ Therefore, in addition to the rules shown in Figure~\ref{fig:inference_module},
 
 \subsection{Memory Model}\label{sec:verilog:memory}
 
-The Verilog semantics do not define a memory model for Verilog, as this is not needed for a hardware description language.  There is no preexisting architecture that Verilog will produce, it can describe any memory layout that is needed.  Instead of having specific semantics for memory, the semantics only needs to support the language features that can produce these different memory layouts, these being Verilog arrays.  We therefore define semantics for updating Verilog arrays using blocking and nonblocking assignment.  We then have to prove that the C memory model that \compcert{} uses matches with the interpretation of arrays that are used in Verilog.  The \compcert{} memory model is infinite, whereas our representation of arrays in Verilog is inherently finite.  There have already been various efforts to define a finite memory model for all compiler passes in \compcert{}, such as CompCertS~\cite{besson18_compc}, CompCertELF~\cite{wang20_compc} and CompCertTSO~\cite{sevcik13_compc}, however, we define the translation from \compcert{}'s standard infinite memory model to finitely sized arrays that can be represented in Verilog, leaving the compiler passes intact.
+The Verilog semantics do not define a memory model for Verilog, as this is not needed for a hardware description language.  There is no preexisting architecture that Verilog will produce, it can describe any memory layout that is needed.  Instead of having specific semantics for memory, the semantics only needs to support the language features that can produce these different memory layouts, these being Verilog arrays.  We therefore define semantics for updating Verilog arrays using blocking and nonblocking assignment.  We then have to prove that the C memory model that \compcert{} uses matches with the interpretation of arrays that are used in Verilog.  The \compcert{} memory model is infinite, whereas our representation of arrays in Verilog is inherently finite.  There have already been various efforts to define a finite memory model for all compiler passes in \compcert{}, such as Comp\-Cert\-S~\cite{besson18_compc}, Comp\-Cert\-ELF~\cite{wang20_compc} and Comp\-Cert\-TSO~\cite{sevcik13_compc}, however, we define the translation from \compcert{}'s standard infinite memory model to finitely sized arrays that can be represented in Verilog, leaving the compiler passes intact.
 
 %\JW{It's not completely clear what the relationship is between your work and those works. The use of `only' suggests that you've re-done a subset of work that has already been done -- is that the right impression?}\YH{Hopefully that's more clear.}
 
@@ -130,7 +129,7 @@ This translation is represented in Figure~\ref{fig:memory_model_transl}, where \
   \caption{Change in the memory model during the translation of 3AC to HTL.  This is immediately after the assignment to the array.\YH{TODO: Update diagram}}\label{fig:memory_model_transl}
 \end{figure}
 
-However, in practice, assigning and reading from an array directly in the state machine will not produce a memory in the final hardware design, as the synthesis tool cannot identify the array as having the necessary properties that a RAM needs, even though this is the most natural formulation of memory.  Even though theoretically the memory will only be read from once per clock cycle, the synthesis tool cannot ensure that this is true, and will instead create a register for each memory location.  This increases the size of the circuit dramatically, as the RAM on the FPGA chip will not be reused.  Instead, the synthesis tool expects a specific template that ensures these properties, and will then transform the template into a proper RAM during synthesis.  Therefore, a translation has to be performed from the naive use of memory in the state machine, to a proper use of a memory template.
+%However, in practice, assigning and reading from an array directly in the state machine will not produce a memory in the final hardware design, as the synthesis tool cannot identify the array as having the necessary properties that a RAM needs, even though this is the most natural formulation of memory.  Even though theoretically the memory will only be read from once per clock cycle, the synthesis tool cannot ensure that this is true, and will instead create a register for each memory location.  This increases the size of the circuit dramatically, as the RAM on the FPGA chip will not be reused.  Instead, the synthesis tool expects a specific interface that ensures these properties, and will then transform the interface into a proper RAM during synthesis.  Therefore, a translation has to be performed from the naive use of memory in the state machine, to a proper use of a memory interface.
 
 %\begin{figure}
 %  \centering
author	Yann Herklotz <git@yannherklotz.com>	2021-04-16 20:53:57 +0100
committer	Yann Herklotz <git@yannherklotz.com>	2021-04-16 20:53:57 +0100
commit	1479bb42c2877c29376549d768a97676e1b96841 (patch)
tree	42dedad20c175e8c9340316a6abde823c213b44b
parent	7d8150af139d30058a6be3b962f252505fd45d9b (diff)
download	oopsla21_fvhls-1479bb42c2877c29376549d768a97676e1b96841.tar.gz oopsla21_fvhls-1479bb42c2877c29376549d768a97676e1b96841.zip