Add some more content

author: Yann Herklotz <git@yannherklotz.com> 2020-11-02 00:27:18 +0000
committer: Yann Herklotz <git@yannherklotz.com> 2020-11-02 00:27:18 +0000
commit: 36dc41477a21e15b81b09985c9505ea1b5f10817 (patch)
tree: fccc68fa4f98dcf5d6b27acdf0332b3827563e5a
parent: a5561bd00606bdbbfdc014e720021d3a900b92d4 (diff)
download: oopsla21_fvhls-36dc41477a21e15b81b09985c9505ea1b5f10817.tar.gz
oopsla21_fvhls-36dc41477a21e15b81b09985c9505ea1b5f10817.zip
7 files changed, 376 insertions, 198 deletions
diff --git a/algorithm.tex b/algorithm.tex
index 9db0262..425e1b0 100644
--- a/algorithm.tex
+++ b/algorithm.tex
@@ -1,41 +1,16 @@
-\section{Turning CompCert into an HLS tool}
+\section{Adding an HLS back end to CompCert}
 
-%% Should maybe go in the introduction instead.
+This section covers the main architecture of the HLS tool, and the way in which the backend was added to CompCert.  This section will also cover an example of converting a simple C program into hardware, expressed in the Verilog language.
 
-\begin{figure}
-  \centering
-  \resizebox{0.47\textwidth}{!}{
-  \begin{tikzpicture}
-    [language/.style={fill=white,rounded corners=2pt}]
-    \fill[compcert,rounded corners=3pt] (-1,-1) rectangle (9,1.5);
-    \fill[formalhls,rounded corners=3pt] (-1,-1.5) rectangle (9,-2.5);
-    \node[language] at (0,0) (clight) {Clight};
-    \node[language] at (2,0) (cminor) {C\#minor};
-    \node[language] at (4,0) (rtl) {RTL};
-    \node[language] at (6,0) (ltl) {LTL};
-    \node[language] at (8,0) (ppc) {PPC};
-    \node[language] at (4,-2) (dfgstmd) {HTL};
-    \node[language] at (7,-2) (verilog) {Verilog};
-    \node at (0,1) {CompCert};
-    \node at (0,-2) {Vericert};
-    \draw[->] (clight) -- (cminor);
-    \draw[->,dashed] (cminor) -- (rtl);
-    \draw[->] (rtl) -- (ltl);
-    \draw[->,dashed] (ltl) -- (ppc);
-    \draw[->] (rtl) -- (dfgstmd);
-    \draw[->] (dfgstmd) -- (verilog);
-  \end{tikzpicture}}
-  \caption{Verilog backend branching off at the RTL stage. \JW{It's a real nightmare that `RTL' is so ambiguous, meaning both software and hardware. I think we have to do something about it. One possibility: according to the CompCert webpages, software RTL is `also known as 3-address code', so we could replace (software) `RTL' with `3AC' (and give a footnote to explain why). Is it also possible to replace (hardware) `RTL' with just `Verilog'? I know Verilog could mean RTL or netlist, but since this paper doesn't talk much about RTL-to-netlist synthesis, perhaps that doesn't matter? Or perhaps RTL could be replaced with FSMD? Or even just `hardware design'?}\YH{I like that suggestion, I will actually replace RTL by 3AC then, and at least in this figure I did purposefully replace RTL by Verilog, and wanted to do it throughout the paper.}}%
-  \label{fig:rtlbranch}
-\end{figure}
-
-This section covers the main architecture of the HLS tool, and the way in which the backend was added to CompCert.
+% - Explain main differences between translating C to software and to hardware.
 
-CompCert is made up of 11 intermediate languages in between the Clight input and the assembly output.  These intermediate languages each serve a different purpose and contain various different optimisations.  When designing a new backend for CompCert, it is therefore crucial to know where to branch off and start the hardware generation.  Many of the optimisations that the CompCert backend performs are not necessary when generating custom hardware and not relying on a CPU anymore, such as register allocation or even scheduling.  It is therefore important to find the right intermediate language so that the HLS tool still benefits from many of the generic optimisations that CompCert performs, but does not receive the code transformations that are specific to CPU architectures.
+%   + This can be done by going through the simple example.
 
-Existing HLS compilers usually use LLVM IR as an intermediate representation when performing HLS specific optimisations, as each instruction can be mapped quite well to hardware which performs the same behaviour.  CompCert's register transfer language (RTL) is the intermediate language that resembles LLVM IR the most, as it also has an infinite number of pseudo-registers and each instruction maps well to hardware. \JP{Perhaps this needs some further qualification? RTL uses the operations from the target architecture, and indeed performs architecture specific optimisations prior to RTL gen, so (for sake of example) switching from x86 RTL to RISC-V RTL could have a significant impact on performance.}\YH{Yes will definitely include those, just have to think about where.  I think this could be something that could be mentioned in the RTL section instead.}  In addition to that, many optimisations that are also useful for HLS are performed in RTL, which means that if it is supported as the input language, the HLS algorithm benefits from the same optimisations.  It is therefore a good candidate to be chosen as the input language to the HLS backend. The complete flow that Vericert takes is show in figure~\ref{fig:rtlbranch}.
-
-%%TODO: Maybe add why LTL and the other smaller languages are not that well suited
+\begin{figure*}
+  \centering
+  \includegraphics[scale=0.3,trim={10cm 10cm 7cm 7cm},clip=true]{data/accumulator_fsmd2.pdf}
+  \caption{Accumulator hardware diagram.}\label{fig:accumulator_diagram}
+\end{figure*}
 
 \begin{figure}
   \centering
@@ -47,101 +22,61 @@ Existing HLS compilers usually use LLVM IR as an intermediate representation whe
     \inputminted[fontsize=\footnotesize]{c}{data/accumulator.rtl}
     \caption{Accumulator RTL code.}\label{fig:accumulator_rtl}
   \end{subfigure}
-  \caption{Accumulator example using CompCert to translate from C to RTL.}\label{fig:accumulator_c_rtl}
+  \caption{Accumulator example using CompCert to translate from C to three address code (3AC).}\label{fig:accumulator_c_rtl}
 \end{figure}
 
 \begin{figure}
   \centering
   \begin{subfigure}[b]{0.5\linewidth}
     \inputminted[fontsize=\tiny]{systemverilog}{data/accumulator1.v}
-    \caption{Accumulator HTL code.}\label{fig:accumulator_htl}
+    \caption{Accumulator HTL code.}\label{fig:accumulator_v_1}
   \end{subfigure}%
   \begin{subfigure}[b]{0.5\linewidth}
     \inputminted[fontsize=\tiny]{systemverilog}{data/accumulator2.v}
-    \caption{Accumulator Verilog code.}\label{fig:accumulator_v}
+    \caption{Accumulator Verilog code.}\label{fig:accumulator_v_2}
   \end{subfigure}
-  \caption{Accumulator example using CompCert to translate from HTL to Verilog.\YH{I feel like these examples take up too much space, but don't really know of a different way to show off a complete example without the long code.} \JW{Ok, what about drawing the FSMD in some sort of pictorial way? I think you tried drawing it as a schematic previously, but that was too big and clumsy. What about drawing the FSMD as a state machine, with sixteen states and labelled edges between them? Or might that be too abstract?}}\label{fig:accumulator_htl_v}
+  \caption{Accumulator example using CompCert to translate from HTL to Verilog.\YH{I feel like these examples take up too much space, but don't really know of a different way to show off a complete example without the long code.} \JW{Ok, what about drawing the FSMD in some sort of pictorial way? I think you tried drawing it as a schematic previously, but that was too big and clumsy. What about drawing the FSMD as a state machine, with sixteen states and labelled edges between them? Or might that be too abstract?}}\label{fig:accumulator_v}
 \end{figure}
 
-%\begin{figure}
-%  \centering
-%  \includegraphics[width=0.47\textwidth]{data/accumulator_fsmd}
-%  \caption{Accumulator example translated to a finite state-machine with datapath (FSMD).}\label{fig:accumulator_fsmd}
-%\end{figure}
-
-To describe the translation, we start with an example of how to translate a simple accumulator example, which is shown in figure~\ref{fig:accumulator_c}.  Using this example, the different stages of the translation can be explained, together with the way they were proven.
-
-The first step is to use the CompCert front end passes to convert the C code into RTL, which is the intermediate language that CompCert uses for most of its optimisations.  The translation is shown in figure~\ref{fig:accumulator_c_rtl}, where the RTL code is depicted in figure~\ref{fig:accumulator_rtl}.  At the RTL stage, many of CompCert's normal optimisations passes, such as deadcode elimination or common subexpression elimination (CSE) can be applied to reduce the RTL as much as possible.  The optimised RTL is then translated to HTL, consisting of a finite state-machine representation of the RTL code.  This representation maps directly to hardware and it therefore transformed to Verilog afterwards.  The translation from RTL to HTL is the main step in converting the software representation of the code into a hardware representation of the code.  The following sections will go over the details of the transformation of each pass.
-
-\subsection{CompCert RTL}
-
-RTL is the existing intermediate language in CompCert that was chosen to transform to hardware, as it is quite architecture independent compared to the lower level intermediate languages, and each instruction translates to a hardware construct.  All CompCert intermediate language follow the similar structure below:
-
-\begin{align*}
-  \mathit{program} \quad ::= \{ &\mathbf{variables} : (\mathit{id} * \mathit{data}) \text{ list}, \\
-                 &\mathbf{functions} : (\mathit{id} * \mathit{function\_def}) \text{ list},\\
-                 &\mathbf{main} : \mathit{id} \}
-\end{align*}
-
-\noindent where function definitions can either be internal or external.  External functions are functions that are not defined in the current translation unit, and are therefore not part of the current translation.  The difference in between the CompCert intermediate languages is therefore how the internal function is defined, as that defines the structure of the language itself.
-
-%% Describe RTL
-RTL function definitions are a sequence of instructions encoded in a control-flow graph, with each instruction linking to the next instruction that should be executed.  RTL consists of ten separate instructions in total, which makes this a good candidate for high-level synthesis translation, as these instructions translate quite directly to a hardware equivalent.  This structure makes it ideal for performing optimisations on, as performing static analysis and transformations on this language is much simpler than doing the same optimisation for the C abstract syntax tree instead.  However, even though RTL is quite architecture independent and good for software optimisations, it does not model hardware well and is therefore not suitable for hardware optimisations.  We therefore need a new intermediate language that translates directly into hardware and which can also be used as an intermediate language for more hardware specific optimisations.
-
-However, before this translation can be performed, all functions need to be inlined.  This means that recursive function calls cannot be supported, however, these are difficult to translate to hardware as they require dynamic allocation of the stack.  Dynamic allocation cannot be performed in hardware, because the memory size has to be set statically and then translated to a physical memory in the hardware. \JP{I think we should explain this a little bit more; HW allocation is possible in principle.} Inlining is already an optimisation that is performed at the RTL stage in CompCert, however, to get RTL that is ready to be translated to hardware, this pass was modified so that is forced all functions to be inlined.  This eliminates all the function calls from the RTL, as long as there are no recursive function calls, which means that do not have to be handled by the translation to hardware.
-
-\subsection{HTL}
+Taking a simple accumulator program shown in Figure~\ref{fig:accumulator_c}, we can describe the main translation that is performed in Vericert to go from the behaviour description into a hardware design.  
 
-The first hardware specific intermediate language that is introduced is the hardware transfer language (HTL).  It is a self-contained finite state machine with datapath (FSMD) representation of the RTL code which maps directly to actual hardware.  HTL, like all CompCert intermediate languages, has the same program structure as RTL, but internal functions now contain logic to control the order of execution, and a datapath that transforms the data in the registers.  This is represented by having two maps that link states to the control logic and to the current position in the datapath, which are both expressed using Verilog statements.  The syntax for HTL functions are the following:
+%   + TODO Explain the main mapping in a short simple way
 
-\begin{align*}
-  g \quad &::= \quad n \mapsto s\\
-  d_{\rm r} \quad &::= \quad r \mapsto (io? * n)\\
-  d_{\rm a} \quad &::= \quad r \mapsto (io? * n * n)\\
-  M \quad &::= \quad \big\{\ \texttt{params} : r \text{ list}\\
-                                     &\texttt{datapath} : g\\
-                                     &\texttt{controllogic} : g\\
-                                     &\texttt{entrypoint} : n\\
-                                     &\texttt{st, stk, finish, return, start, reset, clk} : r\\
-                                     &\texttt{scldecls} : d_{\rm r}\\
-                                     &\texttt{arrdecls} : d_{\rm a}\ \big\}
-\end{align*}
+% - Explain why we chose 3AC (RTL) as the branching off point.
 
-\noindent where $d_{\rm a}$ and $d_{\rm r}$ are declaration lists for all the registers used in module $M$, and $g$ is map from states to Verilog statements.
-
-Going back to to the previous accumulator example, the HTL code that is generated is shown in figure~\ref{fig:accumulator_htl}.  The function consists of two main parts, the data-path and the control logic.  The data-path describes how the data registers should change and what computations should be performed in each cycle, whereas the control logic describes how the state changes should be performed.  These two blocks in the main function will execute in parallel, meaning at each state, the control logic will load the next state into the state register, and the data-path will compute a value and store it in the appropriate register.
-
-The translation from RTL to HTL is quite straightforward, as each RTL instruction either matches up quite well to a hardware construct, or does not have to be handled by the translation, such as function calls.  At each instruction, the control flow is separated from the data computation and is then added to the control logic and data-flow map respectively.  For example, in state 16 in the RTL code in figure~\ref{fig:accumulator_rtl}, the register \texttt{x9} is initialised to one and then moves on to state 15.  This is encoded in HTL by initialising a 32 bit register \texttt{reg\_9} to one as well in the data-flow graph section, and also adding a state transition to the state 15 in the controllogic transition.  Simple operator instructions are translated in a similar way.  For example, in state 5, the value in the array is added to the current value of the accumulated sum, which is simply translated to an addition of the equivalent registers in the HTL code.
-
-The translation can be described by a relational specification, which can be described by the following relation, where \textit{fin}, \textit{rtrn}, $\sigma$ and \textit{stk} are the registers for the finished signal, return value, current state and stack respectively, $i$ is the RTL instruction being translated, and \textit{data} and \textit{control} are the data-flow and control logic map respectively.
-
-\begin{equation*}
-  \yhfunction{tr\_instr } \textit{fin rtrn }\ \sigma\ \textit{stk }\ i\ \textit{data }\ \textit{control}
-\end{equation*}
-
-\noindent An example of a rule describing the translation of an operation is the following:
-
-\begin{equation*}
-  \inferrule[Iop]{\yhfunction{tr\_op } \textit{op }\ \vec{a} = \yhconstant{OK } e}{\yhfunction{tr\_instr } \textit{fin rtrn }\ \sigma\ \textit{stk }\ (\yhconstant{Iop } \textit{op }\ \vec{a}\ d\ n)\ (d \Leftarrow e)\ (\sigma \Leftarrow n)}
-\end{equation*}
-
-In the accumulator C code, the for loop is translated to a branch statement in state 3, which can also be translated to equivalent HTL code by placing the comparison into the control logic part of the main function, and not performing any data operations in the data-flow section.  On the next clock cycle, the state will therefore transition to state 7 or state 2 depending on if \texttt{reg\_3} is less than 3 or not.  One thing to note is that in this case, the comparison is signed.  By default, all operators and registers in Verilog and HTL are unsigned, so to force an operation to handle the bits as signed, both operators have to be forced to signed.  In addition to that, Verilog resizes expressions to the largest needed size by default, which can affect the result of the computation.  This feature is also not supported by the Verilog semantics we used, and there would therefore be a mismatch between the Verilog semantics and the actual behaviour of Verilog according to the standard.  To bypass this issue braces are used to stop the Verilog simulator or synthesis tool from resizing anything inside the braces.  Instead, explicit resizing is used in the semantics and operations can only be performed on two registers that have the same size.
-
-Finally, this example also contains an array, which is stored on the stack for the main function.  Translating arrays means that load and store instructions need to be supported.  In hardware RAM is not as available as in software, and needs to be explicitly implemented by declaring a two dimensional array with specific properties, such as only being able to write and read from it once per clock cycle.  Therefore, to make memory as efficient as possible and support reads and writes in one clock cycle, a memory only addressable by word needs to be implemented, instead of the standard RAM which is addressable by byte.  This is because the clock cycles in devices such as FPGAs are normally an order of magnitude slower than CPUs, meaning taking four cycles to load an integer is not feasible.  However, this leads to various problems, the main one being that addresses need to be word-aligned, which is not the case in RTL or in C.  Therefore, addresses need to be divisible by four to also be valid addresses in HTL, which is shown in the load from the stack in state 6 of the RTL.  The equivalent HTL code is shown in the datapath branch where the same addition is done, however, the result is divided by 4.  To prove that these two loads have the same behaviour, we have to prove that the load was word-aligned and therefore could be divided by 4.\YH{Elaborate on our solution.}
-
-\subsection{Verilog}
-
-Finally, we have to translate the HTL code into proper Verilog and prove that it behaves the same as the RTL according to the Verilog semantics.  The Verilog output is modelled as a complete abstract syntax tree (AST) instead of being an abstract map over the instructions that are executed.  However, as all the instructions are already expressed in Verilog, only the maps need to be translated to valid Verilog, and correct declarations for all the variables in the program need to be added as well.
-
-This translation seems quite straightforward, however, proving that it is correct is not that simple, as all the implicit assumptions that were made in HTL need to be translated explicitly to Verilog and needs to have the same behaviour according to the semantics.  Figure~\ref{fig:accumulator_v} shows the final Verilog output that is generated.  In general, the structure is similar to the structure of the HTL code, however, the control and datapath maps have been translated to case statement which serve the same purpose.  The other main addition to the code is the initialisation of all the variables in the code to the correct bitwidth and the declaration of the inputs and outputs to the module, so that the module can be used inside a larger hardware design.  The main subtle change that was added to the code is the reset signal which sets the state to the starting state correctly.  In HTL, this was described directly in the semantics, where the entrypoint is stored in the module interface of HTL.  However, in Verilog we also want to verify that the hardware will be placed into the right state when we power it up or reset the design, and it therefore has to be encoded directly in the Verilog code.
-
-To check that the initial state of the Verilog is the same as the HTL code, we therefore have to run the module once, assuming the state is undefined and the reset is set to high.  We then have to compare the abstract starting state stored in the HTL module to the bitvector value we obtain from running the module for one clock cycle and prove that they are the same.  As the value for the state is undefined, the case statements will evaluate to the default state and therefore not perform any computations.
+\begin{figure}
+  \centering
+  \resizebox{0.47\textwidth}{!}{
+  \begin{tikzpicture}
+    [language/.style={fill=white,rounded corners=2pt}]
+    \fill[compcert,rounded corners=3pt] (-1,-1) rectangle (9,1.5);
+    \fill[formalhls,rounded corners=3pt] (-1,-1.5) rectangle (9,-2.5);
+    \node[language] at (0,0) (clight) {Clight};
+    \node[language] at (2,0) (cminor) {C\#minor};
+    \node[language] at (4,0) (rtl) {3AC};
+    \node[language] at (6,0) (ltl) {LTL};
+    \node[language] at (8,0) (ppc) {PPC};
+    \node[language] at (4,-2) (dfgstmd) {HTL};
+    \node[language] at (7,-2) (verilog) {Verilog};
+    \node at (0,1) {CompCert};
+    \node at (0,-2) {Vericert};
+    \draw[->] (clight) -- (cminor);
+    \draw[->,dashed] (cminor) -- (rtl);
+    \draw[->] (rtl) -- (ltl);
+    \draw[->,dashed] (ltl) -- (ppc);
+    \draw[->] (rtl) -- (dfgstmd);
+    \draw[->] (dfgstmd) -- (verilog);
+  \end{tikzpicture}}
+  \caption{Verilog backend branching off at the 3AC stage in CompCert.}%
+  \label{fig:rtlbranch}
+\end{figure}
 
-The translation from maps to case statements is done by turning each node of the tree into a case expression with the statments in each being the same.  The main difficulty for the proof is that a structure that can be directly accessed is transformed into an inductive structure where a certain number of constructors need to be called to get to the correct case.  The proof of the translation from maps to case statements follows by induction over the list of elements in the map and the fact that each key in this list will be unique.  In addition to that, the statement that is currently being evaluated is guaranteed by the correctness of the list of elements to be in that list.  The latter fact therefore eliminates the base case, as an empty list does not contain the element we know is in the list.  The other two cases follow by the fact that either the key is equal to the evaluated value of the case expression, or it isn't.  In the first case we can then evaluate the statement and get the state after the case expression, as the uniqueness of the key tells us that the key cannot show up in the list anymore.  In the other case we can just apply the inductive hypothesis and remove the current case from the case statement, as it did not match.
+%   + TODO Clarify connection between CFG and FSMD
 
-Another problem with the representation of the state as an actual register is that we have to make sure that the state does not overflow.  Currently, the state register will always be 32 bits, meaning the maximum number of states can only be $2^{32} - 1$.  We therefore have to prove that the state value will never go over that value.  This means that during the translation we have to check for each state that it can fit into an integer.
+%   + TODO Explain how memory is mapped
 
 %%% Local Variables:
 %%% mode: latex
 %%% TeX-master: "main"
+%%% TeX-command-extra-options: "-shell-escape"
 %%% End:
diff --git a/archive/algorithm.tex b/archive/algorithm.tex
new file mode 100644
index 0000000..9db0262
--- /dev/null
+++ b/archive/algorithm.tex
@@ -0,0 +1,147 @@
+\section{Turning CompCert into an HLS tool}
+
+%% Should maybe go in the introduction instead.
+
+\begin{figure}
+  \centering
+  \resizebox{0.47\textwidth}{!}{
+  \begin{tikzpicture}
+    [language/.style={fill=white,rounded corners=2pt}]
+    \fill[compcert,rounded corners=3pt] (-1,-1) rectangle (9,1.5);
+    \fill[formalhls,rounded corners=3pt] (-1,-1.5) rectangle (9,-2.5);
+    \node[language] at (0,0) (clight) {Clight};
+    \node[language] at (2,0) (cminor) {C\#minor};
+    \node[language] at (4,0) (rtl) {RTL};
+    \node[language] at (6,0) (ltl) {LTL};
+    \node[language] at (8,0) (ppc) {PPC};
+    \node[language] at (4,-2) (dfgstmd) {HTL};
+    \node[language] at (7,-2) (verilog) {Verilog};
+    \node at (0,1) {CompCert};
+    \node at (0,-2) {Vericert};
+    \draw[->] (clight) -- (cminor);
+    \draw[->,dashed] (cminor) -- (rtl);
+    \draw[->] (rtl) -- (ltl);
+    \draw[->,dashed] (ltl) -- (ppc);
+    \draw[->] (rtl) -- (dfgstmd);
+    \draw[->] (dfgstmd) -- (verilog);
+  \end{tikzpicture}}
+  \caption{Verilog backend branching off at the RTL stage. \JW{It's a real nightmare that `RTL' is so ambiguous, meaning both software and hardware. I think we have to do something about it. One possibility: according to the CompCert webpages, software RTL is `also known as 3-address code', so we could replace (software) `RTL' with `3AC' (and give a footnote to explain why). Is it also possible to replace (hardware) `RTL' with just `Verilog'? I know Verilog could mean RTL or netlist, but since this paper doesn't talk much about RTL-to-netlist synthesis, perhaps that doesn't matter? Or perhaps RTL could be replaced with FSMD? Or even just `hardware design'?}\YH{I like that suggestion, I will actually replace RTL by 3AC then, and at least in this figure I did purposefully replace RTL by Verilog, and wanted to do it throughout the paper.}}%
+  \label{fig:rtlbranch}
+\end{figure}
+
+This section covers the main architecture of the HLS tool, and the way in which the backend was added to CompCert.
+
+CompCert is made up of 11 intermediate languages in between the Clight input and the assembly output.  These intermediate languages each serve a different purpose and contain various different optimisations.  When designing a new backend for CompCert, it is therefore crucial to know where to branch off and start the hardware generation.  Many of the optimisations that the CompCert backend performs are not necessary when generating custom hardware and not relying on a CPU anymore, such as register allocation or even scheduling.  It is therefore important to find the right intermediate language so that the HLS tool still benefits from many of the generic optimisations that CompCert performs, but does not receive the code transformations that are specific to CPU architectures.
+
+Existing HLS compilers usually use LLVM IR as an intermediate representation when performing HLS specific optimisations, as each instruction can be mapped quite well to hardware which performs the same behaviour.  CompCert's register transfer language (RTL) is the intermediate language that resembles LLVM IR the most, as it also has an infinite number of pseudo-registers and each instruction maps well to hardware. \JP{Perhaps this needs some further qualification? RTL uses the operations from the target architecture, and indeed performs architecture specific optimisations prior to RTL gen, so (for sake of example) switching from x86 RTL to RISC-V RTL could have a significant impact on performance.}\YH{Yes will definitely include those, just have to think about where.  I think this could be something that could be mentioned in the RTL section instead.}  In addition to that, many optimisations that are also useful for HLS are performed in RTL, which means that if it is supported as the input language, the HLS algorithm benefits from the same optimisations.  It is therefore a good candidate to be chosen as the input language to the HLS backend. The complete flow that Vericert takes is show in figure~\ref{fig:rtlbranch}.
+
+%%TODO: Maybe add why LTL and the other smaller languages are not that well suited
+
+\begin{figure}
+  \centering
+  \begin{subfigure}[b]{0.5\linewidth}
+    \inputminted{c}{data/accumulator.c}
+    \caption{Accumulator C code.}\label{fig:accumulator_c}
+  \end{subfigure}%
+  \begin{subfigure}[b]{0.5\linewidth}
+    \inputminted[fontsize=\footnotesize]{c}{data/accumulator.rtl}
+    \caption{Accumulator RTL code.}\label{fig:accumulator_rtl}
+  \end{subfigure}
+  \caption{Accumulator example using CompCert to translate from C to RTL.}\label{fig:accumulator_c_rtl}
+\end{figure}
+
+\begin{figure}
+  \centering
+  \begin{subfigure}[b]{0.5\linewidth}
+    \inputminted[fontsize=\tiny]{systemverilog}{data/accumulator1.v}
+    \caption{Accumulator HTL code.}\label{fig:accumulator_htl}
+  \end{subfigure}%
+  \begin{subfigure}[b]{0.5\linewidth}
+    \inputminted[fontsize=\tiny]{systemverilog}{data/accumulator2.v}
+    \caption{Accumulator Verilog code.}\label{fig:accumulator_v}
+  \end{subfigure}
+  \caption{Accumulator example using CompCert to translate from HTL to Verilog.\YH{I feel like these examples take up too much space, but don't really know of a different way to show off a complete example without the long code.} \JW{Ok, what about drawing the FSMD in some sort of pictorial way? I think you tried drawing it as a schematic previously, but that was too big and clumsy. What about drawing the FSMD as a state machine, with sixteen states and labelled edges between them? Or might that be too abstract?}}\label{fig:accumulator_htl_v}
+\end{figure}
+
+%\begin{figure}
+%  \centering
+%  \includegraphics[width=0.47\textwidth]{data/accumulator_fsmd}
+%  \caption{Accumulator example translated to a finite state-machine with datapath (FSMD).}\label{fig:accumulator_fsmd}
+%\end{figure}
+
+To describe the translation, we start with an example of how to translate a simple accumulator example, which is shown in figure~\ref{fig:accumulator_c}.  Using this example, the different stages of the translation can be explained, together with the way they were proven.
+
+The first step is to use the CompCert front end passes to convert the C code into RTL, which is the intermediate language that CompCert uses for most of its optimisations.  The translation is shown in figure~\ref{fig:accumulator_c_rtl}, where the RTL code is depicted in figure~\ref{fig:accumulator_rtl}.  At the RTL stage, many of CompCert's normal optimisations passes, such as deadcode elimination or common subexpression elimination (CSE) can be applied to reduce the RTL as much as possible.  The optimised RTL is then translated to HTL, consisting of a finite state-machine representation of the RTL code.  This representation maps directly to hardware and it therefore transformed to Verilog afterwards.  The translation from RTL to HTL is the main step in converting the software representation of the code into a hardware representation of the code.  The following sections will go over the details of the transformation of each pass.
+
+\subsection{CompCert RTL}
+
+RTL is the existing intermediate language in CompCert that was chosen to transform to hardware, as it is quite architecture independent compared to the lower level intermediate languages, and each instruction translates to a hardware construct.  All CompCert intermediate language follow the similar structure below:
+
+\begin{align*}
+  \mathit{program} \quad ::= \{ &\mathbf{variables} : (\mathit{id} * \mathit{data}) \text{ list}, \\
+                 &\mathbf{functions} : (\mathit{id} * \mathit{function\_def}) \text{ list},\\
+                 &\mathbf{main} : \mathit{id} \}
+\end{align*}
+
+\noindent where function definitions can either be internal or external.  External functions are functions that are not defined in the current translation unit, and are therefore not part of the current translation.  The difference in between the CompCert intermediate languages is therefore how the internal function is defined, as that defines the structure of the language itself.
+
+%% Describe RTL
+RTL function definitions are a sequence of instructions encoded in a control-flow graph, with each instruction linking to the next instruction that should be executed.  RTL consists of ten separate instructions in total, which makes this a good candidate for high-level synthesis translation, as these instructions translate quite directly to a hardware equivalent.  This structure makes it ideal for performing optimisations on, as performing static analysis and transformations on this language is much simpler than doing the same optimisation for the C abstract syntax tree instead.  However, even though RTL is quite architecture independent and good for software optimisations, it does not model hardware well and is therefore not suitable for hardware optimisations.  We therefore need a new intermediate language that translates directly into hardware and which can also be used as an intermediate language for more hardware specific optimisations.
+
+However, before this translation can be performed, all functions need to be inlined.  This means that recursive function calls cannot be supported, however, these are difficult to translate to hardware as they require dynamic allocation of the stack.  Dynamic allocation cannot be performed in hardware, because the memory size has to be set statically and then translated to a physical memory in the hardware. \JP{I think we should explain this a little bit more; HW allocation is possible in principle.} Inlining is already an optimisation that is performed at the RTL stage in CompCert, however, to get RTL that is ready to be translated to hardware, this pass was modified so that is forced all functions to be inlined.  This eliminates all the function calls from the RTL, as long as there are no recursive function calls, which means that do not have to be handled by the translation to hardware.
+
+\subsection{HTL}
+
+The first hardware specific intermediate language that is introduced is the hardware transfer language (HTL).  It is a self-contained finite state machine with datapath (FSMD) representation of the RTL code which maps directly to actual hardware.  HTL, like all CompCert intermediate languages, has the same program structure as RTL, but internal functions now contain logic to control the order of execution, and a datapath that transforms the data in the registers.  This is represented by having two maps that link states to the control logic and to the current position in the datapath, which are both expressed using Verilog statements.  The syntax for HTL functions are the following:
+
+\begin{align*}
+  g \quad &::= \quad n \mapsto s\\
+  d_{\rm r} \quad &::= \quad r \mapsto (io? * n)\\
+  d_{\rm a} \quad &::= \quad r \mapsto (io? * n * n)\\
+  M \quad &::= \quad \big\{\ \texttt{params} : r \text{ list}\\
+                                     &\texttt{datapath} : g\\
+                                     &\texttt{controllogic} : g\\
+                                     &\texttt{entrypoint} : n\\
+                                     &\texttt{st, stk, finish, return, start, reset, clk} : r\\
+                                     &\texttt{scldecls} : d_{\rm r}\\
+                                     &\texttt{arrdecls} : d_{\rm a}\ \big\}
+\end{align*}
+
+\noindent where $d_{\rm a}$ and $d_{\rm r}$ are declaration lists for all the registers used in module $M$, and $g$ is map from states to Verilog statements.
+
+Going back to to the previous accumulator example, the HTL code that is generated is shown in figure~\ref{fig:accumulator_htl}.  The function consists of two main parts, the data-path and the control logic.  The data-path describes how the data registers should change and what computations should be performed in each cycle, whereas the control logic describes how the state changes should be performed.  These two blocks in the main function will execute in parallel, meaning at each state, the control logic will load the next state into the state register, and the data-path will compute a value and store it in the appropriate register.
+
+The translation from RTL to HTL is quite straightforward, as each RTL instruction either matches up quite well to a hardware construct, or does not have to be handled by the translation, such as function calls.  At each instruction, the control flow is separated from the data computation and is then added to the control logic and data-flow map respectively.  For example, in state 16 in the RTL code in figure~\ref{fig:accumulator_rtl}, the register \texttt{x9} is initialised to one and then moves on to state 15.  This is encoded in HTL by initialising a 32 bit register \texttt{reg\_9} to one as well in the data-flow graph section, and also adding a state transition to the state 15 in the controllogic transition.  Simple operator instructions are translated in a similar way.  For example, in state 5, the value in the array is added to the current value of the accumulated sum, which is simply translated to an addition of the equivalent registers in the HTL code.
+
+The translation can be described by a relational specification, which can be described by the following relation, where \textit{fin}, \textit{rtrn}, $\sigma$ and \textit{stk} are the registers for the finished signal, return value, current state and stack respectively, $i$ is the RTL instruction being translated, and \textit{data} and \textit{control} are the data-flow and control logic map respectively.
+
+\begin{equation*}
+  \yhfunction{tr\_instr } \textit{fin rtrn }\ \sigma\ \textit{stk }\ i\ \textit{data }\ \textit{control}
+\end{equation*}
+
+\noindent An example of a rule describing the translation of an operation is the following:
+
+\begin{equation*}
+  \inferrule[Iop]{\yhfunction{tr\_op } \textit{op }\ \vec{a} = \yhconstant{OK } e}{\yhfunction{tr\_instr } \textit{fin rtrn }\ \sigma\ \textit{stk }\ (\yhconstant{Iop } \textit{op }\ \vec{a}\ d\ n)\ (d \Leftarrow e)\ (\sigma \Leftarrow n)}
+\end{equation*}
+
+In the accumulator C code, the for loop is translated to a branch statement in state 3, which can also be translated to equivalent HTL code by placing the comparison into the control logic part of the main function, and not performing any data operations in the data-flow section.  On the next clock cycle, the state will therefore transition to state 7 or state 2 depending on if \texttt{reg\_3} is less than 3 or not.  One thing to note is that in this case, the comparison is signed.  By default, all operators and registers in Verilog and HTL are unsigned, so to force an operation to handle the bits as signed, both operators have to be forced to signed.  In addition to that, Verilog resizes expressions to the largest needed size by default, which can affect the result of the computation.  This feature is also not supported by the Verilog semantics we used, and there would therefore be a mismatch between the Verilog semantics and the actual behaviour of Verilog according to the standard.  To bypass this issue braces are used to stop the Verilog simulator or synthesis tool from resizing anything inside the braces.  Instead, explicit resizing is used in the semantics and operations can only be performed on two registers that have the same size.
+
+Finally, this example also contains an array, which is stored on the stack for the main function.  Translating arrays means that load and store instructions need to be supported.  In hardware RAM is not as available as in software, and needs to be explicitly implemented by declaring a two dimensional array with specific properties, such as only being able to write and read from it once per clock cycle.  Therefore, to make memory as efficient as possible and support reads and writes in one clock cycle, a memory only addressable by word needs to be implemented, instead of the standard RAM which is addressable by byte.  This is because the clock cycles in devices such as FPGAs are normally an order of magnitude slower than CPUs, meaning taking four cycles to load an integer is not feasible.  However, this leads to various problems, the main one being that addresses need to be word-aligned, which is not the case in RTL or in C.  Therefore, addresses need to be divisible by four to also be valid addresses in HTL, which is shown in the load from the stack in state 6 of the RTL.  The equivalent HTL code is shown in the datapath branch where the same addition is done, however, the result is divided by 4.  To prove that these two loads have the same behaviour, we have to prove that the load was word-aligned and therefore could be divided by 4.\YH{Elaborate on our solution.}
+
+\subsection{Verilog}
+
+Finally, we have to translate the HTL code into proper Verilog and prove that it behaves the same as the RTL according to the Verilog semantics.  The Verilog output is modelled as a complete abstract syntax tree (AST) instead of being an abstract map over the instructions that are executed.  However, as all the instructions are already expressed in Verilog, only the maps need to be translated to valid Verilog, and correct declarations for all the variables in the program need to be added as well.
+
+This translation seems quite straightforward, however, proving that it is correct is not that simple, as all the implicit assumptions that were made in HTL need to be translated explicitly to Verilog and needs to have the same behaviour according to the semantics.  Figure~\ref{fig:accumulator_v} shows the final Verilog output that is generated.  In general, the structure is similar to the structure of the HTL code, however, the control and datapath maps have been translated to case statement which serve the same purpose.  The other main addition to the code is the initialisation of all the variables in the code to the correct bitwidth and the declaration of the inputs and outputs to the module, so that the module can be used inside a larger hardware design.  The main subtle change that was added to the code is the reset signal which sets the state to the starting state correctly.  In HTL, this was described directly in the semantics, where the entrypoint is stored in the module interface of HTL.  However, in Verilog we also want to verify that the hardware will be placed into the right state when we power it up or reset the design, and it therefore has to be encoded directly in the Verilog code.
+
+To check that the initial state of the Verilog is the same as the HTL code, we therefore have to run the module once, assuming the state is undefined and the reset is set to high.  We then have to compare the abstract starting state stored in the HTL module to the bitvector value we obtain from running the module for one clock cycle and prove that they are the same.  As the value for the state is undefined, the case statements will evaluate to the default state and therefore not perform any computations.
+
+The translation from maps to case statements is done by turning each node of the tree into a case expression with the statments in each being the same.  The main difficulty for the proof is that a structure that can be directly accessed is transformed into an inductive structure where a certain number of constructors need to be called to get to the correct case.  The proof of the translation from maps to case statements follows by induction over the list of elements in the map and the fact that each key in this list will be unique.  In addition to that, the statement that is currently being evaluated is guaranteed by the correctness of the list of elements to be in that list.  The latter fact therefore eliminates the base case, as an empty list does not contain the element we know is in the list.  The other two cases follow by the fact that either the key is equal to the evaluated value of the case expression, or it isn't.  In the first case we can then evaluate the statement and get the state after the case expression, as the uniqueness of the key tells us that the key cannot show up in the list anymore.  In the other case we can just apply the inductive hypothesis and remove the current case from the case statement, as it did not match.
+
+Another problem with the representation of the state as an actual register is that we have to make sure that the state does not overflow.  Currently, the state register will always be 32 bits, meaning the maximum number of states can only be $2^{32} - 1$.  We therefore have to prove that the state value will never go over that value.  This means that during the translation we have to check for each state that it can fit into an integer.
+
+%%% Local Variables:
+%%% mode: latex
+%%% TeX-master: "main"
+%%% End:
diff --git a/data/accumulator.htl b/data/accumulator.htl
index 16770dc..f7bdb8e 100644
--- a/data/accumulator.htl
+++ b/data/accumulator.htl
@@ -1,41 +1,39 @@
 main() {
   datapath {
-   16:	reg_9 <= 32'd1;
-   15:	stack[32'd0] <= reg_9;
-   14:	reg_8 <= 32'd2;
-   13:	stack[32'd1] <= reg_8;
-   12:	reg_7 <= 32'd3;
-   11:	stack[32'd2] <= reg_7;
-   10:	reg_3 <= 32'd0;
-    9:	;
+   15:	reg_8 <= 32'd1;
+   14:	reg_12[32'd0] <= reg_8;
+   13:	reg_7 <= 32'd2;
+   12:	reg_12[32'd1] <= reg_7;
+   11:	reg_6 <= 32'd3;
+   10:	reg_12[32'd2] <= reg_6;
+    9:	reg_2 <= 32'd0;
     8:	reg_1 <= 32'd0;
-    7:	reg_6 <= 32'd0;
-    6:	reg_5 <= stack[{{{reg_6 + 32'd0}
-                        + {reg_1 * 32'd4}} / 32'd4}];
-    5:	reg_3 <= {reg_3 + {reg_5 + 32'd0}};
+    7:	reg_5 <= 32'd0;
+    6:	reg_4 <= reg_12[{{{reg_5 + 32'd0} + {reg_1 * 32'd4}} / 32'd4}];
+    5:	reg_2 <= {{reg_2 + reg_4} + 32'd0};
     4:	reg_1 <= {reg_1 + 32'd1};
     3:	;
-    2:	reg_4 <= reg_3;
-    1:	finish <= 1'd1; return_val <= reg_4;
+    2:	reg_3 <= reg_2;
+    1:	reg_10 = 32'd1;
+reg_11 = reg_3;
   }
 
   controllogic {
-   16:	state <= 32'd15;
-   15:	state <= 32'd14;
-   14:	state <= 32'd13;
-   13:	state <= 32'd12;
-   12:	state <= 32'd11;
-   11:	state <= 32'd10;
-   10:	state <= 32'd9;
-    9:	state <= 32'd8;
-    8:	state <= 32'd7;
-    7:	state <= 32'd6;
-    6:	state <= 32'd5;
-    5:	state <= 32'd4;
-    4:	state <= 32'd3;
-    3:	state <= ({$signed(reg_1) < $signed(32'd3)}
-                        ? 32'd7 : 32'd2);
-    2:	state <= 32'd1;
+   15:	reg_9 <= 32'd14;
+   14:	reg_9 <= 32'd13;
+   13:	reg_9 <= 32'd12;
+   12:	reg_9 <= 32'd11;
+   11:	reg_9 <= 32'd10;
+   10:	reg_9 <= 32'd9;
+    9:	reg_9 <= 32'd8;
+    8:	reg_9 <= 32'd7;
+    7:	reg_9 <= 32'd6;
+    6:	reg_9 <= 32'd5;
+    5:	reg_9 <= 32'd4;
+    4:	reg_9 <= 32'd3;
+    3:	reg_9 <= ({$signed(reg_1) < $signed(32'd3)} ? 32'd7 : 32'd2);
+    2:	reg_9 <= 32'd1;
     1:	;
   }
 }
+
diff --git a/data/accumulator.v b/data/accumulator.v
index 146176d..799756f 100644
--- a/data/accumulator.v
+++ b/data/accumulator.v
@@ -1,53 +1,145 @@
-module main(reset, clk, finish, return_val);
-  reg [31:0] stack [2:0];
-  input [0:0] clk, reset;
-  output reg [31:0] return_val;
-  output reg [0:0] finish;
-  reg [31:0] reg_8, reg_4, state, reg_6,
-             reg_1, reg_9, reg_5, reg_3, reg_7;
-  always @(posedge clk)
-    case (state)
-      32'd16: reg_9 <= 32'd1;
-      32'd15: stack[32'd0] <= reg_9;
-      32'd14: reg_8 <= 32'd2;
-      32'd13: stack[32'd1] <= reg_8;
-      32'd12: reg_7 <= 32'd3;
-      32'd11: stack[32'd2] <= reg_7;
-      32'd10: reg_3 <= 32'd0;
-      32'd9: ;
-      32'd8: reg_1 <= 32'd0;
-      32'd7: reg_6 <= 32'd0;
-      32'd6: reg_5 <= stack[{{{reg_6 + 32'd0}
-                           + {reg_1 * 32'd4}} / 32'd4}];
-      32'd5: reg_3 <= {reg_3 + {reg_5 + 32'd0}};
-      32'd4: reg_1 <= {reg_1 + 32'd1};
-      32'd3: ;
-      32'd2: reg_4 <= reg_3;
-      32'd1: begin finish = 1'd1; return_val = reg_4; end
-      default:;
-    endcase
-  always @(posedge clk)
-    if ({reset == 1'd1})
-      state <= 32'd16;
-    else
-      case (state)
-        32'd16: state <= 32'd15;
-        32'd15: state <= 32'd14;
-        32'd14: state <= 32'd13;
-        32'd13: state <= 32'd12;
-        32'd12: state <= 32'd11;
-        32'd11: state <= 32'd10;
-        32'd10: state <= 32'd9;
-        32'd9: state <= 32'd8;
-        32'd8: state <= 32'd7;
-        32'd7: state <= 32'd6;
-        32'd6: state <= 32'd5;
-        32'd5: state <= 32'd4;
-        32'd4: state <= 32'd3;
-        32'd3: state <= ({$signed(reg_1) < $signed(32'd3)}
-                               ? 32'd7 : 32'd2);
-        32'd2: state <= 32'd1;
-        32'd1: ;
+module main(reg_13, reg_14, reg_15, reg_10, reg_11);
+  always @(posedge reg_15)
+    if ({reg_14 == 32'd1}) begin
+      reg_9 <= 32'd15;
+    end else begin
+      case (reg_9)
+        32'd8: begin
+          reg_9 <= 32'd7;
+        end
+        32'd4: begin
+          reg_9 <= 32'd3;
+        end
+        32'd12: begin
+          reg_9 <= 32'd11;
+        end
+        32'd2: begin
+          reg_9 <= 32'd1;
+        end
+        32'd10: begin
+          reg_9 <= 32'd9;
+        end
+        32'd6: begin
+          reg_9 <= 32'd5;
+        end
+        32'd14: begin
+          reg_9 <= 32'd13;
+        end
+        32'd1: begin
+          ;
+        end
+        32'd9: begin
+          reg_9 <= 32'd8;
+        end
+        32'd5: begin
+          reg_9 <= 32'd4;
+        end
+        32'd13: begin
+          reg_9 <= 32'd12;
+        end
+        32'd3: begin
+          reg_9 <= ({$signed(reg_1) < $signed(32'd3)} ? 32'd7 : 32'd2);
+        end
+        32'd11: begin
+          reg_9 <= 32'd10;
+        end
+        32'd7: begin
+          reg_9 <= 32'd6;
+        end
+        32'd15: begin
+          reg_9 <= 32'd14;
+        end
         default:;
       endcase
+    end
+  always @(posedge reg_15)
+    case (reg_9)
+      32'd8: begin
+        reg_1 <= 32'd0;
+      end
+      32'd4: begin
+        reg_1 <= {reg_1 + 32'd1};
+      end
+      32'd12: begin
+        reg_12[32'd1] <= reg_7;
+      end
+      32'd2: begin
+        reg_3 <= reg_2;
+      end
+      32'd10: begin
+        reg_12[32'd2] <= reg_6;
+      end
+      32'd6: begin
+        reg_4 <= reg_12[{{{reg_5 + 32'd0} + {reg_1 * 32'd4}} / 32'd4}];
+      end
+      32'd14: begin
+        reg_12[32'd0] <= reg_8;
+      end
+      32'd1: begin
+        reg_10 = 32'd1;
+        reg_11 = reg_3;
+      end
+      32'd9: begin
+        reg_2 <= 32'd0;
+      end
+      32'd5: begin
+        reg_2 <= {{reg_2 + reg_4} + 32'd0};
+      end
+      32'd13: begin
+        reg_7 <= 32'd2;
+      end
+      32'd3: begin
+        ;
+      end
+      32'd11: begin
+        reg_6 <= 32'd3;
+      end
+      32'd7: begin
+        reg_5 <= 32'd0;
+      end
+      32'd15: begin
+        reg_8 <= 32'd1;
+      end
+      default:;
+    endcase
+  reg [31:0] reg_12 [2:0];
+  reg [31:0] reg_8;
+  reg [31:0] reg_4;
+  reg [31:0] reg_2;
+  output reg [0:0] reg_10;
+  reg [31:0] reg_6;
+  input [0:0] reg_14;
+  reg [31:0] reg_1;
+  reg [31:0] reg_9;
+  reg [31:0] reg_5;
+  input [0:0] reg_13;
+  reg [31:0] reg_3;
+  output reg [31:0] reg_11;
+  reg [31:0] reg_7;
+  input [0:0] reg_15;
+endmodule
+
+module testbench;
+   reg start, reset, clk;
+   wire finish;
+   wire [31:0] return_val;
+
+   main m(start, reset, clk, finish, return_val);
+
+   initial begin
+      clk = 0;
+      start = 0;
+      reset = 0;
+      @(posedge clk) reset = 1;
+      @(posedge clk) reset = 0;
+   end
+
+   always #5 clk = ~clk;
+
+   always @(posedge clk) begin
+      if (finish == 1) begin
+         $display("finished: %d", return_val);
+         $finish;
+      end
+   end
 endmodule
diff --git a/data/accumulator_fsmd2.pdf b/data/accumulator_fsmd2.pdf
new file mode 100644
index 0000000..3731936
--- /dev/null
+++ b/data/accumulator_fsmd2.pdf
diff --git a/main.tex b/main.tex
index cdadd2c..9fa2217 100644
--- a/main.tex
+++ b/main.tex
@@ -69,7 +69,7 @@
 \begin{document}
 
 %% Title information
-\title[Formally Verified HLS]{Formally Verified High-Level Synthesis}
+\title[Formal Verification of HLS]{Formal Verification of High-Level Synthesis}
                                         %% when present, will be used in
                                         %% header instead of Full Title.
 %\titlenote{with title note}             %% \titlenote is optional;
diff --git a/verilog.tex b/verilog.tex
index abc990a..c45aa26 100644
--- a/verilog.tex
+++ b/verilog.tex
@@ -27,11 +27,17 @@ The main addition to the Verilog syntax is the explicit declaration of inputs an
 
 An existing operational semantics for Verilog~\cite{loow19_verif_compil_verif_proces} was adapted for the semantics of the language that Vericert eventually targets.  This semantics is a small-step operational semantics at the clock cycle level, as hardware typically does not terminate in any way, however, within each clock cycle the semantics are constructed in a big-step style semantics.  This style of semantics matches the small-step operational semantics of CompCert's register transfer language (RTL) quite well.
 
-At the top-level, always blocks describe logic which is run every time some event occurs.  The only event that is supported by these semantics is detecting the rising rising edge of the clock, so that we can implement synchronous logic.~\NR{So the semantics does not support combinational always blocks or assignments?} As soon as an event occurs, the hardware will be executed, meaning if there are multiple always blocks that get triggered by the event, these will run in parallel.~\NR{Does the semantics also enforce that variables that only be modified within one always block (multiple drivers not allowed)? PL community might think of data races. } However, as the semantics should be deterministic, we impose an order on the always blocks and execute them sequentially.~\NR{Oh, do we sequentialise always execution?} However, to preserve the fact that the statements inside of the always block are executed in parallel, nonblocking assignments to variables need to be kept in a different association map compared to blocking assignments to variables.~\NR{Have you discussed what an association map is?}  This preserves the behaviour that blocking assignments change the value of the variable inside of the clock cycle, whereas the nonblocking assignments only take place at the end of the clock cycle, and in parallel.  We can denote these two association maps as $s = (\Gamma_{\rm r}, \Gamma_{\rm a}, \Delta_{\rm r}, \Delta_{\rm a})$, where $\Gamma_{\rm r}$ is the current value of the registers, $\Gamma_{\rm a}$ is the current value of the array, and $\Delta_{\rm r}$ and $\Delta_{\rm a}$ are the values of the variables and arrays when the clock cycle ends.~\NR{Do blocking and non-blocking assignments each have individual maps? Also, association map is used before definition in this paragraph.}
-
-We can then define how one step in the semantics looks like.  We therefore first need to define the structure of the main module which will contain the logic for the program.  In general, functions that are translated to hardware will require basic handshaking signals so that the translated function can be used in hardware. Firstly, they require an input for the clock, so that all the sequential circuits are run at the right time.  They then require a start and reset input, so that the hardware generated from the function can be reused multiple times.  Finally, they need a finish and return signal, where finish will go high when the result is ready to be read.  In addition to that, the function could take an arbitrary number of inputs which act as arguments to the function, so that the function can be called with different arguments.  In addition to inputs and outputs to the module, we also need to keep track of some internal signals and properties about the module.  Firstly, we need to keep track of the internal variables that contain the current state of the module, and the current contents of the stack.~\NR{Are you referring to the C stack?}  Finally, the module contains the entry point of the module, the list of module items that declare all of the internal registers and the encoding of the state machine that behaves in the same way as the function.  We can therefore declare it in the following way:
-~\NR{We should distinguish handshaking signals of a Verilog module to arguments of a sw function. In the sw world, an argument is either value or pointer. Values are unmodifiable, and pointers can be only be modified by the function body. In contrast, handshaking signals are constantly active, interacting with external modules. These signals can also be trigger events, such as the clock.}
-~\NR{Also, I don't think a Verilog module is equivalent to a Verilog function. I think they both exist in Verilog. Please double-check this before we use module and function interchangably in the text above.}
+%~\NR{So the semantics does not support combinational always blocks or assignments?}\YH{Currently not, as they are not needed.}
+%~\NR{Does the semantics also enforce that variables can only be modified within one always block (multiple drivers not allowed)? PL community might think of data races. }\YH{Will add that.}
+%~\NR{Oh, do we sequentialise always execution?}\YH{Yes}
+%~\NR{Have you discussed what an association map is?}\YH{Will add a section about that.}
+%~\NR{Do blocking and non-blocking assignments each have individual maps? Also, association map is used before definition in this paragraph.}
+At the top-level, always blocks describe logic which is run every time some event occurs.  The only event that is supported by these semantics is detecting the rising rising edge of the clock, so that we can implement synchronous logic.  As soon as an event occurs, the hardware will be executed, meaning if there are multiple always blocks that get triggered by the event, these will run in parallel.  Assignments to a variable in multiple always blocks is not allowed, so that there cannot be any data races during assignments.  However, as the semantics should be deterministic, we impose an order on the always blocks and execute them sequentially.  However, to preserve the fact that the statements inside of the always block are executed in parallel, nonblocking assignments to variables need to be kept in a different association map compared to blocking assignments to variables, where association maps are mappings from a variable name to its current value.  This preserves the behaviour that blocking assignments change the value of the variable inside of the clock cycle, whereas the nonblocking assignments only take place at the end of the clock cycle, and in parallel.  We can denote these two association maps as $s = (\Gamma_{\rm r}, \Gamma_{\rm a}, \Delta_{\rm r}, \Delta_{\rm a})$, where $\Gamma_{\rm r}$ is the current value of the registers, $\Gamma_{\rm a}$ is the current value of the array, and $\Delta_{\rm r}$ and $\Delta_{\rm a}$ are the values of the variables and arrays when the clock cycle ends.  Blocking assignments will therefore modify $\Gamma_{\rm r}$ and $\Gamma_{\rm a}$, whereas nonblocking assignments modify $\Delta_{\rm r}$ and $\Delta_{\rm a}$
+
+%~\NR{Are you referring to the C stack?}\YH{Yes}
+We can then define how one step in the semantics looks like.  We therefore first need to define the structure of the main module which will contain the logic for the program.  In general, functions that are translated to hardware will require basic handshaking signals so that the translated function can be used in hardware.  Firstly, they require an input for the clock, so that all the sequential circuits are run at the right time.  They then require a start and reset input, so that the hardware generated from the function can be reused multiple times.  Finally, they need a finish and return signal, where finish will go high when the result is ready to be read.  In addition to that, the module could take an arbitrary number of inputs which act as arguments to the original function, so that the module may be instantiated with any number of arguments.  In addition to inputs and outputs to the module, we also need to keep track of some internal signals and properties about the module.  Firstly, we need to keep track of the internal variables that contain the current state of the module, and the current contents of the stack.  Next, the module contains the entry point of the module, the list of module items that declare all of the internal registers and the encoding of the state machine that behaves in the same way as the function.  We can therefore declare it in the following way:
+~\NR{We should distinguish handshaking signals of a Verilog module to arguments of a sw function. In the sw world, an argument is either value or pointer. Values are unmodifiable, and pointers can be only be modified by the function body. In contrast, handshaking signals are constantly active, interacting with external modules. These signals can also be trigger events, such as the clock.}\YH{Yes that's true, although currently we don't really support modules and therefore assume that the signals do not change.}
+~\NR{Also, I don't think a Verilog module is equivalent to a Verilog function. I think they both exist in Verilog. Please double-check this before we use module and function interchangably in the text above.}\YH{Clarified that, functions are indeed different to modules in Verilog, did not want to use them interchangably.}
 %\JW{I'd be inclined to write `$r~\mathrm{list}$' rather than $\vec{r}$, as it's a little more readable. (Assuming it's more-or-less the same thing?)}
 
 \begin{align*}
@@ -73,7 +79,7 @@ The two main evaluation functions are then \textit{erun}, which takes in the cur
 Taking the \textsc{CondTrue} rule as an example, this rule will only apply if the Boolean result of running the expression results in a \texttt{true} value.  It then also states that the statement in the true branch of the conditional statement \textit{stt}~\NR{\textit{st} or \textit{stt}?} runs from state $\sigma_{0}$ to state $\sigma_{1}$.  If both of these conditions hold~\NR{How are these conditions checked? Can \textit{erun} or \textit{srun} evaluate to false?}, we then get that the conditional statement will also run from state $\sigma_{0}$ to state $\sigma_{1}$.  The \textsc{Blocking} and \textsc{Nonblocking} rules are a bit more interesting, as these modify the blocking and nonblocking association maps respectively.
 
 One main difference between these semantics and the Verilog semantics by \citet{loow19_verif_compil_verif_proces} is that there is no function for external nondeterministic effects, such as memories and inputs and outputs.  These are instead handled explicitly in the semantics by using two dimensional unpacked arrays to model memories and assuming that inputs to modules cannot change.  Another difference with these semantics is that partial updates to arrays are fully supported, due to the fact that there are two different queues for arrays and variables.  Originally, if there was a blocking assignment to an array, and then a nonblocking assignment to a different region in the array, then the blocking assignment would disappear at the end of the clock cycle.  This is because the complete array would be overwritten with the updated array in the nonblocking association maps.  However, in our semantics, only the values that were changed in the array are actually recorded in the nonblocking assignment queue, meaning once the blocking and nonblocking array association maps are merged, only the actual indices that changed with nonblocking assignment are updated in the blocking assignment map.
-\NR{I did not understand this paragraph. Discuss with Yann.}
+%\NR{I did not understand this paragraph. Discuss with Yann.}
 
 In these semantics, module instantiations are not supported, as they can be modelled by inlining the logic that modules would have produced.  This therefore means that function calls are implemented by inlining all the functions as well.  Consequently, recursive function calls are not supported, however, these are not supported by other high-level synthesis tools either, as allocating memory dynamically is not possible with fixed size RAM blocks.
 \NR{Were these last two paragraphs referring to your assumptions or limitations or simplications you made? Discuss with Yann.}
@@ -102,7 +108,7 @@ We then define the semantics of running the module for one clock cycle in the fo
 
 The \textsc{Module} rule is the main rule for the execution of one clock cycle of the module.  Given that the value of the $s_{t}$ register is the value of the program counter at the current instruction and that the value of the $s_{t}$ register in the resulting association map is equal to the next program counter value, we can then say that if all the module items in the body go from one state to another, that the whole module will step from that state to the other.
 
-\NR{Bigger picture question: Why don't always blocks show up in the rules?}
+%\NR{Bigger picture question: Why don't always blocks show up in the rules?}\YH{Mainly because these are at a different level of abstraction, here only statements are modelled.}
 
 %%\input{verilog_notes}
author	Yann Herklotz <git@yannherklotz.com>	2020-11-02 00:27:18 +0000
committer	Yann Herklotz <git@yannherklotz.com>	2020-11-02 00:27:18 +0000
commit	36dc41477a21e15b81b09985c9505ea1b5f10817 (patch)
tree	fccc68fa4f98dcf5d6b27acdf0332b3827563e5a
parent	a5561bd00606bdbbfdc014e720021d3a900b92d4 (diff)
download	oopsla21_fvhls-36dc41477a21e15b81b09985c9505ea1b5f10817.tar.gz oopsla21_fvhls-36dc41477a21e15b81b09985c9505ea1b5f10817.zip