Add more content

author: Yann Herklotz <git@yannherklotz.com> 2020-11-15 18:17:36 +0000
committer: Yann Herklotz <git@yannherklotz.com> 2020-11-15 18:26:05 +0000
commit: c258a14a48bf5cc7ffe22f26edf03f55a92573d2 (patch)
tree: 42d99a077e6b516e1a4e91891e6b1aaa747f1cc2
parent: d9e7c4e5483f64919f9636288e24e52e855bf05a (diff)
download: oopsla21_fvhls-c258a14a48bf5cc7ffe22f26edf03f55a92573d2.tar.gz
oopsla21_fvhls-c258a14a48bf5cc7ffe22f26edf03f55a92573d2.zip
3 files changed, 67 insertions, 42 deletions
diff --git a/algorithm.tex b/algorithm.tex
index 96743d0..712ad3c 100644
--- a/algorithm.tex
+++ b/algorithm.tex
@@ -40,27 +40,9 @@ The main work flow of \vericert{} is shown in Figure~\ref{fig:rtlbranch}, which
 
 \compcert{} is made up of 11 intermediate languages in between the Clight input and the assembly output.  These intermediate languages each serve a different purpose and contain various optimisations.  When designing a new back end for \compcert{}, it is crucial to know where to branch off, so as to benefit from all the useful optimisations that \compcert{} performs, but not performing optimisations that are not useful, which include optimisations that are specific to the target CPU architecture or.  These optimisations include register allocation, as there is not a fixed number of registers that need to be targeted.
 
-To choose where to branch off at, each intermediate language in \compcert{} can be evaluated to see if
+To choose where to branch off at, each intermediate language in \compcert{} can be evaluated to see if it is suitable to be transformed into hardware.  Existing HLS compilers often use LLVM IR as an intermediate representation when performing HLS-specific optimisations, as each instruction can be mapped quite well to hardware that performs the same behaviour.  Looking at the intermediate languages in \compcert{} shown in Figure~\ref{fig:rtlbranch}, there are many languages to choose from.  Clight and CminorSel are an abstract syntax tree (AST) representation of the C code, which does not correspond to a more assembly like language similar to LLVM IR.\@  In addition to that, looking at the languages from LTL to PPC, even though these languages do contain basic blocks, which are desirable when doing HLS, starting from LTL the number of registers is limited
 
-Existing HLS compilers often use LLVM IR as an intermediate representation when performing HLS-specific optimisations, as each instruction can be mapped quite well to hardware that performs the same behaviour.  \compcert{}'s three-address code (3AC)\footnote{Three-address code (3AC) is also known as register transfer language (RTL) in the \compcert{} literature, however, 3AC is used in this paper instead so as not to confuse it with register-transfer level (RTL), which is another name for the final hardware target of the HLS tool.} is the intermediate language that resembles LLVM IR the most, as it also has an infinite number of pseudo-registers and each instruction maps well to hardware.  However, one difference between the two is that 3AC uses operations of the target architecture and performs architecture specific optimisations as well, which is not the case in LLVM IR where all the instructions are quite abstract.  This can be mitigated by making \compcert{} target a specific architecture such as x86\_32, where most operations translate quite well into hardware.  In addition to that, many optimisations that are also useful for HLS are performed in 3AC, which means that if it is supported as the input language, the HLS algorithm benefits from the same optimisations.  It is therefore a good candidate to be chosen as the input language to the HLS back end. The complete flow that \vericert{} takes is show in figure~\ref{fig:rtlbranch}.
-
-\subsection{Some Key Challenges}
-
-This subsection lists some key challenges that were encountered when implementing the translation from 3AC to HTL and subsequently Verilog.
-
-\subsubsection{Discrepency between C and Verilog w.r.t. signedness}
-
-\subsubsection{Byte- and word-addressable memories}
-
-\subsubsection{Reset signals}
-
-\subsubsection{Implementation of the \texttt{Oshrximm} instruction}
-
-\JW{I wonder if Section 2 could benefit from a `Some Key Challenges' subsection, where you highlight several interesting bits of the translation process, each with their own paragraph heading. These could be something like:\begin{enumerate}\item Discrepancy between C and Verilog w.r.t. signedness \item Deciding between byte- and word-addressable memories \item Adding reset signals \item Implementing the Oshrximm instruction correctly \end{enumerate} For the causal reader, this would immediately signal two things: (1) you can skip this subsection on your initial pass, and (2) proving the HLS tool correct was a non-trivial undertaking.}
-
-% - Explain main differences between translating C to software and to hardware.
-
-%   + This can be done by going through the simple example.
+\compcert{}'s three-address code (3AC)\footnote{Three-address code (3AC) is also known as register transfer language (RTL) in the \compcert{} literature, however, 3AC is used in this paper instead so as not to confuse it with register-transfer level (RTL), which is another name for the final hardware target of the HLS tool.} is the intermediate language that resembles LLVM IR the most, as it also has an infinite number of pseudo-registers and each instruction maps well to hardware.  3AC is represented as a control flow graph (CFG) in CompCert.  Each instruction is a node in the graph and links to the instructions that follow it.  This CFG then describes how the computation should proceed, and is a good representation for performing optimisations on as well as local transformations.  However, one difference between LLVM IR and 3AC is that 3AC uses operations of the target architecture and performs architecture specific optimisations as well, which is not the case in LLVM IR where all the instructions are quite abstract.  This can be mitigated by making \compcert{} target a specific architecture such as x86\_32, where most operations translate quite well into hardware.  In addition to that, many optimisations that are also useful for HLS are performed in 3AC, which means that if it is supported as the input language, the HLS algorithm benefits from the same optimisations.  It is therefore a good candidate to be chosen as the input language to the HLS back end. The complete flow that \vericert{} takes is show in figure~\ref{fig:rtlbranch}.
 
 \begin{figure}
   \centering
@@ -109,11 +91,7 @@ main() {
 
 Taking the simple accumulator program shown in Figure~\ref{fig:accumulator_c}, we can describe the main translation that is performed in Vericert to go from the behavioural description into a hardware design.
 
-The first step of the translation is to use \compcert{} to transform the C code into the 3AC shown in Figure~\ref{fig:accumulator_rtl}, including many optimisations that are performed in the 3AC language.  This includes optimisations such as constant propagation and dead-code elimination.  Function inlining is also performed, and it is used as a way to support function calls instead of having to support the \texttt{Icall} 3AC instruction.  The duplication of the function bodies caused by inlining does affect the total area of the hardware, however it improves performance of the hardware.\YH{TODO: Not sure how to justify inlining, apart from that it is an easy way to support function calls} \JW{Perhaps you could say that this means you don't support recursion, but that this is in line with most existing HLS tools.}
-
-% - Explain why we chose 3AC (3AC) as the branching off point.
-
-3AC is represented as a control flow graph (CFG) in CompCert.  Each instruction is a node in the graph and links to the instructions that follow it.  This CFG then describes how the computation should proceed, and is a good representation for performing optimisations on as well as local transformations. \JW{This belongs in the previous subsection where you were justifying 3AC as the branching-off point.}
+The first step of the translation is to use \compcert{} to transform the C code into the 3AC shown in Figure~\ref{fig:accumulator_rtl}, including many optimisations that are performed in the 3AC language.  This includes optimisations such as constant propagation and dead-code elimination.  Function inlining is also performed, and it is used as a way to support function calls instead of having to support the \texttt{Icall} 3AC instruction.  The duplication of the function bodies caused by inlining does affect the total area of the hardware, however it improves the latency of the hardware.  In addition to that, inlining removes the possibility of supporting recursive function calls, however, this is a feature that isn't supported in most other HLS tools either.
 
 \subsubsection{Transformation to HTL}
 
@@ -131,27 +109,13 @@ The first translation performed in Vericert is from 3AC to a hardware translatio
   \caption{Accumulator hardware diagram.}\label{fig:accumulator_diagram}
 \end{figure*}
 
-The translation from RTL \JW{3AC, here and elsewhere} to HTL is quite straightforward, as each RTL instruction either matches up quite well to a hardware construct, or does not have to be handled by the translation, such as function calls.  At each instruction, the control flow is separated from the data computation and is then added to the control logic and data-flow map respectively.  For example, in state 16 in the RTL code in figure~\ref{fig:accumulator_rtl}, the register \texttt{x9} is initialised to one and then moves on \JW{this reads as `the register moves on' which is not what you intend} to state 15.  This is encoded in HTL by initialising a 32 bit \JW{32-bit} register \texttt{reg\_9} to one as well in the data-flow graph section, and also adding a state transition to the state 15 in the control logic transition.  Simple operator instructions are translated in a similar way.  For example, in state 5, the value in the array is added to the current value of the accumulated sum, which is simply translated to an addition of the equivalent registers in the HTL code.
-
-\JW{I think some of the following material belongs in a later section. In the current section, I think we should aim at the casual reader who is interested in understanding things like the overall structure of the tool, which IR language you branch off from, and what the key ideas/difficulties involved are. I think intimidating inference rules with loads of new symbols should be saved for a later section -- the kind of section that really is only going to be read by people who are actually interested in reproducing or building upon your work. So things like `One thing to note is that the comparison is signed ... by default all operators in Verilog are unsigned ... so we force the operators to be signed' very much belong in the current section; you just don't need a scary inference rule first.} The translation can be described by a relational specification, which can be described by the following relation, where \textit{fin}, \textit{rtrn}, $\sigma$ and \textit{stk} are the registers for the finished signal, return value, current state and stack respectively, $i$ is the RTL instruction being translated, and \textit{data} and \textit{control} are the data-flow and control logic map respectively.
-
-\begin{equation*}
-  \yhfunction{tr\_instr } \textit{fin rtrn }\ \sigma\ \textit{stk }\ i\ \textit{data }\ \textit{control}
-\end{equation*}
-
-\noindent An example of a rule describing the translation of an operation is the following:
-
-\begin{equation*}
-  \inferrule[Iop]{\yhfunction{tr\_op } \textit{op }\ \vec{a} = \yhconstant{OK } e}{\yhfunction{tr\_instr } \textit{fin rtrn }\ \sigma\ \textit{stk }\ (\yhconstant{Iop } \textit{op }\ \vec{a}\ d\ n)\ (d \Leftarrow e)\ (\sigma \Leftarrow n)}
-\end{equation*}
+The translation from 3AC to HTL is quite straightforward, as each 3AC instruction either matches up quite well to a hardware construct, or does not have to be handled by the translation, such as function calls.  At each instruction, the control flow is separated from the data computation and is then added to the control logic and data-flow map respectively.  For example, in state 16 in the 3AC code in figure~\ref{fig:accumulator_rtl}, the register \texttt{x9} is initialised to one, after which the control flow moves to state 15.  This is encoded in HTL by initialising a 32-bit register \texttt{reg\_9} to one as well in the data-flow graph section, and also adding a state transition to the state 15 in the control logic transition.  Simple operator instructions are translated in a similar way.  For example, in state 5, the value in the array is added to the current value of the accumulated sum, which is simply translated to an addition of the equivalent registers in the HTL code.
 
 In the accumulator C code, the for loop is translated to a branch statement in state 3, which can also be translated to equivalent HTL code by placing the comparison into the control logic part of the main function, and not performing any data operations in the data-flow section.  On the next clock cycle, the state will therefore transition to state 7 or state 2 depending on if \texttt{reg\_3} is less than 3 or not.  One thing to note is that in this case, the comparison is signed.  By default, all operators and registers in Verilog and HTL are unsigned, so to force an operation to handle the bits as signed, both operators have to be forced to signed.  In addition to that, Verilog resizes expressions to the largest needed size by default, which can affect the result of the computation.  This feature is also not supported by the Verilog semantics we used, and there would therefore be a mismatch between the Verilog semantics and the actual behaviour of Verilog according to the standard.  To bypass this issue braces are used to stop the Verilog simulator or synthesis tool from resizing anything inside the braces.  Instead, explicit resizing is used in the semantics and operations can only be performed on two registers that have the same size.
 
-Finally, this example also contains an array, which is stored on the stack for the main function.  Translating arrays means that load and store instructions need to be supported.  In hardware RAM is not as available as in software, and needs to be explicitly implemented by declaring a two dimensional array with specific properties, such as only being able to write and read from it once per clock cycle.  Therefore, to make memory as efficient as possible and support reads and writes in one clock cycle, a memory only addressable by word needs to be implemented, instead of the standard RAM which is addressable by byte.  This is because the clock cycles in devices such as FPGAs are normally an order of magnitude slower than CPUs, meaning taking four cycles to load an integer is not feasible.  However, this leads to various problems, the main one being that addresses need to be word-aligned, which is not the case in RTL or in C.  Therefore, addresses need to be divisible by four to also be valid addresses in HTL, which is shown in the load from the stack in state 6 of the RTL.  The equivalent HTL code is shown in the datapath branch where the same addition is done, however, the result is divided by 4.  To prove that these two loads have the same behaviour, we have to prove that the load was word-aligned and therefore could be divided by 4.\YH{Elaborate on our solution.}
-
 \subsection{Translation from HTL to Verilog}
 
-Finally, we have to translate the HTL code into proper Verilog and prove that it behaves the same as the RTL according to the Verilog semantics.  The Verilog output is modelled as a complete abstract syntax tree (AST) instead of being an abstract map over the instructions that are executed.  However, as all the instructions are already expressed in Verilog, only the maps need to be translated to valid Verilog, and correct declarations for all the variables in the program need to be added as well.
+Finally, we have to translate the HTL code into proper Verilog and prove that it behaves the same as the 3AC according to the Verilog semantics.  The Verilog output is modelled as a complete abstract syntax tree (AST) instead of being an abstract map over the instructions that are executed.  However, as all the instructions are already expressed in Verilog, only the maps need to be translated to valid Verilog, and correct declarations for all the variables in the program need to be added as well.
 
 This translation seems quite straightforward, however, proving that it is correct is not that simple, as all the implicit assumptions that were made in HTL need to be translated explicitly to Verilog and needs to have the same behaviour according to the semantics.  Figure~\ref{fig:accumulator_v} shows the final Verilog output that is generated.  In general, the structure is similar to the structure of the HTL code, however, the control and datapath maps have been translated to case statement which serve the same purpose.  The other main addition to the code is the initialisation of all the variables in the code to the correct bitwidth and the declaration of the inputs and outputs to the module, so that the module can be used inside a larger hardware design.  The main subtle change that was added to the code is the reset signal which sets the state to the starting state correctly.  In HTL, this was described directly in the semantics, where the entrypoint is stored in the module interface of HTL.  However, in Verilog we also want to verify that the hardware will be placed into the right state when we power it up or reset the design, and it therefore has to be encoded directly in the Verilog code.
 
@@ -225,7 +189,7 @@ endmodule
 \end{minted}
     \caption{Verilog always block describing the control logic of the module.}\label{fig:accumulator_v_2}
   \end{subfigure}
-  \caption{Accumulator example using \vericert{} to translate the 3AC to a state machine expressed in Verilog. \JW{Subfigure captions aren't correct.}}\label{fig:accumulator_v}
+  \caption{Accumulator example using \vericert{} to translate the 3AC to a state machine expressed in Verilog.}\label{fig:accumulator_v}
 \end{figure}
 
 To check that the initial state of the Verilog is the same as the HTL code, we therefore have to run the module once, assuming the state is undefined and the reset is set to high.  We then have to compare the abstract starting state stored in the HTL module to the bitvector value we obtain from running the module for one clock cycle and prove that they are the same.  As the value for the state is undefined, the case statements will evaluate to the default state and therefore not perform any computations.
@@ -234,6 +198,26 @@ The translation from maps to case statements is done by turning each node of the
 
 Another problem with the representation of the state as an actual register is that we have to make sure that the state does not overflow.  Currently, the state register will always be 32 bits, meaning the maximum number of states can only be $2^{32} - 1$.  We therefore have to prove that the state value will never go over that value.  This means that during the translation we have to check for each state that it can fit into an integer.
 
+\subsection{Key Challenges}
+
+This subsection lists some key challenges that were encountered when implementing the translation from 3AC to HTL and subsequently Verilog.
+
+\subsubsection{Byte- and word-addressable memories}
+
+One big difference between C and Verilog is how memory can be represented.  In hardware, efficient RAMs are not as available as in software, and need to be explicitly implemented by declaring a two dimensional array with specific properties.  One main limitation is that RAMs often only allow one read and one write per clock cycle, for example if implementing single port RAM, which is the most common type of RAM.  To make loads and stores as efficient as possible, the RAM needs to be implemented as being word addressable, so that a load and store of an integer can be done in one clock cycle.
+
+However, the memory model that CompCert uses for it's intermediate languages~\cite{blazy05_formal_verif_memor_model_c} is byte-addressable.  It therefore has to be proven that the byte-addressable memory behaves in the same way as the word-addressable memory in hardware.  Any modifications of the bytes in the CompCert memory model have to also be shown to modify the word-addressable memory in the same way.  As only integer loads and stores are currently supported for the HLS backend, it follows that the addresses given to the loads and stores should be divisible by four.  If that is the case, then the translation from byte-addressed memory to word-addressed memory could be done by dividing the address by four and subtracting by the base address of the memory.
+
+\subsubsection{Reset signals}
+
+\subsubsection{Implementation of the \texttt{Oshrximm} instruction}
+
+%\JW{I wonder if Section 2 could benefit from a `Some Key Challenges' subsection, where you highlight several interesting bits of the translation process, each with their own paragraph heading. These could be something like:\begin{enumerate}\item Discrepancy between C and Verilog w.r.t. signedness \item Deciding between byte- and word-addressable memories \item Adding reset signals \item Implementing the Oshrximm instruction correctly \end{enumerate} For the causal reader, this would immediately signal two things: (1) you can skip this subsection on your initial pass, and (2) proving the HLS tool correct was a non-trivial undertaking.}
+
+% - Explain main differences between translating C to software and to hardware.
+
+%   + This can be done by going through the simple example.
+
 
 %%% Local Variables:
 %%% mode: latex
diff --git a/proof.tex b/proof.tex
index 483887b..355094c 100644
--- a/proof.tex
+++ b/proof.tex
@@ -87,6 +87,22 @@ Firstly, as the input representation is a map, all the keys of the map will be u
 
 However, the proof of uniqueness of the keys only works if the translation function is \emph{injective}, meaning that the function will result in distinct outputs for all possible inputs.  Otherwise, the proof of uniqueness of the keys for the input would not translate to a uniqueness of possible case expression matches in the output.  However, in our case the translation function is actually not injective, even though it might at first seem like it, because the state is stored as an integer, whereas the map is addressable by any positive number.  This means that if the positive number is greater than the maximum value that can be stored in the integer, the overflow would result in the wrong states being accessed.  To make the function injective, we therefore have to prove that the input positive number is never greater than $2^{32}-1$ so that the uniqueness property also holds for the output.
 
+\subsection{Model}
+
+%\JW{I think some of the following material belongs in a later section. In the current section, I think we should aim at the casual reader who is interested in understanding things like the overall structure of the tool, which IR language you branch off from, and what the key ideas/difficulties involved are. I think intimidating inference rules with loads of new symbols should be saved for a later section -- the kind of section that really is only going to be read by people who are actually interested in reproducing or building upon your work. So things like `One thing to note is that the comparison is signed ... by default all operators in Verilog are unsigned ... so we force the operators to be signed' very much belong in the current section; you just don't need a scary inference rule first.}
+
+The translation can be described by a relational specification, which can be described by the following relation, where \textit{fin}, \textit{rtrn}, $\sigma$ and \textit{stk} are the registers for the finished signal, return value, current state and stack respectively, $i$ is the 3AC instruction being translated, and \textit{data} and \textit{control} are the data-flow and control logic map respectively.
+
+\begin{equation*}
+  \yhfunction{tr\_instr } \textit{fin rtrn }\ \sigma\ \textit{stk }\ i\ \textit{data }\ \textit{control}
+\end{equation*}
+
+\noindent An example of a rule describing the translation of an operation is the following:
+
+\begin{equation*}
+  \inferrule[Iop]{\yhfunction{tr\_op } \textit{op }\ \vec{a} = \yhconstant{OK } e}{\yhfunction{tr\_instr } \textit{fin rtrn }\ \sigma\ \textit{stk }\ (\yhconstant{Iop } \textit{op }\ \vec{a}\ d\ n)\ (d \Leftarrow e)\ (\sigma \Leftarrow n)}
+\end{equation*}
+
 \subsection{Deterministic Semantics}
 
 Finally, to prove the backward simulation given the forward simulation, it has to be shown that if we generate hardware with a specific behaviour, that it is the only possible program with that behaviour.  This only has to be performed for the final intermediate language, which is Verilog, so that the backward simulation holds for the whole chain from Clight to Verilog.
diff --git a/references.bib b/references.bib
index 6e2babd..cd232b5 100644
--- a/references.bib
+++ b/references.bib
@@ -503,3 +503,28 @@
 	publisher = {Association for Computing Machinery},
 	series = {PLDI '11}
 }
+
+@inproceedings{blazy05_formal_verif_memor_model_c,
+	author = "Blazy, Sandrine and Leroy, Xavier",
+	title = "Formal Verification of a Memory Model for C-Like Imperative
+                  Languages",
+	tags = {verification},
+	booktitle = "Formal Methods and Software Engineering",
+	year = 2005,
+	pages = "280--299",
+	abstract = "This paper presents a formal verification with the Coq proof
+                  assistant of a memory model for C-like imperative
+                  languages. This model defines the memory layout and the
+                  operations that manage the memory. The model has been
+                  specified at two levels of abstraction and implemented as part
+                  of an ongoing certification in Coq of a moderately-optimising
+                  C compiler. Many properties of the memory have been verified
+                  in the specification. They facilitate the definition of
+                  precise formal semantics of C pointers. A certified OCaml code
+                  implementing the memory model has been automatically extracted
+                  from the specifications.",
+	address = "Berlin, Heidelberg",
+	editor = "Lau, Kung-Kiu and Banach, Richard",
+	isbn = "978-3-540-32250-4",
+	publisher = "Springer Berlin Heidelberg"
+}
author	Yann Herklotz <git@yannherklotz.com>	2020-11-15 18:17:36 +0000
committer	Yann Herklotz <git@yannherklotz.com>	2020-11-15 18:26:05 +0000
commit	c258a14a48bf5cc7ffe22f26edf03f55a92573d2 (patch)
tree	42d99a077e6b516e1a4e91891e6b1aaa747f1cc2
parent	d9e7c4e5483f64919f9636288e24e52e855bf05a (diff)
download	oopsla21_fvhls-c258a14a48bf5cc7ffe22f26edf03f55a92573d2.tar.gz oopsla21_fvhls-c258a14a48bf5cc7ffe22f26edf03f55a92573d2.zip