Add algorithm

author: Yann Herklotz <git@yannherklotz.com> 2020-11-19 20:52:02 +0000
committer: Yann Herklotz <git@yannherklotz.com> 2020-11-19 20:53:43 +0000
commit: f933f9a915bd30dcf4df849c9dd20e1f90ed152d (patch)
tree: fcac3c4bb08b089d93687150aef6676d72a73c64 /algorithm.tex
parent: 3a1e1da4d7bee90262ed664178e2ac3f077a028f (diff)
download: oopsla21_fvhls-f933f9a915bd30dcf4df849c9dd20e1f90ed152d.tar.gz
oopsla21_fvhls-f933f9a915bd30dcf4df849c9dd20e1f90ed152d.zip
1 files changed, 40 insertions, 17 deletions
diff --git a/algorithm.tex b/algorithm.tex
index ac885df..9d91ec6 100644
--- a/algorithm.tex
+++ b/algorithm.tex
@@ -65,7 +65,7 @@ The main work flow of \vericert{} is given in Figure~\ref{fig:rtlbranch}, which
 
 We select CompCert's three-address code (3AC)\footnote{This is known as register transfer language (RTL) in the \compcert{} literature. We use `3AC' is used in this paper instead to avoid confusion with register-transfer level (RTL), which is another name for the final hardware target of the HLS tool.} as the branching off point. If we branch off before this (at CminorSel or earlier), then CompCert has not had the opportunity to perform such optimisations as constant propagation and dead code elimination, which have been shown to be valuable not just in conventional compilation but also in HLS~\cite{cong+11}. And if we branch off after this (at LTL or later) then CompCert has already performed register allocation to reduce the number of registers and spill some variables to the stack; this transformation is not required in HLS because there are many more registers available, and these should be used instead of RAM whenever possible.
 
-3AC is also attractive because it is the closest intermediate language to LLVM IR, which is used by several existing HLS compilers. It has an unlimited number of pseudo-registers, and is represented as a control flow graph (CFG) where each instruction is a node with links to the instructions that can follow it. One difference between LLVM IR and 3AC is that 3AC includes operations that are specific to the chosen target architecture; we choose x86\_32 because each instruction maps well to hardware.
+3AC is also attractive because it is the closest intermediate language to LLVM IR, which is used by several existing HLS compilers. It has an unlimited number of pseudo-registers, and is represented as a control flow graph (CFG) where each instruction is a node with links to the instructions that can follow it. One difference between LLVM IR and 3AC is that 3AC includes operations that are specific to the chosen target architecture; we chose x86\_32 because each instruction maps well to hardware.
 
 \begin{figure}
   \centering
@@ -128,13 +128,22 @@ The first step of the translation is to use \compcert{} to transform the input C
 %\JW{I feel like this could use some sort of citation, but I'm not sure what. I guess this is all from "Hardware Design 101", right?}\YH{I think I found a good one actually, which goes over the basics.}
 %\JW{I think it would be worth having a sentence to explain how the C model of memory is translated to a hardware-centric model of memory. For instance, in C we have global variables/arrays, stack-allocated variables/arrays, and heap-allocated variables/arrays (anything else?). In Verilog we have registers and RAM blocks. So what's the correspondence between the two worlds? Globals and heap-allocated are not handled, stack-allocated variables become registers, and stack-allocated arrays become RAM blocks? Am I close?}\YH{Stack allocated variables become RAM as well, so that we can deal with addresses easily and take addresses of any variable.} \JW{I see, thanks. So, in short, the only registers in your hardware designs are those that store things like the current state, etc. You generate a fixed number of registers every time you synthesis -- you don't generate extra registers to store any of the program variables. Right?}
 
-The first translation performed in \vericert{} is from 3AC to a new hardware translation language (HTL). %, which is one step towards being completely translated to hardware described in Verilog. 
+The first translation performed in \vericert{} is from 3AC to a new hardware translation language (HTL). %, which is one step towards being completely translated to hardware described in Verilog.
 This involves going from a CFG representation of the computation to a finite state machine with datapath (FSMD) representation~\cite{hwang99_fsmd}. The core idea of the FSMD representation is that it separates the control flow from the operations on the memory and registers. %This means that the state transitions can be translated into a simple finite state machine (FSM) where each state contains data operations that update the memory and registers. 
-An HTL program thus consists of two maps: a control map that describes how to calculate the next state from the current state, and a datapath map that describes how to update the registers and RAM given the current state. Figure~\ref{fig:accumulator_diagram} shows the resulting architecture of the FSMD. 
+An HTL program thus consists of two maps: a control map that describes how to calculate the next state from the current state, and a datapath map that describes how to update the registers and RAM given the current state. Figure~\ref{fig:accumulator_diagram} shows the resulting architecture of the FSMD.
 
 \begin{figure*}
   \centering
-  \includegraphics[scale=0.3,trim={10cm 8cm 5cm 5cm},clip=true]{data/accumulator_fsmd2.pdf}
+\definecolor{control}{HTML}{b3e2cd}
+\definecolor{data}{HTML}{fdcdac}
+\begin{tikzpicture}
+  \fill[control,fill opacity=1] (6.5,0) rectangle (12,5);
+  \fill[data,fill opacity=1] (0,0) rectangle (5.5,5);
+  \filldraw[white,rounded corners=10pt] (7,0.5) rectangle (11.5,2);
+  \node at (1,4.7) {\footnotesize Data Path};
+  \node at (7.5,4.7) {\footnotesize Control Logic};
+  \node at (8,1.8) {\tiny Next State FSM};
+\end{tikzpicture}
   \caption{The FSMD for our running example. \JW{Maybe replace `State' with `Current State'? And maybe `Calculate State' could be clearer as `Calculate Next State'?} \JW{Can state 15 (or should it be state 16??) have a dangling incoming arrow to indicate that it is the start state? And perhaps state 1 could have a double outline to indicate that it is an `accepting' state? Since there's space above the `Calculate State' box, I'd be mildly in favour of expanding that box a bit so that it included all 15 states explicitly (snaking back and forth).}\YH{If this is better I can mock up a tikz version of it maybe and fix the last bits then too.}}\label{fig:accumulator_diagram}
 \end{figure*}
 
@@ -165,28 +174,28 @@ module main(reset, clk, finish, return_val);
   output reg [31:0] return_val;
   output reg [0:0] finish;
   reg [31:0] reg_8, reg_4, state,
-             reg_6, reg_1, reg_9,
-             reg_5, reg_3, reg_7;
+             reg_6, reg_1, reg_7,
+             reg_5, reg_3;
   always @(posedge clk)
     case (state)
-      32'd15: reg_9 <= 32'd1;
-      32'd14: stack[32'd0] <= reg_9;
-      32'd13: reg_8 <= 32'd2;
-      32'd12: stack[32'd1] <= reg_8;
-      32'd11: reg_7 <= 32'd3;
+      32'd15: reg_8 <= 32'd1;
+      32'd14: stack[32'd0] <= reg_8;
+      32'd13: reg_7 <= 32'd2;
+      32'd12: stack[32'd1] <= reg_7;
+      32'd11: reg_6 <= 32'd3;
       32'd10: stack[32'd2] <= reg_7;
-      32'd9: reg_3 <= 32'd0;
+      32'd9: reg_2 <= 32'd0;
       32'd8: reg_1 <= 32'd0;
-      32'd7: reg_6 <= 32'd0;
-      32'd6: reg_5 <= stack[{{{reg_6 + 32'd0}
+      32'd7: reg_5 <= 32'd0;
+      32'd6: reg_4 <= stack[{{{reg_5 + 32'd0}
         + {reg_1 * 32'd4}} / 32'd4}];
-      32'd5: reg_3 <= {reg_3 + {reg_5 + 32'd0}};
+      32'd5: reg_2 <= {reg_2 + {reg_4 + 32'd0}};
       32'd4: reg_1 <= {reg_1 + 32'd1};
       32'd3: ;
-      32'd2: reg_4 <= reg_3;
+      32'd2: reg_3 <= reg_2;
       32'd1: begin
         finish = 1'd1;
-        return_val = reg_4;
+        return_val = reg_3;
       end
       default:;
     endcase
@@ -226,6 +235,20 @@ endmodule
   \caption{Accumulator example using \vericert{} to translate the 3AC to a state machine expressed in Verilog. \JW{If space permits, it would probably be preferable to have this code in a single column, as splitting a single module across two subfigures is a bit jarring.}\YH{I actually don't mind it for some reason, because it separates control flow and data path, but it's true it is weird.  The only problem is it's very long if I don't do that.}}\label{fig:accumulator_v}
 \end{figure}
 
+The translation from 3AC to HTL is straightforward, as each 3AC instruction either matches up quite well to a hardware construct, or does not have to be handled by the translation, such as function calls.
+%At each instruction, the control flow is separated from the data computation and is then added to the control logic and data-flow map respectively.
+%\JW{I suspect that you could safely chop that sentence.}
+For example, in state 15 in figure~\ref{fig:accumulator_rtl}, the register \texttt{x8} is initialised to 1, after which the control flow moves to state 14.  This is encoded in HTL by initialising a 32-bit register \texttt{reg\_8} to 1 in the data-flow section, and also adding a transition to the state 15 in the control logic section.  Simple operator instructions are translated in a similar way.  For example, in state 5, the value in the array is added to the current value of the accumulated sum, which is simply translated to an addition of the equivalent registers in the HTL code.
+
+\paragraph{Key challenge: signedness} Note that the comparison in state 3 is signed.  By default, all operators and registers in Verilog and HTL are unsigned, so to force an operation to handle the bits as signed, both operators have to be forced to signed.  In addition to that, Verilog resizes expressions to the largest needed size by default, which can affect the result of the computation.  This feature is also not supported by the Verilog semantics we used, and there would therefore be a mismatch between the Verilog semantics and the actual behaviour of Verilog according to the standard.  To bypass this issue braces are used to stop the Verilog simulator or synthesis tool from resizing anything inside the braces.  Instead, explicit resizing is used in the semantics and operations can only be performed on two registers that have the same size.
+
+\subsection{Translating HTL to Verilog}
+
+Finally, we have to translate the HTL code into proper Verilog and prove that it behaves the same as the 3AC according to the Verilog semantics.  Whereas HTL is a language that is specifically designed to represent the FSMDs we are interested in, Verilog is a general-purpose HDL.\@  So the challenge here is to translate our FSMD representation into a Verilog AST.  However, as all the instructions are already expressed in Verilog, only the maps need to be translated to valid Verilog, and correct declarations for all the variables in the program need to be added as well.
+
+This translation seems quite straightforward, however, proving that it is correct is not that simple, as all the implicit assumptions that were made in HTL need to be translated explicitly to Verilog statements and needs it needs to be shown that these explicit behaviours are equivalent to the assumptions made in the HTL semantics.
+Figure~\ref{fig:accumulator_v} shows the final Verilog output that is generated.  In general, the structure is similar to that of the HTL code, however, the control and datapath maps have been translated to case statements.  The other main addition to the code is the initialisation of all the variables in the code to the correct bitwidths and the declaration of the inputs and outputs to the module, so that the module can be used inside a larger hardware design.
+
 \subsection{Optimisations}
 
 Although we would not claim that \vericert{} is a proper `optimising' HLS compiler yet, we have nonetheless implemented a few optimisations that aim to improve the quality of the hardware designs it produces.
author	Yann Herklotz <git@yannherklotz.com>	2020-11-19 20:52:02 +0000
committer	Yann Herklotz <git@yannherklotz.com>	2020-11-19 20:53:43 +0000
commit	f933f9a915bd30dcf4df849c9dd20e1f90ed152d (patch)
tree	fcac3c4bb08b089d93687150aef6676d72a73c64 /algorithm.tex
parent	3a1e1da4d7bee90262ed664178e2ac3f077a028f (diff)
download	oopsla21_fvhls-f933f9a915bd30dcf4df849c9dd20e1f90ed152d.tar.gz oopsla21_fvhls-f933f9a915bd30dcf4df849c9dd20e1f90ed152d.zip