Add more content to algorithm

author: Yann Herklotz <git@yannherklotz.com> 2020-07-01 17:01:27 +0100
committer: Yann Herklotz <git@yannherklotz.com> 2020-07-01 17:01:27 +0100
commit: beaa247399996201c572e6bf35d73df4a1b8aab7 (patch)
tree: 6bb2fef971aca73bae62d72cebb0ab7b5494f6c2
parent: 7710d20a16356636729babae6a069b9866475441 (diff)
download: oopsla21_fvhls-beaa247399996201c572e6bf35d73df4a1b8aab7.tar.gz
oopsla21_fvhls-beaa247399996201c572e6bf35d73df4a1b8aab7.zip
4 files changed, 86 insertions, 82 deletions
diff --git a/algorithm.tex b/algorithm.tex
index 947183f..cae41ca 100644
--- a/algorithm.tex
+++ b/algorithm.tex
@@ -31,12 +31,10 @@ This section covers the main architecture of the HLS tool, and how the backend w
 
 CompCert is made up of 11 intermediate languages in between the Clight input and the assembly output.  These intermediate languages each serve a different purpose and contain various different optimisations.  When designing a new backend for CompCert, it is therefore crucial to know where to branch off and start the hardware generation.  Many of the optimisations that the CompCert backend performs are not necessary when generating custom hardware and not relying on a CPU anymore, such as register allocation or even scheduling.  It is therefore important to find the right intermediate language so that the HLS tool still benefits from many of the generic optimisations that CompCert performs, but does not receive the code transformations that are specific to CPU architectures.
 
-Existing HLS compilers usually use LLVM IR as an intermediate representation when performing HLS specific optimisations, as each instruction can be mapped quite well to hardware which performs the same behaviour.  CompCert's register transfer language (RTL) is the intermediate language that resembles LLVM IR the most, as it also has an infinite number of pseudo-registers and each instruction maps well to hardware. \JP{Perhaps this needs some further qualification? RTL uses the operations from the target architecture, and indeed performs architecture specific optimisations prior to RTL gen, so (for sake of example) switching from x86 RTL to RISC-V RTL could have a significant impact on performance.}\YH{Yes will definitely include those, just have to think about where.}  In addition to that, many optimisations that are also useful for HLS are performed in RTL, which means that if it is supported as the input language, the HLS algorithm benefits from the same optimisations.  It is therefore a good candidate to be chosen as the input language to the HLS backend. The complete flow that CoqUp takes is show in figure~\ref{fig:rtlbranch}.
+Existing HLS compilers usually use LLVM IR as an intermediate representation when performing HLS specific optimisations, as each instruction can be mapped quite well to hardware which performs the same behaviour.  CompCert's register transfer language (RTL) is the intermediate language that resembles LLVM IR the most, as it also has an infinite number of pseudo-registers and each instruction maps well to hardware. \JP{Perhaps this needs some further qualification? RTL uses the operations from the target architecture, and indeed performs architecture specific optimisations prior to RTL gen, so (for sake of example) switching from x86 RTL to RISC-V RTL could have a significant impact on performance.}\YH{Yes will definitely include those, just have to think about where.  I think this could be something that could be mentioned in the RTL section instead.}  In addition to that, many optimisations that are also useful for HLS are performed in RTL, which means that if it is supported as the input language, the HLS algorithm benefits from the same optimisations.  It is therefore a good candidate to be chosen as the input language to the HLS backend. The complete flow that CoqUp takes is show in figure~\ref{fig:rtlbranch}.
 
 %%TODO: Maybe add why LTL and the other smaller languages are not that well suited
 
-\subsection{Example}
-
 \begin{figure}
   \centering
   \begin{subfigure}[b]{0.4\linewidth}
@@ -45,7 +43,7 @@ Existing HLS compilers usually use LLVM IR as an intermediate representation whe
   \end{subfigure}%
   \begin{subfigure}[b]{0.6\linewidth}
     \inputminted[fontsize=\footnotesize]{c}{data/accumulator.rtl}
-    \caption{Accumulator C code.}\label{fig:accumulator_rtl}
+    \caption{Accumulator RTL code.}\label{fig:accumulator_rtl}
   \end{subfigure}
   \caption{Accumulator example using CompCert to translate from C to RTL.}\label{fig:accumulator_c_rtl}
 \end{figure}
@@ -54,22 +52,22 @@ Existing HLS compilers usually use LLVM IR as an intermediate representation whe
   \centering
   \begin{subfigure}[b]{0.5\linewidth}
     \inputminted[fontsize=\tiny]{systemverilog}{data/accumulator.htl}
-    \caption{Accumulator C code.}\label{fig:accumulator_htl}
+    \caption{Accumulator HTL code.}\label{fig:accumulator_htl}
   \end{subfigure}%
   \begin{subfigure}[b]{0.5\linewidth}
     \inputminted[fontsize=\tiny]{systemverilog}{data/accumulator.v}
-    \caption{Accumulator C code.}\label{fig:accumulator_v}
+    \caption{Accumulator Verilog code.}\label{fig:accumulator_v}
   \end{subfigure}
   \caption{Accumulator example using CompCert to translate from HTL to Verilog.\YH{I feel like these examples take up too much space, but don't really know of a different way to show off a complete example without the long code.}}\label{fig:accumulator_htl_v}
 \end{figure}
 
 To describe the translation, we start with an example of how to translate a simple accumulator example, which is shown in figure~\ref{fig:accumulator_c}.  Using this example, the different stages in the translation can be explained, together with how they were proven.
 
-The first step is to use the CompCert front end passes to convert the C code into RTL, which is the intermediate language that CompCert uses for most of its optimisations.  The translation is shown in figure~\ref{fig:accumulator_c_rtl}, where the RTL code is depicted in figure~\ref{fig:accumulator_rtl}.  The optimised RTL is then translated to a hardware transfer language (HTL), consisting of a finite state-machine representation of the RTL code.  This representation maps directly to hardware and it therefore transformed to Verilog afterwards.
+The first step is to use the CompCert front end passes to convert the C code into RTL, which is the intermediate language that CompCert uses for most of its optimisations.  The translation is shown in figure~\ref{fig:accumulator_c_rtl}, where the RTL code is depicted in figure~\ref{fig:accumulator_rtl}.  At the RTL stage, many of CompCert's normal optimisations passes, such as deadcode elimination or common subexpression elimination (CSE) can be applied to reduce the RTL as much as possible.  The optimised RTL is then translated to HTL, consisting of a finite state-machine representation of the RTL code.  This representation maps directly to hardware and it therefore transformed to Verilog afterwards.  The translation from RTL to HTL is the main step in converting the software representation of the code into a hardware representation of the code.  The following sections will go over the details of the transformation of each pass.
 
 \subsection{CompCert RTL}
 
-All CompCert intermediate language follow the similar structure below:
+RTL is the existing intermediate language in CompCert that was chosen to transform to hardware, as it is quite architecture independent compared to the lower level intermediate languages, and each instruction translates to a hardware construct.  All CompCert intermediate language follow the similar structure below:
 
 \begin{align*}
   \mathit{program} \quad ::= \{ &\mathbf{variables} : (\mathit{id} * \mathit{data}) \text{ list}, \\
@@ -80,28 +78,34 @@ All CompCert intermediate language follow the similar structure below:
 \noindent where function definitions can either be internal or external.  External functions are functions that are not defined in the current translation unit, and are therefore not part of the current translation.  The difference in between the CompCert intermediate languages is therefore how the internal function is defined, as that defines the structure of the language itself.
 
 %% Describe RTL
-The accumulator example in RTL function definitions are a sequence of instructions encoded in a control-flow graph, with each instruction linking to the next instruction that should be executed.
+RTL function definitions are a sequence of instructions encoded in a control-flow graph, with each instruction linking to the next instruction that should be executed.  RTL consists of ten separate instructions in total, which makes this a good candidate for high-level synthesis translation, as these instructions translate quite directly to a hardware equivalent.  This structure makes it ideal for performing optimisations on, as performing static analysis and transformations on this language is much simpler than doing the same optimisation for the C abstract syntax tree instead.  However, even though RTL is quite architecture independent and good for software optimisations, it does not model hardware well and is therefore not suitable for hardware optimisations.  We therefore need a new intermediate language that translates directly into hardware and which can also be used as an intermediate language for more hardware specific optimisations.
 
-%%TODO: Finish this section and describe the syntax and semantics of RTL.
+However, before this translation can be performed, all functions need to be inlined.  This means that recursive function calls cannot be supported, however, these are difficult to translate to hardware as they require dynamic allocation of the stack.  Dynamic allocation cannot be performed in hardware, because the memory size has to be set statically and then translated to a physical memory in the hardware.  Inlining is already an optimisation that is performed at the RTL stage in CompCert, however, to get RTL that is ready to be translated to hardware, this pass was modified so that is forced all functions to be inlined.  This eliminates all the function calls from the RTL, as long as there are no recursive function calls, which means that do not have to be handled by the translation to hardware.
 
 \subsection{HTL}
 
-RTL is first translated to an intermediate language called hardware transfer language (HTL), which is a finite state machine with datapath (FSMD) representation of the RTL code.  HTL, like all CompCert intermediate languages, has the same program structure as RTL, but internal functions now contain logic to control the order of execution, and a datapath that transforms the data in the registers.  This is represented by having two maps that link states to the control logic and to the current position in the datapath, which are both expressed using Verilog statements.  The syntax for HTL functions are the following:
+The first hardware specific intermediate language that is introduced is the hardware transfer language (HTL).  It is a self-contained finite state machine with datapath (FSMD) representation of the RTL code which maps directly to actual hardware.  HTL, like all CompCert intermediate languages, has the same program structure as RTL, but internal functions now contain logic to control the order of execution, and a datapath that transforms the data in the registers.  This is represented by having two maps that link states to the control logic and to the current position in the datapath, which are both expressed using Verilog statements.  The syntax for HTL functions are the following:
 
 \begin{align*}
   g \quad &::= \quad n \mapsto s\\
-  d_{r} \quad &::= \quad r \mapsto (io? * n)\\
-  d_{a} \quad &::= \quad r \mapsto (io? * n * n)\\
-  F \quad &::= \quad \big\{\ \texttt{params} : \vec{r}\\
+  d_{\rm r} \quad &::= \quad r \mapsto (io? * n)\\
+  d_{\rm a} \quad &::= \quad r \mapsto (io? * n * n)\\
+  M \quad &::= \quad \big\{\ \texttt{params} : r \text{ list}\\
                                      &\texttt{datapath} : g\\
                                      &\texttt{controllogic} : g\\
                                      &\texttt{entrypoint} : n\\
                                      &\texttt{st, stk, finish, return, start, reset, clk} : r\\
-                                     &\texttt{scldecls} : d_{r}\\
-                                     &\texttt{arrdecls} : d_{a}\ \big\}
+                                     &\texttt{scldecls} : d_{\rm r}\\
+                                     &\texttt{arrdecls} : d_{\rm a}\ \big\}
 \end{align*}
 
-\subsection{HLS Algorithm}
+\noindent where $d_{\rm a}$ and $d_{\rm r}$ are declaration lists for all the registers used in module $M$, and $g$ is map from states to Verilog statements.
+
+Going back to to the previous accumulator example, the HTL code that is generated is shown in figure~\ref{fig:accumulator_htl}.  The function consists of two main parts, the data-path and the control logic.  The data-path describes how the data registers should change and what computations should be performed in each cycle, whereas the control logic describes how the state changes should be performed.  These two blocks in the main function will execute in parallel, meaning at each state, the control logic will load the next state into the state register, and the data-path will compute a value and store it in the appropriate register.
+
+The translation from RTL to HTL is quite straightforward, as each RTL instruction either matches up quite well to a hardware construct, or does not have to be handled by the translation, such as function calls.  At each instruction, the control flow is separated from the data computation and is then added to the control logic and data-flow map respectively.  For example, in state 16 in the RTL code in figure~\ref{fig:accumulator_rtl}, the register \texttt{x9} is initialised to one and then moves on to state 15.  This is encoded in HTL by initialising a 32 bit register \texttt{reg\_9} to one as well in the data-flow graph section, and also adding a state transition to the state 15 in the controllogic transition.
+
+\subsection{Verilog}
 
 %%% Local Variables:
 %%% mode: latex
diff --git a/data/accumulator.htl b/data/accumulator.htl
index 99e3ccb..16770dc 100644
--- a/data/accumulator.htl
+++ b/data/accumulator.htl
@@ -1,41 +1,41 @@
 main() {
   datapath {
    16:	reg_9 <= 32'd1;
-   15:	reg_13[32'd0] <= reg_9;
+   15:	stack[32'd0] <= reg_9;
    14:	reg_8 <= 32'd2;
-   13:	reg_13[32'd1] <= reg_8;
+   13:	stack[32'd1] <= reg_8;
    12:	reg_7 <= 32'd3;
-   11:	reg_13[32'd2] <= reg_7;
+   11:	stack[32'd2] <= reg_7;
    10:	reg_3 <= 32'd0;
     9:	;
     8:	reg_1 <= 32'd0;
     7:	reg_6 <= 32'd0;
-    6:	reg_5 <= reg_13[{{{reg_6 + 32'd0}
+    6:	reg_5 <= stack[{{{reg_6 + 32'd0}
                         + {reg_1 * 32'd4}} / 32'd4}];
     5:	reg_3 <= {reg_3 + {reg_5 + 32'd0}};
     4:	reg_1 <= {reg_1 + 32'd1};
     3:	;
     2:	reg_4 <= reg_3;
-    1:	reg_11 <= 1'd1; reg_12 <= reg_4;
+    1:	finish <= 1'd1; return_val <= reg_4;
   }
 
   controllogic {
-   16:	reg_10 <= 32'd15;
-   15:	reg_10 <= 32'd14;
-   14:	reg_10 <= 32'd13;
-   13:	reg_10 <= 32'd12;
-   12:	reg_10 <= 32'd11;
-   11:	reg_10 <= 32'd10;
-   10:	reg_10 <= 32'd9;
-    9:	reg_10 <= 32'd8;
-    8:	reg_10 <= 32'd7;
-    7:	reg_10 <= 32'd6;
-    6:	reg_10 <= 32'd5;
-    5:	reg_10 <= 32'd4;
-    4:	reg_10 <= 32'd3;
-    3:	reg_10 <= ({$signed(reg_1) < $signed(32'd3)}
+   16:	state <= 32'd15;
+   15:	state <= 32'd14;
+   14:	state <= 32'd13;
+   13:	state <= 32'd12;
+   12:	state <= 32'd11;
+   11:	state <= 32'd10;
+   10:	state <= 32'd9;
+    9:	state <= 32'd8;
+    8:	state <= 32'd7;
+    7:	state <= 32'd6;
+    6:	state <= 32'd5;
+    5:	state <= 32'd4;
+    4:	state <= 32'd3;
+    3:	state <= ({$signed(reg_1) < $signed(32'd3)}
                         ? 32'd7 : 32'd2);
-    2:	reg_10 <= 32'd1;
+    2:	state <= 32'd1;
     1:	;
   }
 }
diff --git a/data/accumulator.v b/data/accumulator.v
index baabfe3..146176d 100644
--- a/data/accumulator.v
+++ b/data/accumulator.v
@@ -1,53 +1,53 @@
-module main(reg_14, reg_15, reg_16, reg_11, reg_12);
-  always @(posedge reg_16)
-    if ({reg_15 == 1'd1})
-      reg_10 <= 32'd16;
-    else
-      case (reg_10)
-        32'd16: reg_10 <= 32'd15;
-        32'd8: reg_10 <= 32'd7;
-        32'd4: reg_10 <= 32'd3;
-        32'd12: reg_10 <= 32'd11;
-        32'd2: reg_10 <= 32'd1;
-        32'd10: reg_10 <= 32'd9;
-        32'd6: reg_10 <= 32'd5;
-        32'd14: reg_10 <= 32'd13;
-        32'd1: ;
-        32'd9: reg_10 <= 32'd8;
-        32'd5: reg_10 <= 32'd4;
-        32'd13: reg_10 <= 32'd12;
-        32'd3: reg_10 <= ({$signed(reg_1) < $signed(32'd3)}
-                               ? 32'd7 : 32'd2);
-        32'd11: reg_10 <= 32'd10;
-        32'd7: reg_10 <= 32'd6;
-        32'd15: reg_10 <= 32'd14;
-        default:;
-      endcase
-  always @(posedge reg_16)
-    case (reg_10)
+module main(reset, clk, finish, return_val);
+  reg [31:0] stack [2:0];
+  input [0:0] clk, reset;
+  output reg [31:0] return_val;
+  output reg [0:0] finish;
+  reg [31:0] reg_8, reg_4, state, reg_6,
+             reg_1, reg_9, reg_5, reg_3, reg_7;
+  always @(posedge clk)
+    case (state)
       32'd16: reg_9 <= 32'd1;
-      32'd8: reg_1 <= 32'd0;
-      32'd4: reg_1 <= {reg_1 + 32'd1};
+      32'd15: stack[32'd0] <= reg_9;
+      32'd14: reg_8 <= 32'd2;
+      32'd13: stack[32'd1] <= reg_8;
       32'd12: reg_7 <= 32'd3;
-      32'd2: reg_4 <= reg_3;
+      32'd11: stack[32'd2] <= reg_7;
       32'd10: reg_3 <= 32'd0;
-      32'd6: reg_5 <= reg_13[{{{reg_6 + 32'd0}
-                             + {reg_1 * 32'd4}} / 32'd4}];
-      32'd14: reg_8 <= 32'd2;
-      32'd1: begin reg_11 = 1'd1; reg_12 = reg_4; end
       32'd9: ;
+      32'd8: reg_1 <= 32'd0;
+      32'd7: reg_6 <= 32'd0;
+      32'd6: reg_5 <= stack[{{{reg_6 + 32'd0}
+                           + {reg_1 * 32'd4}} / 32'd4}];
       32'd5: reg_3 <= {reg_3 + {reg_5 + 32'd0}};
-      32'd13: reg_13[32'd1] <= reg_8;
+      32'd4: reg_1 <= {reg_1 + 32'd1};
       32'd3: ;
-      32'd11: reg_13[32'd2] <= reg_7;
-      32'd7: reg_6 <= 32'd0;
-      32'd15: reg_13[32'd0] <= reg_9;
+      32'd2: reg_4 <= reg_3;
+      32'd1: begin finish = 1'd1; return_val = reg_4; end
       default:;
     endcase
-  reg [31:0] reg_13 [2:0];
-  input [0:0] reg_16, reg_14, reg_15;
-  reg [31:0] reg_8, reg_4, reg_10, reg_6,
-             reg_1, reg_9, reg_5, reg_3, reg_7;
-  output reg [31:0] reg_12;
-  output reg [0:0] reg_11;
+  always @(posedge clk)
+    if ({reset == 1'd1})
+      state <= 32'd16;
+    else
+      case (state)
+        32'd16: state <= 32'd15;
+        32'd15: state <= 32'd14;
+        32'd14: state <= 32'd13;
+        32'd13: state <= 32'd12;
+        32'd12: state <= 32'd11;
+        32'd11: state <= 32'd10;
+        32'd10: state <= 32'd9;
+        32'd9: state <= 32'd8;
+        32'd8: state <= 32'd7;
+        32'd7: state <= 32'd6;
+        32'd6: state <= 32'd5;
+        32'd5: state <= 32'd4;
+        32'd4: state <= 32'd3;
+        32'd3: state <= ({$signed(reg_1) < $signed(32'd3)}
+                               ? 32'd7 : 32'd2);
+        32'd2: state <= 32'd1;
+        32'd1: ;
+        default:;
+      endcase
 endmodule
diff --git a/main.tex b/main.tex
index 3a4c120..c07b93a 100644
--- a/main.tex
+++ b/main.tex
@@ -52,7 +52,7 @@
 \setminted{fontsize=\small}
 
 \newif\ifCOMMENTS
-\COMMENTStrue
+\COMMENTSfalse
 \newcommand{\Comment}[3]{\ifCOMMENTS\textcolor{#1}{{\bf [\![#2:} #3{\bf ]\!]}}\fi}
 \newcommand\JW[1]{\Comment{red!75!black}{JW}{#1}}
 \newcommand\YH[1]{\Comment{green!50!blue}{YH}{#1}}
author	Yann Herklotz <git@yannherklotz.com>	2020-07-01 17:01:27 +0100
committer	Yann Herklotz <git@yannherklotz.com>	2020-07-01 17:01:27 +0100
commit	beaa247399996201c572e6bf35d73df4a1b8aab7 (patch)
tree	6bb2fef971aca73bae62d72cebb0ab7b5494f6c2
parent	7710d20a16356636729babae6a069b9866475441 (diff)
download	oopsla21_fvhls-beaa247399996201c572e6bf35d73df4a1b8aab7.tar.gz oopsla21_fvhls-beaa247399996201c572e6bf35d73df4a1b8aab7.zip