algorithm.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131

\section{Turning CompCert into an HLS tool}

%% Should maybe go in the introduction instead.

\begin{figure}
  \centering
  \begin{tikzpicture}
    [language/.style={fill=white,rounded corners=2pt}]
    \fill[compcert,rounded corners=3pt] (-1,-1) rectangle (9,1.5);
    \fill[formalhls,rounded corners=3pt] (-1,-1.5) rectangle (9,-2.5);
    \node[language] at (0,0) (clight) {Clight};
    \node[language] at (2,0) (cminor) {C\#minor};
    \node[language] at (4,0) (rtl) {RTL};
    \node[language] at (6,0) (ltl) {LTL};
    \node[language] at (8,0) (ppc) {PPC};
    \node[language] at (4,-2) (dfgstmd) {HTL};
    \node[language] at (7,-2) (verilog) {Verilog};
    \node at (0,1) {CompCert};
    \node at (0,-2) {CoqUp};
    \draw[->] (clight) -- (cminor);
    \draw[->,dashed] (cminor) -- (rtl);
    \draw[->] (rtl) -- (ltl);
    \draw[->,dashed] (ltl) -- (ppc);
    \draw[->] (rtl) -- (dfgstmd);
    \draw[->] (dfgstmd) -- (verilog);
  \end{tikzpicture}
  \caption{Verilog backend branching off at the RTL stage.}\label{fig:rtlbranch}
\end{figure}

This section covers the main architecture of the HLS tool, and how the backend was added to CompCert.

CompCert is made up of 11 intermediate languages in between the Clight input and the assembly output.  These intermediate languages each serve a different purpose and contain various different optimisations.  When designing a new backend for CompCert, it is therefore crucial to know where to branch off and start the hardware generation.  Many of the optimisations that the CompCert backend performs are not necessary when generating custom hardware and not relying on a CPU anymore, such as register allocation or even scheduling.  It is therefore important to find the right intermediate language so that the HLS tool still benefits from many of the generic optimisations that CompCert performs, but does not receive the code transformations that are specific to CPU architectures.

Existing HLS compilers usually use LLVM IR as an intermediate representation when performing HLS specific optimisations, as each instruction can be mapped quite well to hardware which performs the same behaviour.  CompCert's register transfer language (RTL) is the intermediate language that resembles LLVM IR the most, as it also has an infinite number of pseudo-registers and each instruction maps well to hardware. \JP{Perhaps this needs some further qualification? RTL uses the operations from the target architecture, and indeed performs architecture specific optimisations prior to RTL gen, so (for sake of example) switching from x86 RTL to RISC-V RTL could have a significant impact on performance.}\YH{Yes will definitely include those, just have to think about where.  I think this could be something that could be mentioned in the RTL section instead.}  In addition to that, many optimisations that are also useful for HLS are performed in RTL, which means that if it is supported as the input language, the HLS algorithm benefits from the same optimisations.  It is therefore a good candidate to be chosen as the input language to the HLS backend. The complete flow that CoqUp takes is show in figure~\ref{fig:rtlbranch}.

%%TODO: Maybe add why LTL and the other smaller languages are not that well suited

\begin{figure}
  \centering
  \begin{subfigure}[b]{0.4\linewidth}
    \inputminted{c}{data/accumulator.c}
    \caption{Accumulator C code.}\label{fig:accumulator_c}
  \end{subfigure}%
  \begin{subfigure}[b]{0.6\linewidth}
    \inputminted[fontsize=\footnotesize]{c}{data/accumulator.rtl}
    \caption{Accumulator RTL code.}\label{fig:accumulator_rtl}
  \end{subfigure}
  \caption{Accumulator example using CompCert to translate from C to RTL.}\label{fig:accumulator_c_rtl}
\end{figure}

\begin{figure}
  \centering
  \begin{subfigure}[b]{0.5\linewidth}
    \inputminted[fontsize=\tiny]{systemverilog}{data/accumulator.htl}
    \caption{Accumulator HTL code.}\label{fig:accumulator_htl}
  \end{subfigure}%
  \begin{subfigure}[b]{0.5\linewidth}
    \inputminted[fontsize=\tiny]{systemverilog}{data/accumulator.v}
    \caption{Accumulator Verilog code.}\label{fig:accumulator_v}
  \end{subfigure}
  \caption{Accumulator example using CompCert to translate from HTL to Verilog.\YH{I feel like these examples take up too much space, but don't really know of a different way to show off a complete example without the long code.}}\label{fig:accumulator_htl_v}
\end{figure}

To describe the translation, we start with an example of how to translate a simple accumulator example, which is shown in figure~\ref{fig:accumulator_c}.  Using this example, the different stages in the translation can be explained, together with how they were proven.

The first step is to use the CompCert front end passes to convert the C code into RTL, which is the intermediate language that CompCert uses for most of its optimisations.  The translation is shown in figure~\ref{fig:accumulator_c_rtl}, where the RTL code is depicted in figure~\ref{fig:accumulator_rtl}.  At the RTL stage, many of CompCert's normal optimisations passes, such as deadcode elimination or common subexpression elimination (CSE) can be applied to reduce the RTL as much as possible.  The optimised RTL is then translated to HTL, consisting of a finite state-machine representation of the RTL code.  This representation maps directly to hardware and it therefore transformed to Verilog afterwards.  The translation from RTL to HTL is the main step in converting the software representation of the code into a hardware representation of the code.  The following sections will go over the details of the transformation of each pass.

\subsection{CompCert RTL}

RTL is the existing intermediate language in CompCert that was chosen to transform to hardware, as it is quite architecture independent compared to the lower level intermediate languages, and each instruction translates to a hardware construct.  All CompCert intermediate language follow the similar structure below:

\begin{align*}
  \mathit{program} \quad ::= \{ &\mathbf{variables} : (\mathit{id} * \mathit{data}) \text{ list}, \\
                 &\mathbf{functions} : (\mathit{id} * \mathit{function\_def}) \text{ list},\\
                 &\mathbf{main} : \mathit{id} \}
\end{align*}

\noindent where function definitions can either be internal or external.  External functions are functions that are not defined in the current translation unit, and are therefore not part of the current translation.  The difference in between the CompCert intermediate languages is therefore how the internal function is defined, as that defines the structure of the language itself.

%% Describe RTL
RTL function definitions are a sequence of instructions encoded in a control-flow graph, with each instruction linking to the next instruction that should be executed.  RTL consists of ten separate instructions in total, which makes this a good candidate for high-level synthesis translation, as these instructions translate quite directly to a hardware equivalent.  This structure makes it ideal for performing optimisations on, as performing static analysis and transformations on this language is much simpler than doing the same optimisation for the C abstract syntax tree instead.  However, even though RTL is quite architecture independent and good for software optimisations, it does not model hardware well and is therefore not suitable for hardware optimisations.  We therefore need a new intermediate language that translates directly into hardware and which can also be used as an intermediate language for more hardware specific optimisations.

However, before this translation can be performed, all functions need to be inlined.  This means that recursive function calls cannot be supported, however, these are difficult to translate to hardware as they require dynamic allocation of the stack.  Dynamic allocation cannot be performed in hardware, because the memory size has to be set statically and then translated to a physical memory in the hardware. \JP{I think we should explain this a little bit more; HW allocation is possible in principle.} Inlining is already an optimisation that is performed at the RTL stage in CompCert, however, to get RTL that is ready to be translated to hardware, this pass was modified so that is forced all functions to be inlined.  This eliminates all the function calls from the RTL, as long as there are no recursive function calls, which means that do not have to be handled by the translation to hardware.

\subsection{HTL}

The first hardware specific intermediate language that is introduced is the hardware transfer language (HTL).  It is a self-contained finite state machine with datapath (FSMD) representation of the RTL code which maps directly to actual hardware.  HTL, like all CompCert intermediate languages, has the same program structure as RTL, but internal functions now contain logic to control the order of execution, and a datapath that transforms the data in the registers.  This is represented by having two maps that link states to the control logic and to the current position in the datapath, which are both expressed using Verilog statements.  The syntax for HTL functions are the following:

\begin{align*}
  g \quad &::= \quad n \mapsto s\\
  d_{\rm r} \quad &::= \quad r \mapsto (io? * n)\\
  d_{\rm a} \quad &::= \quad r \mapsto (io? * n * n)\\
  M \quad &::= \quad \big\{\ \texttt{params} : r \text{ list}\\
                                     &\texttt{datapath} : g\\
                                     &\texttt{controllogic} : g\\
                                     &\texttt{entrypoint} : n\\
                                     &\texttt{st, stk, finish, return, start, reset, clk} : r\\
                                     &\texttt{scldecls} : d_{\rm r}\\
                                     &\texttt{arrdecls} : d_{\rm a}\ \big\}
\end{align*}

\noindent where $d_{\rm a}$ and $d_{\rm r}$ are declaration lists for all the registers used in module $M$, and $g$ is map from states to Verilog statements.

Going back to to the previous accumulator example, the HTL code that is generated is shown in figure~\ref{fig:accumulator_htl}.  The function consists of two main parts, the data-path and the control logic.  The data-path describes how the data registers should change and what computations should be performed in each cycle, whereas the control logic describes how the state changes should be performed.  These two blocks in the main function will execute in parallel, meaning at each state, the control logic will load the next state into the state register, and the data-path will compute a value and store it in the appropriate register.

The translation from RTL to HTL is quite straightforward, as each RTL instruction either matches up quite well to a hardware construct, or does not have to be handled by the translation, such as function calls.  At each instruction, the control flow is separated from the data computation and is then added to the control logic and data-flow map respectively.  For example, in state 16 in the RTL code in figure~\ref{fig:accumulator_rtl}, the register \texttt{x9} is initialised to one and then moves on to state 15.  This is encoded in HTL by initialising a 32 bit register \texttt{reg\_9} to one as well in the data-flow graph section, and also adding a state transition to the state 15 in the controllogic transition.  Simple operator instructions are translated in a similar way.  For example, in state 5, the value in the array is added to the current value of the accumulated sum, which is simply translated to an addition of the equivalent registers in the HTL code.

The translation can be described by a relational specification, which can be described by the following relation, where \textit{fin}, \textit{rtrn}, $\sigma$ and \textit{stk} are the registers for the finished signal, return value, current state and stack respectively, $i$ is the RTL instruction being translated, and \textit{data} and \textit{control} are the data-flow and control logic map respectively.

\begin{equation*}
  \yhfunction{tr\_instr } \textit{fin rtrn }\ \sigma\ \textit{stk }\ i\ \textit{data }\ \textit{control}
\end{equation*}

\noindent An example of a rule describing the translation of an operation is the following:

\begin{equation*}
  \inferrule[Iop]{\yhfunction{tr\_op } \textit{op }\ \vec{a} = \yhconstant{OK } e}{\yhfunction{tr\_instr } \textit{fin rtrn }\ \sigma\ \textit{stk }\ (\yhconstant{Iop } \textit{op }\ \vec{a}\ d\ n)\ (d \Leftarrow e)\ (\sigma \Leftarrow n)}
\end{equation*}

In the accumulator C code, the for loop is translated to a branch statement in state 3, which can also be translated to equivalent HTL code by placing the comparison into the control logic part of the main function, and not performing any data operations in the data-flow section.  On the next clock cycle, the state will therefore transition to state 7 or state 2 depending on if \texttt{reg\_3} is less than 3 or not.  One thing to note is that in this case, the comparison is signed.  By default, all operators and registers in Verilog and HTL are unsigned, so to force an operation to handle the bits as signed, both operators have to be forced to signed.  In addition to that, Verilog resizes expressions to the largest needed size by default, which can affect the result of the computation.  This feature is also not supported by the Verilog semantics we used, and there would therefore be a mismatch between the Verilog semantics and the actual behaviour of Verilog according to the standard.  To bypass this issue braces are used to stop the Verilog simulator or synthesis tool from resizing anything inside the braces.  Instead, explicit resizing is used in the semantics and operations can only be performed on two registers that have the same size.

Finally, this example also contains an array, which is stored on the stack for the main function.  Translating arrays means that load and store instructions need to be supported.  In hardware RAM is not as available as in software, and needs to be explicitly implemented by declaring a two dimensional array with specific properties, such as only being able to write and read from it once per clock cycle.  Therefore, to make memory as efficient as possible and support reads and writes in one clock cycle, a memory only addressable by word needs to be implemented, instead of the standard RAM which is addressable by byte.  This is because the clock cycles in devices such as FPGAs are normally an order of magnitude slower than CPUs, meaning taking four cycles to load an integer is not feasible.  However, this leads to various problems, the main one being that addresses need to be word-aligned, which is not the case in RTL or in C.  Therefore, addresses need to be divisible by four to also be valid addresses in HTL, which is shown in the load from the stack in state 6 of the RTL.  The equivalent HTL code is shown in the datapath branch where the same addition is done, however, the result is divided by 4.  To prove that these two loads have the same behaviour, we have to prove that the load was word-aligned and therefore could be divided by 4.\YH{Elaborate on our solution.}

\subsection{Verilog}

Finally, we have to prove the translation from Verilog to hardware correct

%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"
%%% End: