archive/algorithm.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177

The translation from 3AC to HTL is straightforward, as each 3AC instruction either matches up quite well to a hardware construct, or does not have to be handled by the translation, such as function calls.
%At each instruction, the control flow is separated from the data computation and is then added to the control logic and data-flow map respectively.
%\JW{I suspect that you could safely chop that sentence.}
For example, in state 15 in figure~\ref{fig:accumulator_rtl}, the register \texttt{x8} is initialised to 1, after which the control flow moves to state 14.  This is encoded in HTL by initialising a 32-bit register \texttt{reg\_8} to 1 in the data-flow section, and also adding a transition to the state 15 in the control logic section.  Simple operator instructions are translated in a similar way.  For example, in state 5, the value in the array is added to the current value of the accumulated sum, which is simply translated to an addition of the equivalent registers in the HTL code.

\subsection{Translating HTL to Verilog}

Finally, we have to translate the HTL code into proper Verilog and prove that it behaves the same as the 3AC according to the Verilog semantics.  Whereas HTL is a language that is specifically designed to represent the FSMDs we are interested in, Verilog is a general-purpose HDL.\@  So the challenge here is to translate our FSMD representation into a Verilog AST.  However, as all the instructions are already expressed in Verilog, only the maps need to be translated to valid Verilog, and correct declarations for all the variables in the program need to be added as well.

This translation seems quite straightforward, however, proving that it is correct is not that simple, as all the implicit assumptions that were made in HTL need to be translated explicitly to Verilog statements and needs it needs to be shown that these explicit behaviours are equivalent to the assumptions made in the HTL semantics.
Figure~\ref{fig:accumulator_v} shows the final Verilog output that is generated.  In general, the structure is similar to that of the HTL code, however, the control- and data-path maps have been translated to case statements.  The other main addition to the code is the initialisation of all the variables in the code to the correct bitwidths and the declaration of the inputs and outputs to the module, so that the module can be used inside a larger hardware design.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsubsection{Reset signals}

\YH{This section could maybe go into the proof section instead.}

\JW{Yeah I agree that the rest of this paragraph describes a detail of the proof and should be in the Proof section.}
Even though functions calls are not supported by \vericert{}, the initial function call that \compcert{} uses to enter the main function does need to be supported in the Verilog semantics, as explained further in Section~\ref{sec:verilog}.  The reset signal therefore has to be reasoned about correctly in the Semantics and in the initial function call to ensure that the initial state of the module is set correctly, as in 3AC, the entry point of the function is part of the function definition.

\paragraph{Key challenge: Adding reset signals}
In hardware, there is no notion of a stack that can be popped or pushed.  It is therefore necessary to add reset signals to the main module \JWcouldcut{that will be called} so that the proper state can be set at the start of the function.
The main subtle change that was added to the code is the reset signal which sets the state to the starting state correctly.  In HTL, this was described directly in the semantics, where the entry point is stored in the module interface of HTL.  However, in Verilog we also want to verify that the hardware will be placed into the right state when we power it up or reset the design, and it therefore has to be encoded directly in the Verilog code.
To check that the initial state of the Verilog is the same as the HTL code, we therefore have to run the module once, assuming the state is undefined and the reset is set to high.  We then have to compare the abstract starting state stored in the HTL module to the bitvector value we obtain from running the module for one clock cycle and prove that they are the same.  As the value for the state is undefined, the case statements will evaluate to the default state and therefore not perform any computations.

It is therefore important to find the right intermediate language \JW{You already said `crucial to know where to branch off' so this feels repetitive} so that the HLS tool still benefits from many of the generic optimisations that \compcert{} performs, but does not receive the code transformations that are specific to CPU architectures.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Turning CompCert into an HLS tool}

%% Should maybe go in the introduction instead.

\begin{figure}
  \centering
  \resizebox{0.47\textwidth}{!}{
  \begin{tikzpicture}
    [language/.style={fill=white,rounded corners=2pt}]
    \fill[compcert,rounded corners=3pt] (-1,-1) rectangle (9,1.5);
    \fill[formalhls,rounded corners=3pt] (-1,-1.5) rectangle (9,-2.5);
    \node[language] at (0,0) (clight) {Clight};
    \node[language] at (2,0) (cminor) {C\#minor};
    \node[language] at (4,0) (rtl) {RTL};
    \node[language] at (6,0) (ltl) {LTL};
    \node[language] at (8,0) (ppc) {PPC};
    \node[language] at (4,-2) (dfgstmd) {HTL};
    \node[language] at (7,-2) (verilog) {Verilog};
    \node at (0,1) {CompCert};
    \node at (0,-2) {Vericert};
    \draw[->] (clight) -- (cminor);
    \draw[->,dashed] (cminor) -- (rtl);
    \draw[->] (rtl) -- (ltl);
    \draw[->,dashed] (ltl) -- (ppc);
    \draw[->] (rtl) -- (dfgstmd);
    \draw[->] (dfgstmd) -- (verilog);
  \end{tikzpicture}}
  \caption{Verilog backend branching off at the RTL stage. \JW{It's a real nightmare that `RTL' is so ambiguous, meaning both software and hardware. I think we have to do something about it. One possibility: according to the CompCert webpages, software RTL is `also known as 3-address code', so we could replace (software) `RTL' with `3AC' (and give a footnote to explain why). Is it also possible to replace (hardware) `RTL' with just `Verilog'? I know Verilog could mean RTL or netlist, but since this paper doesn't talk much about RTL-to-netlist synthesis, perhaps that doesn't matter? Or perhaps RTL could be replaced with FSMD? Or even just `hardware design'?}\YH{I like that suggestion, I will actually replace RTL by 3AC then, and at least in this figure I did purposefully replace RTL by Verilog, and wanted to do it throughout the paper.}}%
  \label{fig:rtlbranch}
\end{figure}

This section covers the main architecture of the HLS tool, and the way in which the backend was added to CompCert.

CompCert is made up of 11 intermediate languages in between the Clight input and the assembly output.  These intermediate languages each serve a different purpose and contain various different optimisations.  When designing a new backend for CompCert, it is therefore crucial to know where to branch off and start the hardware generation.  Many of the optimisations that the CompCert backend performs are not necessary when generating custom hardware and not relying on a CPU anymore, such as register allocation or even scheduling.  It is therefore important to find the right intermediate language so that the HLS tool still benefits from many of the generic optimisations that CompCert performs, but does not receive the code transformations that are specific to CPU architectures.

Existing HLS compilers usually use LLVM IR as an intermediate representation when performing HLS specific optimisations, as each instruction can be mapped quite well to hardware which performs the same behaviour.  CompCert's register transfer language (RTL) is the intermediate language that resembles LLVM IR the most, as it also has an infinite number of pseudo-registers and each instruction maps well to hardware. \JP{Perhaps this needs some further qualification? RTL uses the operations from the target architecture, and indeed performs architecture specific optimisations prior to RTL gen, so (for sake of example) switching from x86 RTL to RISC-V RTL could have a significant impact on performance.}\YH{Yes will definitely include those, just have to think about where.  I think this could be something that could be mentioned in the RTL section instead.}  In addition to that, many optimisations that are also useful for HLS are performed in RTL, which means that if it is supported as the input language, the HLS algorithm benefits from the same optimisations.  It is therefore a good candidate to be chosen as the input language to the HLS backend. The complete flow that Vericert takes is show in figure~\ref{fig:rtlbranch}.

%%TODO: Maybe add why LTL and the other smaller languages are not that well suited

\begin{figure}
  \centering
  \begin{subfigure}[b]{0.5\linewidth}
    \inputminted{c}{data/accumulator.c}
    \caption{Accumulator C code.}\label{fig:accumulator_c}
  \end{subfigure}%
  \begin{subfigure}[b]{0.5\linewidth}
    \inputminted[fontsize=\footnotesize]{c}{data/accumulator.rtl}
    \caption{Accumulator RTL code.}\label{fig:accumulator_rtl}
  \end{subfigure}
  \caption{Accumulator example using CompCert to translate from C to RTL.}\label{fig:accumulator_c_rtl}
\end{figure}

\begin{figure}
  \centering
  \begin{subfigure}[b]{0.5\linewidth}
    \inputminted[fontsize=\tiny]{systemverilog}{data/accumulator1.v}
    \caption{Accumulator HTL code.}\label{fig:accumulator_htl}
  \end{subfigure}%
  \begin{subfigure}[b]{0.5\linewidth}
    \inputminted[fontsize=\tiny]{systemverilog}{data/accumulator2.v}
    \caption{Accumulator Verilog code.}\label{fig:accumulator_v}
  \end{subfigure}
  \caption{Accumulator example using CompCert to translate from HTL to Verilog.\YH{I feel like these examples take up too much space, but don't really know of a different way to show off a complete example without the long code.} \JW{Ok, what about drawing the FSMD in some sort of pictorial way? I think you tried drawing it as a schematic previously, but that was too big and clumsy. What about drawing the FSMD as a state machine, with sixteen states and labelled edges between them? Or might that be too abstract?}}\label{fig:accumulator_htl_v}
\end{figure}

%\begin{figure}
%  \centering
%  \includegraphics[width=0.47\textwidth]{data/accumulator_fsmd}
%  \caption{Accumulator example translated to a finite state-machine with datapath (FSMD).}\label{fig:accumulator_fsmd}
%\end{figure}

To describe the translation, we start with an example of how to translate a simple accumulator example, which is shown in figure~\ref{fig:accumulator_c}.  Using this example, the different stages of the translation can be explained, together with the way they were proven.

The first step is to use the CompCert front end passes to convert the C code into RTL, which is the intermediate language that CompCert uses for most of its optimisations.  The translation is shown in figure~\ref{fig:accumulator_c_rtl}, where the RTL code is depicted in figure~\ref{fig:accumulator_rtl}.  At the RTL stage, many of CompCert's normal optimisations passes, such as deadcode elimination or common subexpression elimination (CSE) can be applied to reduce the RTL as much as possible.  The optimised RTL is then translated to HTL, consisting of a finite state-machine representation of the RTL code.  This representation maps directly to hardware and it therefore transformed to Verilog afterwards.  The translation from RTL to HTL is the main step in converting the software representation of the code into a hardware representation of the code.  The following sections will go over the details of the transformation of each pass.

\subsection{CompCert RTL}

RTL is the existing intermediate language in CompCert that was chosen to transform to hardware, as it is quite architecture independent compared to the lower level intermediate languages, and each instruction translates to a hardware construct.  All CompCert intermediate language follow the similar structure below:

\begin{align*}
  \mathit{program} \quad ::= \{ &\mathbf{variables} : (\mathit{id} * \mathit{data}) \text{ list}, \\
                 &\mathbf{functions} : (\mathit{id} * \mathit{function\_def}) \text{ list},\\
                 &\mathbf{main} : \mathit{id} \}
\end{align*}

\noindent where function definitions can either be internal or external.  External functions are functions that are not defined in the current translation unit, and are therefore not part of the current translation.  The difference in between the CompCert intermediate languages is therefore how the internal function is defined, as that defines the structure of the language itself.

%% Describe RTL
RTL function definitions are a sequence of instructions encoded in a control-flow graph, with each instruction linking to the next instruction that should be executed.  RTL consists of ten separate instructions in total, which makes this a good candidate for high-level synthesis translation, as these instructions translate quite directly to a hardware equivalent.  This structure makes it ideal for performing optimisations on, as performing static analysis and transformations on this language is much simpler than doing the same optimisation for the C abstract syntax tree instead.  However, even though RTL is quite architecture independent and good for software optimisations, it does not model hardware well and is therefore not suitable for hardware optimisations.  We therefore need a new intermediate language that translates directly into hardware and which can also be used as an intermediate language for more hardware specific optimisations.

However, before this translation can be performed, all functions need to be inlined.  This means that recursive function calls cannot be supported, however, these are difficult to translate to hardware as they require dynamic allocation of the stack.  Dynamic allocation cannot be performed in hardware, because the memory size has to be set statically and then translated to a physical memory in the hardware. \JP{I think we should explain this a little bit more; HW allocation is possible in principle.} Inlining is already an optimisation that is performed at the RTL stage in CompCert, however, to get RTL that is ready to be translated to hardware, this pass was modified so that is forced all functions to be inlined.  This eliminates all the function calls from the RTL, as long as there are no recursive function calls, which means that do not have to be handled by the translation to hardware.

\subsection{HTL}

The first hardware specific intermediate language that is introduced is the hardware transfer language (HTL).  It is a self-contained finite state machine with datapath (FSMD) representation of the RTL code which maps directly to actual hardware.  HTL, like all CompCert intermediate languages, has the same program structure as RTL, but internal functions now contain logic to control the order of execution, and a datapath that transforms the data in the registers.  This is represented by having two maps that link states to the control logic and to the current position in the datapath, which are both expressed using Verilog statements.  The syntax for HTL functions are the following:

\begin{align*}
  g \quad &::= \quad n \mapsto s\\
  d_{\rm r} \quad &::= \quad r \mapsto (io? * n)\\
  d_{\rm a} \quad &::= \quad r \mapsto (io? * n * n)\\
  M \quad &::= \quad \big\{\ \texttt{params} : r \text{ list}\\
                                     &\texttt{datapath} : g\\
                                     &\texttt{controllogic} : g\\
                                     &\texttt{entrypoint} : n\\
                                     &\texttt{st, stk, finish, return, start, reset, clk} : r\\
                                     &\texttt{scldecls} : d_{\rm r}\\
                                     &\texttt{arrdecls} : d_{\rm a}\ \big\}
\end{align*}

\noindent where $d_{\rm a}$ and $d_{\rm r}$ are declaration lists for all the registers used in module $M$, and $g$ is map from states to Verilog statements.

Going back to to the previous accumulator example, the HTL code that is generated is shown in figure~\ref{fig:accumulator_htl}.  The function consists of two main parts, the data-path and the control logic.  The data-path describes how the data registers should change and what computations should be performed in each cycle, whereas the control logic describes how the state changes should be performed.  These two blocks in the main function will execute in parallel, meaning at each state, the control logic will load the next state into the state register, and the data-path will compute a value and store it in the appropriate register.

The translation from RTL to HTL is quite straightforward, as each RTL instruction either matches up quite well to a hardware construct, or does not have to be handled by the translation, such as function calls.  At each instruction, the control flow is separated from the data computation and is then added to the control logic and data-flow map respectively.  For example, in state 16 in the RTL code in figure~\ref{fig:accumulator_rtl}, the register \texttt{x9} is initialised to one and then moves on to state 15.  This is encoded in HTL by initialising a 32 bit register \texttt{reg\_9} to one as well in the data-flow graph section, and also adding a state transition to the state 15 in the controllogic transition.  Simple operator instructions are translated in a similar way.  For example, in state 5, the value in the array is added to the current value of the accumulated sum, which is simply translated to an addition of the equivalent registers in the HTL code.

The translation can be described by a relational specification, which can be described by the following relation, where \textit{fin}, \textit{rtrn}, $\sigma$ and \textit{stk} are the registers for the finished signal, return value, current state and stack respectively, $i$ is the RTL instruction being translated, and \textit{data} and \textit{control} are the data-flow and control logic map respectively.

\begin{equation*}
  \yhfunction{tr\_instr } \textit{fin rtrn }\ \sigma\ \textit{stk }\ i\ \textit{data }\ \textit{control}
\end{equation*}

\noindent An example of a rule describing the translation of an operation is the following:

\begin{equation*}
  \inferrule[Iop]{\yhfunction{tr\_op } \textit{op }\ \vec{a} = \yhconstant{OK } e}{\yhfunction{tr\_instr } \textit{fin rtrn }\ \sigma\ \textit{stk }\ (\yhconstant{Iop } \textit{op }\ \vec{a}\ d\ n)\ (d \Leftarrow e)\ (\sigma \Leftarrow n)}
\end{equation*}

In the accumulator C code, the for loop is translated to a branch statement in state 3, which can also be translated to equivalent HTL code by placing the comparison into the control logic part of the main function, and not performing any data operations in the data-flow section.  On the next clock cycle, the state will therefore transition to state 7 or state 2 depending on if \texttt{reg\_3} is less than 3 or not.  One thing to note is that in this case, the comparison is signed.  By default, all operators and registers in Verilog and HTL are unsigned, so to force an operation to handle the bits as signed, both operators have to be forced to signed.  In addition to that, Verilog resizes expressions to the largest needed size by default, which can affect the result of the computation.  This feature is also not supported by the Verilog semantics we used, and there would therefore be a mismatch between the Verilog semantics and the actual behaviour of Verilog according to the standard.  To bypass this issue braces are used to stop the Verilog simulator or synthesis tool from resizing anything inside the braces.  Instead, explicit resizing is used in the semantics and operations can only be performed on two registers that have the same size.

Finally, this example also contains an array, which is stored on the stack for the main function.  Translating arrays means that load and store instructions need to be supported.  In hardware RAM is not as available as in software, and needs to be explicitly implemented by declaring a two dimensional array with specific properties, such as only being able to write and read from it once per clock cycle.  Therefore, to make memory as efficient as possible and support reads and writes in one clock cycle, a memory only addressable by word needs to be implemented, instead of the standard RAM which is addressable by byte.  This is because the clock cycles in devices such as FPGAs are normally an order of magnitude slower than CPUs, meaning taking four cycles to load an integer is not feasible.  However, this leads to various problems, the main one being that addresses need to be word-aligned, which is not the case in RTL or in C.  Therefore, addresses need to be divisible by four to also be valid addresses in HTL, which is shown in the load from the stack in state 6 of the RTL.  The equivalent HTL code is shown in the datapath branch where the same addition is done, however, the result is divided by 4.  To prove that these two loads have the same behaviour, we have to prove that the load was word-aligned and therefore could be divided by 4.\YH{Elaborate on our solution.}

\subsection{Verilog}

Finally, we have to translate the HTL code into proper Verilog and prove that it behaves the same as the RTL according to the Verilog semantics.  The Verilog output is modelled as a complete abstract syntax tree (AST) instead of being an abstract map over the instructions that are executed.  However, as all the instructions are already expressed in Verilog, only the maps need to be translated to valid Verilog, and correct declarations for all the variables in the program need to be added as well.

This translation seems quite straightforward, however, proving that it is correct is not that simple, as all the implicit assumptions that were made in HTL need to be translated explicitly to Verilog and needs to have the same behaviour according to the semantics.  Figure~\ref{fig:accumulator_v} shows the final Verilog output that is generated.  In general, the structure is similar to the structure of the HTL code, however, the control and datapath maps have been translated to case statement which serve the same purpose.  The other main addition to the code is the initialisation of all the variables in the code to the correct bitwidth and the declaration of the inputs and outputs to the module, so that the module can be used inside a larger hardware design.  The main subtle change that was added to the code is the reset signal which sets the state to the starting state correctly.  In HTL, this was described directly in the semantics, where the entrypoint is stored in the module interface of HTL.  However, in Verilog we also want to verify that the hardware will be placed into the right state when we power it up or reset the design, and it therefore has to be encoded directly in the Verilog code.

To check that the initial state of the Verilog is the same as the HTL code, we therefore have to run the module once, assuming the state is undefined and the reset is set to high.  We then have to compare the abstract starting state stored in the HTL module to the bitvector value we obtain from running the module for one clock cycle and prove that they are the same.  As the value for the state is undefined, the case statements will evaluate to the default state and therefore not perform any computations.

The translation from maps to case statements is done by turning each node of the tree into a case expression with the statments in each being the same.  The main difficulty for the proof is that a structure that can be directly accessed is transformed into an inductive structure where a certain number of constructors need to be called to get to the correct case.  The proof of the translation from maps to case statements follows by induction over the list of elements in the map and the fact that each key in this list will be unique.  In addition to that, the statement that is currently being evaluated is guaranteed by the correctness of the list of elements to be in that list.  The latter fact therefore eliminates the base case, as an empty list does not contain the element we know is in the list.  The other two cases follow by the fact that either the key is equal to the evaluated value of the case expression, or it isn't.  In the first case we can then evaluate the statement and get the state after the case expression, as the uniqueness of the key tells us that the key cannot show up in the list anymore.  In the other case we can just apply the inductive hypothesis and remove the current case from the case statement, as it did not match.

Another problem with the representation of the state as an actual register is that we have to make sure that the state does not overflow.  Currently, the state register will always be 32 bits, meaning the maximum number of states can only be $2^{32} - 1$.  We therefore have to prove that the state value will never go over that value.  This means that during the translation we have to check for each state that it can fit into an integer.

%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"
%%% End: