algorithm.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272

\section{Designing a verified HLS tool}
\label{sec:design}

This section covers the main architecture of the HLS tool, and the way in which the back end was added to \compcert{}.  This section will also cover an example of converting a simple C program into hardware, expressed in the Verilog language.

\paragraph{Choice of source language}
First of all, the choice of C for the input language of \vericert{} is simply because it is what most major HLS tools use~\cite{canis11_legup, xilinx20_vivad_high_synth, intel_hls, bambu_hls}. This, in turn, may be because C is ``[t]he starting point for the vast majority of algorithms to be implemented in hardware''~\cite{5522874}.
%Since a lot of existing code for HLS is written in C, supporting C as an input language, rather than a custom domain-specific language, means that \vericert{} is more practical. 
%An alternative was to support LLVM IR as an input language, however, to get a full work flow from a higher level language to hardware, a front end for that language to LLVM IR would also have to be verified. \JW{Maybe save LLVM for the `Choice of implementation language'?}
We considered Bluespec~\cite{nikhil04_blues_system_veril}, but decided that although it ``can be classed as a high-level language''~\cite{greaves_note}, it is too hardware-oriented to be used for traditional HLS.
We also considered using a language with built-in parallel constructs that map well to parallel hardware, such as occam~\cite{page91_compil_occam} or Spatial~\cite{spatial}, but found these languages too niche.
% However, this would not qualify as being HLS due to the manual parallelism that would have to be performed. \JW{I don't think the presence of parallelism stops it being proper HLS.} 
%\JP{I think I agree with Yann here, but it could be worded better. At any rate not many people have experience writing what is essentially syntactic sugar over a process calculus.} 
%\JW{I mean: there are plenty of software languages that involve parallel constructs. Anyway, perhaps we can just dismiss occam for being too obscure.}


\paragraph{Choice of target language}
Verilog~\cite{06_ieee_stand_veril_hardw_descr_languag} is an HDL that can be synthesised into logic cells which can be either placed onto a field-programmable gate array (FPGA) or turned into an application-specific integrated circuit (ASIC).  Verilog was chosen as the output language for \vericert{} because it is one of the most popular HDLs and there already exist a few formal semantics for it that could be used as a target~\cite{loow19_verif_compil_verif_proces, meredith10_veril}. Other possible targets could have been Bluespec, from which there exists a formally verified translation to circuits using K\^{o}ika~\cite{bourgeat20_essen_blues}.% but targeting this language would not be trivial as it is not meant to be targeted by an automatic tool, instead strives to a formally verified high-level hardware description language instead. 

%\JW{Can we mention one or two alternatives that we considered? Bluespec or Chisel or one of Adam Chlipala's languages, perhaps?}

\paragraph{Choice of implementation language}
We chose Coq as the implementation language because of its mature support for code extraction; that is, its ability to generate OCaml programs directly from the definitions used in the theorems. 
We note that other authors have had some success reasoning about the HLS process using other theorem provers such as Isabelle~\cite{ellis08}.
The framework that was chosen for the front end was \compcert{}, as it is a mature framework for simulation proofs about intermediate languages, and it already provides a validated C parser~\cite{jourdan12_valid_lr_parser}.
The Vellvm~\cite{zhao12_formal_llvm_inter_repres_verif_progr_trans} framework was also considered because several existing HLS tools are already LLVM-based, but additional work would be required to support a high-level language like C as input.
The .NET framework has been used as a basis for other HLS tools, such as Kiwi~\cite{kiwi}, and LLHD~\cite{schuiki20_llhd} has been recently proposed as an intermediate language for hardware design, but neither are suitable for us because they lack formal semantics.

\begin{figure}
  \centering
  \resizebox{0.47\textwidth}{!}{
  \begin{tikzpicture}
    [language/.style={fill=white,rounded corners=3pt,minimum height=7mm},
    continuation/.style={}]
    \fill[compcert,rounded corners=3pt] (-1,-1) rectangle (9,1.5);
    \fill[formalhls,rounded corners=3pt] (-1,-1.5) rectangle (9,-2.5);
    \node[language] at (-0.3,0) (clight) {Clight};
    \node[continuation] at (1,0) (conta) {$\cdots$};
    \node[language] at (2.7,0) (cminor) {CminorSel};
    \node[language] at (4.7,0) (rtl) {3AC};
    \node[language] at (6.2,0) (ltl) {LTL};
    \node[language] at (8.4,0) (ppc) {PPC};
    \node[continuation] at (7.3,0) (contb) {$\cdots$};
    \node[language] at (4.7,-2) (dfgstmd) {HTL};
    \node[language] at (6.7,-2) (verilog) {Verilog};
    \node at (0,1) {\compcert{}};
    \node at (0,-2) {\vericert{}};
    \draw[->] (clight) -- (conta);
    \draw[->] (conta) -- (cminor);
    \draw[->] (cminor) -- (rtl);
    \draw[->] (rtl) -- (ltl);
    \draw[->] (ltl) -- (contb);
    \draw[->] (contb) -- (ppc);
    \draw[->] (rtl) -- (dfgstmd);
    \draw[->] (dfgstmd) -- (verilog);
  \end{tikzpicture}}
  \caption{Verilog back end to Compcert, branching off at the three address code (3AC), at which point the three address code is transformed into a state machine.  Finally, it is transformed to a hardware description of the state machine in Verilog.}%
  \label{fig:rtlbranch}
\end{figure}

\paragraph{Architecture of \vericert{}}
The main work flow of \vericert{} is given in Figure~\ref{fig:rtlbranch}, which shows the parts of the translation that are performed in \compcert{}, and which have been added.

\compcert{} is made up of 11 intermediate languages in between the Clight input and the assembly output, so the first thing we must decide is where best to branch off to our Verilog back end.

We select CompCert's three-address code (3AC)\footnote{This is known as register transfer language (RTL) in the \compcert{} literature. We use `3AC' is used in this paper instead to avoid confusion with register-transfer level (RTL), which is another name for the final hardware target of the HLS tool.} as the branching off point. If we branch off before this (at CminorSel or earlier), then CompCert has not had the opportunity to perform such optimisations as constant propagation and dead code elimination, which have been shown to be valuable not just in conventional compilation but also in HLS~\cite{cong+11}. And if we branch off after this (at LTL or later) then CompCert has already performed register allocation to reduce the number of registers and spill some variables to the stack; this transformation is not required in HLS because there are many more registers available, and these should be used instead of RAM whenever possible.

3AC is also attractive because it is the closest intermediate language to LLVM IR, which is used by several existing HLS compilers. It has an unlimited number of pseudo-registers, and is represented as a control flow graph (CFG) where each instruction is a node with links to the instructions that can follow it. One difference between LLVM IR and 3AC is that 3AC includes operations that are specific to the chosen target architecture; we choose x86\_32 because each instruction maps well to hardware.

\begin{figure}
  \centering
  \begin{subfigure}[b]{0.49\linewidth}
\begin{minted}{c}
int main() {
  int x[3] = {1, 2, 3};
  int sum = 0;
  for (int i = 0;
       i < 3;
       i++)
    sum += x[i];
  return sum;
}
\end{minted}
    \caption{Input C code.}\label{fig:accumulator_c}
  \end{subfigure}\hspace*{-4mm}
  \begin{subfigure}[b]{0.49\linewidth}
\begin{minted}[fontsize=\footnotesize]{c}
main() {
 15:  x8 = 1
 14:  int32[stack(0)] = x8
 13:  x7 = 2
 12:  int32[stack(4)] = x7
 11:  x6 = 3
 10:  int32[stack(8)] = x6
  9:  x2 = 0
  8:  x1 = 0
  7:  x5 = stack(0) (int)
  6:  x4 = int32[x5 + x1 * 4 + 0]
  5:  x2 = x2 + x4 + 0 (int)
  4:  x1 = x1 + 1 (int)
  3:  if (x1 <s 3) goto 7 else goto 2
  2:  x3 = x2
  1:  return x3
}
\end{minted}
    \caption{3AC produced by \compcert{}.}\label{fig:accumulator_rtl}
  \end{subfigure}
  \caption{Using \compcert{} to translate a simple program from C to three address code (3AC).}\label{fig:accumulator_c_rtl}
\end{figure}

\subsection{Translating C to Verilog, by example}

Using the simple program in Figure~\ref{fig:accumulator_c} as a worked example, this section describes how \vericert{} translates a behavioural description in C into a hardware design in Verilog.

\subsubsection{Translating C to 3AC}

The first step of the translation is to use \compcert{} to transform the input C code into the 3AC shown in Figure~\ref{fig:accumulator_rtl}. As part of this, \compcert{} performs such optimisations as constant propagation and dead-code elimination.  Function inlining is also performed, which allows us to support function calls without having to support the \texttt{Icall} 3AC instruction.  Although the duplication of the function bodies caused by inlining can increase the area of the hardware, it can have a positive effect on latency. Moreover, inlining excludes support for recursive function calls, but this feature isn't supported in most other HLS tools either~\cite{davidthomas_asap16}.

%\JW{Is that definitely true? Was discussing this with Nadesh and George recently, and I ended up not being so sure. Inlining could actually lead to \emph{reduced} resource usage because once everything has been inlined, the (big) scheduling problem could then be solved quite optimally. Certainly inlining is known to increase register pressure, but that's not really an issue here. If we're  not sure, we could just say that inlining everything leads to bloated Verilog files and the inability to support recursion, and leave it at that.}\YH{I think that is true, just because we don't do scheduling.  With scheduling I think that's true, inlining actually becomes quite good.}

\subsubsection{Translating 3AC to HTL}

%   + TODO Explain the main mapping in a short simple way

%   + TODO Clarify connection between CFG and FSMD

%   + TODO Explain how memory is mapped
%\JW{I feel like this could use some sort of citation, but I'm not sure what. I guess this is all from "Hardware Design 101", right?}\YH{I think I found a good one actually, which goes over the basics.}
%\JW{I think it would be worth having a sentence to explain how the C model of memory is translated to a hardware-centric model of memory. For instance, in C we have global variables/arrays, stack-allocated variables/arrays, and heap-allocated variables/arrays (anything else?). In Verilog we have registers and RAM blocks. So what's the correspondence between the two worlds? Globals and heap-allocated are not handled, stack-allocated variables become registers, and stack-allocated arrays become RAM blocks? Am I close?}\YH{Stack allocated variables become RAM as well, so that we can deal with addresses easily and take addresses of any variable.} \JW{I see, thanks. So, in short, the only registers in your hardware designs are those that store things like the current state, etc. You generate a fixed number of registers every time you synthesis -- you don't generate extra registers to store any of the program variables. Right?}

The first translation performed in \vericert{} is from 3AC to a new hardware translation language (HTL). %, which is one step towards being completely translated to hardware described in Verilog. 
This involves going from a CFG representation of the computation to a finite state machine with datapath (FSMD) representation~\cite{hwang99_fsmd}. The core idea of the FSMD representation is that it separates the control flow from the operations on the memory and registers. %This means that the state transitions can be translated into a simple finite state machine (FSM) where each state contains data operations that update the memory and registers. 
An HTL program thus consists of two maps: a control map that describes how to calculate the next state from the current state, and a datapath map that describes how to update the registers and RAM given the current state. Figure~\ref{fig:accumulator_diagram} shows the resulting architecture of the FSMD. 

\begin{figure*}
  \centering
  \includegraphics[scale=0.3,trim={10cm 8cm 5cm 5cm},clip=true]{data/accumulator_fsmd2.pdf}
  \caption{The FSMD for our running example. \JW{Maybe replace `State' with `Current State'? And maybe `Calculate State' could be clearer as `Calculate Next State'?} \JW{Can state 15 (or should it be state 16??) have a dangling incoming arrow to indicate that it is the start state? And perhaps state 1 could have a double outline to indicate that it is an `accepting' state? Since there's space above the `Calculate State' box, I'd be mildly in favour of expanding that box a bit so that it included all 15 states explicitly (snaking back and forth).}\YH{If this is better I can mock up a tikz version of it maybe and fix the last bits then too.}}\label{fig:accumulator_diagram}
\end{figure*}

\paragraph{Translating memory}
Hardware does not have the same memory model as C, so the memory model needs to be translated, as follows.  Global variables are not translated in \vericert{} at the moment. The stack of the main function becomes a block of RAM, as seen in Figure~\ref{fig:accumulator_diagram}. Program variables that have their address taken are stored in this RAM, as are any arrays or structs defined in the function. Variables that do not have their address taken are kept in registers.

\paragraph{Translating instructions}
Each 3AC instruction either corresponds to a hardware construct, or does not have to be handled by the translation, such as function calls.
For example, in state 16 in figure~\ref{fig:accumulator_rtl}, the register \texttt{x9} is initialised to 1, after which the control flow moves to state 15.  This is encoded in HTL by initialising a 32-bit register \texttt{reg\_9} to 1 in the data-flow section, and also adding a transition to the state 15 in the control logic section.  Simple operator instructions are translated in a similar way.  For example, in state 5, the value in the array is added to the current value of the accumulated sum, which is simply translated to an addition of the equivalent registers in the HTL code.

\paragraph{Key challenge: signedness} Note that the comparison in state 3 is signed. This is because C and Verilog handle signedness quite differently. By default, all operators and registers in Verilog (and HTL) are unsigned, so to force an operation to handle the bits as signed, both operators have to be forced to be signed.  In addition to that, Verilog resizes expressions to the largest needed size by default, which can affect the result of the computation.  This feature is also not supported by the Verilog semantics we adopted, and there would therefore be a mismatch between the Verilog semantics and the actual behaviour of Verilog according to the standard.  To bypass this issue, braces are used to stop the Verilog simulator or synthesis tool from resizing anything inside the braces.  Instead, explicit resizing is used in the semantics and operations can only be performed on two registers that have the same size.

\subsection{Translating HTL to Verilog}

Finally, we have to translate the HTL code into proper Verilog. % and prove that it behaves the same as the 3AC according to the Verilog semantics.
Whereas HTL is a language that is specifically designed to represent the FSMDs we are interested in, Verilog is a general-purpose HDL.\@  So the challenge here is to translate our FSMD representation into a Verilog AST.  However, as all the instructions are already expressed in Verilog, only the maps need to be translated to valid Verilog, and correct declarations for all the variables in the program need to be added as well.

This translation seems quite straightforward, but proving it correct is not that simple, as all the implicit assumptions that were made in HTL need to be translated explicitly to Verilog statements and it needs to be shown that these explicit behaviours are equivalent to the assumptions made in the HTL semantics.
Figure~\ref{fig:accumulator_v} shows the final Verilog output that is generated for our worked example.  In general, the structure is similar to that of the HTL code, but the control and datapath maps become Verilog case-statements.  The other main addition to the code is the initialisation of all the variables in the code to the correct bitwidths and the declaration of the inputs and outputs to the module, so that the module can be used inside a larger hardware design.

\begin{figure}
  \centering
  \begin{subfigure}[b]{0.49\linewidth}
\begin{minted}[fontsize=\tiny]{verilog}
module main(reset, clk, finish, return_val);
  reg [31:0] stack [2:0];
  input [0:0] clk, reset;
  output reg [31:0] return_val;
  output reg [0:0] finish;
  reg [31:0] reg_8, reg_4, state,
             reg_6, reg_1, reg_9,
             reg_5, reg_3, reg_7;
  always @(posedge clk)
    case (state)
      32'd15: reg_9 <= 32'd1;
      32'd14: stack[32'd0] <= reg_9;
      32'd13: reg_8 <= 32'd2;
      32'd12: stack[32'd1] <= reg_8;
      32'd11: reg_7 <= 32'd3;
      32'd10: stack[32'd2] <= reg_7;
      32'd9: reg_3 <= 32'd0;
      32'd8: reg_1 <= 32'd0;
      32'd7: reg_6 <= 32'd0;
      32'd6: reg_5 <= stack[{{{reg_6 + 32'd0}
        + {reg_1 * 32'd4}} / 32'd4}];
      32'd5: reg_3 <= {reg_3 + {reg_5 + 32'd0}};
      32'd4: reg_1 <= {reg_1 + 32'd1};
      32'd3: ;
      32'd2: reg_4 <= reg_3;
      32'd1: begin
        finish = 1'd1;
        return_val = reg_4;
      end
      default:;
    endcase
\end{minted}
    \caption{Verilog always block describing the datapath of the module.}\label{fig:accumulator_v_1}
  \end{subfigure}\hfill%
  \begin{subfigure}[b]{0.49\linewidth}
\begin{minted}[fontsize=\tiny]{verilog}
  always @(posedge clk)
    if ({reset == 1'd1})
      state <= 32'd16;
    else
      case (state)
        32'd15: state <= 32'd14;
        32'd14: state <= 32'd13;
        32'd13: state <= 32'd12;
        32'd12: state <= 32'd11;
        32'd11: state <= 32'd10;
        32'd10: state <= 32'd9;
        32'd9: state <= 32'd8;
        32'd8: state <= 32'd7;
        32'd7: state <= 32'd6;
        32'd6: state <= 32'd5;
        32'd5: state <= 32'd4;
        32'd4: state <= 32'd3;
        32'd3: state <=
          ({$signed(reg_1) < $signed(32'd3)}
             ? 32'd7 : 32'd2);
        32'd2: state <= 32'd1;
        32'd1: ;
        default:;
      endcase
endmodule
\end{minted}
    \caption{Verilog always block describing the control logic of the module.}\label{fig:accumulator_v_2}
  \end{subfigure}
  \caption{Accumulator example using \vericert{} to translate the 3AC to a state machine expressed in Verilog. \JW{If space permits, it would probably be preferable to have this code in a single column, as splitting a single module across two subfigures is a bit jarring.}\YH{I actually don't mind it for some reason, because it separates control flow and data path, but it's true it is weird.  The only problem is it's very long if I don't do that.}}\label{fig:accumulator_v}
\end{figure}

\subsection{Optimisations}

Although we would not claim that \vericert{} is a proper `optimising' HLS compiler yet, we have nonetheless implemented a few optimisations that aim to improve the quality of the hardware designs it produces.

\subsubsection{Byte- and word-addressable memories}

One big difference between C and Verilog is how memory is represented.  In hardware, efficient RAMs are not as available as in software, and need to be explicitly implemented by declaring two-dimensional arrays with specific properties.  A major limitation is that RAMs often only allow one read and one write per clock cycle. So, to make loads and stores as efficient as possible, the RAM needs to be word-addressable, which means that an entire integer can be loaded or stored in one clock cycle.
However, the memory model that \compcert{} uses for its intermediate languages is byte-addressable~\cite{blazy05_formal_verif_memor_model_c}.  It therefore has to be proven that the byte-addressable memory behaves in the same way as the word-addressable memory in hardware.  Any modifications of the bytes in the \compcert{} memory model also have to be shown to modify the word-addressable memory in the same way.  Since only integer loads and stores are currently supported in \vericert{}, it follows that the addresses given to the loads and stores should be multiples of four.  If that is the case, then the translation from byte-addressed memory to word-addressed memory can be done by dividing the address by four.

\subsubsection{Implementing the \texttt{Oshrximm} instruction}

% Mention that this optimisation is not performed sometimes (clang -03).

\vericert{} performs some optimisations at the level of the instructions that are generated, so that the hardware performs the instructions as quickly as possible and so that the maximum frequency at which the hardware can run is increased.  One of the main constructs that cripples performance of the generated hardware is the instantiation of divider circuits.  Since \vericert{} requires the result of a divide operation to be ready within a single clock cycle, the divide circuit needs to be entirely combinational.  This is inefficient in terms of area, but also in terms of latency, because it means that the maximum frequency of the hardware must be reduced dramatically so that the divide circuit has enough time to finish.

%\JP{Multi-cycle paths might be something worth exploring in future work: fairly error-prone/dangerous for hand-written code, but might be interesting in generated code.}\YH{Definitely is on the list for next things to look into, will make divide so much more efficient.}

%These small optimisations were found to be the most error prone, and guaranteeing that the new representation is equivalent to representation used in the \compcert{} semantics is difficult without proving this for all possible inputs.

But dividing by a constant can often be optimised to a more efficient operation, especially if the denominator is a power of two.  In \compcert{}, the \texttt{Oshrximm} instruction is a strength-reduced version of a signed divide operation that performs the following operation, which is transformed to our optimal representation on the right, where $\div$ stands for integer signed division and $>>$ stands for a logical right shift:

\begin{align*}
  &\forall x, y \in \mathbb{Z},\ \ 0 \leq y < 31,\ \ -2^{31} \leq x < 2^{31},\\
  &x \div 2^y =
  \begin{cases}
    \left\lfloor \frac{x}{2^y} \right\rfloor = x >> y,& \text{if } x \geq 0\\
    \left\lceil \frac{x}{2^y} \right\rceil = - \left\lfloor \frac{-x}{2^y} \right\rfloor = - ( - x >> y ),& \text{otherwise}.
  \end{cases}\\
\end{align*}

The \compcert{} semantics for the \texttt{Oshrximm} instruction express it's operation exactly as shown in the equation above, even though in hardware the computation that would be performed would be different.  In \vericert{}, if the same operation would be implemented using Verilog operators, it is not guaranteed to be optimised correctly by the synthesis tools that convert the Verilog into a circuit.  To guarantee an output that does not include divides, we therefore have to express it in Verilog using shifts, and then prove that this representation is equivalent to the divide representation used in the \compcert{} semantics.  This proof discovered quite a few bugs in our initial implementation of optimisations, which rounded to $-\infty$ instead of 0.

%\JW{I wonder if Section 2 could benefit from a `Some Key Challenges' subsection, where you highlight several interesting bits of the translation process, each with their own paragraph heading. These could be something like:\begin{enumerate}\item Discrepancy between C and Verilog w.r.t. signedness \item Deciding between byte- and word-addressable memories \item Adding reset signals \item Implementing the Oshrximm instruction correctly \end{enumerate} For the causal reader, this would immediately signal two things: (1) you can skip this subsection on your initial pass, and (2) proving the HLS tool correct was a non-trivial undertaking.}

% - Explain main differences between translating C to software and to hardware.

%   + This can be done by going through the simple example.


%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"
%%% TeX-command-extra-options: "-shell-escape"
%%% End: