summaryrefslogtreecommitdiffstats
path: root/algorithm.tex
blob: c4c91a4c9b2d809f769fc91588c6508d76dd11a4 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
\section{Designing a verified HLS tool}
\label{sec:design}

This section describes the main architecture of the HLS tool, and the way in which the Verilog back end was added to \compcert{}.  This section also covers an example of converting a simple C program into hardware, expressed in the Verilog language.

\paragraph{Choice of source language}
C was chosen as the source language as it remains the most common source language amongst production-quality HLS tools~\cite{canis11_legup, xilinx20_vivad_high_synth, intel_hls, bambu_hls}. This, in turn, may be because it is ``[t]he starting point for the vast majority of algorithms to be implemented in hardware''~\cite{5522874}, lending a degree of practicality. 
The availability of \compcert{}~\cite{leroy09_formal_verif_realis_compil} also provides a solid basis for formally verified C compilation.
%Since a lot of existing code for HLS is written in C, supporting C as an input language, rather than a custom domain-specific language, means that \vericert{} is more practical. 
%An alternative was to support LLVM IR as an input language, however, to get a full work flow from a higher level language to hardware, a front end for that language to LLVM IR would also have to be verified. \JW{Maybe save LLVM for the `Choice of implementation language'?}
We considered Bluespec~\cite{nikhil04_blues_system_veril}, but decided that although it ``can be classed as a high-level language''~\cite{greaves_note}, it is too hardware-oriented to be suitable for traditional HLS.
We also considered using a language with built-in parallel constructs that map well to parallel hardware, such as occam~\cite{page91_compil_occam}, Spatial~\cite{spatial} or Scala~\cite{chisel}.
% However, this would not qualify as being HLS due to the manual parallelism that would have to be performed. \JW{I don't think the presence of parallelism stops it being proper HLS.}
%\JP{I think I agree with Yann here, but it could be worded better. At any rate not many people have experience writing what is essentially syntactic sugar over a process calculus.} 
%\JW{I mean: there are plenty of software languages that involve parallel constructs. Anyway, perhaps we can just dismiss occam for being too obscure.}


\paragraph{Choice of target language}
Verilog~\cite{06_ieee_stand_veril_hardw_descr_languag} is an HDL that can be synthesised into logic cells which can either be placed onto a field-programmable gate array (FPGA) or turned into an application-specific integrated circuit (ASIC).  Verilog was chosen as the output language for \vericert{} because it is one of the most popular HDLs and there already exist a few formal semantics for it that could be used as a target~\cite{loow19_verif_compil_verif_proces, meredith10_veril}.  Bluespec, previously ruled out as a source language, is another possible target and there exists a formally verified translation to circuits using K\^{o}ika~\cite{bourgeat20_essen_blues}. %\JP{This needs an extra comment maybe?}\YH{Maybe about bluespec not being an ideal target language because it's quite high-level?} % but targeting this language would not be trivial as it is not meant to be targeted by an automatic tool, instead strives to a formally verified high-level hardware description language instead. 

%\JW{Can we mention one or two alternatives that we considered? Bluespec or Chisel or one of Adam Chlipala's languages, perhaps?}

\paragraph{Choice of implementation language}
We chose Coq as the implementation language because of its mature support for code extraction; that is, its ability to generate OCaml programs directly from the definitions used in the theorems.
We note that other authors have had some success reasoning about the HLS process using other theorem provers such as Isabelle~\cite{ellis08}.
\compcert{}~\cite{leroy09_formal_verif_realis_compil} was chosen as the front end framework, as it is a mature framework for simulation proofs about intermediate languages, and it already provides a validated C parser~\cite{jourdan12_valid_lr_parser}.
The Vellvm framework~\cite{zhao12_formal_llvm_inter_repres_verif_progr_trans} was also considered because several existing HLS tools are already LLVM-based, but additional work would be required to support a high-level language like C as input.
The .NET framework has been used as a basis for other HLS tools, such as Kiwi~\cite{kiwi}, and LLHD~\cite{schuiki20_llhd} has been recently proposed as an intermediate language for hardware design, but neither are suitable for us because they lack formal semantics.

\begin{figure}
  \centering
  % JW: resizebox is associated with puppies dying
  %\resizebox{0.85\textwidth}{!}{
  \begin{tikzpicture}
    [language/.style={fill=white,rounded corners=3pt,minimum height=7mm},
    continuation/.style={}]
    \fill[compcert,rounded corners=3pt] (-1,-0.5) rectangle (9,1.5);
    \fill[formalhls,rounded corners=3pt] (-1,-1) rectangle (9,-2);
    \node[language] at (-0.3,0) (clight) {Clight};
    \node[continuation] at (1,0) (conta) {$\cdots$};
    \node[language] at (2.7,0) (cminor) {CminorSel};
    \node[language] at (4.7,0) (rtl) {3AC};
    \node[language] at (6.2,0) (ltl) {LTL};
    \node[language] at (8.4,0) (ppc) {PPC};
    \node[continuation] at (7.3,0) (contb) {$\cdots$};
    \node[language] at (4.7,-1.5) (htl) {HTL};
    \node[language] at (6.7,-1.5) (verilog) {Verilog};
    \node at (0,1) {\bf\compcert{}};
    \node at (0,-1.5) {\bf\vericert{}};
    \node[align=center] at (3.3,-2.4) {\footnotesize RAM\\[-0.5em]\footnotesize insertion};
    \draw[->,thick] (clight) -- (conta);
    \draw[->,thick] (conta) -- (cminor);
    \draw[->,thick] (cminor) -- (rtl);
    \draw[->,thick] (rtl) -- (ltl);
    \draw[->,thick] (ltl) -- (contb);
    \draw[->,thick] (contb) -- (ppc);
    \draw[->,thick] (rtl) -- (htl);
    \draw[->,thick] (htl) -- (verilog);
    \draw[->,thick] (htl.west) to [out=180,in=150] (4,-2.2) to [out=330,in=270] (htl.south);
  \end{tikzpicture}%}
  \caption{\vericert{} as a Verilog back end to \compcert{}}%
  \label{fig:rtlbranch}
\end{figure}

\paragraph{Architecture of \vericert{}}
The main work flow of \vericert{} is given in Figure~\ref{fig:rtlbranch}, which shows those parts of the translation that are performed in \compcert{}, and those that have been added.

\def\numcompcertlanguages{ten}

\compcert{} translates Clight\footnote{A deterministic subset of C with pure expressions.} input into assembly output via a sequence of intermediate languages; we must decide which of these \numcompcertlanguages{} languages is the most suitable starting point for the HLS-specific translation stages.

We select CompCert's three-address code (3AC)\footnote{This is known as register transfer language (RTL) in the \compcert{} literature. `3AC' is used in this paper instead to avoid confusion with register-transfer level (RTL), which is another name for the final hardware target of the HLS tool.} as the starting point. Branching off \emph{before} this point (at CminorSel or earlier) denies \compcert{} the opportunity to perform optimisations such as constant propagation and dead code elimination, which, despite being designed for software compilers, have been found useful in HLS tools as well~\cite{cong+11}. And if we branch off \emph{after} this point (at LTL or later) then \compcert{} has already performed register allocation to reduce the number of registers and spill some variables to the stack; this transformation is not required in HLS because there are many more registers available, and these should be used instead of RAM whenever possible. %\JP{``\compcert{} performs register allocation during the translation to LTL, with some registers spilled onto the stack: this is unnecessary in HLS since as many registers as are required may be described in the output RTL.''} \JP{Maybe something about FPGAs being register-dense (so rarely a need to worry about the number of flops)?}

3AC is also attractive because it is the closest intermediate language to LLVM IR, which is used by several existing HLS compilers. %\JP{We already ruled out LLVM as a starting point, so this seems like it needs further qualification.}\YH{Well not because it's not a good starting point, but the ecosystem in Coq isn't as good.  I think it's still OK here to say that being similar to LLVM IR is an advantage?} 
It has an unlimited number of pseudo-registers, and is represented as a control flow graph (CFG) where each instruction is a node with links to the instructions that can follow it. One difference between LLVM IR and 3AC is that 3AC includes operations that are specific to the chosen target architecture; we chose to target the x86\_32 backend, because it generally produces relatively dense 3AC thanks to the availability of complex addressing modes.% reducing cycle counts in the absence of an effective scheduling approach.

\begin{figure}
  \centering
  \begin{subfigure}[b]{0.3\linewidth}
    \begin{subfigure}[t]{1\linewidth}
\begin{minted}[fontsize=\footnotesize,linenos,xleftmargin=20pt]{c}
int main() {
    int x[2] = {3, 6};
    int i = 1;
    return x[i];
}
\end{minted}
      \caption{Example C code passed to \vericert{}.}\label{fig:accumulator_c}
    \end{subfigure}\\\vspace{3em}
    \begin{subfigure}[b]{1\linewidth}
\begin{minted}[fontsize=\footnotesize,linenos,xleftmargin=20pt]{c}
main() {
    x5 = 3
    int32[stack(0)] = x5
    x4 = 6
    int32[stack(4)] = x4
    x1 = 1
    x3 = stack(0) (int)
    x2 = int32[x3 + x1
               * 4 + 0]
    return x2
}
\end{minted}
      \caption{3AC produced by the \compcert{} front-end without any optimisations.}\label{fig:accumulator_rtl}
    \end{subfigure}
  \end{subfigure}\hfill%
  \begin{subfigure}[b]{0.65\linewidth}
\begin{minted}[fontsize=\tiny,linenos,xleftmargin=20pt]{verilog}
module main(reset, clk, finish, return_val);
  input [0:0] reset, clk;
  output reg [0:0] finish = 0;
  output reg [31:0] return_val = 0;
  reg [31:0] reg_3 = 0, addr = 0, d_in = 0, reg_5 = 0, wr_en = 0;
  reg [0:0] en = 0, u_en = 0;
  reg [31:0] state = 0, reg_2 = 0, reg_4 = 0, d_out = 0, reg_1 = 0;
  reg [31:0] stack [1:0];
  // RAM interface
  always @(negedge clk)
    if ({u_en != en}) begin
      if (wr_en) stack[addr] <= d_in;
      else d_out <= stack[addr];
      en <= u_en;
    end
  // Data-path
  always @(posedge clk)
    case (state)
      32'd11: reg_2 <= d_out;
      32'd8: reg_5 <= 32'd3;
      32'd7: begin u_en <= ( ! u_en); wr_en <= 32'd1;
                   d_in <= reg_5; addr <= 32'd0; end
      32'd6: reg_4 <= 32'd6;
      32'd5: begin u_en <= ( ! u_en); wr_en <= 32'd1;
                   d_in <= reg_4; addr <= 32'd1; end
      32'd4: reg_1 <= 32'd1;
      32'd3: reg_3 <= 32'd0;
      32'd2: begin u_en <= ( ! u_en); wr_en <= 32'd0;
                   addr <= {{{reg_3 + 32'd0} + {reg_1 * 32'd4}} / 32'd4}; end
      32'd1: begin finish = 32'd1; return_val = reg_2; end
      default: ;
    endcase
  // Control logic
  always @(posedge clk)
    if ({reset == 32'd1}) state <= 32'd8;
    else case (state)
           32'd11: state <= 32'd1;        32'd4: state <= 32'd3;
           32'd8: state <= 32'd7;         32'd3: state <= 32'd2;
           32'd7: state <= 32'd6;         32'd2: state <= 32'd11;
           32'd6: state <= 32'd5;         32'd1: ;
           32'd5: state <= 32'd4;         default: ;
         endcase
endmodule
\end{minted}
\caption{Verilog produced by \vericert{}. It demonstrates the instantiation of the RAM (lines 9--15), the data-path (lines 16--32) and the control logic (lines 33--42).}\label{fig:accumulator_v}
\end{subfigure}
  \caption{Translating a simple program from C to Verilog.}\label{fig:accumulator_c_rtl}
\end{figure}

\subsection{Translating C to Verilog, by example}
Figure~\ref{fig:accumulator_c_rtl} illustrates the translation of a simple program that stores and retrieves values from an array.
In this section, we describe the stages of the \vericert{} translation, referring to this program as an example.

\subsubsection{Translating C to 3AC}

The first stage of the translation uses unmodified \compcert{} to transform the C input, shown in Figure~\ref{fig:accumulator_c}, into a 3AC intermediate representation, shown in Figure~\ref{fig:accumulator_rtl}.
As part of this translation, function inlining is performed on all functions, which allows us to support function calls without having to support the \texttt{Icall} 3AC instruction.  Although the duplication of the function bodies caused by inlining can increase the area of the hardware, it can have a positive effect on latency. Inlining precludes support for recursive function calls, but this feature is not supported in most other HLS tools either~\cite{davidthomas_asap16}.

%\JW{Is that definitely true? Was discussing this with Nadesh and George recently, and I ended up not being so sure. Inlining could actually lead to \emph{reduced} resource usage because once everything has been inlined, the (big) scheduling problem could then be solved quite optimally. Certainly inlining is known to increase register pressure, but that's not really an issue here. If we're  not sure, we could just say that inlining everything leads to bloated Verilog files and the inability to support recursion, and leave it at that.}\YH{I think that is true, just because we don't do scheduling.  With scheduling I think that's true, inlining actually becomes quite good.}

\subsubsection{Translating 3AC to HTL}

%   + TODO Explain the main mapping in a short simple way

%   + TODO Clarify connection between CFG and FSMD

%   + TODO Explain how memory is mapped
%\JW{I feel like this could use some sort of citation, but I'm not sure what. I guess this is all from "Hardware Design 101", right?}\YH{I think I found a good one actually, which goes over the basics.}
%\JW{I think it would be worth having a sentence to explain how the C model of memory is translated to a hardware-centric model of memory. For instance, in C we have global variables/arrays, stack-allocated variables/arrays, and heap-allocated variables/arrays (anything else?). In Verilog we have registers and RAM blocks. So what's the correspondence between the two worlds? Globals and heap-allocated are not handled, stack-allocated variables become registers, and stack-allocated arrays become RAM blocks? Am I close?}\YH{Stack allocated variables become RAM as well, so that we can deal with addresses easily and take addresses of any variable.} \JW{I see, thanks. So, in short, the only registers in your hardware designs are those that store things like the current state, etc. You generate a fixed number of registers every time you synthesis -- you don't generate extra registers to store any of the program variables. Right?}

The next translation is from 3AC to a new hardware translation language (HTL). %, which is one step towards being completely translated to hardware described in Verilog.
This involves going from a CFG representation of the computation to a finite state machine with data-path (FSMD) representation~\cite{hwang99_fsmd}. The core idea of the FSMD representation is that it separates the control flow from the operations on the memory and registers. %\JP{I've become less comfortable with this term, but it's personal preference so feel free to ignore. I think `generalised finite state machine' (i.e.\ thinking of the entire `data-path' as contributing to the overall state) is more accurate.}\YH{Hmm, yes, I mainly chose FSMD because there is quite a lot of literature around it.  I think for now I'll keep it but for the final draft we could maybe change it.}
%This means that the state transitions can be translated into a simple finite state machine (FSM) where each state contains data operations that update the memory and registers. 
Hence, an HTL program consists of two maps from states to Verilog statements: control logic and data-path maps that express state transitions and computations respectively.
Figure~\ref{fig:accumulator_diagram} shows the resulting FSMD architecture. The right-hand block is the control logic that computes the next state, while the left-hand block updates all the registers and RAM based on the current program state. 

\begin{figure*}
  \centering
\definecolor{control}{HTML}{b3e2cd}
\definecolor{data}{HTML}{fdcdac}
\scalebox{1.2}{%
\begin{tikzpicture}
  \fill[control,fill opacity=1] (6.5,0) rectangle (12,5);
  \fill[data,fill opacity=1] (0,0) rectangle (5.5,5);
  \node at (1,4.7) {\footnotesize Data-path};
  \node at (7.5,4.7) {\footnotesize Control Logic};

  \fill[white,rounded corners=10pt] (7,0.5) rectangle (11.5,2.2);
  \node at (8,2) {\tiny Next State FSM};
  \foreach \x in {8,...,2}
    {\pgfmathtruncatemacro{\y}{8-\x}%
      \node[draw,circle,inner sep=0,minimum size=10,scale=0.8] (s\x) at (7.5+\y/2,1.35) {\tiny \x};}
  \node[draw,circle,inner sep=0,minimum size=10,scale=0.8] (s1c) at (11,1.35) {\tiny 1};
  \node[draw,circle,inner sep=0,minimum size=13,scale=0.8] (s1) at (s1c) {};
  \foreach \x in {8,...,3}
    {\pgfmathtruncatemacro{\y}{\x-1}\draw[-{Latex[length=1mm,width=0.7mm]}] (s\x) -- (s\y);}
  \node[draw,circle,inner sep=0,minimum size=10,scale=0.8] (s11) at (10.5,0.9) {\tiny 11};
  \draw[-{Latex[length=1mm,width=0.7mm]}] (s2) -- (s11);
  \draw[-{Latex[length=1mm,width=0.7mm]}] (s11) -- (s1);
  \draw[-{Latex[length=1mm,width=0.7mm]}] (7.2,1.7) to [out=0,in=100] (s8);

  \node[draw,fill=white] (nextstate) at (9.25,3) {\tiny Current State};
  \draw[-{Latex[length=1mm,width=0.7mm]}] let \p1 = (nextstate) in
    (11.5,1.25) -| (11.75,\y1) -- (nextstate);
  \draw let \p1 = (nextstate) in (nextstate) -- (6,\y1) |- (6,1.5);
  \node[scale=0.4,rotate=60] at (7.5,0.75) {\texttt{clk}};
  \node[scale=0.4,rotate=60] at (7.7,0.75) {\texttt{rst}};
  \draw[-{Latex[length=1mm,width=0.7mm]}] (7.65,-0.5) -- (7.65,0.5);
  \draw[-{Latex[length=1mm,width=0.7mm]}] (7.45,-0.5) -- (7.45,0.5);

  \fill[white,rounded corners=10pt] (2,0.5) rectangle (5,3);
  \filldraw[fill=white] (0.25,0.5) rectangle (1.5,2.75);
  \node at (2.6,2.8) {\tiny Update};
  \node[align=center] at (0.875,2.4) {\tiny \texttt{RAM}};
  \node[scale=0.4] at (4.7,1.5) {\texttt{state}};
  \draw[-{Latex[length=1mm,width=0.7mm]}] (6,1.5) -- (5,1.5);
  \draw[-{Latex[length=1mm,width=0.7mm]}] (6,1.5) -- (7,1.5);
  \node[scale=0.4,rotate=60] at (4.1,0.9) {\texttt{finished}};
  \node[scale=0.4,rotate=60] at (3.9,0.95) {\texttt{return\_val}};
  \node[scale=0.4,rotate=60] at (2.5,0.75) {\texttt{clk}};
  \node[scale=0.4,rotate=60] at (2.7,0.75) {\texttt{rst}};

  \node[scale=0.4,right,inner sep=5pt] (ram1) at (2,2.1) {\texttt{u\_en}};
  \node[scale=0.4,right,inner sep=5pt] (ram2) at (2,1.9) {\texttt{wr\_en}};
  \node[scale=0.4,right,inner sep=5pt] (ram3) at (2,1.7) {\texttt{addr}};
  \node[scale=0.4,right,inner sep=5pt] (ram4) at (2,1.5) {\texttt{d\_in}};
  \node[scale=0.4,right,inner sep=5pt] (ram5) at (2,1.3) {\texttt{d\_out}};

  \node[scale=0.4,left,inner sep=5pt] (r1) at (1.5,2.1) {\texttt{u\_en}};
  \node[scale=0.4,left,inner sep=5pt] (r2) at (1.5,1.9) {\texttt{wr\_en}};
  \node[scale=0.4,left,inner sep=5pt] (r3) at (1.5,1.7) {\texttt{addr}};
  \node[scale=0.4,left,inner sep=5pt] (r4) at (1.5,1.5) {\texttt{d\_in}};
  \node[scale=0.4,left,inner sep=5pt] (r5) at (1.5,1.3) {\texttt{d\_out}};

  \draw[-{Latex[length=1mm,width=0.7mm]}] (ram1) -- (r1);
  \draw[-{Latex[length=1mm,width=0.7mm]}] (ram2) -- (r2);
  \draw[-{Latex[length=1mm,width=0.7mm]}] (ram3) -- (r3);
  \draw[-{Latex[length=1mm,width=0.7mm]}] (ram4) -- (r4);
  \draw[-{Latex[length=1mm,width=0.7mm]}] (r5) -- (ram5);

  \draw[-{Latex[length=1mm,width=0.7mm]}] (4,0.5) -- (4,-0.5);
  \draw[-{Latex[length=1mm,width=0.7mm]}] (3.75,0.5) -- (3.75,-0.5);
  \draw[-{Latex[length=1mm,width=0.7mm]}] (2.45,-0.5) -- (2.45,0.5);
  \draw[-{Latex[length=1mm,width=0.7mm]}] (2.65,-0.5) -- (2.65,0.5);

  \foreach \x in {0,...,1}
  {\draw (0.25,1-0.25*\x) -- (1.5,1-0.25*\x); \node at (0.875,0.88-0.25*\x) {\tiny \x};}

  %\node[scale=0.4] at (1.2,2.2) {\texttt{wr\_en}};
  %\node[scale=0.4] at (1.2,2) {\texttt{wr\_addr}};
  %\node[scale=0.4] at (1.2,1.8) {\texttt{wr\_data}};
  %\node[scale=0.4] at (1.2,1.4) {\texttt{r\_addr}};
  %\node[scale=0.4] at (1.2,1.2) {\texttt{r\_data}};
  %
  %\node[scale=0.4] at (2.3,2.2) {\texttt{wr\_en}};
  %\node[scale=0.4] at (2.3,2) {\texttt{wr\_addr}};
  %\node[scale=0.4] at (2.3,1.8) {\texttt{wr\_data}};
  %\node[scale=0.4] at (2.3,1.4) {\texttt{r\_addr}};
  %\node[scale=0.4] at (2.3,1.2) {\texttt{r\_data}};
  %
  %\draw[-{Latex[length=1mm,width=0.7mm]}] (2,2.2) -- (1.5,2.2);
  %\draw[-{Latex[length=1mm,width=0.7mm]}] (2,2) -- (1.5,2);
  %\draw[-{Latex[length=1mm,width=0.7mm]}] (2,1.8) -- (1.5,1.8);
  %\draw[-{Latex[length=1mm,width=0.7mm]}] (2,1.4) -- (1.5,1.4);
  %\draw[-{Latex[length=1mm,width=0.7mm]}] (1.5,1.2) -- (2,1.2);

  \filldraw[fill=white] (2.8,3.25) rectangle (4.2,4.75);
  \node at (3.5,4.55) {\tiny \texttt{Registers}};
  \draw[-{Latex[length=1mm,width=0.7mm]}] (2,2.4) -| (1.75,4) -- (2.8,4);
  \draw[-{Latex[length=1mm,width=0.7mm]}] (4.2,4) -- (5.25,4) |- (5,2.4);
  \draw[-{Latex[length=1mm,width=0.7mm]}] (5.25,2.4) -- (6.2,2.4) |- (7,1.8);

  \node[scale=0.4] at (3.5,4.2) {\texttt{reg\_1}};
  \node[scale=0.4] at (3.5,4) {\texttt{reg\_2}};
  \node[scale=0.4] at (3.5,3.8) {\texttt{reg\_3}};
  \node[scale=0.4] at (3.5,3.6) {\texttt{reg\_4}};
  \node[scale=0.4] at (3.5,3.4) {\texttt{reg\_5}};
\end{tikzpicture}}
  \caption{The FSMD for the example shown in Figure~\ref{fig:accumulator_c_rtl}, split into a data-path and control logic for the next state calculation.  The Update block takes the current state, current values of all registers and at most one value stored in the array, and calculates a new value that can either be stored back in the array or in a register.}\label{fig:accumulator_diagram}
\end{figure*}

%\JP{Does it? Verilog has neither physical registers nor RAMs, just language constructs which the synthesiser might implement with registers and RAMs. We should be clear whether we're talking about the HDL representation, or the synthesised result: in our case these can be very different since we don't target any specific architectural features of an FPGA fabric of ASIC process.}
\paragraph{Translating memory}
Typically, HLS-generated hardware consists of a sea of registers and RAM memories.
This memory view is very different to the C memory model, so we perform the following translation. 
Variables that do not have their address taken are kept in registers, which correspond to the registers in 3AC.
All address-taken variables, arrays, and structs are kept in RAM.
The stack of the main function becomes an unpacked array of 32-bit integers, which is translated to a RAM when the hardware description is passed through a synthesis tool.  In a first pass, \vericert{} represents the RAM by direct reads and writes to an array. This is followed by a RAM insertion pass which inserts a proper RAM interface, and reads and writes are performed through that interface instead.  This interface is shown on lines 9-15 in the Verilog code in Figure~\ref{fig:accumulator_v}.
Finally, global variables are not translated in \vericert{} at the moment.
A high-level overview of the architecture can be seen in Figure~\ref{fig:accumulator_diagram}. 

\paragraph{Translating instructions}

Each 3AC instruction either corresponds to a hardware construct, or does not have to be handled by the translation, such as function calls (because of inlining).
For example, line 2 in Figure~\ref{fig:accumulator_rtl} shows a 32-bit register \texttt{x5} being initialised to 3, after which the control flow moves execution to line 3. This initialisation is also encoded in HTL at state 8 in both the control logic and data-path. always-blocks, shown in Figure~\ref{fig:accumulator_v}.  Simple operator instructions are translated in a similar way.

C and Verilog handle signedness quite differently. By default, all operators and registers in Verilog (and HTL) are unsigned, so to force an operation to handle the bits as signed, both operators have to be forced to be signed. Moreover, Verilog implicitly resizes expressions to the largest needed size by default, which can affect the result of the computation.  This feature is not supported by the Verilog semantics we adopted, so to match the semantics to the behaviour of the simulator and synthesis tool, braces are placed around all expressions to inhibit implicit resizing.  Instead, explicit resizing is used in the semantics, and operations can only be performed on two registers that have the same size.

In addition to that, equality checks between \emph{unsigned} variables is actually not supported, because this requires supporting the comparison of pointers, which should only be performed between pointers with the same provenance.  In \vericert{} there is currently no way to determine the provenance of a pointer, and it therefore cannot model the semantics of unsigned comparison in \compcert{}. This is not a severe restriction in practice, however, because, as dynamic allocation is not supported either, equality comparison of pointers is rarely needed, and equality comparison of integers can still be performed by casting them both to signed integers.

Finally, the \texttt{mulhs} and \texttt{mulhu} instructions, which fetch the upper bits of a 32-bit multiplication, are not translated by \vericert{} either, because 64-bit numbers are not supported. These instructions are only generated to optimise divisions by a constant that is not a power of two, so turning off constant propagation will allow these programs to pass without error.

\subsubsection{Translating HTL to Verilog}

Finally, we have to translate the HTL code into proper Verilog. % and prove that it behaves the same as the 3AC according to the Verilog semantics.
The challenge here is to translate our FSMD representation into a Verilog AST.  However, as all the instructions in HTL are already expressed as Verilog statements, only the top level data-path and control logic maps need to be translated to valid Verilog. We also require declarations for all the variables in the program, as well as declarations of the inputs and outputs to the module, so that the module can be used inside a larger hardware design.  In addition to translating the maps of Verilog statements, an always-block that will behave like the RAM also has to be created, which is only modelled abstractly at the HTL level.
Figure~\ref{fig:accumulator_v} shows the final Verilog output that is generated for our example.

Although this translation seems quite straight\-forward, proving that this translation is correct is complex.
All the implicit assumptions that were made in HTL need to be translated explicitly to Verilog statements and it needs to be shown that these explicit behaviours are equivalent to the assumptions made in the HTL semantics.  One main example of this is proving that specification of the RAM in HTL does indeed behave the same as its Verilog implementation.
We discuss these proofs in upcoming sections.

 %In general, the generated Verilog structure has similar to that of the HTL code.
%The key difference is that the control and datapath maps become Verilog case-statements. 
%Other additions are the initialisation of all the variables in the code to the correct bitwidths and the declaration of the inputs and outputs to the module, so that the module can be used inside a larger hardware design.

\subsection{Optimisations}

Although we would not claim that \vericert{} is a proper `optimising' HLS compiler yet, we have nonetheless made several design choices that aim to improve the quality of the hardware designs it produces.

\subsubsection{Byte- and word-addressable memories}

One big difference between C and Verilog is how memory is represented.  Although Verilog arrays use similar syntax to C arrays, they must be treated quite differently. To make loads and stores as efficient as possible, the RAM needs to be word-addressable, which means that an entire integer can be loaded or stored in one clock cycle.
However, the memory model that \compcert{} uses for its intermediate languages is byte-addre\-ssa\-ble~\cite{blazy05_formal_verif_memor_model_c}.  If a byte-addressable memory was used in the target hardware, which is closer to \compcert{}'s memory model, then a load and store would instead take four clock cycles, because a RAM can only perform one read and write per clock cycle.  It therefore has to be proven that the byte-addressable memory behaves in the same way as the word-addressable memory in hardware.  Any modifications of the bytes in the \compcert{} memory model also have to be shown to modify the word-addressable memory in the same way.  Since only integer loads and stores are currently supported in \vericert{}, it follows that the addresses given to the loads and stores should be multiples of four.  If that is the case, then the translation from byte-addressed memory to word-addressed memory can be done by dividing the address by four.

\subsubsection{Implementation of RAM interface}
The simplest way to implement loads and stores in \vericert{} would be to access the Verilog array directly from within the data-path (i.e., inside the always-block on lines 16--32 of Figure~\ref{fig:accumulator_v}). This would be correct, but when a Verilog array is accessed at several program points, the synthesis tool is unlikely to detect that it can be implemented as a RAM block, and will resort to using lots of registers instead, ruining the circuit's area and performance. To avert this, we arrange that the data-path does not access memory directly, but simply sets the address it wishes to access and then toggles the \texttt{u\_en} flag. This activates the RAM interface (lines 9--15 of Figure~\ref{fig:accumulator_v}) on the next falling clock edge, which performs the requested load or store. By factoring all the memory accesses out into a separate interface like this, we ensure that the underlying array is only accessed from a single program point in the Verilog code, and thus ensure that the synthesis tool will correctly infer a RAM block.
Interestingly, the syntax for the RAM interface is quite strict, as the synthesis tool will pattern match on it and only work for a predefined set of interfaces.  Without the interface, the array would be implemented using registers, which would increase the size of the hardware considerably.

Therefore, an extra compiler pass is added from HTL to HTL to extract all the direct accesses to the Verilog array and replace them by signals which access the RAM interface in a separate always-block. The translation is performed by going through all the instructions and replacing each load and store expression in turn.  Stores can simply be replaced by the necessary wires directly, however, because loads using the RAM interface take two clock cycles where a direct load from an array takes only one, this pass inserts an extra state after each load.

%\JW{I've called that negedge always-block the `RAM driver' in my proposed text above -- that feels like quite a nice a word for it to my mind -- what do you think?}\YH{Yes I quite like it!}
%Verilog arrays can be used in a variety of ways, however, these do not all produce optimal hardware designs.  If, for example, arrays in Verilog are accessed immediately in the data-path, then the synthesis tool is not be able to identify it as having the right properties for a RAM, and would instead implement the array using registers.  This is extremely expensive, and for large memories this can easily blow up the area usage of the FPGA, and because of the longer wires that are needed, it would also affect the performance of the circuit.  The synthesis tools therefore provide code snippets that they know how to transform into various constructs, including snippets that will generate proper RAMs in the final hardware.  This process is called memory inference.  The initial translation from 3AC to HTL converts loads and stores to direct accesses to the memory, as this preserves the same behaviour without having to insert more registers and logic.  We therefore have another pass from HTL to itself which performs the translation from this na\"ive use of arrays to a representation which always allows for memory inference.  This pass creates a separate always block to perform the loads and stores to the memory, and adds the necessary data, address and enable signals to communicate with that always-block from other always-blocks.  This always-block is shown between lines 10-15 in Figure~\ref{fig:accumulator_v}.

There are two interesting parts to the inserted RAM interface.  Firstly, the memory updates are triggered on the negative edge of the clock, out of phase with the rest of the design which is triggered on the positive edge of the clock.  The main advantage is that instead of loads and stores taking three clock cycles and two clock cycles respectively, they only take two clock cycles and one clock cycle instead, greatly improving their performance.  In addition to that, using the negative edge for the clock is supported by many synthesis tools, and does not affect the maximum frequency of the final design.

Secondly, the logic in the enable signal of the RAM (\texttt{en != u\_en}) is also atypical.  To make the proof simpler, the goal is to create a RAM which disables itself after every use, so that firstly, the proof can assume that the RAM is disabled at the start and end of every clock cycle, and secondly so that only the state which contains the load and store need to be modified to ensure this property.  With a simple enable signal, it would not be possible to disable it in the RAM itself, because this would result in a register being driven twice from two different locations, once from the RAM interface and once from the data-path.  The only other solutions are to either insert extra states that disable the RAM accordingly, thereby eliminating the speed advantage of the negative edge triggered RAM, or to check the next state after a load and store and insert disables into that state.  The latter solution can quickly become complicated though, especially as this new state could contain another memory operation, in which case the disable signal should not be added to that state.  A solution to this problem is to create another enable signal that is controlled by the self-disabling RAM, which is always set to be equal to the enable signal set by the data-path.  The RAM is then considered enabled if the data-path enable and the RAM enable are different.

%We can instead generate a second enable signal that is set by the user, and the original enable signal is then updated by the RAM to be equal to the value that the user set.  This means that the RAM should be enabled whenever the two signals are different, and disabled otherwise.

\begin{figure}
  \centering
  \begin{subfigure}[b]{0.48\linewidth}
    \begin{tikztimingtable}[timing/d/background/.style={fill=white}]
      \small clk & 2L 3{6C} \\
      \small u\_en & 2D{u\_en} 18D{$\overline{\text{u\_en}}$}\\
      \small addr & 2U 18D{3} \\
      \small wr\_en & 2U 18L \\
      \small en & 8D{u\_en} 12D{$\overline{\text{u\_en}}$}\\
      \small d\_out & 8U 12D{0xDEADBEEF} \\
      \small r & 14U 6D{0xDEADBEEF} \\
      \extracode
      \node[help lines] at (2,2.25) {\tiny 1};
      \node[help lines] at (8,2.25) {\tiny 2};
      \node[help lines] at (14,2.25) {\tiny 3};
      \begin{pgfonlayer}{background}
        \vertlines[help lines]{2,8,14}
      \end{pgfonlayer}
    \end{tikztimingtable}
    \caption{Timing diagram for loads.  The \texttt{u\_en} signal is toggled which enables the RAM, then d\_out is set to be the value stored at the address in the RAM, which is finally assigned to the register \texttt{r}.}\label{fig:ram_load}
  \end{subfigure}\hfill%
  \begin{subfigure}[b]{0.48\linewidth}
    \begin{tikztimingtable}[timing/d/background/.style={fill=white}]
      \small clk & 2L 2{7C} \\
      \small u\_en & 2D{u\_en} 14D{$\overline{\text{u\_en}}$}\\
      \small addr & 2U 14D{3} \\
      \small wr\_en & 2U 14H \\
      \small d\_in & 2U 14D{0xDEADBEEF} \\
      \small en & 9D{u\_en} 7D{$\overline{\text{u\_en}}$}\\
      \small stack[addr] & 9U 7D{0xDEADBEEF} \\
      \extracode
      \node[help lines] at (2,2.25) {\tiny 1};
      \node[help lines] at (9,2.25) {\tiny 2};
      \begin{pgfonlayer}{background}
        \vertlines[help lines]{2,9}
      \end{pgfonlayer}
    \end{tikztimingtable}
    \caption{Timing diagram for stores.  The \texttt{u\_en} signal is toggled to enable the RAM, together with the address \texttt{addr} and the data to store \texttt{d\_in}.  On the negative edge the data is stored into the RAM.}\label{fig:ram_store}
  \end{subfigure}
  \caption{Timing diagrams showing the execution of loads and stores over multiple clock cycles.}\label{fig:ram_load_store}
\end{figure}

Figure~\ref{fig:ram_load} shows an example of how the waveforms in the RAM shown in Figure~\ref{fig:accumulator_v} behave when a value is loaded.  To initiate a load, the data-path enable signal \texttt{u\_en} flag is toggled, the address \texttt{addr} is set and the write enable \texttt{wr\_en} is set to low.  This all happens at the positive edge of the clock, in the time slice 1.  Then, on the next negative edge of the clock, at time slice 2, the \texttt{u\_en} is now different to the RAM enable \texttt{en}, so the RAM is enabled.  A load is then performed by assigning the data-out register to the value stored at the address in the RAM and the \texttt{en} is set to the same value as \texttt{u\_en} to disable the RAM again.  Finally, on the next positive edge of the clock, the value in \texttt{d\_out} is assigned to the destination register \texttt{r}.  An example of a store is shown in Figure~\ref{fig:ram_store}, which instead assigns the \texttt{d\_in} register with the value to be stored.  The store is then performed on the negative edge of the clock and is therefore complete by the next positive edge.

\subsubsection{Implementing the \texttt{Oshrximm} instruction}

% Mention that this optimisation is not performed sometimes (clang -03).

Many of the \compcert{} instructions map well to hardware, but \texttt{Oshrximm} (efficient signed division by a power of two using a logical shift) is expensive if implemented na\"ively. The problem is that in \compcert{} it is specified as a signed division:
\begin{equation*}
  \texttt{Oshrximm } x\ y = \text{round\_towards\_zero}\left(\frac{x}{2^{y}}\right)
\end{equation*}
(where $x, y \in \mathbb{Z}$, $0 \leq y < 31$, and $-2^{31} \leq x < 2^{31}$) and instantiating divider circuits in hardware is well-known to cripple performance. Moreover, since \vericert{} requires the result of a divide operation to be ready within a single clock cycle, the divide circuit needs to be entirely combinational. This is inefficient in terms of area, but also in terms of latency, because it means that the maximum frequency of the hardware must be reduced dramatically so that the divide circuit has enough time to finish.

%\JP{Multi-cycle paths might be something worth exploring in future work: fairly error-prone/dangerous for hand-written code, but might be interesting in generated code.}\YH{Definitely is on the list for next things to look into, will make divide so much more efficient.}

%These small optimisations were found to be the most error prone, and guaranteeing that the new representation is equivalent to representation used in the \compcert{} semantics is difficult without proving this for all possible inputs.

One might hope that the synthesis tool consuming our generated Verilog would convert the division to an efficient shift operation, but this is unlikely to happen with signed division which requires more than a single shift. However, the observation can be made that signed division can be implemented using shifts:
\begin{equation*}
  \text{round\_towards\_zero}\left(\frac{x}{2^{y}}\right) =
  \begin{cases}
    x \gg y & \text{if } x \geq 0 \\
    - ( - x \gg y ) & \text{otherwise}.
  \end{cases}\\
\end{equation*}
where $\gg$ stands for a logical right shift. %Once this equivalence about the shifts and division operator is proven correct, it can be used to implement the \texttt{Oshrximm} using the efficient shift version instead of how the \compcert{} semantics described it.
When proving this equivalence, we actually found a bug in our original implementation that was due to the fact that a na\"{i}ve shift rounds towards $-\infty$.

%The \compcert{} semantics for the \texttt{Oshrximm} instruction expresses its operation exactly as shown in the equation above, even though in hardware the computation that would be performed would be different.  In \vericert{}, if the same operation would be implemented using Verilog operators, it is not guaranteed to be optimised correctly by the synthesis tools that convert the Verilog into a circuit.  To guarantee an output that does not include divides, we therefore have to express it in Verilog using shifts, and then prove that this representation is equivalent to the divide representation used in the \compcert{} semantics.  While conducting the proof, we discovered quite a few bugs in our initial implementation of optimisations, which rounded to $-\infty$ instead of 0.



%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"
%%% TeX-command-extra-options: "-shell-escape"
%%% End: