summaryrefslogtreecommitdiffstats
path: root/evaluation.tex
blob: 96812104b393bccdb5bb5354435188274a346dd8 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
\section{Evaluation}
\label{sec:evaluation}

Our evaluation is designed to answer the following three research questions.
\begin{description}
\item[RQ1] How fast is the hardware generated by \vericert{}?
\item[RQ2] How area-efficient is the hardware generated by \vericert{}?
\item[RQ3] How long does \vericert{} take to produce hardware?
\end{description}

\subsection{Experimental Setup}
\label{sec:evaluation:setup}

\paragraph{Choice of HLS tool for comparison.} We compare \vericert{} against \legup{} 5.1 because it is open-source and hence easily accessible, but still produces hardware ``of comparable quality to a commercial high-level synthesis tool''~\cite{canis11_legup}.

\paragraph{Choice and preparation of benchmarks.} We evaluate \vericert{} using the PolyBench/C benchmark suite (version 4.2.1)~\cite{polybench}, which consists of a collection of 30 numerical kernels. PolyBench/C is popular in the HLS context~\cite{choi+18,poly_hls_pouchet2013polyhedral,poly_hls_zhao2017,poly_hls_zuo2013}, since it has affine loop bounds, making it attractive for streaming computation on FPGA architectures.
We were able to use 27 of the 30 programs; three had to be discarded (\texttt{correlation},~\texttt{gramschmidt} and~\texttt{deriche}) because they involve square roots, requiring floats, which we do not support. 
% Interestingly, we were also unable to evaluate \texttt{cholesky} on \legup{}, since it produce an error during its HLS compilation. 
%In summary, we evaluate 27 programs from the latest Polybench suite. 
We configured Polybench's parameters so that only integer types are used, since we do not support floats. We use Polybench's smallest datasets for each program to ensure that data can reside within on-chip memories of the FPGA, avoiding any need for off-chip memory accesses.

\vericert{} implements divisions and modulo operations in C using the corresponding built-in Verilog operators. These built-in operators are designed to complete within a single clock cycle, and this causes substantial penalties in clock frequency. Other HLS tools, including LegUp, supply their own multi-cycle division/modulo implementations, and we plan to do the same in future versions of \vericert{}. In the meantime, we have prepared an alternative version of the benchmarks in which each division/modulo operation is overridden with our own implementation that uses repeated division and multiplications by 2. Where this change makes an appreciable difference to the performance results, we give the results for both benchmark sets.

\paragraph{Synthesis setup} The Verilog that is generated by \vericert{} or \legup{} is provided to Intel Quartus v16.0~\cite{quartus}, which synthesises it to a netlist, before placing-and-routing this netlist onto an Intel Arria 10 FPGA device that contains approximately 430000 LUTs.

\subsection{RQ1: How fast is \vericert{}-generated hardware?}

\pgfplotstableread[col sep=comma]{results/rel-time-div.csv}{\divtimingtable}
\pgfplotstableread[col sep=comma]{results/rel-size-div.csv}{\divslicetable}
\definecolor{vericertcol}{HTML}{66C2A5}
\definecolor{legupnooptcol}{HTML}{FC8D62}
\definecolor{legupnooptnochaincol}{HTML}{8DA0CB}
\begin{figure}\centering
  \begin{tikzpicture}
    \begin{groupplot}[
      group style={
        group name=my plots,
        group size=1 by 2,
        xlabels at=edge bottom,
        xticklabels at=edge bottom,
        vertical sep=0pt,
      },
      ymode=log,
      ybar=0pt,
      %ymode=log, %JW: normalised data ==> log scale
      %log ticks with fixed point,
      width=1\textwidth,
      height=0.4\textwidth,
      /pgf/bar width=3pt,
      legend pos=south east,
      log ticks with fixed point,
      xticklabels from table={\divtimingtable}{benchmark},
      legend style={nodes={scale=0.7, transform shape}},
      x tick label style={rotate=60,anchor=east,font=\footnotesize},
      legend columns=-1,
      xtick=data,
      enlarge x limits={abs=0.5},
      ylabel style={font=\footnotesize},
      ]

      \nextgroupplot[ymin=0.8,ylabel={Relative execution time to \legup{}}]
      \addplot+[vericertcol] table [x expr=\coordindex,y=vericert,col sep=comma] from \divtimingtable;
      \addplot+[legupnooptcol] table [x expr=\coordindex,y=legup noopt nochain,col sep=comma] from \divtimingtable;
      \addplot+[legupnooptnochaincol] table [x expr=\coordindex,y=legup noopt,col sep=comma] from \divtimingtable;
      \draw (axis cs:-1,1) -- (axis cs:26,1);

      \nextgroupplot[ymin=0.1,ylabel={Relative area to \legup{}}]
      \addplot+[vericertcol] table [x expr=\coordindex,y=vericert,col sep=comma] from \divslicetable;
      \addplot+[legupnooptcol] table [x expr=\coordindex,y=legup noopt nochain,col sep=comma] from \divslicetable;
      \addplot+[le on gupnooptnochaincol] table [x expr=\coordindex,y=legup noopt,col sep=comma] from \divslicetable;
      \draw (axis cs:-1,1) -- (axis cs:26,1);

      \legend{\vericert{},\legup{} w/o opt/chain,\legup{} w/o opt};
    \end{groupplot}
  \end{tikzpicture}
  \caption{Polybench with division enabled.}
\end{figure}

\pgfplotstableread[col sep=comma]{results/rel-time-nodiv.csv}{\nodivtimingtable}
\pgfplotstableread[col sep=comma]{results/rel-size-nodiv.csv}{\nodivslicetable}
\begin{figure}\centering
  \begin{tikzpicture}
    \begin{groupplot}[
      group style={
        group name=my plots,
        group size=1 by 2,
        xlabels at=edge bottom,
        xticklabels at=edge bottom,
        vertical sep=0pt,
      },
      ymode=log,
      ybar=0pt,
      ytick={0.5,1,2,4,8},
      width=1\textwidth,
      height=0.4\textwidth,
      /pgf/bar width=3pt,
      legend pos=south east,
      log ticks with fixed point,
      xticklabels from table={\nodivtimingtable}{benchmark},
      legend style={nodes={scale=0.7, transform shape}},
      x tick label style={rotate=60,anchor=east,font=\footnotesize},
      legend columns=-1,
      xtick=data,
      enlarge x limits={abs=0.5},
      ylabel style={font=\footnotesize},
      ymin=0.3,
      ]

      \nextgroupplot[ylabel={Relative execution time to \legup{}}]
      \addplot+[vericertcol] table [x expr=\coordindex,y=vericert,col sep=comma] from \nodivtimingtable;
      \addplot+[legupnooptcol] table [x expr=\coordindex,y=legup noopt nochain,col sep=comma] from \nodivtimingtable;
      \addplot+[legupnooptnochaincol] table [x expr=\coordindex,y=legup noopt,col sep=comma] from \nodivtimingtable;
      \draw (axis cs:-1,1) -- (axis cs:26,1);

      \nextgroupplot[ylabel={Relative area to \legup{}}]
      \addplot+[vericertcol] table [x expr=\coordindex,y=vericert,col sep=comma] from \nodivslicetable;
      \addplot+[legupnooptcol] table [x expr=\coordindex,y=legup noopt nochain,col sep=comma] from \nodivslicetable;
      \addplot+[legupnooptnochaincol] table [x expr=\coordindex,y=legup noopt,col sep=comma] from \nodivslicetable;
      \draw (axis cs:-1,1) -- (axis cs:26,1);

      \legend{\vericert{},\legup{} w/o opt/chain,\legup{} w/o opt};
    \end{groupplot}
  \end{tikzpicture}
  \caption{Polybench with division replaced by iterative division algorithm.}
\end{figure}

%
%\pgfplotstableread[col sep=comma]{results/exec-time.csv}{\nodivtimingtable}
%\begin{figure}\centering
%  \begin{tikzpicture}
%    \begin{semilogyaxis}[
%      ybar=0pt,
%      width=1\textwidth,
%      height=0.5\textwidth,
%      bar width=3pt,
%      ymin=0.1,
%      ymax=3,
%      log ticks with fixed point,
%      legend pos=south east,
%      xlabel={Polybench Benchmarks},
%      xticklabels from table={\nodivtimingtable}{benchmark},
%      ylabel={\vericert{} / \legup{} execution time ratio},
%      legend style={nodes={scale=0.7, transform shape}},
%      x tick label style={rotate=60,anchor=east,font=\footnotesize},
%      xtick=data,
%      enlarge x limits={abs=0.5},
%      ]
%
%      \addplot+ table [x expr=\coordindex,y=v no nc,col sep=comma] from \nodivtimingtable;
%      \addlegendentry{LegUp w/o opt w/o chain};
%      \addplot+ table [x expr=\coordindex,y=v no,col sep=comma] from \nodivtimingtable;
%      \addlegendentry{LegUp w/o opt};
%      \addplot+ table [x expr=\coordindex,y=v op,col sep=comma] from \nodivtimingtable;
%      \addlegendentry{LegUp};
%
%    \end{semilogyaxis}
%  \end{tikzpicture}
%\end{figure}
%
%\pgfplotstableread[col sep=comma]{results/slice-nodiv.csv}{\nodivslicetable}
%\begin{figure}\centering
%  \begin{tikzpicture}
%    \begin{semilogyaxis}[
%      ybar=0pt,
%      width=1\textwidth,
%      height=0.5\textwidth,
%      bar width=3pt,
%      ymin=0.1,
%      ymax=3,
%      log ticks with fixed point,
%      legend pos=south east,
%      xlabel={Polybench Benchmarks},
%      xticklabels from table={\nodivslicetable}{benchmark},
%      ylabel={\vericert{} / \legup{} execution time ratio},
%      legend style={nodes={scale=0.7, transform shape}},
%      x tick label style={rotate=60,anchor=east,font=\footnotesize},
%      xtick=data,
%      enlarge x limits={abs=0.5},
%      legend columns=-1,
%      ]
%
%      \addplot+ table [x expr=\coordindex,y=legup noopt nochain,col sep=comma] from \nodivslicetable;
%      \addlegendentry{LegUp w/o opt w/o chain};
%      \addplot+ table [x expr=\coordindex,y=legup noopt,col sep=comma] from \nodivslicetable;
%      \addlegendentry{LegUp w/o opt};
%      \addplot+ table [x expr=\coordindex,y=legup,col sep=comma] from \nodivslicetable;
%      \addlegendentry{LegUp};
%
%    \end{semilogyaxis}
%  \end{tikzpicture}
%\end{figure}

%\begin{figure}\centering
%\begin{subfigure}[t]{0.48\textwidth}
%\definecolor{cyclecountcol}{HTML}{1b9e77}
%\begin{tikzpicture}
%\begin{axis}[
%    xmode=log,
%    ymode=log,
%    height=1\textwidth,
%    width=1\textwidth,
%    xlabel={\legup{} cycle count},
%    ylabel={\vericert{} cycle count},
%    xmin=1000,
%    xmax=10000000,
%    ymax=10000000,
%    ymin=1000,
%    %log ticks with fixed point,
%  ]
%
%\addplot[draw=none, mark=*, draw opacity=0, fill opacity=0.6,cyclecountcol]
%  table [x=legupcycles, y=vericertcycles, col sep=comma]
%  {results/poly.csv};
%
%\addplot[dotted, domain=1000:10000000]{x};
%%\addplot[dashed, domain=10:10000]{9.02*x};
%
%\end{axis}
%\end{tikzpicture}
%\caption{A comparison of the cycle count of hardware designs generated by \vericert{} and by \legup{}.}
%\label{fig:comparison_cycles}
%\end{subfigure}\hfill%
%\begin{subfigure}[t]{0.48\textwidth}
%\definecolor{polycol}{HTML}{e6ab02}
%\definecolor{polywocol}{HTML}{7570b3}
%\begin{tikzpicture}
%\begin{axis}[
%    xmode=log,
%    ymode=log,
%    height=1\textwidth,
%    width=1\textwidth,
%    xlabel={\legup{} execution time (ms)},
%    ylabel={\vericert{} execution time (ms)},
%    xmin=10,
%    xmax=1000000,
%    ymax=1000000,
%    ymin=10,
%    legend pos=south east,
%    %log ticks with fixed point,
%  ]
%
%\addplot[draw=none, mark=*, draw opacity=0, fill opacity=0.8, polycol]
%  table [x expr={\thisrow{legupcycles}/\thisrow{legupfreqMHz}}, y expr={\thisrow{vericertcycles}/\thisrow{vericertoldfreqMHz}}, col sep=comma]
%  {results/poly.csv};
%
%\addlegendentry{PolyBench}
%
%\addplot[draw=none, mark=o, fill opacity=0, polywocol]
%  table [x expr={\thisrow{legupcycles}/\thisrow{legupfreqMHz}}, y expr={\thisrow{vericertcycles}/\thisrow{vericertfreqMHz}}, col sep=comma]
%  {results/poly.csv};
%
%\addlegendentry{PolyBench w/o division}
%
%\addplot[dotted, domain=10:1000000]{x};
%%\addplot[dashed, domain=10:10000]{9.02*x + 442};
%
%\end{axis}
%\end{tikzpicture}
%\caption{A comparison of the execution time of hardware designs generated by \vericert{} and by \legup{}.}
%\label{fig:comparison_time}
%\end{subfigure}
%\end{figure}

%Firstly, before comparing any performance metrics, it is worth highlighting that any Verilog produced by \vericert{} is guaranteed to be \emph{correct}, whilst no such guarantee can be provided by \legup{}.
%This guarantee in itself provides a significant leap in terms of HLS reliability, compared to any other HLS tools available. 

Figure~\ref{fig:comparison_cycles} compares the cycle counts of our 27 programs executed by \vericert{} and \legup{} respectively. 
In most cases, we see that the data points are above the diagonal, which demonstrates that the \legup{}-generated hardware is faster than \vericert{}-generated Verilog.

On average, \legup{} designs are $4.5\times$ faster than \vericert{} designs.
This performance gap is mostly due to \legup{} optimisations such as scheduling and memory analysis, which are designed to extract parallelism from input programs.
%This gap does not represent the performance cost that comes with formally proving a HLS tool.  
%Instead, it is simply a gap between an unoptimised \vericert{} versus an optimised \legup{}.
It is notable that even without \vericert{} performing many optimisations, a few data points are close to the diagonal and even below it.
%We are very encouraged by these data points. 
As we improve \vericert{} by incorporating further  optimisations, this gap should reduce whilst preserving our correctness guarantees.

Cycle count is one factor in calculating execution times; the other is the clock frequency, which determines the duration of each of these cycles. Figure~\ref{fig:comparison_time} compares the execution times of \vericert{} and \legup{}. Across the original Polybench/C benchmarks, we see that \vericert{} designs are about \slowdownOrig$\times$ slower than \legup{} designs. This dramatic discrepancy in performance can be largely attributed to \vericert's na\"ive implementations of division and modulo operations, as explained in Section~\ref{sec:evaluation:setup}. Indeed, \vericert{} achieved an average clock frequency of just 21MHz, while \legup{} managed about 247MHz. After replacing the division/modulo operations with our own C-based implementations, \vericert{}'s average clock frequency becomes about 112MHz. This is better, but still substantially below \legup{}, which uses various additional optimisations and Intel-specific IP blocks. Across the modified Polybench/C benchmarks, we see that \vericert{} designs are about \slowdownDiv$\times$ slower than \legup{} designs.

\subsection{RQ2: How area-efficient is \vericert{}-generated hardware?}

%\begin{figure}
%\begin{subfigure}[t]{0.48\textwidth}
%\definecolor{resourceutilcol}{HTML}{e7298a}
%\begin{tikzpicture}
%\begin{axis}[
%    height=1\textwidth,
%    width=1\textwidth,
%    xlabel={\legup{} resource utilisation (\%)},
%    ylabel={\vericert{} resource utilisation (\%)},
%    xmin=0, ymin=0,
%    xmax=1, ymax=30,
%  ]
%
%\addplot[draw=none, mark=*, draw opacity=0, fill opacity=0.6,resourceutilcol]
%  table [x expr=(\thisrow{legupluts}/427200*100), y expr=(\thisrow{vericertluts}/427200*100), col sep=comma]
%  {results/poly.csv};
%
%%   \addplot[dashed, domain=0:1]{x};
%
%\end{axis}
%\end{tikzpicture}
%\caption{A comparison of the resource utilisation of designs generated by \vericert{} and by \legup{}.}
%\label{fig:comparison_area}
%\end{subfigure}\hfill%
%\begin{subfigure}[t]{0.48\textwidth}
%\definecolor{compiltimecol}{HTML}{66a61e}
%\begin{tikzpicture}
%\begin{axis}[
%    height=1\textwidth,
%    width=1\textwidth,
%    xlabel={\legup{} compilation time (s)},
%    ylabel={\vericert{} compilation time (s)},
%    yticklabel style={
%        /pgf/number format/fixed,
%        /pgf/number format/precision=2},
%    xmin=4.6,
%    xmax=5.1,
%    ymin=0.06,
%    ymax=0.20,
%  ]
%
%\addplot[draw=none, mark=*, draw opacity=0, fill opacity=0.6,compiltimecol]
%  table [x=legupcomptime, y=vericertcomptime, col sep=comma]
%  {results/poly.csv};
%
%  %\addplot[dashed, domain=4.5:5.1]{0.1273*x-0.5048};
%
%\end{axis}
%\end{tikzpicture}
%\caption{A comparison of compilation time for \vericert{} and for \legup{}}
%\label{fig:comparison_comptime}
%\end{subfigure}
%\end{figure}

Figure~\ref{fig:comparison_area} compares the resource utilisation of the Polybench programs generated by \vericert{} and \legup{}.
On average, we see that \vericert{} produces hardware that is about $21\times$ larger than \legup{}. \vericert{} designs are filling up to 30\% of a (large) FPGA chip, while \legup{} uses no more than 1\% of the chip.
The main reason for this is that RAM is not inferred automatically for the Verilog that is generated by \vericert{}; instead, large arrays of registers are synthesised.
Synthesis tools such as Quartus generally require array accesses to be in a specific form in order for RAM inference to activate.
\legup{}'s Verilog generation is tailored to enable RAM inference by Quartus, while \vericert{} generates more generic array accesses. This may make \vericert{} more portable across different FPGA synthesis tools and vendors.
%For a fair comparison, we chose Quartus for these experiments because LegUp supports Quartus efficiently. 
% Consequently, on average, \legup{} designs use $XX$ RAMs whereas \vericert{} use none. 
Enabling RAM inference is part of our future plans. 

% We see that \vericert{} designs use between 1\% and 30\% of the available logic on the FPGA, averaging at around 10\%, whereas LegUp designs all use less than 1\% of the FPGA, averaging at around 0.45\%. The main reason for this is mainly because RAM is not inferred automatically for the Verilog that is generated by \vericert{}.  Other synthesis tools can infer the RAM correctly for \vericert{} output, so this issue could be solved by either using a different synthesis tool and targeting a different FPGA, or by generating the correct template which allows Quartus to identify the RAM automatically.

\subsection{RQ3: How long does \vericert{} take to produce hardware?}

Figure~\ref{fig:comparison_comptime} compares the compilation times of \vericert{} and of \legup{}, with each data point corresponding to one of the PolyBench/C benchmarks. On average, \vericert{} compilation is about $47\times$ faster than \legup{} compilation. \vericert{} is much faster because it omits many of the time-consuming HLS optimisations performed by \legup{}, such as scheduling and memory analysis. This comparison also demonstrates that our fully verified approach does not add substantial overheads in compilation time, since we do not invoke verification for every compilation instance, unlike the approaches based on translation validation that we mentioned in Section~\ref{sec:intro}.

%\NR{Do we want to finish the section off with some highlights or a summary?}

%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"
%%% TeX-command-extra-options: "-shell-escape"
%%% End: