evaluation.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131

\section{Evaluation}

Our evaluation is designed to answer the following three research questions. \JW{How about adding one more RQ at the start called `How reliable is the hardware generated by \vericert{}, and how does this compare to existing HLS tools?' Then you can answer this by saying that you couldn't break your tool using the Csmith programs generated by Du et al., which did break all the other HLS tools.}
\begin{description}
\item[RQ1] How fast is the hardware generated by \vericert{}, and how does this compare to existing HLS tools?
\item[RQ2] How area-efficient is the hardware generated by \vericert{}, and how does this compare to existing HLS tools?
\item[RQ3] How long does \vericert{} take to produce hardware?
\end{description}

\subsection{Experimental Setup}

\paragraph{Choice of HLS tool for comparison.} We compare \vericert{} against LegUp 4.0 because it is open-source and hence easily accessible, but still produces hardware ``of comparable quality to a commercial high-level synthesis tool''~\cite{canis+11}.

\paragraph{Choice of benchmarks.} We evaluate \vericert{} using the PolyBench/C benchmark suite\footnote{\url{http://web.cs.ucla.edu/~pouchet/software/polybench/}}. PolyBench/C is a modern benchmark suite that has been previously used to evaluate HLS tools~\cite{choi+18}. For completeness, we use the full set of 24 benchmarks. We set the benchmark parameters so that all datatypes are integers (since \vericert{} only supports integers) and all datasets are `small' (to fit into the small on-chip memories). A current limitation of \vericert{}, as discussed in Section~\ref{?}, is that it does not support addition and subtraction operations involving integer literals not divisible by 4. To work around this, we lightly modified each benchmark program so that literals other than multiples of 4 are stored into variables before being added or subtracted. \JW{Any other notable changes to the benchmarks?}

\paragraph{Experimental setup.} In order to generate a hardware implementation, the Verilog produced by the HLS tool-under-test must be synthesised to a netlist; the resultant netlist can then be placed-and-routed for a particular FPGA device. We use Intel Quartus~\cite{quartus} for both of these tasks, and we target an Arria 10 \ref{?} FPGA.

\subsection{RQ1: How fast is \vericert{}-generated hardware?}

\begin{figure}
\begin{tikzpicture}
\begin{axis}[
    xmode=log,
    ymode=log,
    height=80mm,
    width=80mm,
    xlabel={LegUp (ms)},
    ylabel={\vericert{} (ms)},
    xmin=0.01,
    xmax=10,
    ymax=500,
    ymin=0.5,
    log ticks with fixed point,
  ]
  
\addplot[draw=none, mark=*, draw opacity=0, fill opacity=0.3]
  table [x=legupwallclockMS, y=vericertwallclockMS, col sep=comma]
  {results/comparison.csv};
  
\addplot[dashed, domain=0.01:10]{50*x};
  
\end{axis}
\end{tikzpicture}
\caption{A comparison of the execution time of hardware designs generated by \vericert{} and by LegUp. The diagonal is at $y=50x$.}
\label{fig:comparison_time}
\end{figure}

Figure~\ref{fig:comparison_time} compares the execution time of hardware designs generated by \vericert{} and by LegUp, with each data point corresponding to one of the PolyBench/C benchmarks. It uses logarithmic scales so that the data points are spread out more evenly. The execution times are quite well correlated between the two tools, but LegUp generates hardware designs that execute roughly fifty times faster, as illustrated by the diagonal line at $y=50x$. There are two reasons for this. First, because \vericert{}-generated hardware has all of its operations serialised, it requires more clock cycles to perform its computation -- about 4$\times$ as many as LegUp-generated hardware. Second, \vericert{} generates hardware with a much lower clock frequency than LegUp (roughly 22 MHz compared to roughly 250 MHz) so each of those clock cycles takes about 11$\times$ longer. The main reason for the low clock frequency is \JW{could somebody else finish this bit off?}.

The difference in cycle counts shows the degree of  parallelism that LegUp's scheduling and memory system can offer. However, their Verilog generation is not guaranteed to be correct. Although the runtime LegUp outputs are tested to be correct for these programs, this does not provide any confidence on the correctness of Verilog generation of any other programs. Our Coq proof mechanisation guarantees that generated Verilog is correct for any input program that uses the subset of \compcert{} instructions that we have proven to be correct.

\subsection{RQ2: How area-efficient is \vericert{}-generated hardware?}

\begin{figure}
\begin{tikzpicture}
\begin{axis}[
    height=80mm,
    width=80mm,
    xlabel={LegUp (\%)},
    ylabel={\vericert{} (\%)},
    xmin=0, ymin=0,
  ]
  
\addplot[draw=none, mark=*, draw opacity=0, fill opacity=0.3]
  table [x=leguputilisation, y=vericertutilisation, col sep=comma]
  {results/comparison.csv};
  
\end{axis}
\end{tikzpicture}
\caption{A comparison of the resource utilisation of designs generated by \vericert{} and by LegUp}
\label{fig:comparison_area}
\end{figure}

Figure~\ref{fig:comparison_area} compares the size of the hardware designs generated by \vericert{} and by LegUp, with each data point corresponding to one of the PolyBench/C benchmarks. We see that \vericert{} designs use between 1\% and 30\% of the available logic on the FPGA, averaging at around 10\%, whereas LegUp designs all use less than 1\% of the FPGA, averaging at around 0.45\%. The main reason for this is \JW{blah blah}.

\subsection{RQ3: How long does \vericert{} take to produce hardware?}

\begin{figure}
\begin{tikzpicture}
\begin{axis}[
    height=80mm,
    width=80mm,
    xlabel={LegUp (s)},
    ylabel={\vericert{} (s)},
  ]
  
\addplot[draw=none, mark=*, draw opacity=0, fill opacity=0.3]
  table [x=legupcomptime, y=vericertcomptime, col sep=comma]
  {results/comparison.csv};
  
\end{axis}
\end{tikzpicture}
\caption{A comparison of compilation time for \vericert{} and for LegUp \JW{Numbers for \vericert{} not ready yet.} \JW{This graph might end up being cut -- we might just summarise the numbers in the paragraph.}}
\label{fig:comparison_comptime}
\end{figure}

Figure~\ref{fig:comparison_comptime} compares the compilation times of \vericert{} and by LegUp, with each data point corresponding to one of the PolyBench/C benchmarks. We see that \vericert{} compilation is about ten times faster than LegUp compilation. The simple reason for this is that \vericert{} omits many of the complicated and time-consuming optimisations that LegUp performs, such as scheduling multiple operations into clock cycles.

\begin{table}
  \begin{tabular}{lcccccc}
    \toprule
    Bench & Cycles & MHz & LUTs & Reg & BRAMs\\
    \midrule
    adpcm & 30241 &90.05 & 7719 & 12034 & 7\\
    aes & 8489 & 87.83 & 24413 & 23796 & 19 \\
    gsm & 7190 & 119.25 & 6638 & 9201 & 3 \\
    mips & 7754 & 98.95 & 5049 & 4185 & 0 \\
    \bottomrule
  \end{tabular}
  \caption{CHstone programs synthesised in LegUp 5.1}
\end{table}

\begin{table}
  \begin{tabular}{lccccccc}
    \toprule
    Bench & Cycles & MHz & LUTs & Reg & BRAMs & DSPs\\
    \midrule
    adpcm & 121386 & 66.3 & 51626 & 42688 & 0 & 48 &\\
    aes & 41958 & 19.6 & 104017 & 94239 & 0 & 6 \\
    gsm & 21994 & 66.1 & 45764 & 33675 & 0 & 8\\
    mips & 18482 & 78.43 & 10617 & 7690 & 0 & 0\\
    \bottomrule
  \end{tabular}
  \caption{CHstone programs synthesised in \vericert{}}
\end{table}

%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"
%%% TeX-command-extra-options: "-shell-escape"
%%% End: