evaluation.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146

\section{Evaluation}
\label{sec:evaluation}

Our evaluation is designed to answer the following three research questions.
\begin{description}
\item[RQ1] How fast is the hardware generated by \vericert{}, and how does this compare to existing HLS tools?
\item[RQ2] How area-efficient is the hardware generated by \vericert{}, and how does this compare to existing HLS tools?
\item[RQ3] How long does \vericert{} take to produce hardware?
\end{description}

\subsection{Experimental Setup}

\paragraph{Choice of HLS tool for comparison.} We compare \vericert{} against LegUp 4.0 because it is open-source and hence easily accessible, but still produces hardware ``of comparable quality to a commercial high-level synthesis tool''~\cite{canis11_legup}.

\paragraph{Choice of benchmarks.} We evaluate \vericert{} using the PolyBench/C benchmark suite~\cite{polybench}. PolyBench/C is a modern benchmark suite that has been previously used to evaluate HLS tools~\cite{choi+18}. For completeness, we use the full set of 24 benchmarks. We set the benchmark parameters so that all datatypes are integers (since \vericert{} only supports integers) and all datasets are `small' (to fit into the small on-chip memories). A current limitation of \vericert{}, as discussed in Section~\ref{?}, is that it does not support addition and subtraction operations involving integer literals not divisible by 4. To work around this, we lightly modified each benchmark program so that literals other than multiples of 4 are stored into variables before being added or subtracted. \JW{Any other notable changes to the benchmarks?}

\paragraph{Experimental setup.} In order to generate a hardware implementation, the Verilog produced by the HLS tool-under-test must be synthesised to a netlist; the resultant netlist can then be placed-and-routed for a particular FPGA device. We use Intel Quartus~\cite{quartus} for both of these  tasks, and we target an Arria 10 FPGA.

\subsection{RQ1: How fast is \vericert{}-generated hardware?}

\begin{figure}
\begin{tikzpicture}
\begin{axis}[
    xmode=log,
    ymode=log,
    height=80mm,
    width=80mm,
    xlabel={\legup{} cycle count},
    ylabel={\vericert{} cycle count},
    xmin=1000,
    xmax=10000000,
    ymax=10000000,
    ymin=1000,
    %log ticks with fixed point,
  ]
  
\addplot[draw=none, mark=*, draw opacity=0, fill opacity=0.3]
  table [x=legupcycles, y=vericertcycles, col sep=comma]
  {results/poly.csv};

\addplot[dotted, domain=1000:10000000]{x};
%\addplot[dashed, domain=10:10000]{9.02*x};
  
\end{axis}
\end{tikzpicture}
\caption{A comparison of the cycle count of hardware designs generated by \vericert{} and by \legup{}.}
\label{fig:comparison_cycles}
\end{figure}

\begin{figure}
\begin{tikzpicture}
\begin{axis}[
    xmode=log,
    ymode=log,
    height=80mm,
    width=80mm,
    xlabel={\legup{} execution time (ms)},
    ylabel={\vericert{} execution time (ms)},
    xmin=10,
    xmax=100000,
    ymax=100000,
    ymin=10,
    %log ticks with fixed point,
  ]
  
\addplot[draw=none, mark=*, draw opacity=0, fill opacity=0.3]
  table [x expr={\thisrow{legupcycles}/\thisrow{legupfreqMHz}}, y expr={\thisrow{vericertcycles}/\thisrow{vericertfreqMHz}}, col sep=comma]
  {results/poly.csv};

\addplot[dotted, domain=10:100000]{x};
%\addplot[dashed, domain=10:10000]{9.02*x};
  
\end{axis}
\end{tikzpicture}
\caption{A comparison of the execution time of hardware designs generated by \vericert{} and by \legup{}.}
\label{fig:comparison_time}
\end{figure}

Figure~\ref{fig:comparison_time} compares the execution time of hardware designs generated by \vericert{} and by LegUp, with each data point corresponding to one of the PolyBench/C benchmarks. It uses logarithmic scales so that the data points are spread out more evenly. The execution times are quite well correlated between the two tools, but LegUp generates hardware designs that execute roughly fifty times faster, as illustrated by the diagonal line at $y=50x$. There are two reasons for this. First, because \vericert{}-generated hardware has all of its operations serialised, it requires more clock cycles to perform its computation -- about 4$\times$ as many as LegUp-generated hardware. Second, \vericert{} generates hardware with a much lower clock frequency than LegUp (roughly 22 MHz compared to roughly 250 MHz) so each of those clock cycles takes about 11$\times$ longer. The main reason for the low clock frequency is the divide circuits that are introduced by \vericert{} when a divide operation is performed in C.  Currently, these have to be performed in one cycle, which heavily influences the maximum clock speed that can be reached, as the maximum delay introduced by the divide is quite large.  One possible solution to this is to separate the divide operation into multiple clock cycles.

The difference in cycle counts shows the degree of  parallelism that \legup{}'s scheduling and memory system can offer. However, their Verilog generation is not guaranteed to be correct. Although the runtime LegUp outputs are tested to be correct for these programs, this does not provide any confidence on the correctness of Verilog generation of any other programs. Our Coq proof mechanisation guarantees that generated Verilog is correct for any input program that uses the subset of \compcert{} instructions that we have proven to be correct.

\subsection{RQ2: How area-efficient is \vericert{}-generated hardware?}

\begin{figure}
\begin{tikzpicture}
\begin{axis}[
    height=80mm,
    width=80mm,
    xlabel={\legup{} resource utilisation (\%)},
    ylabel={\vericert{} resource utilisation (\%)},
    xmin=0, ymin=0,
    xmax=1, ymax=30,
  ]
  
\addplot[draw=none, mark=*, draw opacity=0, fill opacity=0.3]
  table [x expr=(\thisrow{legupluts}/427200*100), y expr=(\thisrow{vericertluts}/427200*100), col sep=comma]
  {results/poly.csv};
  
  %\addplot[dashed, domain=0:1]{-1.415 *x+10.6};
  
\end{axis}
\end{tikzpicture}
\caption{A comparison of the resource utilisation of designs generated by \vericert{} and by LegUp}
\label{fig:comparison_area}
\end{figure}

Figure~\ref{fig:comparison_area} compares the size of the hardware designs generated by \vericert{} and by LegUp, with each data point corresponding to one of the PolyBench/C benchmarks. We see that \vericert{} designs use between 1\% and 30\% of the available logic on the FPGA, averaging at around 10\%, whereas LegUp designs all use less than 1\% of the FPGA, averaging at around 0.45\%. The main reason for this is mainly because RAM is not inferred automatically for the Verilog that is generated by \vericert{}.  Other synthesis tools can infer the RAM correctly for \vericert{} output, so this issue could be solved by either using a different synthesis tool and targeting a different FPGA, or by generating the correct template which allows Quartus to identify the RAM automatically.

\subsection{RQ3: How long does \vericert{} take to produce hardware?}

\begin{figure}
\begin{tikzpicture}
\begin{axis}[
    height=80mm,
    width=80mm,
    xlabel={\legup{} compilation time (s)},
    ylabel={\vericert{} compilation time (s)},
    yticklabel style={
        /pgf/number format/fixed,
        /pgf/number format/precision=2},
    xmin=4.6,
    xmax=5.1,
    ymin=0.06,
    ymax=0.20,
  ]
  
\addplot[draw=none, mark=*, draw opacity=0, fill opacity=0.3] 
  table [x=legupcomptime, y=vericertcomptime, col sep=comma]
  {results/poly.csv};

  %\addplot[dashed, domain=4.5:5.1]{0.1273*x-0.5048};

\end{axis}
\end{tikzpicture}
\caption{A comparison of compilation time for \vericert{} and for LegUp}
\label{fig:comparison_comptime}
\end{figure}

Figure~\ref{fig:comparison_comptime} compares the compilation times of \vericert{} and by LegUp, with each data point corresponding to one of the PolyBench/C benchmarks. We see that \vericert{} compilation is about ten times faster than LegUp compilation. The simple reason for this is that \vericert{} omits many of the complicated and time-consuming optimisations that LegUp performs, such as scheduling multiple operations into clock cycles.

%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"
%%% TeX-command-extra-options: "-shell-escape"
%%% End: