evaluation.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114

\section{Evaluation}

Our evaluation is designed to answer the following three research questions.
\begin{description}
\item[RQ1] How fast is the hardware generated by CoqUp, and how does this compare to existing HLS tools?
\item[RQ2] How area-efficient is the hardware generated by CoqUp, and how does this compare to existing HLS tools?
\item[RQ3] How long does CoqUp take to produce hardware?
\end{description}

\subsection{Experimental Setup}

\paragraph{Choice of HLS tool for comparison.} We compare CoqUp against LegUp 4.0 because it is open-source and hence easily accessible, but still produces hardware ``of comparable quality to a commercial high-level synthesis tool''~\cite{canis+11}.

\paragraph{Choice of benchmarks.} We evaluate CoqUp using the PolyBench/C benchmark suite\footnote{\url{http://web.cs.ucla.edu/~pouchet/software/polybench/}}. PolyBench/C is a modern benchmark suite that has been previously used to evaluate HLS tools~\cite{choi+18}. For completeness, we use the full set of 24 benchmarks. We set the benchmark parameters so that all datatypes are integers (since CoqUp only supports integers) and all datasets are `small' (to fit into the small on-chip memories). A current limitation of CoqUp, as discussed in Section~\ref{?}, is that it does not support addition and subtraction operations involving integer literals not divisible by 4. To work around this, we lightly modified each benchmark program so that literals other than multiples of 4 are stored into variables before being added or subtracted. \JW{Any other notable changes to the benchmarks?}

\paragraph{Experimental setup.} In order to generate a hardware implementation, the Verilog produced by the HLS tool-under-test must be synthesised to a netlist; the resultant netlist can then be placed-and-routed for a particular FPGA device. We use Intel Quartus~\cite{quartus} for both of these tasks, and we target an Arria 10 \ref{?} FPGA.

\subsection{RQ1: How fast is CoqUp-generated hardware?}

\begin{figure}
\begin{tikzpicture}
\begin{axis}[
    xmode=log,
    ymode=log,
    height=80mm,
    width=80mm,
    xlabel={LegUp (ms)},
    ylabel={CoqUp (ms)},
    xmin=0.01,
    xmax=10,
    ymax=500,
    ymin=0.5,
    log ticks with fixed point,
  ]
  
\addplot[draw=none, mark=*, draw opacity=0, fill opacity=0.3]
  table [x=legupwallclockMS, y=coqupwallclockMS, col sep=comma]
  {results/comparison.csv};
  
\addplot[dashed, domain=0.01:10]{50*x};
  
\end{axis}
\end{tikzpicture}
\caption{A comparison of the execution time of hardware designs generated by CoqUp and by LegUp. The diagonal is at $y=50x$.}
\label{fig:comparison_time}
\end{figure}

Figure~\ref{fig:comparison_time} compares the execution time of hardware designs generated by CoqUp and by LegUp, with each data point corresponding to one of the PolyBench/C benchmarks. It uses logarithmic scales so that the data points are spread out more evenly. The execution times are quite well correlated between the two tools, but LegUp generates hardware designs that execute roughly fifty times faster, as illustrated by the diagonal line at $y=50x$. There are two reasons for this. First, because CoqUp-generated hardware has all of its operations serialised, it requires more clock cycles to perform its computation -- about 4$\times$ as many as LegUp-generated hardware. Second, CoqUp generates hardware with a much lower clock frequency than LegUp (roughly 22 MHz compared to roughly 250 MHz) so each of those clock cycles takes about 11$\times$ longer. The main reason for the low clock frequency is \JW{could somebody else finish this bit off?}.

\subsection{RQ2: How area-efficient is CoqUp-generated hardware?}

Figure~\ref{fig:comparison_area} compares the size of the hardware designs generated by CoqUp and by LegUp, with each data point corresponding to one of the PolyBench/C benchmarks. We see that CoqUp designs use between 1\% and 30\% of the available logic on the FPGA, averaging at around 10\%, whereas LegUp designs all use less than 1\% of the FPGA, averaging at around 0.45\%. The main reason for this is \JW{blah blah}.

\begin{figure}
\begin{tikzpicture}
\begin{axis}[
    height=80mm,
    width=80mm,
    xlabel={LegUp (\%)},
    ylabel={CoqUp (\%)},
    xmin=0, ymin=0,
  ]
  
\addplot[draw=none, mark=*, draw opacity=0, fill opacity=0.3]
  table [x=leguputilisation, y=coquputilisation, col sep=comma]
  {results/comparison.csv};
  
\end{axis}
\end{tikzpicture}
\caption{A comparison of the resource utilisation of designs generated by CoqUp and by LegUp}
\label{fig:comparison_area}
\end{figure}

\subsection{RQ3: How long does CoqUp take to produce hardware?}

\begin{itemize}
    \item Draw a scatter graph and talk about it.
\end{itemize}

\begin{table}
  \begin{tabular}{lcccccc}
    \toprule
    Benchmark & Cycles & Frequency & LUTs & Registers & BRAMs\\
    \midrule
    adpcm & 30241 &90.05 & 7719 & 12034 & 7\\
    aes & 8489 & 87.83 & 24413 & 23796 & 19 \\
    gsm & 7190 & 119.25 & 6638 & 9201 & 3 \\
    mips & 7754 & 98.95 & 5049 & 4185 & 0 \\
    \bottomrule
  \end{tabular}
  \caption{CHstone programs synthesised in LegUp 5.1}
\end{table}

\begin{table}
  \begin{tabular}{lccccccc}
    \toprule
    Benchmark & Cycles & Frequency & LUTs & Registers & BRAMs & DSPs\\
    \midrule
    adpcm & XXX  & 66.3 & 51626 & 42688 & 0 & 48 &\\
    aes & 41958 & 19.6 & 104017 & 94239 & 0 & 6 \\
    gsm & 21994 & 66.1 & 45764 & 33675 & 0 & 8\\
    mips & 18482 & 78.43 & 10617 & 7690 & 0 & 0\\
    \bottomrule
  \end{tabular}
  \caption{CHstone programs synthesised in CoqUp}
\end{table}

The difference in cycle counts shows the degree of  parallelism that LegUp's scheduling and memory system can offer. However, their Verilog generation is not guaranteed to be correct. Although the runtime LegUp outputs are tested to be correct for these programs, this does not provide any confidence on the correctness of Verilog generation of any other programs. Our Coq proof mechanisation guarantees that generated Verilog is correct for any input program that uses the subset of CompCert instructions that we have proven to be correct.

%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"
%%% TeX-command-extra-options: "-shell-escape"
%%% End: