evaluation.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174

\section{Evaluation}\label{sec:evaluation}

Our evaluation is designed to answer the following three research questions.
\begin{description}
\item[RQ1] How fast is the hardware generated by \vericert{}?
\item[RQ2] How area-efficient is the hardware generated by \vericert{}?
\item[RQ3] How quickly does \vericert{} translate the C into Verilog?
\end{description}

\subsection{Experimental Setup}
\label{sec:evaluation:setup}

\paragraph{Choice of HLS tool for comparison.} We compare \vericert{} against \legup{} 4.0, because it is open-source and hence easily accessible, but still produces hardware ``of comparable quality to a commercial high-level synthesis tool''~\cite{canis11_legup}.  We also compare against \legup{} with different optimisation levels in an effort to understand which optimisations have the biggest impact on the performance discrepancies between \legup{} and \vericert{}.  The baseline \legup{} version has all the default automatic optimisations turned on.  The benchmarks are also not manually optimised to run through \legup{} optimally, such as adding pragmas and other manual indications to add further more advanced optimisations.  \vericert{} is also compared with other optimisation levels of \legup{}.  First, we only turn off the LLVM optimisations in \legup{}, to eliminate all the optimisations that are common to standard software compilers, referred to as \legup{} w/o opt.  Secondly, we also compare against \legup{} with LLVM optimisations and operation chaining turned off, referred to as \legup{} w/o opt+chain. Operation chaining is an HLS-specific optimisation that combines data-dependent operations into one clock cycle, and therefore dramatically reduces the number of cycles, without necessarily decreasing the clock speed.

\paragraph{Choice and preparation of benchmarks.} We evaluate \vericert{} using the \polybench{} benchmark suite (version 4.2.1)~\cite{polybench}, which is a collection of 30 numerical kernels. \polybench{} is popular in the HLS context~\cite{choi+18,poly_hls_pouchet2013polyhedral,poly_hls_zhao2017,poly_hls_zuo2013}, since it has affine loop bounds, making it attractive for streaming computation on FPGA architectures.
We were able to use 27 of the 30 programs; three had to be discarded (\texttt{correlation},~\texttt{gramschmidt} and~\texttt{deriche}) because they involve square roots, requiring floats, which we do not support. 
% Interestingly, we were also unable to evaluate \texttt{cholesky} on \legup{}, since it produce an error during its HLS compilation. 
%In summary, we evaluate 27 programs from the latest Polybench suite. 
We configured \polybench{}'s parameters so that only integer types are used.  We use \polybench{}'s smallest datasets for each program to ensure that data can reside within on-chip memories of the FPGA, avoiding any need for off-chip memory accesses.

\vericert{} implements divisions and modulo operations in C using the corresponding built-in Verilog operators. These built-in operators are designed to complete within a single clock cycle, and this causes substantial penalties in clock frequency. Other HLS tools, including LegUp, supply their own multi-cycle division/modulo implementations, and we plan to do the same in future versions of \vericert{}. Implementing pipelined operators such as the divide and modulus operator can be solved by scheduling the instructions so that these can execute in parallel, which is the main optimisation that needs to be added to \vericert{}. In the meantime, we have prepared an alternative version of the benchmarks in which each division/modulo operation is replaced with our own implementation that uses repeated division and multiplications by 2.  Figure~\ref{fig:polybench-div} shows the results of comparing \vericert{} with optimised LegUp 4.0 on the \polybench{} benchmarks, where divisions have been left intact.  Figure~\ref{fig:polybench-nodiv} performs the comparison where the division/modulo operations have been replaced by the iterative algorithm.

\paragraph{Synthesis setup} The Verilog that is generated by \vericert{} or \legup{} is provided to Xilinx Vivado v2017.1~\cite{xilinx_vivad_desig_suite}, which synthesises it to a netlist, before placing-and-routing this netlist onto a Xilinx XC7Z020 FPGA device that contains approximately 85000 LUTs.

\subsection{RQ1: How fast is \vericert{}-generated hardware?}

\pgfplotstableread[col sep=comma]{results/rel-time-div.csv}{\divtimingtable}
\pgfplotstableread[col sep=comma]{results/rel-size-div.csv}{\divslicetable}
\definecolor{vericertcol}{HTML}{66C2A5}
\definecolor{legupnooptcol}{HTML}{FC8D62}
\definecolor{legupnooptnochaincol}{HTML}{8DA0CB}
\newcommand\backgroundbar[2][5]{\draw[draw=none, fill=black!#1] (axis cs:#2*2+0.5,0.1) rectangle (axis cs:1+#2*2+0.5,300);}

\begin{figure}\centering
  \begin{tikzpicture}
  
    \begin{groupplot}[
      group style={
        group name=my plots,
        group size=1 by 2,
        xlabels at=edge bottom,
        xticklabels at=edge bottom,
        vertical sep=5pt,
      },
      ymode=log,
      ybar=0pt,
      width=1\textwidth,
      height=0.4\textwidth,
      /pgf/bar width=3pt,
      legend pos=south east,
      log ticks with fixed point,
      xticklabels from table={\divtimingtable}{benchmark},
      legend style={nodes={scale=0.7, transform shape}},
      x tick label style={rotate=90,anchor=east,font=\footnotesize},
      legend columns=-1,
      xtick=data,
      enlarge x limits={abs=0.5},
      ylabel style={font=\footnotesize},
      xtick style={draw=none},
      ]

      \nextgroupplot[ymin=0.8,ymax=300,ylabel={Execution time relative to \legup{}}]
      \pgfplotsinvokeforeach{0,...,12}{%
        \backgroundbar{#1}}
      \backgroundbar[10]{13}
      \addplot+[vericertcol] table [x expr=\coordindex,y=vericert,col sep=comma] from \divtimingtable;
      \addplot+[legupnooptcol] table [x expr=\coordindex,y=legup noopt nochain,col sep=comma] from \divtimingtable;
      \addplot+[legupnooptnochaincol] table [x expr=\coordindex,y=legup noopt,col sep=comma] from \divtimingtable;
      \draw (axis cs:-1,1) -- (axis cs:28,1);
      % JW: redraw axis border which has been partially covered by the grey bars
      \draw (axis cs:-0.5,0.8) rectangle (axis cs:27.5,300);
     
      \nextgroupplot[ymin=0.3,ymax=10,ylabel={Area relative to \legup{}}]
      \pgfplotsinvokeforeach{0,...,12}{%
        \backgroundbar{#1}}
      \backgroundbar[10]{13}
      \addplot+[vericertcol] table [x expr=\coordindex,y=vericert,col sep=comma] from \divslicetable;
      \addplot+[legupnooptcol] table [x expr=\coordindex,y=legup noopt nochain,col sep=comma] from \divslicetable;
      \addplot+[legupnooptnochaincol] table [x expr=\coordindex,y=legup noopt,col sep=comma] from \divslicetable;
      \draw (axis cs:-1,1) -- (axis cs:28,1);
      % JW: redraw axis border which has been partially covered by the grey bars
      \draw (axis cs:-0.5,0.3) rectangle (axis cs:27.5,10);

      \legend{\vericert{},\legup{} w/o opt+chain,\legup{} w/o opt};
    \end{groupplot}
  \end{tikzpicture}
  \caption{\polybench{} with division/modulo operations enabled.  The top graph shows the execution time of \vericert{}, \legup{} without LLVM optimisations and without operation chaining and \legup{} without front end LLVM optimisations relative to optimised \legup{}. The bottom graph shows the area relative to \legup{}.}\label{fig:polybench-div}
\end{figure}

\pgfplotstableread[col sep=comma]{results/rel-time-nodiv.csv}{\nodivtimingtable}
\pgfplotstableread[col sep=comma]{results/rel-size-nodiv.csv}{\nodivslicetable}
\begin{figure}\centering
  \begin{tikzpicture}
    \begin{groupplot}[
      group style={
        group name=my plots,
        group size=1 by 2,
        xlabels at=edge bottom,
        xticklabels at=edge bottom,
        vertical sep=5pt,
      },
      ymode=log,
      ybar=0pt,
      ytick={0.5,1,2,4,8},
      width=1\textwidth,
      height=0.4\textwidth,
      /pgf/bar width=3pt,
      legend pos=south east,
      log ticks with fixed point,
      xticklabels from table={\nodivtimingtable}{benchmark},
      legend style={nodes={scale=0.7, transform shape}},
      x tick label style={rotate=90,anchor=east,font=\footnotesize},
      legend columns=-1,
      xtick=data,
      enlarge x limits={abs=0.5},
      ylabel style={font=\footnotesize},
      ymin=0.3,
      xtick style={draw=none},
      ]

      \nextgroupplot[ymin=0.3,ymax=10,ylabel={Execution time relative to \legup{}}]
      \pgfplotsinvokeforeach{0,...,12}{%
        \backgroundbar{#1}}
      \backgroundbar[10]{13}
      \addplot+[vericertcol] table [x expr=\coordindex,y=vericert,col sep=comma] from \nodivtimingtable;
      \addplot+[legupnooptcol] table [x expr=\coordindex,y=legup noopt nochain,col sep=comma] from \nodivtimingtable;
      \addplot+[legupnooptnochaincol] table [x expr=\coordindex,y=legup noopt,col sep=comma] from \nodivtimingtable;
      \draw (axis cs:-1,1) -- (axis cs:28,1);
      \draw (axis cs:-0.5,0.3) rectangle (axis cs:27.5,10);

      \nextgroupplot[ymin=0.3,ymax=4,ylabel={Area relative to \legup{}}]
      \pgfplotsinvokeforeach{0,...,12}{%
        \backgroundbar{#1}}
      \backgroundbar[10]{13}
      \addplot+[vericertcol] table [x expr=\coordindex,y=vericert,col sep=comma] from \nodivslicetable;
      \addplot+[legupnooptcol] table [x expr=\coordindex,y=legup noopt nochain,col sep=comma] from \nodivslicetable;
      \addplot+[legupnooptnochaincol] table [x expr=\coordindex,y=legup noopt,col sep=comma] from \nodivslicetable;
      \draw (axis cs:-1,1) -- (axis cs:28,1);
      \draw (axis cs:-0.5,0.3) rectangle (axis cs:27.5,4);

      \legend{\vericert{},\legup{} w/o opt+chain,\legup{} w/o opt};
    \end{groupplot}
  \end{tikzpicture}
  \caption{\polybench{} with division/modulo operations replaced by an iterative algorithm.  The top graph shows the execution time of \vericert{}, \legup{} without LLVM optimisations and without operation chaining and \legup{} without front end LLVM optimisations relative to optimised \legup{}. The bottom graph shows the area relative to \legup{}.}\label{fig:polybench-nodiv}
\end{figure}

Firstly, before comparing any performance metrics, it is worth highlighting that any Verilog produced by \vericert{} is guaranteed to be \emph{correct}, whilst no such guarantee can be provided by \legup{}.
This guarantee in itself provides a significant leap in terms of HLS reliability, compared to any other HLS tools available.

The top graphs of Figure~\ref{fig:polybench-div} and Figure~\ref{fig:polybench-nodiv} compare the execution time of the 27 programs executed by \vericert{} and the different optimisation levels of \legup{}.  Each graph uses optimised \legup{} as the baseline.  When division/modulo operations are present \legup{} designs execute around 27$\times$ faster than \vericert{} designs.  However, when division/modulo operations are replaced by the iterative algorithm, \legup{} designs are only 2$\times$ faster than \vericert{} designs.  However, the benchmarks with division/modulo replaced show that \vericert{} actually achieves the same execution speed as \legup{} without LLVM optimisations and without operation chaining, which is encouraging, and shows that the hardware generation is following the right steps.  The execution time is calculated by multiplying the maximum frequency that the FPGA can run at with this design, by the number of clock cycles that are needed to complete the execution.  We can therefore analyse each separately.

First, looking at the difference in clock cycles, \vericert{} produces designs that have around 4.5$\times$ as many clock cycles as \legup{} designs in both cases, when division/modulo operations are enabled as well as when they are replaced.  This performance gap can be explained in part by LLVM optimisations, which seem to account for a 2$\times$ decrease in clock cycles, as well as operation chaining, which decreases the clock cycles by another 2$\times$.  The rest of the speed-up is mostly due to \legup{} optimisations such as scheduling and memory analysis, which are designed to extract parallelism from input programs.
This gap does not represent the performance cost that comes with formally proving a HLS tool.
Instead, it is simply a gap between an unoptimised \vericert{} versus an optimised \legup{}.
As we improve \vericert{} by incorporating further optimisations, this gap should reduce whilst preserving the correctness guarantees.

Secondly, looking at the maximum clock frequency that each design can achieve, \legup{} designs achieve 8.2$\times$ the maximum clock frequency of \vericert{} when division/modulo operations are present.  This is in great contrast to the maximum clock frequency that \vericert{} can achieve when no divide/modulo operations are present, where \vericert{} generates designs that are actually 2$\times$ better than the frequency achieved by \legup{} designs.  The dramatic discrepancy in performance for the former case can be largely attributed to \vericert{}'s na\"ive implementations of division and modulo operations, as explained in Section~\ref{sec:evaluation:setup}. Indeed, \vericert{} achieved an average clock frequency of just 13MHz, while \legup{} managed about 111MHz. After replacing the division/modulo operations with our own C-based implementations, \vericert{}'s average clock frequency becomes about 220MHz.  This improvement in frequency can be explained by the fact that \legup{} uses a memory controller to manage multiple RAMs using one interface, which is not needed in \vericert{} as a single RAM is used for the memory.

Looking at a few benchmarks in particular in Figure~\ref{fig:polybench-nodiv} for some interesting cases.  For the \texttt{trmm} benchmark, \vericert{} produces hardware that executes with the same cycle count as \legup{}, and manages to create hardware that achieves twice the frequency compared to \legup{}, thereby actually producing a design that executes twice as fast as \legup{}.  Another interesting benchmark is \texttt{doitgen}, where \vericert{} is comparable to \legup{} without LLVM optimisations, however, LLVM optimisations seem to have a large effect on the cycle count.

\subsection{RQ2: How area-efficient is \vericert{}-generated hardware?}

The bottom graphs in both Figure~\ref{fig:polybench-div} and Figure~\ref{fig:polybench-nodiv} compare the resource utilisation of the \polybench{} programs generated by \vericert{} and \legup{} at various optimisation levels.
By looking at the median, when division/modulo operations are enabled, we see that \vericert{} produces hardware that is about the same size as optimised \legup{}, whereas the unoptimised versions of \legup{} actually produce slightly smaller hardware.  This is because optimisations can often increase the size of the hardware to make it faster.  Especially in Figure~\ref{fig:polybench-div}, there are a few benchmarks where the size of the \legup{} design is much smaller than that produced by \vericert{}.  This can mostly be explained because of resource sharing in LegUp.  Division/modulo operations need large circuits, and it is therefore usual to only have one circuit per design.  As \vericert{} uses the na\"ive implementation of division/modulo, there will be multiple circuits present in the design, which blows up the size.  Looking at Figure~\ref{fig:polybench-nodiv}, one can see that without division, the size of \vericert{} designs are almost always around the same size as \legup{} designs, never being more than 2$\times$ larger, and sometimes even being smaller.  The similarity in area also shows that RAM is correctly being inferred by the synthesis tool, and is therefore not implemented as registers.

\subsection{RQ3: How quickly does \vericert{} translate the C into Verilog?}

\legup{} takes around 10$\times$ as long as \vericert{} to perform the translation from C into Verilog, showing at least that verification does not have a substantial effect on the run-time of the HLS tool.  However, this is a meaningless victory, as a lot of the extra time that \legup{} uses is on performing computationally heavy optimisations such as loop pipelining and scheduling.

%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"
%%% TeX-command-extra-options: "-shell-escape"
%%% End: