evaluation.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169

\section{Evaluation}
\label{sec:evaluation}

Our evaluation is designed to answer the following three research questions.
\begin{description}
\item[RQ1] How fast is the hardware generated by \vericert{}?
\item[RQ2] How area-efficient is the hardware generated by \vericert{}?
\item[RQ3] How long does \vericert{} take to produce hardware?
\end{description}

\subsection{Experimental Setup}
\label{sec:evaluation:setup}

\paragraph{Choice of HLS tool for comparison.} We compare \vericert{} against \legup{} 4.0, because it is open-source and hence easily accessible, but still produces hardware ``of comparable quality to a commercial high-level synthesis tool''~\cite{canis11_legup}.  We also compare against \legup{} with different optimisation levels.  First, we only turn off the LLVM optimisations in \legup{}, to eliminate all the optimisations that are common to standard software compilers.  Secondly, we also compare against \legup{} with operation chaining turned off, in addition to the \legup{} optimisations.

\paragraph{Choice and preparation of benchmarks.} We evaluate \vericert{} using the \polybench{} benchmark suite (version 4.2.1)~\cite{polybench}, which consists of a collection of 30 numerical kernels. \polybench{} is popular in the HLS context~\cite{choi+18,poly_hls_pouchet2013polyhedral,poly_hls_zhao2017,poly_hls_zuo2013}, since it has affine loop bounds, making it attractive for streaming computation on FPGA architectures.
We were able to use 27 of the 30 programs; three had to be discarded (\texttt{correlation},~\texttt{gramschmidt} and~\texttt{deriche}) because they involve square roots, requiring floats, which we do not support. 
% Interestingly, we were also unable to evaluate \texttt{cholesky} on \legup{}, since it produce an error during its HLS compilation. 
%In summary, we evaluate 27 programs from the latest Polybench suite. 
We configured \polybench{}'s parameters so that only integer types are used, since we do not support floats. We use \polybench{}'s smallest datasets for each program to ensure that data can reside within on-chip memories of the FPGA, avoiding any need for off-chip memory accesses.

\vericert{} implements divisions and modulo operations in C using the corresponding built-in Verilog operators. These built-in operators are designed to complete within a single clock cycle, and this causes substantial penalties in clock frequency. Other HLS tools, including LegUp, supply their own multi-cycle division/modulo implementations, and we plan to do the same in future versions of \vericert{}. Implementing pipelined operators such as the divide and modulus operator can be solved by scheduling the instructions so that these can execute in parallel, which is the main optimisation that needs to be added to \vericert{}. In the meantime, we have prepared an alternative version of the benchmarks in which each division/modulo operation is replaced with our own implementation that uses repeated division and multiplications by 2.  Figure~\ref{fig:polybench-div} shows the results of comparing Vericert with optimised LegUp 4.0 on the \polybench{} benchmarks, where divisions have been left intact.  Figure~\ref{fig:polybench-nodiv} performs the comparison where the division/modulo operations have been replaced by the iterative algorithm.

\paragraph{Synthesis setup} The Verilog that is generated by \vericert{} or \legup{} is provided to Xilinx Vivado v2017.1~\cite{xilinx_vivad_desig_suite}, which synthesises it to a netlist, before placing-and-routing this netlist onto a Xilinx XC7Z020 FPGA device that contains approximately 85000 LUTs.

\subsection{RQ1: How fast is \vericert{}-generated hardware?}

\pgfplotstableread[col sep=comma]{results/rel-time-div.csv}{\divtimingtable}
\pgfplotstableread[col sep=comma]{results/rel-size-div.csv}{\divslicetable}
\definecolor{vericertcol}{HTML}{66C2A5}
\definecolor{legupnooptcol}{HTML}{FC8D62}
\definecolor{legupnooptnochaincol}{HTML}{8DA0CB}
\newcommand\backgroundbar[2][5]{\draw[draw=none, fill=black!#1] (axis cs:#2*2+0.5,0.1) rectangle (axis cs:1+#2*2+0.5,300);}

\begin{figure}\centering
  \begin{tikzpicture}
  
    \begin{groupplot}[
      group style={
        group name=my plots,
        group size=1 by 2,
        xlabels at=edge bottom,
        xticklabels at=edge bottom,
        vertical sep=5pt,
      },
      ymode=log,
      ybar=0pt,
      width=1\textwidth,
      height=0.4\textwidth,
      /pgf/bar width=3pt,
      legend pos=south east,
      log ticks with fixed point,
      xticklabels from table={\divtimingtable}{benchmark},
      legend style={nodes={scale=0.7, transform shape}},
      x tick label style={rotate=90,anchor=east,font=\footnotesize},
      legend columns=-1,
      xtick=data,
      enlarge x limits={abs=0.5},
      ylabel style={font=\footnotesize},
      xtick style={draw=none},
      ]

      \nextgroupplot[ymin=0.8,ymax=300,ylabel={Execution time relative to \legup{}}]
      \pgfplotsinvokeforeach{0,...,12}{%
        \backgroundbar{#1}}
      \backgroundbar[10]{13}
      \addplot+[vericertcol] table [x expr=\coordindex,y=vericert,col sep=comma] from \divtimingtable;
      \addplot+[legupnooptcol] table [x expr=\coordindex,y=legup noopt nochain,col sep=comma] from \divtimingtable;
      \addplot+[legupnooptnochaincol] table [x expr=\coordindex,y=legup noopt,col sep=comma] from \divtimingtable;
      \draw (axis cs:-1,1) -- (axis cs:28,1);
      % JW: redraw axis border which has been partially covered by the grey bars
      \draw (axis cs:-0.5,0.8) rectangle (axis cs:27.5,300);
     
      \nextgroupplot[ymin=0.3,ymax=10,ylabel={Area relative to \legup{}}]
      \pgfplotsinvokeforeach{0,...,12}{%
        \backgroundbar{#1}}
      \backgroundbar[10]{13}
      \addplot+[vericertcol] table [x expr=\coordindex,y=vericert,col sep=comma] from \divslicetable;
      \addplot+[legupnooptcol] table [x expr=\coordindex,y=legup noopt nochain,col sep=comma] from \divslicetable;
      \addplot+[legupnooptnochaincol] table [x expr=\coordindex,y=legup noopt,col sep=comma] from \divslicetable;
      \draw (axis cs:-1,1) -- (axis cs:28,1);
      % JW: redraw axis border which has been partially covered by the grey bars
      \draw (axis cs:-0.5,0.3) rectangle (axis cs:27.5,10);

      \legend{\vericert{},\legup{} w/o opt+chain,\legup{} w/o opt};
    \end{groupplot}
  \end{tikzpicture}
  \caption{\polybench{} with division/modulo operations enabled.  The top graph shows the execution time of \vericert{}, \legup{} without LLVM optimisations and without operation chaining and \legup{} without front end LLVM optimisations relative to optimised \legup{}. The bottom graph shows the area relative to \legup{}.}\label{fig:polybench-div}
\end{figure}

\pgfplotstableread[col sep=comma]{results/rel-time-nodiv.csv}{\nodivtimingtable}
\pgfplotstableread[col sep=comma]{results/rel-size-nodiv.csv}{\nodivslicetable}
\begin{figure}\centering
  \begin{tikzpicture}
    \begin{groupplot}[
      group style={
        group name=my plots,
        group size=1 by 2,
        xlabels at=edge bottom,
        xticklabels at=edge bottom,
        vertical sep=5pt,
      },
      ymode=log,
      ybar=0pt,
      ytick={0.5,1,2,4,8},
      width=1\textwidth,
      height=0.4\textwidth,
      /pgf/bar width=3pt,
      legend pos=south east,
      log ticks with fixed point,
      xticklabels from table={\nodivtimingtable}{benchmark},
      legend style={nodes={scale=0.7, transform shape}},
      x tick label style={rotate=90,anchor=east,font=\footnotesize},
      legend columns=-1,
      xtick=data,
      enlarge x limits={abs=0.5},
      ylabel style={font=\footnotesize},
      ymin=0.3,
      xtick style={draw=none},
      ]

      \nextgroupplot[ymin=0.3,ymax=10,ylabel={Execution time relative to \legup{}}]
      \pgfplotsinvokeforeach{0,...,12}{%
        \backgroundbar{#1}}
      \backgroundbar[10]{13}
      \addplot+[vericertcol] table [x expr=\coordindex,y=vericert,col sep=comma] from \nodivtimingtable;
      \addplot+[legupnooptcol] table [x expr=\coordindex,y=legup noopt nochain,col sep=comma] from \nodivtimingtable;
      \addplot+[legupnooptnochaincol] table [x expr=\coordindex,y=legup noopt,col sep=comma] from \nodivtimingtable;
      \draw (axis cs:-1,1) -- (axis cs:28,1);
      \draw (axis cs:-0.5,0.3) rectangle (axis cs:27.5,10);

      \nextgroupplot[ymin=0.3,ymax=4,ylabel={Area relative to \legup{}}]
      \pgfplotsinvokeforeach{0,...,12}{%
        \backgroundbar{#1}}
      \backgroundbar[10]{13}
      \addplot+[vericertcol] table [x expr=\coordindex,y=vericert,col sep=comma] from \nodivslicetable;
      \addplot+[legupnooptcol] table [x expr=\coordindex,y=legup noopt nochain,col sep=comma] from \nodivslicetable;
      \addplot+[legupnooptnochaincol] table [x expr=\coordindex,y=legup noopt,col sep=comma] from \nodivslicetable;
      \draw (axis cs:-1,1) -- (axis cs:28,1);
      \draw (axis cs:-0.5,0.3) rectangle (axis cs:27.5,4);

      \legend{\vericert{},\legup{} w/o opt+chain,\legup{} w/o opt};
    \end{groupplot}
  \end{tikzpicture}
  \caption{\polybench{} with division/modulo operations replaced by an iterative algorithm.  The top graph shows the execution time of \vericert{}, \legup{} without LLVM optimisations and without operation chaining and \legup{} without front end LLVM optimisations relative to optimised \legup{}. The bottom graph shows the area relative to \legup{}.}\label{fig:polybench-nodiv}
\end{figure}

Figure~\ref{fig:comparison_area} compares the resource utilisation of the \polybench{} programs generated by \vericert{} and \legup{}.
On average, we see that \vericert{} produces hardware that is about $21\times$ larger than \legup{}. \vericert{} designs are filling up to 30\% of a (large) FPGA chip, while \legup{} uses no more than 1\% of the chip.
The main reason for this is that RAM is not inferred automatically for the Verilog that is generated by \vericert{}; instead, large arrays of registers are synthesised.
Synthesis tools such as Quartus generally require array accesses to be in a specific form in order for RAM inference to activate.
\legup{}'s Verilog generation is tailored to enable RAM inference by Quartus, while \vericert{} generates more generic array accesses. This may make \vericert{} more portable across different FPGA synthesis tools and vendors.
%For a fair comparison, we chose Quartus for these experiments because LegUp supports Quartus efficiently. 
% Consequently, on average, \legup{} designs use $XX$ RAMs whereas \vericert{} use none. 
Enabling RAM inference is part of our future plans. 

% We see that \vericert{} designs use between 1\% and 30\% of the available logic on the FPGA, averaging at around 10\%, whereas LegUp designs all use less than 1\% of the FPGA, averaging at around 0.45\%. The main reason for this is mainly because RAM is not inferred automatically for the Verilog that is generated by \vericert{}.  Other synthesis tools can infer the RAM correctly for \vericert{} output, so this issue could be solved by either using a different synthesis tool and targeting a different FPGA, or by generating the correct template which allows Quartus to identify the RAM automatically.

\subsection{RQ3: How long does \vericert{} take to produce hardware?}

Figure~\ref{fig:comparison_comptime} compares the compilation times of \vericert{} and of \legup{}, with each data point corresponding to one of the \polybench{} benchmarks. On average, \vericert{} compilation is about $47\times$ faster than \legup{} compilation. \vericert{} is much faster because it omits many of the time-consuming HLS optimisations performed by \legup{}, such as scheduling and memory analysis. This comparison also demonstrates that our fully verified approach does not add substantial overheads in compilation time, since we do not invoke verification for every compilation instance, unlike the approaches based on translation validation that we mentioned in Section~\ref{sec:intro}.

%\NR{Do we want to finish the section off with some highlights or a summary?}

%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"
%%% TeX-command-extra-options: "-shell-escape"
%%% End: