summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
-rw-r--r--evaluation.tex52
-rw-r--r--references.bib35
-rw-r--r--related.tex2
3 files changed, 79 insertions, 10 deletions
diff --git a/evaluation.tex b/evaluation.tex
index 9e000b0..c9d5919 100644
--- a/evaluation.tex
+++ b/evaluation.tex
@@ -10,11 +10,27 @@ Our evaluation is designed to answer the following three research questions.
\subsection{Experimental Setup}
-\paragraph{Choice of HLS tool for comparison.} We compare \vericert{} against LegUp 4.0 because it is open-source and hence easily accessible, but still produces hardware ``of comparable quality to a commercial high-level synthesis tool''~\cite{canis11_legup}.
+\paragraph{Choice of HLS tool for comparison.} We compare \vericert{} against \legup{} 5.1 because it is open-source and hence easily accessible, but still produces hardware ``of comparable quality to a commercial high-level synthesis tool''~\cite{canis11_legup}.
-\paragraph{Choice of benchmarks.} We evaluate \vericert{} using the PolyBench/C benchmark suite~\cite{polybench}. PolyBench/C is a modern benchmark suite that has been previously used to evaluate HLS tools~\cite{choi+18}. For completeness, we use the full set of 24 benchmarks. We set the benchmark parameters so that all datatypes are integers (since \vericert{} only supports integers) and all datasets are `small' (to fit into the small on-chip memories). A current limitation of \vericert{}, as discussed in Section~\ref{?}, is that it does not support addition and subtraction operations involving integer literals not divisible by 4. To work around this, we lightly modified each benchmark program so that literals other than multiples of 4 are stored into variables before being added or subtracted. \JW{Any other notable changes to the benchmarks?}
+\paragraph{Choice of benchmarks.} We evaluate \vericert{} using the PolyBench/C benchmark suite~\cite{polybench}, consisting of a collection of well-known numerical kernels. PolyBench/C is widely-used in the HLS context~\cite{choi+18,poly_hls_pouchet2013polyhedral,poly_hls_zhao2017,poly_hls_zuo2013}, since it consists of affine loop bounds, making it attractive for regular and streaming computation on FPGA architectures.
+We chose Polybench 4.2.1 for our experiments, which consists of 30 programs.
+Out of these 30 programs, three programs utilise square root functions: \texttt{co-relation}, \texttt{gramschmidt} and \texttt{deriche}.
+Hence, we were unable evaluate these programs, since they mandatorily require \texttt{float}s.
+Interestingly, we were also unable to evaluate \texttt{cholesky} on \legup{}, since it produce an error during its HLS compilation.
+In summary, we evaluate 26 programs from the latest Polybench suite.
-\paragraph{Experimental setup.} In order to generate a hardware implementation, the Verilog produced by the HLS tool-under-test must be synthesised to a netlist; the resultant netlist can then be placed-and-routed for a particular FPGA device. We use Intel Quartus~\cite{quartus} for both of these tasks, and we target an Arria 10 FPGA.
+\paragraph{Configuring Polybench for experimentation}
+We configure Polybench's metadata and slightly modified the source code to suit our purposes.
+First, we restrict Polybench to only generate integer data types, since we do not support floats or doubles currently.
+Secondly, we utilise Polybench's smallest data set size for each program to ensure that data can reside within on-chip memories of the FPGA, avoiding any need for off-chip memory accesses.
+Furthermore, using the C divide or modulo operators results in directly translate to built-in Verilog divide and modulo operators.
+Unfortunately, the built-in operators are designed as single-cycle operation, causing large penalties in latency and area.
+To work around this issue, we use a C implementation of the divide and modulo operations, which is indirectly compiles them as multi-cycle operations on the FPGA, reducing their latency penalties drastically.
+In addition, we initial the input arrays and check the output arrays of all programs entirely on-chip.
+
+% For completeness, we use the full set of 24 benchmarks. We set the benchmark parameters so that all datatypes are integers (since \vericert{} only supports integers) and all datasets are `small' (to fit into the small on-chip memories). A current limitation of \vericert{}, as discussed in Section~\ref{?}, is that it does not support addition and subtraction operations involving integer literals not divisible by 4. To work around this, we lightly modified each benchmark program so that literals other than multiples of 4 are stored into variables before being added or subtracted. \JW{Any other notable changes to the benchmarks?}
+
+\paragraph{Synthesis setup.} The generated Verilog, by both \vericert{} and \legup{}, is provided to Intel Quartus v16.0~\cite{quartus}, that synthesises the Verilog into a netlist and place-and-routes this netlist onto an Arria 10 FPGA device, consists of approximately 430000 LUTs.
\subsection{RQ1: How fast is \vericert{}-generated hardware?}
@@ -43,7 +59,7 @@ Our evaluation is designed to answer the following three research questions.
\end{axis}
\end{tikzpicture}
-\caption{A comparison of the cycle count of hardware designs generated by \vericert{} and by \legup{}.}
+\caption{A comparison of the cycle count of hardware designs generated by \vericert{} and by \legup{}, where the diagonal represents $y=x$.}
\label{fig:comparison_cycles}
\end{figure}
@@ -68,7 +84,7 @@ Our evaluation is designed to answer the following three research questions.
{results/poly.csv};
\addplot[dotted, domain=10:100000]{x};
-%\addplot[dashed, domain=10:10000]{9.02*x};
+%\addplot[dashed, domain=10:10000]{9.02*x + 442};
\end{axis}
\end{tikzpicture}
@@ -76,9 +92,22 @@ Our evaluation is designed to answer the following three research questions.
\label{fig:comparison_time}
\end{figure}
-Figure~\ref{fig:comparison_time} compares the execution time of hardware designs generated by \vericert{} and by LegUp, with each data point corresponding to one of the PolyBench/C benchmarks. It uses logarithmic scales so that the data points are spread out more evenly. The execution times are quite well correlated between the two tools, but LegUp generates hardware designs that execute roughly fifty times faster, as illustrated by the diagonal line at $y=50x$. There are two reasons for this. First, because \vericert{}-generated hardware has all of its operations serialised, it requires more clock cycles to perform its computation -- about 4$\times$ as many as LegUp-generated hardware. Second, \vericert{} generates hardware with a much lower clock frequency than LegUp (roughly 22 MHz compared to roughly 250 MHz) so each of those clock cycles takes about 11$\times$ longer. The main reason for the low clock frequency is the divide circuits that are introduced by \vericert{} when a divide operation is performed in C. Currently, these have to be performed in one cycle, which heavily influences the maximum clock speed that can be reached, as the maximum delay introduced by the divide is quite large. One possible solution to this is to separate the divide operation into multiple clock cycles.
+Firstly, before comparing any performance metrics, it is worth highlighting that any Verilog produced by \vericert{} is guaranteed to be \emph{correct}, whilst no such guarantee can be provided by \legup{}.
+This guarantee in itself provides a significant leap in terms of reliability of HLS, compared to any other HLS tools available.
-The difference in cycle counts shows the degree of parallelism that \legup{}'s scheduling and memory system can offer. However, their Verilog generation is not guaranteed to be correct. Although the runtime LegUp outputs are tested to be correct for these programs, this does not provide any confidence on the correctness of Verilog generation of any other programs. Our Coq proof mechanisation guarantees that generated Verilog is correct for any input program that uses the subset of \compcert{} instructions that we have proven to be correct.
+Figure~\ref{fig:comparison_cycles} compares the cycle counts of our 26 programs executed by \vericert{} and \legup{} respectively.
+In most cases, we see that the data points are above the diagonal, demonstrating that the \legup{}-generated hardware is faster than \vericert{}-generated Verilog.
+This performance gap is mostly due to LegUp optimisations such as scheduling and memory analysis, which are designed to exploit parallelism from input programs.
+On average, \legup{} designs are $4\times$ faster than \vericert{} designs.
+This gap does not represent the performance cost that comes with formally proving a HLS tool.
+Instead, it is simply a gap between an unoptimised \vericert{} versus an optimised \legup{}.
+In fact, without any optimisation, a few data points are close to diagonal and even below diagonal, which means \vericert{} is competitive to \legup{}.
+We are very encouraged by these data points.
+As we optimise \vericert{} to incorporate other HLS optimisations in a formally-proved manner, this gap can reduce whilst preserving our correctness guarantees.
+
+Cycle count is one factor in calculating execution latency. The other factor is the generated clock frequency, which tells us how fast a cycle executes. Figure~\ref{fig:comparison_time} compared the execution times of \vericert{} and \legup{}. On average, \vericert{} are $9\times$ slower than \legup{}. As mentioned earlier, we modify the Polybench programs to utilise C-based divides and modulos. We had to avoid using the built-in Verilog divides and modulos, since Quartus interprets them as single-cycle operations. This interpretation affects clock frequency drastically. On average, when using the built-in Verilog approach, \vericert{}'s clock frequency was 21MHz, compared to \legup{}'s clock frequency of 246MHz. By moving to the C-based approach, our average clock frequency is now 113MHz. Hence, we reduce the frequency gap from approximately $12\times$ to $2\times$. This gap still exists because \legup{} use various optimisation tactics and Intel-specific IPs, which requires further engineering effort and testing from our side.
+
+% The difference in cycle counts shows the degree of parallelism that \legup{}'s scheduling and memory system can offer. However, their Verilog generation is not guaranteed to be correct. Although the runtime LegUp outputs are tested to be correct for these programs, this does not provide any confidence on the correctness of Verilog generation of any other programs. Our Coq proof mechanisation guarantees that generated Verilog is correct for any input program that uses the subset of \compcert{} instructions that we have proven to be correct.
\subsection{RQ2: How area-efficient is \vericert{}-generated hardware?}
@@ -97,7 +126,7 @@ The difference in cycle counts shows the degree of parallelism that \legup{}'s
table [x expr=(\thisrow{legupluts}/427200*100), y expr=(\thisrow{vericertluts}/427200*100), col sep=comma]
{results/poly.csv};
- %\addplot[dashed, domain=0:1]{-1.415 *x+10.6};
+ \addplot[dashed, domain=0:1]{x};
\end{axis}
\end{tikzpicture}
@@ -105,7 +134,12 @@ The difference in cycle counts shows the degree of parallelism that \legup{}'s
\label{fig:comparison_area}
\end{figure}
-Figure~\ref{fig:comparison_area} compares the size of the hardware designs generated by \vericert{} and by LegUp, with each data point corresponding to one of the PolyBench/C benchmarks. We see that \vericert{} designs use between 1\% and 30\% of the available logic on the FPGA, averaging at around 10\%, whereas LegUp designs all use less than 1\% of the FPGA, averaging at around 0.45\%. The main reason for this is mainly because RAM is not inferred automatically for the Verilog that is generated by \vericert{}. Other synthesis tools can infer the RAM correctly for \vericert{} output, so this issue could be solved by either using a different synthesis tool and targeting a different FPGA, or by generating the correct template which allows Quartus to identify the RAM automatically.
+Figure~\ref{fig:comparison_area} compares the resource utilisation of Polybench programs generated by \vericert{} and \legup{}.
+On average, \vericert{} produces hardware that is $21\times$ larger than \legup{} for the same Polybench programs.
+\vericert{} designs are also filling up to 30\% of a large FPGA chip, which indicates that upoptimised Verilog generation can costly.
+The key reason for this behaviour, is that absence of RAMs
+
+We see that \vericert{} designs use between 1\% and 30\% of the available logic on the FPGA, averaging at around 10\%, whereas LegUp designs all use less than 1\% of the FPGA, averaging at around 0.45\%. The main reason for this is mainly because RAM is not inferred automatically for the Verilog that is generated by \vericert{}. Other synthesis tools can infer the RAM correctly for \vericert{} output, so this issue could be solved by either using a different synthesis tool and targeting a different FPGA, or by generating the correct template which allows Quartus to identify the RAM automatically.
\subsection{RQ3: How long does \vericert{} take to produce hardware?}
diff --git a/references.bib b/references.bib
index 747d0b3..cb8c8d8 100644
--- a/references.bib
+++ b/references.bib
@@ -755,3 +755,38 @@ year = {2020},
isbn = "978-3-642-35705-3",
publisher = "Springer Berlin Heidelberg",
}
+
+
+@inproceedings{poly_hls_zuo2013,
+ title={Improving polyhedral code generation for high-level synthesis},
+ author={Zuo, Wei and Li, Peng and Chen, Deming and Pouchet, Louis-No{\"e}l and Zhong, Shunan and Cong, Jason},
+ booktitle={2013 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ ISSS)},
+ pages={1--10},
+ year={2013},
+ organization={IEEE}
+}
+
+@inproceedings{poly_hls_zhao2017,
+ title={COMBA: A comprehensive model-based analysis framework for high level synthesis of real applications},
+ author={Zhao, Jieru and Feng, Liang and Sinha, Sharad and Zhang, Wei and Liang, Yun and He, Bingsheng},
+ booktitle={2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)},
+ pages={430--437},
+ year={2017},
+ organization={IEEE}
+}
+
+@inproceedings{poly_hls_pouchet2013polyhedral,
+ title={Polyhedral-based data reuse optimization for configurable computing},
+ author={Pouchet, Louis-Noel and Zhang, Peng and Sadayappan, Ponnuswamy and Cong, Jason},
+ booktitle={Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays},
+ pages={29--38},
+ year={2013}
+}
+
+@misc{quartus,
+ author = {Intel},
+ title = {{Intel® Quartus® Prime Software Suite}},
+ url = {https://intel.ly/3fpUNhv},
+ urldate = {2020-07-20},
+ year = 2020,
+} \ No newline at end of file
diff --git a/related.tex b/related.tex
index 5cdee3a..7b519e7 100644
--- a/related.tex
+++ b/related.tex
@@ -44,7 +44,7 @@
A summary of the related works can be found in Figure~\ref{fig:related_venn}, which is represented as a Venn diagram. The categories that were chosen for the Venn diagram are if the tool is usable and available, if it takes a high-level software language as input, if it has a correctness proof and finally if that proof is mechanised. The goal of \vericert{} is to cover all of these categories.
-Work is being done to prove the equivalence between the generated hardware and the original behavioural description in C. An example of a tool that implements this is Mentor's Catapult~\cite{mentor20_catap_high_level_synth}, which tries to match the states in the three adress code (3AC) description to states in the original C code after an unverified translation. This technique is called translation validation~\cite{pnueli98_trans}, whereby the translation that the HLS tool performed is proven to have been correct for that input, by showing that they behave in the same way for all possible inputs. Using translation validation is quite effective for proving complex optimisations such as scheduling~\cite{kim04_autom_fsmd,karfa06_formal_verif_method_sched_high_synth,chouksey20_verif_sched_condit_behav_high_level_synth} or code motion~\cite{banerjee14_verif_code_motion_techn_using_value_propag,chouksey19_trans_valid_code_motion_trans_invol_loops}, however, the validation has to be run every time the high-level synthesis is performed. In addition to that, the proofs are often not mechanised or directly related to the actual implementation, meaning the verifying algorithm might be wrong and could give false positives or false negatives.
+Work is being done to prove the equivalence between the generated hardware and the original behavioural description in C. An example of a tool that implements this is Mentor's Catapult~\cite{mentor20_catap_high_level_synth}, which tries to match the states in the three address code (3AC) description to states in the original C code after an unverified translation. This technique is called translation validation~\cite{pnueli98_trans}, whereby the translation that the HLS tool performed is proven to have been correct for that input, by showing that they behave in the same way for all possible inputs. Using translation validation is quite effective for proving complex optimisations such as scheduling~\cite{kim04_autom_fsmd,karfa06_formal_verif_method_sched_high_synth,chouksey20_verif_sched_condit_behav_high_level_synth} or code motion~\cite{banerjee14_verif_code_motion_techn_using_value_propag,chouksey19_trans_valid_code_motion_trans_invol_loops}, however, the validation has to be run every time the high-level synthesis is performed. In addition to that, the proofs are often not mechanised or directly related to the actual implementation, meaning the verifying algorithm might be wrong and could give false positives or false negatives.
\JW{Some papers to cite somewhere:
\begin{itemize}