summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorn.ramanathan14 <n.ramanathan14@imperial.ac.uk>2020-11-19 22:06:35 +0000
committeroverleaf <overleaf@localhost>2020-11-19 22:23:51 +0000
commitcfdd7c67242a9211dc572070c33e7af758ed5745 (patch)
treeae09379f23050fc5cd2f13cbaaaf09ab06a73e43
parent3ff788662bdfa67f6a44358544cd4c7e4f263c7c (diff)
downloadoopsla21_fvhls-cfdd7c67242a9211dc572070c33e7af758ed5745.tar.gz
oopsla21_fvhls-cfdd7c67242a9211dc572070c33e7af758ed5745.zip
Update on Overleaf.
-rw-r--r--evaluation.tex13
-rw-r--r--results/coqup.csv3
-rw-r--r--results/poly.csv3
3 files changed, 10 insertions, 9 deletions
diff --git a/evaluation.tex b/evaluation.tex
index 5970de1..cd0a278 100644
--- a/evaluation.tex
+++ b/evaluation.tex
@@ -14,7 +14,7 @@ Our evaluation is designed to answer the following three research questions.
\paragraph{Choice of benchmarks.} We evaluate \vericert{} using the PolyBench/C benchmark suite~\cite{polybench}, consisting of a collection of well-known numerical kernels. PolyBench/C is widely-used in the HLS context~\cite{choi+18,poly_hls_pouchet2013polyhedral,poly_hls_zhao2017,poly_hls_zuo2013}, since it consists of affine loop bounds, making it attractive for regular and streaming computation on FPGA architectures.
We chose Polybench 4.2.1 for our experiments, which consists of 30 programs.
-Out of these 30 programs, three programs utilise square root functions: \texttt{co-relation}, \texttt{gramschmidt} and \texttt{deriche}.
+Out of these 30 programs, three programs utilise square root functions: \texttt{corelation},~\texttt{gramschmidt} and~\texttt{deriche}.
Hence, we were unable evaluate these programs, since they mandatorily require \texttt{float}s.
% Interestingly, we were also unable to evaluate \texttt{cholesky} on \legup{}, since it produce an error during its HLS compilation.
In summary, we evaluate 27 programs from the latest Polybench suite.
@@ -93,19 +93,19 @@ In addition, we initial the input arrays and check the output arrays of all prog
\end{figure}
Firstly, before comparing any performance metrics, it is worth highlighting that any Verilog produced by \vericert{} is guaranteed to be \emph{correct}, whilst no such guarantee can be provided by \legup{}.
-This guarantee in itself provides a significant leap in terms of reliability of HLS, compared to any other HLS tools available.
+This guarantee in itself provides a significant leap in terms of HLS reliability, compared to any other HLS tools available.
Figure~\ref{fig:comparison_cycles} compares the cycle counts of our 27 programs executed by \vericert{} and \legup{} respectively.
In most cases, we see that the data points are above the diagonal, demonstrating that the \legup{}-generated hardware is faster than \vericert{}-generated Verilog.
This performance gap is mostly due to \legup{} optimisations such as scheduling and memory analysis, which are designed to exploit parallelism from input programs.
-On average, \legup{} designs are $4\times$ faster than \vericert{} designs on Polybench programs.
+On average, \legup{} designs are $4.5\times$ faster than \vericert{} designs on Polybench programs.
This gap does not represent the performance cost that comes with formally proving a HLS tool.
Instead, it is simply a gap between an unoptimised \vericert{} versus an optimised \legup{}.
In fact, even without any optimisations, a few data points are close to diagonal and even below diagonal, which means \vericert{} is competitive to \legup{}.
We are very encouraged by these data points.
As we optimise \vericert{} to incorporate other HLS optimisations in a formally-proved manner, this gap can reduce whilst preserving our correctness guarantees.
-Cycle count is one factor in calculating execution latency. The other factor is the generated clock frequency, which tells us how fast a cycle executes. Figure~\ref{fig:comparison_time} compared the execution times of \vericert{} and \legup{}. On average, \vericert{} are $9\times$ slower than \legup{}. As mentioned earlier, we modify the Polybench programs to utilise C-based divides and modulos. We had to avoid using the built-in Verilog divides and modulos, since Quartus interprets them as single-cycle operations. This interpretation affects clock frequency drastically. On average, when using the built-in Verilog approach, \vericert{}'s clock frequency was 21MHz, compared to \legup{}'s clock frequency of 246MHz. By moving to the C-based approach, our average clock frequency is now 113MHz. Hence, we reduce the frequency gap from approximately $12\times$ to $2\times$. This gap still exists because \legup{} uses various optimisation tactics and Intel-specific IPs, which requires further engineering effort and testing from our side.
+Cycle count is one factor in calculating execution latency. The other factor is the generated clock frequency, which tells us how fast a cycle executes. Figure~\ref{fig:comparison_time} compared the execution times of \vericert{} and \legup{}. On average, \vericert{} are $10\times$ slower than \legup{}. As mentioned earlier, we modify the Polybench programs to utilise C-based divides and modulos. We had to avoid using the built-in Verilog divides and modulos, since Quartus interprets them as single-cycle operations. This interpretation affects clock frequency drastically. On average, when using the built-in Verilog approach, \vericert{}'s clock frequency was 21MHz, compared to \legup{}'s clock frequency of 247MHz. By moving to the C-based approach, our average clock frequency is now 112MHz. Hence, we reduce the frequency gap from approximately $12\times$ to $2\times$. This gap still exists because \legup{} uses various optimisation tactics and Intel-specific IPs, which requires further engineering effort and testing from our side.
% The difference in cycle counts shows the degree of parallelism that \legup{}'s scheduling and memory system can offer. However, their Verilog generation is not guaranteed to be correct. Although the runtime LegUp outputs are tested to be correct for these programs, this does not provide any confidence on the correctness of Verilog generation of any other programs. Our Coq proof mechanisation guarantees that generated Verilog is correct for any input program that uses the subset of \compcert{} instructions that we have proven to be correct.
@@ -136,12 +136,12 @@ Cycle count is one factor in calculating execution latency. The other factor is
Figure~\ref{fig:comparison_area} compares the resource utilisation of Polybench programs generated by \vericert{} and \legup{}.
On average, \vericert{} produces hardware that is $21\times$ larger than \legup{} for the same Polybench programs.
-\vericert{} designs are also filling up to 30\% of a large FPGA chip, which indicates that unoptimised Verilog generation can costly.
+\vericert{} designs are also filling up to 30\% of a large FPGA chip, which indicates that unoptimised Verilog generation can be costly.
The key reason for this behaviour, is that absence of RAM inference for the \vericert{} designs.
RAM are dedicated hardware blocks on the FPGA that are customly designed to allow for array synthesis.
Synthesis tools like Quartus generally require array accesses to be in a certain template or form to enable inference of RAM use.
\legup{}'s Verilog generation is tailored to enable RAM inference by Quartus.
-\vericert{}'s array accesses in Verilog are more generic, allowing us to target different FPGA synthesis tools and vendors, but underfits Quartus requirements.
+\vericert{}'s array accesses in Verilog are more generic, allowing us to target different FPGA synthesis tools and vendors, but underfits for Quartus requirements.
For a fair comparison, we chose Quartus for these experiments because LegUp supports Quartus efficiently.
% Consequently, on average, \legup{} designs use $XX$ RAMs whereas \vericert{} use none.
Improving RAM inference is part of our future plans.
@@ -180,7 +180,6 @@ Improving RAM inference is part of our future plans.
Figure~\ref{fig:comparison_comptime} compares the compilation times of \vericert{} and by \legup{}, with each data point corresponding to one of the PolyBench/C benchmarks. On average, \vericert{} compilation is about $47\times$ faster than \legup{} compilation. \vericert{} is much faster because it omits many complext time-consuming HLS optimisations performed by \legup{}, such as scheduling and memory analysis. This comparison also shows our approach do not add any significant overheads in compilation time, since we do not invoke verification for every compilation instance.
-
\NR{Do we want to finish the section off with some highlights or a summary?}
%%% Local Variables:
diff --git a/results/coqup.csv b/results/coqup.csv
index 063a667..9b5f075 100644
--- a/results/coqup.csv
+++ b/results/coqup.csv
@@ -24,4 +24,5 @@ jacobi-2d,385212,115.1,30426,59844,0,4
seidel-2d,979334,128.88,26988,53293,0,2
floyd-warshall,5038446,96.62,60609,119811,0,2
nussinov,882120,73.22,60238,119719,0,0
-cholesky,2347850,0.100,90.88,53992,106180,0,8 \ No newline at end of file
+cholesky,2347850,90.88,53992,106180,0,8
+adi,1713702,0.160,110.44,31207,60639,0,36 \ No newline at end of file
diff --git a/results/poly.csv b/results/poly.csv
index 1aaf90b..3e979cf 100644
--- a/results/poly.csv
+++ b/results/poly.csv
@@ -24,4 +24,5 @@ jacobi-2d,84609,240.79,2347,3185,0,2,4.81,357100,113.53,30393,59782,0,4,0.079
seidel-2d,345294,232.4,2128,3337,0,2,4.68,875758,127.99,26948,53133,0,2,0.091
floyd-warshall,1238764,235.52,1869,2367,0,2,4.71,4762766,109.4,59859,118101,0,2,0.094
nussinov,216467,273.07,1078,1431,0,2,4.79,837958,90.73,60663,119303,0,0,0.080
-cholesky,148443,273.6,1528,3419,0,15,4.89,2347850,90.88,53992,106180,0,8,0.100 \ No newline at end of file
+cholesky,148443,273.6,1528,3419,0,15,4.89,2347850,90.88,53992,106180,0,8,0.100
+adi,90644,256.28,1111,2391,0,0,5.32,1713702,110.44,31207,60639,0,36,0.160 \ No newline at end of file