bit o' trimming

author: John Wickerson <j.wickerson@imperial.ac.uk> 2022-03-30 19:17:32 +0100
committer: John Wickerson <j.wickerson@imperial.ac.uk> 2022-03-30 19:17:32 +0100
commit: 5d5be8703af7cd05112133d51bf9270fc161b1e1 (patch)
tree: 46585146b49ca2e6cddd0c07a2a03a5323056be8
parent: 63db6a1712e93dc293eaa730f0eb092566c06271 (diff)
download: fccm22_rsvhls-5d5be8703af7cd05112133d51bf9270fc161b1e1.tar.gz
fccm22_rsvhls-5d5be8703af7cd05112133d51bf9270fc161b1e1.zip
1 files changed, 14 insertions, 9 deletions
diff --git a/verified_resource_sharing.tex b/verified_resource_sharing.tex
index 21357b7..43fc16e 100644
--- a/verified_resource_sharing.tex
+++ b/verified_resource_sharing.tex
@@ -71,7 +71,7 @@ escapeinside=||,
 \newcommand\vericertfun{Vericert-Fun}
 \newcommand\bambu{Bambu}
 
-\title{Resource Sharing for Verified High-Level Synthesis \\ (Work in Progress)}
+\title{Resource Sharing for Verified High-Level Synthesis}
 \author{
 \IEEEauthorblockN{Michalis Pardalos}
 \IEEEauthorblockA{Imperial College London, UK \\ Email: mpardalos@gmail.com}
@@ -84,14 +84,20 @@ escapeinside=||,
 }
 
 
+\IEEEoverridecommandlockouts
+\IEEEpubid{\makebox[\columnwidth]{978-1-6654-8332-2/22/\$31.00~\copyright2022 IEEE \hfill}
+\hspace{\columnsep}\makebox[\columnwidth]{ }}
+
 \begin{document}
 
 \maketitle
 
+\IEEEpubidadjcol
+
 \begin{abstract}
 High-level synthesis (HLS) is playing an ever-increasing role in hardware design, but concerns have been raised about its reliability. Seeking to address these, Herklotz et al. have developed an HLS compiler called \vericert{} that has been mechanically proven (using the Coq proof assistant) to output Verilog designs that are behaviourally equivalent to the input C program. Unfortunately, \vericert{} cannot compete performance-wise with established HLS tools, and a major reason for this is \vericert{}'s complete lack of support for \emph{resource sharing}. 
   
-This paper introduces \vericertfun: \vericert{} extended with function-level resource sharing. Where original \vericert{} creates one block of hardware per function \emph{call}, \vericertfun{} creates one block of hardware per function \emph{definition}. To enable this, we extend \vericert{}'s intermediate language HTL with the ability to represent \emph{multiple} state machines, and we implement function calls by transferring control between these state machines. We are working to extend \vericert{}'s formal correctness proof to cover the translation from C into this extended HTL and thence to Verilog. Benchmarking on the PolyBench/C suite indicates that \vericertfun{} generates hardware with about 0.8$\times$ the resource usage of \vericert{}'s on average, with minimal impact on execution time.
+This paper introduces \vericertfun: \vericert{} extended with function-level resource sharing. Where original \vericert{} creates one block of hardware per function \emph{call}, \vericertfun{} creates one block of hardware per function \emph{definition}. To enable this, we extend \vericert{}'s intermediate language HTL with the ability to represent \emph{multiple} state machines, and we implement function calls by transferring control between these state machines. We are working to extend \vericert{}'s correctness proof to cover the translation from C into this extended HTL and thence to Verilog. Benchmarking on the PolyBench/C suite indicates that \vericertfun{} generates hardware with about 0.8$\times$ the resource usage of \vericert{}'s on average, with minimal impact on execution time.
 \end{abstract}
 
 \section{Introduction}
@@ -110,8 +116,9 @@ The entire \vericertfun{} development is fully open-source~\cite{vericertfun-git
 \paragraph{The Coq proof assistant} \vericert{} is implemented using Coq~\cite{coq}, which means it consists of a collection of functions that define the compilation process, together with the proof of a theorem stating that those definitions constitute a correct HLS tool. Coq mechanically checks this proof using a formal mathematical calculus, and then automatically translates the function definitions into OCaml code that can be compiled and executed. 
 Engineering a software system within a proof assistant like Coq is widely held to be the gold standard for correctness. Recent years have shown that it is feasible to design substantial pieces of software in this way, such as database systems~\cite{malecha+10}, web servers~\cite{chlipala15}, and operating system kernels~\cite{gu+16}. Coq has also been successfully deployed in the hardware design process, both in academia~\cite{braibant+13, bourgeat+20} and in industry~\cite{silveroak}. It has even been applied specifically to HLS: Faissole et al.~\cite{faissole+19} used it to verify that HLS optimisations respect dependencies in the source code.
 
-\paragraph{The CompCert verified C compiler} Among the most celebrated applications of Coq is the CompCert project~\cite{compcert}. CompCert is a lightly optimising C compiler, with backend support for the Arm, x86, PowerPC, and Kalray VLIW architectures~\cite{six+20}, that has been implemented and proven correct in Coq. CompCert accepts most of the C99 language, and generally performs comparably to GCC at \texttt{-O1}. It transforms its input through a series of ten intermediate languages before generating the final output. This design ensures that each individual pass remains relatively simple and hence feasible to prove correct. The correctness proof of the entire compiler is formed by composing the correctness proofs of each of its passes. 
-That the Csmith compiler testing tool~\cite{csmith} has found hundreds of bugs in GCC and LLVM but none in (the verified parts of) CompCert is a testament to the reliability of this development approach. That said, we cannot guarantee that \vericertfun{} designs will never go wrong, because of fallibilities in components not covered by the theorem, including the computer checking the proofs, the synthesis toolchain~\cite{verismith}, and the FPGA itself.
+\paragraph{The CompCert verified C compiler} Among the most celebrated applications of Coq is the CompCert project~\cite{compcert}. CompCert is a lightly optimising C compiler, with backend support for the Arm, x86, PowerPC, and Kalray VLIW architectures~\cite{six+20}, that has been implemented and proven correct in Coq. CompCert accepts most of the C99 language, and generally performs comparably to GCC at \texttt{-O1}. It transforms its input through a series of ten intermediate languages before generating the final output. %This design ensures that each individual pass remains relatively simple and hence feasible to prove correct.
+The correctness proof of the entire compiler is formed by composing the correctness proofs of each of its passes. 
+That the Csmith compiler testing tool~\cite{csmith} has found hundreds of bugs in GCC and LLVM but none in (the verified parts of) CompCert is a testament to the reliability of this development approach. That said, we cannot guarantee that \vericertfun{} designs will never go wrong, because of fallibilities in components not covered by the theorem, such as the synthesis toolchain~\cite{verismith} and the FPGA itself.
  
 \paragraph{The \vericert{} verified HLS tool}
 
@@ -423,11 +430,9 @@ To give a rough idea of the scale and complexity of our work: the implementation
 
 \section{Performance evaluation}
 
-\JWchanged{We now compare the quality of the hardware generated by \vericertfun{} against that generated by \vericert{}. We use the open-source (but unverified) \bambu{} tool~\cite{pilato13_bambu, ferrandi2021_bambu} as a baseline. We run \bambu{} (version 0.9.6) in the {\tt BAMBU\_AREA} configuration, which optimises for area ahead of latency, but do not provide any additional pragmas to control the HLS process.}
-
+\JWchanged{We now compare the quality of the hardware generated by \vericertfun{} against that of \vericert{}. We use the open-source (but unverified) \bambu{} tool~\cite{pilato13_bambu, ferrandi2021_bambu} as a baseline. We run \bambu{} (version 0.9.6) in the {\tt BAMBU\_AREA} configuration, which optimises for area ahead of latency, but do not provide any additional pragmas to control the HLS process.}
 Following \citet{vericert}, we use the PolyBench/C benchmark suite~\cite{polybench} with division and modulo replaced with iterative software implementations because \vericert{} does not handle them efficiently. That is, \lstinline{a/b} and \lstinline{c%d}
 are textually replaced with \lstinline{div(a,b)} and \lstinline{mod(c,d)}. These \lstinline{div} and \lstinline{mod} functions are the only function calls that are not inlined.
-
 We used the Icarus Verilog simulator to determine the cycle counts of the generated designs. We used Xilinx Vivado 2017.1, targeting a Xilinx 7-series FPGA (XC7K70T) to determine area usage \JWchanged{(measured in slices)} and fmax.
 
 %Figure~\ref{fig:results} summarises our results. The x-axis shows the impact of resource sharing on
@@ -437,7 +442,7 @@ We used the Icarus Verilog simulator to determine the cycle counts of the genera
 %introduces an extra state per function call. The impact on fmax is similarly minimal, ranging
 %between a 1.5\% increase and a 3.1\% decrease (0.2\% decrease on average).
 
-Figure~\ref{fig:results} compares the hardware generated by \vericertfun{} with that of \vericert{}, using Bambu as a baseline. The top graph compares the area usage. We observe a consistent and substantial reduction in area usage across the benchmark programs, with \vericert{} consistently using more area than Bambu (1.5$\times$ on average) and \vericertfun{} requiring less area than \vericert{} (0.8$\times$ on average), but still more than \bambu{} (1.2$\times$ on average).  As expected, the benchmarks with several function calls (mvt, 2mm, 3mm, ludcmp) enjoy the biggest area savings, while those with only one function call (heat-3d, nussinov) require slightly more area because of the extra circuitry required.
+Figure~\ref{fig:results} compares the hardware generated by \vericertfun{} with that of \vericert{}, using Bambu as a baseline. The top graph compares the area usage. We observe a substantial reduction in area usage across the benchmark programs, with \vericert{} consistently using more area than Bambu (1.5$\times$ on average) and \vericertfun{} requiring less area than \vericert{} (0.8$\times$ on average), but still more than \bambu{} (1.2$\times$ on average).  As expected, the benchmarks with several function calls (mvt, 2mm, 3mm, ludcmp) enjoy the biggest area savings, while those with only one function call (heat-3d, nussinov) require slightly more area because of the extra circuitry required.
 The bottom graph compares the execution time. We observe that \vericertfun{} generates hardware that is slightly (about 4\%) slower than \vericert{}'s, which can be attributed to the latency overhead of performing a function call. Hardware from \vericert{} and \vericertfun{} is significantly slower than \bambu's, which can be attributed to \vericert{} employing far fewer optimisations than the unverified \bambu{} tool.
 
 \pgfplotstableread[col sep=comma]{data/time-ratio.csv}{\divtimingtable}
@@ -503,7 +508,7 @@ The bottom graph compares the execution time. We observe that \vericertfun{} gen
 
 \section{Future work}
 
-Our immediate priority i to complete \vericertfun's correctness proof. In the medium term, we intend to improve our implementation of resource sharing by dropping the requirement to inline functions that access pointers; we anticipate that this will lead to further area savings \JWchanged{and also allow \vericertfun{} to be evaluated on benchmarks with more interesting call graphs}. We would also like \vericertfun{} to generate designs with one Verilog module per C function, as this is more idiomatic than cramming all the state machines into a single module; we did not do this yet because it requires extending the Verilog semantics to handle multiple modules. \JWchanged{It would also be interesting to implement \emph{selective} inlining of functions~\cite{huang+15}, either guided by heuristics or by programmer-supplied pragmas. It is worth noting that having proven inlining correct in general, the amount of inlining performed can be adjusted without affecting the correctness proof.} In the longer term, we would like to combine resource sharing with operation scheduling, i.e. resource-constrained scheduling~\cite{sdc}.
+Our immediate priority is to complete \vericertfun's correctness proof. In the medium term, we intend to improve our implementation of resource sharing by dropping the requirement to inline functions that access pointers; we anticipate that this will lead to further area savings \JWchanged{and also allow \vericertfun{} to be evaluated on benchmarks with more interesting call graphs}. We would also like \vericertfun{} to generate designs with one Verilog module per C function, as this is more idiomatic than cramming all the state machines into a single module; we did not do this yet because it requires extending the Verilog semantics to handle multiple modules. \JWchanged{It would also be interesting to implement \emph{selective} inlining of functions~\cite{huang+15}, either guided by heuristics or by programmer-supplied pragmas. It is worth noting that having proven inlining correct in general, the amount of inlining performed can be adjusted without affecting the correctness proof.} In the longer term, we would like to combine resource sharing with operation scheduling, i.e. resource-constrained scheduling~\cite{sdc}.
 
 
 \bibliographystyle{ACM-Reference-Format}
author	John Wickerson <j.wickerson@imperial.ac.uk>	2022-03-30 19:17:32 +0100
committer	John Wickerson <j.wickerson@imperial.ac.uk>	2022-03-30 19:17:32 +0100
commit	5d5be8703af7cd05112133d51bf9270fc161b1e1 (patch)
tree	46585146b49ca2e6cddd0c07a2a03a5323056be8
parent	63db6a1712e93dc293eaa730f0eb092566c06271 (diff)
download	fccm22_rsvhls-5d5be8703af7cd05112133d51bf9270fc161b1e1.tar.gz fccm22_rsvhls-5d5be8703af7cd05112133d51bf9270fc161b1e1.zip