summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorYann Herklotz <ymh15@ic.ac.uk>2020-11-21 10:17:05 +0000
committeroverleaf <overleaf@localhost>2020-11-21 10:24:36 +0000
commitefba8f4e36ecff409eab2d2ed107593d479b754b (patch)
tree9fb57ed4684eb7a877d14165a83af9077430dd33
parent15aab23c1e89e9369b631eb3cefc7fc708d6da36 (diff)
downloadoopsla21_fvhls-efba8f4e36ecff409eab2d2ed107593d479b754b.tar.gz
oopsla21_fvhls-efba8f4e36ecff409eab2d2ed107593d479b754b.zip
Update on Overleaf.
-rw-r--r--algorithm.tex66
-rw-r--r--evaluation.tex4
-rw-r--r--introduction.tex21
-rw-r--r--main.tex2
-rw-r--r--proof.tex7
5 files changed, 54 insertions, 46 deletions
diff --git a/algorithm.tex b/algorithm.tex
index 1d79eee..3fc7b2f 100644
--- a/algorithm.tex
+++ b/algorithm.tex
@@ -1,29 +1,29 @@
\section{Designing a verified HLS tool}
\label{sec:design}
-This section describes the main architecture of the HLS tool, and the way in which the Verilog back end \JP{back-end? backend?}\YH{I actually always used back end} was added to \compcert{}. This section will also cover an example of converting a simple C program into hardware, expressed in the Verilog language.
+This section describes the main architecture of the HLS tool, and the way in which the Verilog back end was added to \compcert{}. This section will also cover an example of converting a simple C program into hardware, expressed in the Verilog language.
\paragraph{Choice of source language}
-C was chosen as the source language as it remains the most common source language amongst production-quality HLS tools~\cite{canis11_legup, xilinx20_vivad_high_synth, intel_hls, bambu_hls}. This, in turn, may be because it is ``[t]he starting point for the vast majority of algorithms to be implemented in hardware''~\cite{5522874}, lending a degree of practicality. The availability of \compcert{}~\cite{??} also provides a solid basis for formally verified C compilation.
+C was chosen as the source language as it remains the most common source language amongst production-quality HLS tools~\cite{canis11_legup, xilinx20_vivad_high_synth, intel_hls, bambu_hls}. This, in turn, may be because it is ``[t]he starting point for the vast majority of algorithms to be implemented in hardware''~\cite{5522874}, lending a degree of practicality. %The availability of \compcert{}~\cite{??} also provides a solid basis for formally verified C compilation.
%Since a lot of existing code for HLS is written in C, supporting C as an input language, rather than a custom domain-specific language, means that \vericert{} is more practical.
%An alternative was to support LLVM IR as an input language, however, to get a full work flow from a higher level language to hardware, a front end for that language to LLVM IR would also have to be verified. \JW{Maybe save LLVM for the `Choice of implementation language'?}
-We considered Bluespec~\cite{nikhil04_blues_system_veril}, but decided that although it ``can be classed as a high-level language''~\cite{greaves_note}, it is too hardware-oriented to be used for traditional HLS.
-We also considered using a language with built-in parallel constructs that map well to parallel hardware, such as occam~\cite{page91_compil_occam}, Spatial~\cite{spatial} or Scala~\cite{chisel}, but we found these languages too niche.
+We considered Bluespec~\cite{nikhil04_blues_system_veril}, but decided that although it ``can be classed as a high-level language''~\cite{greaves_note}, it is too hardware-oriented to be suitable for traditional HLS.
+We also considered using a language with built-in parallel constructs that map well to parallel hardware, such as occam~\cite{page91_compil_occam}, Spatial~\cite{spatial} or Scala~\cite{chisel}, but found these languages too niche.
% However, this would not qualify as being HLS due to the manual parallelism that would have to be performed. \JW{I don't think the presence of parallelism stops it being proper HLS.}
%\JP{I think I agree with Yann here, but it could be worded better. At any rate not many people have experience writing what is essentially syntactic sugar over a process calculus.}
%\JW{I mean: there are plenty of software languages that involve parallel constructs. Anyway, perhaps we can just dismiss occam for being too obscure.}
\paragraph{Choice of target language}
-Verilog~\cite{06_ieee_stand_veril_hardw_descr_languag} is an HDL that can be synthesised into logic cells which can be either placed onto a field-programmable gate array (FPGA) or turned into an application-specific integrated circuit (ASIC). Verilog was chosen as the output language for \vericert{} because it is one of the most popular HDLs and there already exist a few formal semantics for it that could be used as a target~\cite{loow19_verif_compil_verif_proces, meredith10_veril}. Bluespec, previously ruled out as a source language, is another possible target and there exists a formally verified translation to circuits using K\^{o}ika~\cite{bourgeat20_essen_blues}. \JP{This needs an extra comment maybe?}\YH{Maybe about bluespec not being an ideal target language because it's quite high-level?} % but targeting this language would not be trivial as it is not meant to be targeted by an automatic tool, instead strives to a formally verified high-level hardware description language instead.
+Verilog~\cite{06_ieee_stand_veril_hardw_descr_languag} is an HDL that can be synthesised into logic cells which can be either placed onto a field-programmable gate array (FPGA) or turned into an application-specific integrated circuit (ASIC). Verilog was chosen as the output language for \vericert{} because it is one of the most popular HDLs and there already exist a few formal semantics for it that could be used as a target~\cite{loow19_verif_compil_verif_proces, meredith10_veril}. Bluespec, previously ruled out as a source language, is another possible target and there exists a formally verified translation to circuits using K\^{o}ika~\cite{bourgeat20_essen_blues}. %\JP{This needs an extra comment maybe?}\YH{Maybe about bluespec not being an ideal target language because it's quite high-level?} % but targeting this language would not be trivial as it is not meant to be targeted by an automatic tool, instead strives to a formally verified high-level hardware description language instead.
%\JW{Can we mention one or two alternatives that we considered? Bluespec or Chisel or one of Adam Chlipala's languages, perhaps?}
\paragraph{Choice of implementation language}
We chose Coq as the implementation language because of its mature support for code extraction; that is, its ability to generate OCaml programs directly from the definitions used in the theorems.
We note that other authors have had some success reasoning about the HLS process using other theorem provers such as Isabelle~\cite{ellis08}.
-\compcert{} was chosen as the front end framework\JP{But careful if we've already mentioned this before. Also needs a citation?}\YH{Citation for compcert?}, as it is a mature framework for simulation proofs about intermediate languages, and it already provides a validated C parser~\cite{jourdan12_valid_lr_parser}.
-The Vellvm~\cite{zhao12_formal_llvm_inter_repres_verif_progr_trans} framework was also considered because several existing HLS tools are already LLVM-based, but additional work would be required to support a high-level language like C as input.
+\compcert{}~\cite{leroy09_formal_verif_realis_compil} was chosen as the front end framework, as it is a mature framework for simulation proofs about intermediate languages, and it already provides a validated C parser~\cite{jourdan12_valid_lr_parser}.
+The Vellvm framework~\cite{zhao12_formal_llvm_inter_repres_verif_progr_trans} was also considered because several existing HLS tools are already LLVM-based, but additional work would be required to support a high-level language like C as input.
The .NET framework has been used as a basis for other HLS tools, such as Kiwi~\cite{kiwi}, and LLHD~\cite{schuiki20_llhd} has been recently proposed as an intermediate language for hardware design, but neither are suitable for us because they lack formal semantics.
\begin{figure}
@@ -54,18 +54,21 @@ The .NET framework has been used as a basis for other HLS tools, such as Kiwi~\c
\draw[->] (rtl) -- (dfgstmd);
\draw[->] (dfgstmd) -- (verilog);
\end{tikzpicture}}
- \caption{Verilog back end to Compcert, branching off at the three address code (3AC), at which point the three address code is transformed into a state machine. Finally, it is transformed to a hardware description of the state machine in Verilog.}%
+ \caption{Verilog back end to \compcert{}, branching off at the three address code (3AC), at which point the three address code is transformed into a state machine. Finally, it is transformed to a hardware description of the state machine in Verilog.}%
\label{fig:rtlbranch}
\end{figure}
\paragraph{Architecture of \vericert{}}
The main work flow of \vericert{} is given in Figure~\ref{fig:rtlbranch}, which shows those parts of the translation that are performed in \compcert{}, and those that have been added.
-\compcert{} translates Clight\footnote{A deterministic subset of C with pure expressions.} input into assembly output via a sequence of intermediate languages; we must decide which of these eleven languages \JP{I can only count 9 in total (including CompCert C and Asm) now that I look at the docs; I remember having this issue before...}\YH{I thought we discussed it and found 11, but I count 10 now....: Clight, C\#minor, Cminor, CminorSel, RTL, LTL, LTLin, Linear, Mach, PPC (ASM)} is the most suitable starting point to begin the HLS-specific translation stages. \JP{Should we say what Clight is here? (i.e. deterministic C with pure expressions)} \JW{Yeah maybe a footnote?}\YH{Added}
+\def\numcompcertlanguages{ten}
-We select CompCert's three-address code (3AC)\footnote{This is known as register transfer language (RTL) in the \compcert{} literature. `3AC' is used in this paper instead to avoid confusion with register-transfer level (RTL), which is another name for the final hardware target of the HLS tool.} as the starting point. Branching off before this point (at CminorSel or earlier) denies \compcert{} the opportunity to perform optimisations such as constant propagation and dead code elimination, which have been found useful in HLS tools as well as software compilers~\cite{cong+11}. And if we branch off after this point (at LTL or later) then \compcert{} has already performed register allocation to reduce the number of registers and spill some variables to the stack; this transformation is not required in HLS because there are many more registers available, and these should be used instead of RAM whenever possible. \JP{``\compcert{} performs register allocation during the translation to LTL, with some registers spilled onto the stack: this is unnecessary in HLS since as many registers as are required may be described in the output RTL.''} \JP{Maybe something about FPGAs being register-dense (so rarely a need to worry about the number of flops)?}
+\compcert{} translates Clight\footnote{A deterministic subset of C with pure expressions.} input into assembly output via a sequence of intermediate languages; we must decide which of these \numcompcertlanguages{} languages is the most suitable starting point for the HLS-specific translation stages.
-3AC is also attractive because it is the closest intermediate language to LLVM IR, which is used by several existing HLS compilers \JP{We already ruled out LLVM as a starting point, so this seems like it needs further qualification.}\YH{Well not because it's not a good starting point, but the ecosystem in Coq isn't as good. I think it's still OK here to say that being similar to LLVM IR is an advantage?}. It has an unlimited number of pseudo-registers, and is represented as a control flow graph (CFG) where each instruction is a node with links to the instructions that can follow it. One difference between LLVM IR and 3AC is that 3AC includes operations that are specific to the chosen target architecture; we chose to target the x86\_32 backend, because it generally produces relatively dense 3AC thanks to the availability of more complex addressing modes, reducing cycle counts when the HLS tool uses a na\"{i}ve scheduling approach.
+We select CompCert's three-address code (3AC)\footnote{This is known as register transfer language (RTL) in the \compcert{} literature. `3AC' is used in this paper instead to avoid confusion with register-transfer level (RTL), which is another name for the final hardware target of the HLS tool.} as the starting point. Branching off before this point (at CminorSel or earlier) denies \compcert{} the opportunity to perform optimisations such as constant propagation and dead code elimination, which have been found useful in HLS tools as well as software compilers~\cite{cong+11}. And if we branch off after this point (at LTL or later) then \compcert{} has already performed register allocation to reduce the number of registers and spill some variables to the stack; this transformation is not required in HLS because there are many more registers available, and these should be used instead of RAM whenever possible. %\JP{``\compcert{} performs register allocation during the translation to LTL, with some registers spilled onto the stack: this is unnecessary in HLS since as many registers as are required may be described in the output RTL.''} \JP{Maybe something about FPGAs being register-dense (so rarely a need to worry about the number of flops)?}
+
+3AC is also attractive because it is the closest intermediate language to LLVM IR, which is used by several existing HLS compilers. %\JP{We already ruled out LLVM as a starting point, so this seems like it needs further qualification.}\YH{Well not because it's not a good starting point, but the ecosystem in Coq isn't as good. I think it's still OK here to say that being similar to LLVM IR is an advantage?}
+It has an unlimited number of pseudo-registers, and is represented as a control flow graph (CFG) where each instruction is a node with links to the instructions that can follow it. One difference between LLVM IR and 3AC is that 3AC includes operations that are specific to the chosen target architecture; we chose to target the x86\_32 backend, because it generally produces relatively dense 3AC thanks to the availability of complex addressing modes.% reducing cycle counts in the absence of an effective scheduling approach.
\begin{figure}
\centering
@@ -178,13 +181,13 @@ endmodule
\end{figure}
\subsection{Translating C to Verilog, by example}
-Figure~\ref{fig:accumulator_c_rtl} illustrates the translation of a program that sums the elements of an array.
+Figure~\ref{fig:accumulator_c_rtl} illustrates the translation of a simple program that sums the elements of an array.
In this section, we describe the stages of the \vericert{} translation, referring to this program as an example.
\subsubsection{Translating C to 3AC}
The first stage of the translation uses unmodified \compcert{} to transform the C input, shown in Figure~\ref{fig:accumulator_c}, into a 3AC intermediate representation, shown in Figure~\ref{fig:accumulator_rtl}.
-As part of this translation, function inlining is also performed on all functions, which allows us to support function calls without having to support the \texttt{Icall} 3AC instruction. Although the duplication of the function bodies caused by inlining can increase the area of the hardware, it can have a positive effect on latency. Moreover, inlining excludes support for recursive function calls, but this feature isn't supported in most other HLS tools either~\cite{davidthomas_asap16}.
+As part of this translation, function inlining is also performed on all functions, which allows us to support function calls without having to support the \texttt{Icall} 3AC instruction. Although the duplication of the function bodies caused by inlining can increase the area of the hardware, it can have a positive effect on latency. Inlining precludes support for recursive function calls, but this feature isn't supported in most other HLS tools either~\cite{davidthomas_asap16}.
%\JW{Is that definitely true? Was discussing this with Nadesh and George recently, and I ended up not being so sure. Inlining could actually lead to \emph{reduced} resource usage because once everything has been inlined, the (big) scheduling problem could then be solved quite optimally. Certainly inlining is known to increase register pressure, but that's not really an issue here. If we're not sure, we could just say that inlining everything leads to bloated Verilog files and the inability to support recursion, and leave it at that.}\YH{I think that is true, just because we don't do scheduling. With scheduling I think that's true, inlining actually becomes quite good.}
@@ -199,15 +202,16 @@ As part of this translation, function inlining is also performed on all function
%\JW{I think it would be worth having a sentence to explain how the C model of memory is translated to a hardware-centric model of memory. For instance, in C we have global variables/arrays, stack-allocated variables/arrays, and heap-allocated variables/arrays (anything else?). In Verilog we have registers and RAM blocks. So what's the correspondence between the two worlds? Globals and heap-allocated are not handled, stack-allocated variables become registers, and stack-allocated arrays become RAM blocks? Am I close?}\YH{Stack allocated variables become RAM as well, so that we can deal with addresses easily and take addresses of any variable.} \JW{I see, thanks. So, in short, the only registers in your hardware designs are those that store things like the current state, etc. You generate a fixed number of registers every time you synthesis -- you don't generate extra registers to store any of the program variables. Right?}
The next translation is from 3AC to a new hardware translation language (HTL). %, which is one step towards being completely translated to hardware described in Verilog.
-This involves going from a CFG representation of the computation to a finite state machine with datapath (FSMD) representation~\cite{hwang99_fsmd}. The core idea of the FSMD representation is that it separates the control flow from the operations on the memory and registers. \JP{I've become less comfortable with this term, but it's personal preference so feel free to ignore. I think `generalised finite state machine' (i.e.\ thinking of the entire `datapath' as contributing to the overall state) is more accurate.}\YH{Hmm, yes, I mainly chose FSMD because there is quite a lot of literature around it. I think for now I'll keep it but for the final draft we could maybe change it.}
+This involves going from a CFG representation of the computation to a finite state machine with datapath (FSMD) representation~\cite{hwang99_fsmd}. The core idea of the FSMD representation is that it separates the control flow from the operations on the memory and registers. %\JP{I've become less comfortable with this term, but it's personal preference so feel free to ignore. I think `generalised finite state machine' (i.e.\ thinking of the entire `datapath' as contributing to the overall state) is more accurate.}\YH{Hmm, yes, I mainly chose FSMD because there is quite a lot of literature around it. I think for now I'll keep it but for the final draft we could maybe change it.}
%This means that the state transitions can be translated into a simple finite state machine (FSM) where each state contains data operations that update the memory and registers.
-Hence, an HTL program consists of two maps: control and datapath maps that express state transitions and computations respectively.
-Figure~\ref{fig:accumulator_diagram} shows the resulting FSMD architecture. The right block is the control logic that computes the next state whereas the left block updates all the registers and RAM based on the current program state.
+Hence, an HTL program consists of two maps consisting of Verilog statements: control and datapath maps that express state transitions and computations respectively.
+Figure~\ref{fig:accumulator_diagram} shows the resulting FSMD architecture. The right-hand block is the control logic that computes the next state, while the left-hand block updates all the registers and RAM based on the current program state.
\begin{figure*}
\centering
\definecolor{control}{HTML}{b3e2cd}
\definecolor{data}{HTML}{fdcdac}
+\scalebox{1.2}{%
\begin{tikzpicture}
\fill[control,fill opacity=1] (6.5,0) rectangle (12,5);
\fill[data,fill opacity=1] (0,0) rectangle (5.5,5);
@@ -288,7 +292,7 @@ Figure~\ref{fig:accumulator_diagram} shows the resulting FSMD architecture. The
\node[scale=0.4] at (3.5,3.8) {\texttt{reg\_3}};
\node[scale=0.4] at (3.5,3.6) {$\cdots$};
\node[scale=0.4] at (3.5,3.4) {\texttt{reg\_8}};
-\end{tikzpicture}
+\end{tikzpicture}}
\caption{The FSMD for the example shown in Figure~\ref{fig:accumulator_c_rtl}, split into a data path and control logic for the next state calculation. The Update block takes the current state, current values of all registers and at most one value stored in the array, and calculates a new value that can either be stored back in the array or in a register.}\label{fig:accumulator_diagram}
\end{figure*}
@@ -297,29 +301,32 @@ Figure~\ref{fig:accumulator_diagram} shows the resulting FSMD architecture. The
Typically, HLS-generated hardware consists of a sea of registers and RAM memories.
This memory view is very different to the C memory model, so we perform the following translation.
Variables that do not have their address taken are kept in registers, which correspond to the registers in 3AC.
-All address-taken variables, arrays or structs are kept in RAM.
-The stack of the main function becomes an unpacked array of 32-bit integers, which may be translated to a RAM when the hardware description is passed through a synthesis tool. A high-level overview of the architecture can be seen in Figure~\ref{fig:accumulator_diagram}.
+All address-taken variables, arrays, and structs are kept in RAM.
+The stack of the main function becomes an unpacked array of 32-bit integers, which may be translated to a RAM when the hardware description is passed through a synthesis tool.
Finally, global variables are not translated in \vericert{} at the moment.
+A high-level overview of the architecture can be seen in Figure~\ref{fig:accumulator_diagram}.
\paragraph{Translating instructions}
-Each 3AC instruction either corresponds to a hardware construct, or does not have to be handled by the translation, such as function calls, as a call instruction will never be generated due to inlining all calls.
-For example, state 15 in Figure~\ref{fig:accumulator_rtl} shows a 32-bit register \texttt{x8} being initialised to 1, after which the control flow moves to state 14. This initialisation is also encoded in HTL at state 15 in both the control and datapath always blocks, as shown in Figure~\ref{fig:accumulator_v}. Simple operator instructions are translated in a similar way. For example, in state 5, the value of the array element is added to the current sum value, which is simply translated to an addition of the equivalent registers in the HTL code.
+Each 3AC instruction either corresponds to a hardware construct, or does not have to be handled by the translation, such as function calls (because of inlining).
+For example, state 15 in Figure~\ref{fig:accumulator_rtl} shows a 32-bit register \texttt{x8} being initialised to 1, after which the control flow moves to state 14. This initialisation is also encoded in HTL at state 15 in both the control and datapath always-blocks, as shown in Figure~\ref{fig:accumulator_v}. Simple operator instructions are translated in a similar way. For example, in state 5, the value of the array element is added to the current sum value, which is simply translated to an addition of the equivalent registers in the HTL code.
+
+Note that the comparison in state 3 is signed. C and Verilog handle signedness quite differently; by default, all operators and registers in Verilog (and HTL) are unsigned, so to force an operation to handle the bits as signed, both operators have to be forced to be signed. In addition to that, Verilog implicitly resizes expressions to the largest needed size by default, which can affect the result of the computation. This feature is not supported by the Verilog semantics we adopted, so to match the semantics to the behaviour of the simulator and synthesis tool, braces are placed around all expressions as this inhibits implicit resizing. Instead, explicit resizing is used in the semantics and operations can only be performed on two registers that have the same size.
-\paragraph{Key challenge: signedness} Note that the comparison in state 3 is signed. C and Verilog handle signedness quite differently, by default, all operators and registers in Verilog (and HTL) are unsigned, so to force an operation to handle the bits as signed, both operators have to be forced to be signed. In addition to that, Verilog implicitly resizes expressions to the largest needed size by default, which can affect the result of the computation. This feature is not supported by the Verilog semantics we adopted, and therefore, to match the semantics to the behaviour of the simulator and synthesis tool, braces are placed around all expressions as this hinders implicit resizing. Instead, explicit resizing is used in the semantics and operations can only be performed on two registers that
-have the same size. %\JP{I feel like this is an issue of syntax? We embed a simplified syntax in Coq which elides braces etc. There isn't really a semantic mismatch (which sounds really scary and wrong).}
+In addition to that, equality between \emph{unsigned} variables are actually not supported, because this requires supporting the comparison of pointers, which should only be performed between pointers with the same provenance. In \vericert{} there is currently no way to determine the provenance of a pointer, and it therefore cannot model the semantics of unsigned comparison in \compcert{}. As dynamic allocation is not supported either, comparison of pointers is rarely needed, and for the comparison of integers, these can be cast to signed integers for the translation to succeed.
\subsubsection{Translating HTL to Verilog}
-Finally, we have to translate the HTL code into proper Verilog, % and prove that it behaves the same as the 3AC according to the Verilog semantics.
-The challenge here is to translate our FSMD representation into a Verilog AST. However, as all the instructions in HTL are already expressed as Verilog statements, only the top level data path and control logic maps need to be translated to valid Verilog. To generate correct Verilog, declarations for all the variables in the program need to be added as well.
+Finally, we have to translate the HTL code into proper Verilog. % and prove that it behaves the same as the 3AC according to the Verilog semantics.
+The challenge here is to translate our FSMD representation into a Verilog AST. However, as all the instructions in HTL are already expressed as Verilog statements, only the top level data path and control logic maps need to be translated to valid Verilog. We also require declarations for all the variables in the program, as well as declarations of the inputs and outputs to the module, so that the module can be used inside a larger hardware design.
+Figure~\ref{fig:accumulator_v} shows the final Verilog output that is generated for our example.
Although this translation seems quite straightforward, proving that this translation is correct is complex.
All the implicit assumptions that were made in HTL need to be translated explicitly to Verilog statements and it needs to be shown that these explicit behaviours are equivalent to the assumptions made in the HTL semantics.
We discuss these proofs in upcoming sections.
-Figure~\ref{fig:accumulator_v} shows the final Verilog output that is generated for our example. In general, the generated Verilog structure has similar to that of the HTL code.
-The key difference is that the control and datapath maps become Verilog case-statements.
-Other additions are the initialisation of all the variables in the code to the correct bitwidths and the declaration of the inputs and outputs to the module, so that the module can be used inside a larger hardware design.
+ %In general, the generated Verilog structure has similar to that of the HTL code.
+%The key difference is that the control and datapath maps become Verilog case-statements.
+%Other additions are the initialisation of all the variables in the code to the correct bitwidths and the declaration of the inputs and outputs to the module, so that the module can be used inside a larger hardware design.
\subsection{Optimisations}
@@ -327,7 +334,7 @@ Although we would not claim that \vericert{} is a proper `optimising' HLS compil
\subsubsection{Byte- and word-addressable memories}
-One big difference between C and Verilog is how memory is represented. Although Verilog arrays might seem to mirror their C counterparts directly, they must be treated quite differently. To reduce the design area and avoid issues meeting timing, it is beneficial if Verilog arrays can be synthesised as RAMs, but this imposes various constraints on how Verilog arrays are used. RAMs often only allow one read and one write operation per clock cycle: to make loads and stores as efficient as possible, the RAM needs to be word-addressable, which means that an entire integer can be loaded or stored in one clock cycle.
+One big difference between C and Verilog is how memory is represented. Although Verilog arrays might seem to mirror their C counterparts directly, they must be treated quite differently. To reduce the design area and avoid timing issues, it is beneficial if Verilog arrays can be synthesised as RAMs, but this imposes various constraints on how Verilog arrays are used; for instance, RAMs often only allow one read and one write operation per clock cycle. To make loads and stores as efficient as possible, the RAM needs to be word-addressable, which means that an entire integer can be loaded or stored in one clock cycle.
However, the memory model that \compcert{} uses for its intermediate languages is byte-addre\-ssa\-ble~\cite{blazy05_formal_verif_memor_model_c}. It therefore has to be proven that the byte-addressable memory behaves in the same way as the word-addressable memory in hardware. Any modifications of the bytes in the \compcert{} memory model also have to be shown to modify the word-addressable memory in the same way. Since only integer loads and stores are currently supported in \vericert{}, it follows that the addresses given to the loads and stores should be multiples of four. If that is the case, then the translation from byte-addressed memory to word-addressed memory can be done by dividing the address by four.
\subsubsection{Implementing the \texttt{Oshrximm} instruction}
@@ -345,7 +352,6 @@ Many of the \compcert{} instructions map well to hardware, but \texttt{Oshrximm}
%These small optimisations were found to be the most error prone, and guaranteeing that the new representation is equivalent to representation used in the \compcert{} semantics is difficult without proving this for all possible inputs.
One might hope that the synthesis tool consuming our generated Verilog would convert the division to an efficient shift operation, but this is unlikely to happen with signed division which requires more than a single shift. However, the observation can be made that signed division can be implemented using shifts:
-
\begin{equation*}
\text{round\_towards\_zero}\left(\frac{x}{2^{y}}\right) =
\begin{cases}
diff --git a/evaluation.tex b/evaluation.tex
index ef4883e..59fc7f9 100644
--- a/evaluation.tex
+++ b/evaluation.tex
@@ -139,8 +139,8 @@ Cycle count is one factor in calculating execution times; the other is the clock
\end{figure}
Figure~\ref{fig:comparison_area} compares the resource utilisation of the Polybench programs generated by \vericert{} and \legup{}.
-On average, we see that \vericert{} produces hardware that is about $21\times$ larger than \legup{}. \vericert{} designs are filling up to 30\% of a (large) FPGA chip, while $\legup$ uses no more than 1\% of the chip.
-The main reason for this is mainly because RAM is not inferred automatically for the Verilog that is generated by \vericert{}; instead, large arrays of registers are synthesised instead.
+On average, we see that \vericert{} produces hardware that is about $21\times$ larger than \legup{}. \vericert{} designs are filling up to 30\% of a (large) FPGA chip, while \legup{} uses no more than 1\% of the chip.
+The main reason for this is that RAM is not inferred automatically for the Verilog that is generated by \vericert{}; instead, large arrays of registers are synthesised.
Synthesis tools such as Quartus generally require array accesses to be in a specific form in order for RAM inference to activate.
\legup{}'s Verilog generation is tailored to enable RAM inference by Quartus, while \vericert{} generates more generic array accesses. This may make \vericert{} more portable across different FPGA synthesis tools and vendors.
%For a fair comparison, we chose Quartus for these experiments because LegUp supports Quartus efficiently.
diff --git a/introduction.tex b/introduction.tex
index af148ac..9e88564 100644
--- a/introduction.tex
+++ b/introduction.tex
@@ -11,11 +11,11 @@
As latency, throughput and energy efficiency become increasingly important, custom hardware accelerators are being designed for numerous applications.
Alas, designing these accelerators can be a tedious and error-prone process using a hardware description language (HDL) such as Verilog.
-An attractive alternative is high-level synthesis (HLS), in which hardware designs are automatically compiled from software written in a language like C.
-Modern HLS tools such as \legup{}~\cite{canis11_legup}, Vivado HLS~\cite{xilinx20_vivad_high_synth}, Intel i++~\cite{intel_hls}, and Bambu HLS~\cite{bambu_hls} promise designs with comparable performance and energy-efficiency to hand-written HDL designs~\cite{homsirikamol+14, silexicahlshdl, 7818341}, while offering the convenient abstractions and rich ecosystems of software development.
+An attractive alternative is \emph{high-level synthesis} (HLS), in which hardware designs are automatically compiled from software written in a high-level language like C.
+Modern HLS tools such as \legup{}~\cite{canis11_legup}, Vivado HLS~\cite{xilinx20_vivad_high_synth}, Intel i++~\cite{intel_hls}, and Bambu HLS~\cite{bambu_hls} promise designs with comparable performance and energy-efficiency to those hand-written in HDL~\cite{homsirikamol+14, silexicahlshdl, 7818341}, while offering the convenient abstractions and rich ecosystems of software development.
But existing HLS tools cannot always guarantee that the hardware designs they produce are equivalent to the software they were given, and this undermines any reasoning conducted at the software level.
-Indeed, there are reasons to doubt that HLS tools actually \emph{do} always preserve equivalence.
+Indeed, there are reasons to doubt that HLS tools actually \emph{do} always preserve equivalence.
%Some of these reasons are general: HLS tools are large pieces of software, they perform a series of complex analyses and transformations, and the HDL output they produce has superficial syntactic similarities to a software language but a subtly different semantics.
%Hence, the premise of this work is: Can we trust these compilers to translate high-level languages like C/C++ to HDL correctly?
%Other reasons are more specific:
@@ -25,24 +25,25 @@ And more recently, \citet{du_fuzzin_high_level_synth_tools} fuzz-tested three co
\paragraph{Existing workarounds}
-Aware of the reliability shortcomings of HLS tools, hardware designers routinely check the generated hardware for functional correctness. This is commonly done by simulating the design against a large test-bench; however, the guarantees are only as good as the test-bench, meaning that if all the inputs were not covered exhaustively, there is a risk that bugs remain.
+Aware of the reliability shortcomings of HLS tools, hardware designers routinely check the generated hardware for functional correctness. This is commonly done by simulating the design against a large test-bench. But unless the test-bench covers all inputs exhaustively, which is often infeasible, there is a risk that bugs remain.
-An alternative is to use \emph{translation validation} to formally prove the input and output equivalent~\cite{pnueli98_trans}. Translation validation has been successfully applied to several HLS optimisations~\cite{kim04_autom_fsmd,karfa06_formal_verif_method_sched_high_synth,chouksey20_verif_sched_condit_behav_high_level_synth,banerjee14_verif_code_motion_techn_using_value_propag,chouksey19_trans_valid_code_motion_trans_invol_loops}.
-Yet translation validation must be repeated every time the compiler is invoked, and it is an expensive task, especially for large designs.
-For example, the translation validation for Catapult C~\cite{mentor20_catap_high_level_synth, slec_whitepaper} may require several rounds of expert modifications to the input C program before validation succeeds. And even when it succeeds, translation validation does not provide watertight guarantees unless the validator itself has been mechanically proven correct, which is seldom the case.
+An alternative is to use \emph{translation validation} to prove the input and output equivalent~\cite{pnueli98_trans}. Translation validation has been successfully applied to several HLS optimisations~\cite{kim04_autom_fsmd,karfa06_formal_verif_method_sched_high_synth,chouksey20_verif_sched_condit_behav_high_level_synth,banerjee14_verif_code_motion_techn_using_value_propag,chouksey19_trans_valid_code_motion_trans_invol_loops}.
+But translation validation must be repeated every time the compiler is invoked, and it is an expensive task, especially for large designs.
+For example, the translation validation for Catapult C~\cite{mentor20_catap_high_level_synth} may require several rounds of expert `adjustments'~\cite[p.~3]{slec_whitepaper} to the input C program before validation succeeds. And even when it succeeds, translation validation does not provide watertight guarantees unless the validator itself has been mechanically proven correct, which is seldom the case.
Our position is that none of the above workarounds are necessary if the HLS tool can simply be trusted to work correctly.
\paragraph{Our solution}
-We design a new HLS tool in the Coq theorem prover~\cite{bertot04_inter_theor_provin_progr_devel} and prove that it any output it produces always has the same behaviour as its input. Our tool, which is called \vericert{},\ifANONYMOUS\footnote{Tool name has been changed for blind review.}\fi{} is automatically extracted to an OCaml program from Coq, which ensures that the object of the proof is the same as the implementation of the tool. \vericert{} is built by extending the \compcert{} verified C compiler~\cite{leroy09_formal_verif_realis_compil} with a new hardware-specific intermediate language and a Verilog back end. It supports all C constructs except for function pointers, recursive function calls, non-integer types and global variables.
+We have designed a new HLS tool in the Coq theorem prover~\cite{bertot04_inter_theor_provin_progr_devel} and proved that any output it produces always has the same behaviour as its input. Our tool, called \vericert{},\ifANONYMOUS\footnote{Tool name has been changed for blind review.}\fi{} is automatically extracted to an OCaml program from Coq, which ensures that the object of the proof is the same as the implementation of the tool. \vericert{} is built by extending the \compcert{} verified C compiler~\cite{leroy09_formal_verif_realis_compil} with a new hardware-specific intermediate language and a Verilog back end. It supports all C constructs except for case statements, function pointers, recursive function calls, integers larger than 32 bits, floats, and global variables.
\paragraph{Contributions and Outline}
The contributions of this paper are as follows:
\begin{itemize}
\item We present \vericert{}, the first mechanically verified HLS tool that compiles C to Verilog. In Section~\ref{sec:design}, we describe the design of \vericert{}.
- \item We prove \vericert{} correct w.r.t. an existing semantics for Verilog due to \citet{loow19_proof_trans_veril_devel_hol}. In Section~\ref{sec:verilog}, we describe how we lightly extended this semantics to make it suitable as an HLS target. Then, in Section~\ref{sec:proof}, we describe the proof.
- \item We evaluate \vericert{} on the Polybench/C benchmark suite~\cite{polybench}, and compare the performance of our generated hardware against and the well-known but unverified \legup{} HLS tool~\cite{canis11_legup}. In Section~\ref{sec:evaluation}, we show that \vericert{}-generated Polybench hardware is \slowdownOrig$\times$ slower and \areaIncr$\times$ larger than, HLS-optimised but unverified, \legup{} hardware. We intend to bridge this performance gap in the future by introducing HLS optimisations of our own, such as scheduling and memory analysis.
+ \item We state the correctness theorem of \vericert{} with respect to an existing semantics for Verilog due to \citet{loow19_proof_trans_veril_devel_hol}. In Section~\ref{sec:verilog}, we describe how we lightly extended this semantics to make it suitable as an HLS target.
+ \item In Section~\ref{sec:proof}, we describe how we proved this theorem. The proof follows standard \compcert{} techniques -- forward simulations, intermediate specifications, and determinism results -- but we encountered several challenges peculiar to our hardware-oriented setting. These include handling discrepancies between byte- and word-addressable memories, different handling of unsigned comparisons between C and Verilog, and correctly mapping CompCert's memory model onto a finite Verilog array.
+ \item In Section~\ref{sec:evaluation}, we evaluate \vericert{} on the Polybench/C benchmark suite~\cite{polybench}, and compare the performance of our generated hardware against an existing, unverified HLS tool called \legup{}~\cite{canis11_legup}. We show that \vericert{} generates hardware that is \slowdownOrig$\times$ slower (\slowdownDiv$\times$ slower in the absence of division) and \areaIncr$\times$ larger than that generated by \legup{}. We intend to bridge this performance gap in the future by introducing (and verifying) HLS optimisations of our own, such as scheduling and memory analysis.
\end{itemize}
\vericert{} is fully open source and available online.
diff --git a/main.tex b/main.tex
index 06667b5..aab349a 100644
--- a/main.tex
+++ b/main.tex
@@ -158,7 +158,7 @@
\begin{abstract}
High-level synthesis (HLS), which refers to the automatic compilation of software into hardware, is rapidly gaining popularity. In a world increasingly reliant on application-specific hardware accelerators, HLS promises hardware designs of comparable performance and energy efficiency to those coded by hand in a hardware description language like Verilog, while maintaining the convenience and the rich ecosystem of software development. However, current HLS tools cannot always guarantee that the hardware designs they produce are equivalent to the software they were given, thus undermining any reasoning conducted at the software level. Worse, there is mounting evidence that existing HLS tools are quite unreliable, sometimes generating wrong hardware or crashing when given valid inputs.
- To address this problem, we present the first HLS tool that is mechanically verified to preserve the behaviour of its input software. Our tool, called \vericert{}, extends the \compcert{} verified C compiler with a new hardware-esque intermediate language and a Verilog back end, and has been proven correct in Coq. \vericert{} supports all C constructs except for function pointers, recursive function calls, non-integer types and global variables. We evaluate \vericert{} on the PolyBench/C benchmark suite. We can generate Polybench hardware that is guaranteed to be translated correctly from C to Verilog. Our generated hardware is around one order of magnitude slower and larger than hardware generated by an existing, optimising (but unverified) HLS tool.
+ To address this problem, we present the first HLS tool that is mechanically verified to preserve the behaviour of its input software. Our tool, called \vericert{}, extends the \compcert{} verified C compiler with a new hardware-oriented intermediate language and a Verilog back end, and has been proven correct in Coq. \vericert{} supports all C constructs except for case statements, function pointers, recursive function calls, integers larger than 32 bits, floats, and global variables. An evaluation on the PolyBench/C benchmark suite indicates that \vericert{} generates hardware that is around an order of magnitude slower and larger than hardware generated by an existing, optimising (but unverified) HLS tool.
\end{abstract}
%% 2012 ACM Computing Classification System (CSS) concepts
diff --git a/proof.tex b/proof.tex
index 4e4c721..8ff212b 100644
--- a/proof.tex
+++ b/proof.tex
@@ -125,8 +125,9 @@ The HTL-to-Verilog simulation is conceptually simple, as the only transformation
%The proof of the translation from maps to case-statements follows by induction over the list of elements in the map and the fact that each key will be unique. In addition to that, the statement that is currently being evaluated is guaranteed by the correctness of the list of elements to be in that list. The latter fact therefore eliminates the base case, as an empty list does not contain the element we know is in the list. The other two cases follow by the fact that either the key is equal to the evaluated value of the case expression, or it isn't. In the first case we can then evaluate the statement and get the state after the case expression, as the uniqueness of the key tells us that the key cannot show up in the list anymore. In the other case we can just apply the inductive hypothesis and remove the current case from the case statement, as it did not match.
\end{proof}
-One problem with our representation of the state as an actual register is that we have to make sure that the state does not overflow. Currently, the state register always has 32 bits, meaning the maximum number of states can only be $2^{32} - 1$. \JW{Why not $2^{32}$?}\YH{OK, mainly because states use positive numbers which start at 1, I guess I could make the mapping by subtracting 1, but currently don't do that.} % We therefore have to prove that the state value will never go over that value.
-This means that during the translation we have to check that each state can fit into an integer. \vericert{} will error out if there are more than $2^{32}$ instructions to be translated, thus satisfying the correctness theorem vacuously.
+One problem with our representation of the state as an actual register is that we have to make sure that the state does not overflow. Currently, the state register always has 32 bits, meaning the maximum number of states supported is $2^{32}$.
+%This means that during the translation we have to check that each state can fit into an integer.
+\vericert{} will error out if there are more than this many nodes in the 3AC, thus satisfying the correctness theorem vacuously.
\subsection{Deterministic Semantics}\label{sec:proof:deterministic}
@@ -142,7 +143,7 @@ The final lemma we need is that the Verilog we generate is deterministic. This r
\end{lemma}
\begin{proof}[Proof sketch]
- The Verilog semantics that are used are deterministic, as the order of operation of all the constructs is defined, and there is therefore only one way to evaluate the module and only one possible behaviour. This was proven correct for the small-step semantics shown in Figure~\ref{fig:inferrence_module}.
+ The Verilog semantics is deterministic because the order of operation of all the constructs is defined, and there is therefore only one way to evaluate the module and hence only one possible behaviour. This was proven for the small-step semantics shown in Figure~\ref{fig:inferrence_module}.
\end{proof}
%\subsection{Coq Mechanisation}