summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
-rw-r--r--algorithm.tex4
-rw-r--r--introduction.tex2
-rw-r--r--main.tex2
-rw-r--r--proof.tex2
-rw-r--r--verilog.tex2
5 files changed, 6 insertions, 6 deletions
diff --git a/algorithm.tex b/algorithm.tex
index 4d07424..ba44613 100644
--- a/algorithm.tex
+++ b/algorithm.tex
@@ -162,7 +162,7 @@ In this section, we describe the stages of the \vericert{} translation, referrin
\subsubsection{Translating C to 3AC}
The first stage of the translation uses unmodified \compcert{} to transform the C input, shown in Figure~\ref{fig:accumulator_c}, into a 3AC intermediate representation, shown in Figure~\ref{fig:accumulator_rtl}.
-As part of this translation, function inlining is performed on all functions, which allows us to support function calls without having to support the \texttt{Icall} 3AC instruction. Although the duplication of the function bodies caused by inlining can increase the area of the hardware, it can have a positive effect on latency\YH{TODO: Add sentence on why inlining is often the better choice}. Inlining precludes support for recursive function calls, but this feature isn't supported in most other HLS tools either~\cite{davidthomas_asap16}.
+As part of this translation, function inlining is performed on all functions, which allows us to support function calls without having to support the \texttt{Icall} 3AC instruction. Although the duplication of the function bodies caused by inlining can increase the area of the hardware, it can have a positive effect on latency\YH{TODO: Add sentence on why inlining is often the better choice}. Inlining precludes support for recursive function calls, but this feature is not supported in most other HLS tools either~\cite{davidthomas_asap16}.
%\JW{Is that definitely true? Was discussing this with Nadesh and George recently, and I ended up not being so sure. Inlining could actually lead to \emph{reduced} resource usage because once everything has been inlined, the (big) scheduling problem could then be solved quite optimally. Certainly inlining is known to increase register pressure, but that's not really an issue here. If we're not sure, we could just say that inlining everything leads to bloated Verilog files and the inability to support recursion, and leave it at that.}\YH{I think that is true, just because we don't do scheduling. With scheduling I think that's true, inlining actually becomes quite good.}
@@ -332,7 +332,7 @@ One big difference between C and Verilog is how memory is represented. Although
However, the memory model that \compcert{} uses for its intermediate languages is byte-addre\-ssa\-ble~\cite{blazy05_formal_verif_memor_model_c}. If a byte-addressable memory was used in the target hardware, which is closer to \compcert{}'s memory model, then a load and store would instead take four clock cycles, because a RAM can only perform one read and write per clock cycle. It therefore has to be proven that the byte-addressable memory behaves in the same way as the word-addressable memory in hardware. Any modifications of the bytes in the \compcert{} memory model also have to be shown to modify the word-addressable memory in the same way. Since only integer loads and stores are currently supported in \vericert{}, it follows that the addresses given to the loads and stores should be multiples of four. If that is the case, then the translation from byte-addressed memory to word-addressed memory can be done by dividing the address by four.
\subsubsection{Implementation of RAM templates}
-The simplest way to implement loads and stores in \vericert{} would be to access the Verilog array directly from within the data-path (i.e., inside the always-block on lines 16--32 of Figure~\ref{fig:accumulator_v}). This would be correct, but when a Verilog array is accessed at several program points, the synthesis tool is unlikely to detect that it can be implemented as a RAM block, and will resort to using lots of registers instead, ruining the circuit's area and performance. To avert this, we arrange that the data-path doesn't access memory directly, but simply sets the address it wishes to access and then toggles the \texttt{u\_en} flag. This activates the RAM driver (lines 9--15 of Figure~\ref{fig:accumulator_v}) on the next falling clock edge, which performs the requested load or store. By factoring all the memory accesses out into a separate driver like this, we ensure that the underlying array is only accessed from a single program point in the Verilog code, and thus ensure that the synthesis tool will correctly infer a RAM block.
+The simplest way to implement loads and stores in \vericert{} would be to access the Verilog array directly from within the data-path (i.e., inside the always-block on lines 16--32 of Figure~\ref{fig:accumulator_v}). This would be correct, but when a Verilog array is accessed at several program points, the synthesis tool is unlikely to detect that it can be implemented as a RAM block, and will resort to using lots of registers instead, ruining the circuit's area and performance. To avert this, we arrange that the data-path does not access memory directly, but simply sets the address it wishes to access and then toggles the \texttt{u\_en} flag. This activates the RAM driver (lines 9--15 of Figure~\ref{fig:accumulator_v}) on the next falling clock edge, which performs the requested load or store. By factoring all the memory accesses out into a separate driver like this, we ensure that the underlying array is only accessed from a single program point in the Verilog code, and thus ensure that the synthesis tool will correctly infer a RAM block.
Therefore, an extra compiler pass is added from HTL to HTL to extracts all the direct accesses to the Verilog array implementing memory and replaces them by signals which access the memory in a separate always-block. This ensures that the synthesis tool correctly identifies the array as being a RAM, so that it is not implemented by logic directly. The translation is performed by going through all the instructions and replacing each load and store expression one after another. Stores can simply be replaced by the necessary wires directly, however, loads using the RAM block take two clock cycles instead of a direct load from an array which only takes one clock cycles. This pass therefore creates a extra state which is inserted after each load.
diff --git a/introduction.tex b/introduction.tex
index 146b33c..c96e795 100644
--- a/introduction.tex
+++ b/introduction.tex
@@ -12,7 +12,7 @@
As latency, throughput and energy efficiency become increasingly important, custom hardware accelerators are being designed for numerous applications.
Alas, designing these accelerators can be a tedious and error-prone process using a hardware description language (HDL) such as Verilog.
An attractive alternative is \emph{high-level synthesis} (HLS), in which hardware designs are automatically compiled from software written in a high-level language like C.
-Modern HLS tools such as \legup{}~\cite{canis11_legup}, Vivado HLS~\cite{xilinx20_vivad_high_synth}, Intel i++~\cite{intel_hls}, and Bambu HLS~\cite{bambu_hls} promise designs with comparable performance and energy-efficiency to those hand-written in HDL~\cite{homsirikamol+14, silexicahlshdl, 7818341}, while offering the convenient abstractions and rich ecosystems of software development.
+Modern HLS tools such as \legup{}~\cite{canis11_legup}, Vivado HLS~\cite{xilinx20_vivad_high_synth}, Intel i++~\cite{intel_hls}, and Bambu HLS~\cite{bambu_hls} promise designs with comparable performance and energy-efficiency to those hand-written in an HDL~\cite{homsirikamol+14, silexicahlshdl, 7818341}, while offering the convenient abstractions and rich ecosystems of software development.
But existing HLS tools cannot always guarantee that the hardware designs they produce are equivalent to the software they were given, and this undermines any reasoning conducted at the software level.
Indeed, there are reasons to doubt that HLS tools actually \emph{do} always preserve equivalence.
diff --git a/main.tex b/main.tex
index eb4f0c6..d7583fc 100644
--- a/main.tex
+++ b/main.tex
@@ -141,7 +141,7 @@
\begin{abstract}
High-level synthesis (HLS), which refers to the automatic compilation of software into hardware, is rapidly gaining popularity. In a world increasingly reliant on application-specific hardware accelerators, HLS promises hardware designs of comparable performance and energy efficiency to those coded by hand in a hardware description language such as Verilog, while maintaining the convenience and the rich ecosystem of software development. However, current HLS tools cannot always guarantee that the hardware designs they produce are equivalent to the software they were given, thus undermining any reasoning conducted at the software level. Worse, there is mounting evidence that existing HLS tools are quite unreliable, sometimes generating wrong hardware or crashing when given valid inputs.
- To address this problem, we present the first HLS tool that is mechanically verified to preserve the behaviour of its input software. Our tool, called \vericert{}, extends the \compcert{} verified C compiler with a new hardware-oriented intermediate language and a Verilog back end, and has been proven correct in Coq. \vericert{} supports most C constructs, including all integer operations, function calls, local arrays, structs and unions and general control-flow statements. An evaluation on the PolyBench/C benchmark suite indicates that \vericert{} generates hardware that is around an order of magnitude slower (only around 2$\times$ slower in the absence of division) and about the same size as hardware generated by an existing, optimising (but unverified) HLS tool.
+ To address this problem, we present the first HLS tool that is mechanically verified to preserve the behaviour of its input software. Our tool, called \vericert{}, extends the \compcert{} verified C compiler with a new hardware-oriented intermediate language and a Verilog back end, and has been proven correct in Coq. \vericert{} supports most C constructs, including all integer operations, function calls, local arrays, structs, unions and general control-flow statements. An evaluation on the PolyBench/C benchmark suite indicates that \vericert{} generates hardware that is around an order of magnitude slower (only around 2$\times$ slower in the absence of division) and about the same size as hardware generated by an existing, optimising (but unverified) HLS tool.
\end{abstract}
%% 2012 ACM Computing Classification System (CSS) concepts
diff --git a/proof.tex b/proof.tex
index 6782262..66dc5f0 100644
--- a/proof.tex
+++ b/proof.tex
@@ -216,7 +216,7 @@ The final lemma we need is that the Verilog we generate is deterministic. This r
The lines of code for the implementation and proof of \vericert{} can be found in Table~\ref{tab:proof_statistics}. Overall, it took about 1.5 person-years to build \vericert{} -- about three person-months on implementation and 15 person-months on proofs. The largest proof is the correctness proof for the HTL generation, which required equivalence proofs between all integer operations supported by \compcert{} and those supported in hardware. From the 3069 lines of proof code in the HTL generation, 1189 are for the correctness proof of just the load and store instructions. These were tedious to prove correct because of the substantial difference between the memory models used, and the need to prove properties such as stores outside of the allocated memory being undefined, so that a finite array could be used. In addition to that, since pointers in HTL and Verilog are represented as integers, whereas there is a separate pointer value in the \compcert{} semantics, it was painful to reason about them and many new theorems had to be proven about integers and pointers in \vericert{}. In addition to that, the second largest proof of the correct RAM generation includes many proofs about the extensional equality of array operations, such as merging arrays with different assignments. As the negative edge implies two merges take place every clock cycle, the proofs about the equality of the arrays becomes more tedious as well.
-Looking at the trusted base of \vericert{}, the Verilog semantics are 431 lines of code. This, together with the Clight semantics from \compcert{}, are the only parts of the compiler that need to be trusted. Compared to the 1721 lines of the implementation that are written in Coq, which are the verified parts of the synthesis tool, this is larger than the 431 lines of Verilog semantics specification, even if the Clight semantics are added. In addition to that, reading semantics specifications is simpler than trying to understand algorithms, meaning the trusted base has been successfully reduced.
+Looking at the trusted base of \vericert{}, the Verilog semantics are 431 lines of code. This, together with the Clight semantics from \compcert{}, are the only parts of the compiler that need to be trusted. Compared to the 1721 lines of the implementation that are written in Coq, which are the verified parts of the HLS tool, this is larger than the 431 lines of Verilog semantics specification, even if the Clight semantics are added. In addition to that, reading semantics specifications is simpler than trying to understand algorithms, meaning the trusted base has been successfully reduced.
%\JW{Can we include a comment about the size of the trusted base, in case we get that reviewer again?}
diff --git a/verilog.tex b/verilog.tex
index 368fd8d..8f03de3 100644
--- a/verilog.tex
+++ b/verilog.tex
@@ -62,7 +62,7 @@ The main execution of the module $\downarrow_{\text{module}}$ is split into $\do
\paragraph{Adding declarations} Explicit support for declaring inputs, outputs and internal variables was added to the semantics to make sure that the generated Verilog also contains the correct declarations. This adds some guarantees to the generated Verilog and ensures that it synthesises and simulates correctly.
-\paragraph{Removing support for external inputs to modules} Support for receiving external inputs was removed from the semantics for simplicity, as these are not needed for an HLS target. The main module in Verilog models the main function in C, and since the inputs to a C function shouldn't change during its execution, there is no need for external inputs for Verilog modules.
+\paragraph{Removing support for external inputs to modules} Support for receiving external inputs was removed from the semantics for simplicity, as these are not needed for an HLS target. The main module in Verilog models the main function in C, and since the inputs to a C function should not change during its execution, there is no need for external inputs for Verilog modules.
\paragraph{Simplifying representation of bitvectors} Finally, we use 32-bit integers to represent bitvectors rather than arrays of Booleans. This is because \vericert{} (currently) only supports types represented by 32 bits.