summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorYann Herklotz <git@yannherklotz.com>2021-09-11 20:21:43 +0100
committerYann Herklotz <git@yannherklotz.com>2021-09-11 20:21:43 +0100
commit28256833310c6fe14280c246a096f9dd45d20abd (patch)
tree27a9f5b7c1ec4be71bd5428f9956fae9725fedf2
parentc7dc8c7f5140166b3a3424f3db83fbe5604084cb (diff)
downloadoopsla21_fvhls-28256833310c6fe14280c246a096f9dd45d20abd.tar.gz
oopsla21_fvhls-28256833310c6fe14280c246a096f9dd45d20abd.zip
Fix some small grammar mistakes
-rw-r--r--algorithm.tex8
-rw-r--r--evaluation.tex2
-rw-r--r--introduction.tex2
-rw-r--r--proof.tex6
4 files changed, 9 insertions, 9 deletions
diff --git a/algorithm.tex b/algorithm.tex
index 41870a8..72a53aa 100644
--- a/algorithm.tex
+++ b/algorithm.tex
@@ -375,7 +375,7 @@ A high-level overview of the architecture and of the RAM interface can be seen i
Most 3AC instructions correspond to hardware constructs.
%Each 3AC instruction either corresponds to a hardware construct or does not have to be handled by the translation, such as function calls (because of inlining). \JW{Are function calls the only 3AC instruction that we ignore? (And I guess return statements too for the same reason.)}\YH{Actually, return instructions are translated (because you can return from main whenever), so call instructions (Icall, Ibuiltin and Itailcall) are the only functions that are not handled.}
% JW: Thanks; please check proposed new text.
-For example, line 2 in Fig.~\ref{fig:accumulator_rtl} shows a 32-bit register \texttt{x5} being initialised to 3, after which the control flow moves execution to line 3. This initialisation is also encoded in the Verilog generated from HTL at state 8 in both the control logic and data-path always-blocks, shown at lines 33 and 16 respectively in Fig.~\ref{fig:accumulator_v}. Simple operator instructions are translated in a similar way. For example, the add instruction is just translated to the built-in add operator, similarly for the multiply operator. All 32-bit instructions can be translated in this way, but some special instructions require extra care. One such is the \texttt{Oshrximm} instruction, which is discussed further in Section~\ref{sec:algorithm:optimisation:oshrximm}. Another is the \texttt{Oshldimm} instruction, which is a left rotate instruction that has no Verilog equivalent and therefore has to be implemented in terms of other operations and proven to be equivalent.
+For example, line 2 in Fig.~\ref{fig:accumulator_rtl} shows a 32-bit register \texttt{x5} being initialised to 3, after which the control flow moves execution to line 3. This initialisation is also encoded in the Verilog generated from HTL at state 8 in both the control logic and data-path always-blocks, shown at lines 33 and 16 respectively in Fig.~\ref{fig:accumulator_v}. Simple operator instructions are translated in a similar way. For example, the add instruction is just translated to the built-in add operator, similarly for the multiply operator. All 32-bit instructions can be translated in this way, but some special instructions require extra care. One such instruction is the \texttt{Oshrximm} instruction, which is discussed further in Section~\ref{sec:algorithm:optimisation:oshrximm}. Another is the \texttt{Oshldimm} instruction, which is a left rotate instruction that has no Verilog equivalent and therefore has to be implemented in terms of other operations and proven to be equivalent.
% In addition to any non-32-bit operations, the remaining
The only 32-bit instructions that we do not translate are case-statements (\texttt{Ijumptable}) and those instructions related to function calls (\texttt{Icall}, \texttt{Ibuiltin}, and \texttt{Itailcall}), because we enable inlining by default.
@@ -386,7 +386,7 @@ The challenge here is to translate our FSMD representation into a Verilog AST.
Fig.~\ref{fig:accumulator_v} shows the final Verilog output that is generated for our example.
Although this translation seems quite straight\-forward, proving that this translation is correct is complex.
-All the implicit assumptions that were made in HTL need to be translated explicitly to Verilog statements and it needs to be shown that these explicit behaviours are equivalent to the assumptions made in the HTL semantics. One main example of this is proving that specification of the RAM in HTL does indeed behave the same as its Verilog implementation.
+All the implicit assumptions that were made in HTL need to be translated explicitly to Verilog statements and it needs to be shown that these explicit behaviours are equivalent to the assumptions made in the HTL semantics. One main example of this is proving that the specification of the RAM in HTL does indeed behave in the same as its Verilog implementation.
We discuss these proofs in upcoming sections.
%In general, the generated Verilog structure has similar to that of the HTL code.
@@ -403,7 +403,7 @@ One big difference between C and Verilog is how memory is represented. Although
However, the memory model that \compcert{} uses for its intermediate languages is byte-addre\-ssa\-ble~\cite{blazy05_formal_verif_memor_model_c}. If a byte-addressable memory was used in the target hardware, which is closer to \compcert{}'s memory model, then a load and store would instead take four clock cycles, because a RAM can only perform one read and write per clock cycle. It therefore has to be proven that the byte-addressable memory behaves in the same way as the word-addressable memory in hardware. Any modifications of the bytes in the \compcert{} memory model also have to be shown to modify the word-addressable memory in the same way. Since only integer loads and stores are currently supported in \vericert{}, it follows that the addresses given to the loads and stores will be multiples of four. Translating from byte-addressed memory to word-addressed memory can then be done by dividing the address by four.
\subsubsection{Implementation of RAM Interface}\label{sec:algorithm:optimisation:ram}
-The simplest way to implement loads and stores in \vericert{} would be to access the Verilog array directly from within the data-path (i.e., inside the always-block on lines 16--32 of Fig.~\ref{fig:accumulator_v}). This would be correct, but when a Verilog array is accessed at several program points, the synthesis tool is unlikely to detect that it can be implemented as a RAM block, and will resort to using lots of registers instead, ruining the circuit's area and performance. To avert this, we arrange that the data-path does not access memory directly, but simply sets the address it wishes to access and then toggles the \texttt{u\_en} flag. This activates the RAM interface (lines 9--15 of Fig.~\ref{fig:accumulator_v}) on the next falling clock edge, which performs the requested load or store. By factoring all the memory accesses out into a separate interface like this, we ensure that the underlying array is only accessed from a single program point in the Verilog code, and thus ensure that the synthesis tool will correctly infer a RAM block.\footnote{Interestingly, the Verilog code shown for the RAM interface must not be modified, because the synthesis tool will only generate a RAM when the code matches a small set of specific patterns.}
+The simplest way to implement loads and stores in \vericert{} would be to access the Verilog array directly from within the data-path (i.e., inside the always-block on lines 16--32 of Fig.~\ref{fig:accumulator_v}). This would be correct, but when a Verilog array is accessed at several program points, the synthesis tool is unlikely to detect that it can be implemented as a RAM block, and will resort to using lots of registers instead, ruining the circuit's area and performance. To avert this, we arrange that the data-path does not access memory directly, but simply sets the address it wishes to access and then toggles the \texttt{u\_en} flag. This activates the RAM interface (lines 9--15 of Fig.~\ref{fig:accumulator_v}) on the next falling clock edge, which performs the requested load or store. By factoring all the memory accesses out into a separate interface, we ensure that the underlying array is only accessed from a single program point in the Verilog code, and thus ensure that the synthesis tool will correctly infer a RAM block.\footnote{Interestingly, the Verilog code shown for the RAM interface must not be modified, because the synthesis tool will only generate a RAM when the code matches a small set of specific patterns.}
%\JW{I tweaked this slightly in an attempt to clarify; please check.} %\NR{Bring forward this sentence to help with flow.}
%\JW{I think the following sentence could be cut as we've said this kind of thing a couple of times already.} Without the interface, the array would be implemented using registers, which would increase the size of the hardware considerably.
@@ -417,7 +417,7 @@ There are two interesting parts to the inserted RAM interface. Firstly, the mem
% JW: thanks!
Using the negative edge of the clock is widely supported by synthesis tools, and does not affect the maximum frequency of the final design.
-Secondly, the logic in the enable signal of the RAM (\texttt{en != u\_en}) is also atypical in hardware designs. Enable signals are normally manually controlled and inserted into the appropriate states, by using a check like the following in the RAM:\@ \texttt{en == 1}. This means that the RAM only turns on when the enable signal is set. However, to make the proof simpler and to not have to reason about possible side effects introduced by the RAM being enabled but not used, a RAM which disables itself after every use would be ideal. One method for implementing this would be to insert an extra state after each load or store that disables the RAM, but this extra state would eliminate the speed advantage of the negative-edge-triggered RAM. Another method would be to determine the next state after each load or store and disable the RAM in that state, but this could quickly become complicated, especially in the case where the next state also contains a memory operation, and hence the disable signal should not be added. The method we ultimately chose was to have the RAM become enabled not when the enable signal is high, but when it \emph{toggles} its value. This can be arranged by keeping track of the old value of the enable signal in \texttt{en} and comparing it to the current value \texttt{u\_en} set by the data-path. When the values are different, the RAM gets enabled, and then \texttt{en} is set to the value of \texttt{u\_en}. This ensures that the RAM will always be disabled straight after it was used, without having to insert or modify any other states.
+Secondly, the logic in the enable signal of the RAM (\texttt{en != u\_en}) is also atypical in hardware designs. Enable signals are normally manually controlled and inserted into the appropriate states, by using a check like the following in the RAM:\@ \texttt{en == 1}. This means that the RAM only turns on when the enable signal is set. However, to make the proof simpler and avoid reasoning about possible side effects introduced by the RAM being enabled but not used, a RAM which disables itself after every use would be ideal. One method for implementing this would be to insert an extra state after each load or store that disables the RAM, but this extra state would eliminate the speed advantage of the negative-edge-triggered RAM. Another method would be to determine the next state after each load or store and disable the RAM in that state, but this could quickly become complicated, especially in the case where the next state also contains a memory operation, and hence the disable signal should not be added. The method we ultimately chose was to have the RAM become enabled not when the enable signal is high, but when it \emph{toggles} its value. This can be arranged by keeping track of the old value of the enable signal in \texttt{en} and comparing it to the current value \texttt{u\_en} set by the data-path. When the values are different, the RAM gets enabled, and then \texttt{en} is set to the value of \texttt{u\_en}. This ensures that the RAM will always be disabled straight after it was used, without having to insert or modify any other states.
%We can instead generate a second enable signal that is set by the user, and the original enable signal is then updated by the RAM to be equal to the value that the user set. This means that the RAM should be enabled whenever the two signals are different, and disabled otherwise.
diff --git a/evaluation.tex b/evaluation.tex
index 0147bdc..bd10ae3 100644
--- a/evaluation.tex
+++ b/evaluation.tex
@@ -155,7 +155,7 @@ We configured \polybench{}'s parameters so that only integer types are used. We
Firstly, before comparing any performance metrics, it is worth highlighting that any Verilog produced by \vericert{} is guaranteed to be \emph{correct}, whilst no such guarantee can be provided by \legup{}.
This guarantee in itself provides a significant leap in terms of HLS reliability, compared to any other HLS tools available.
-The top graphs of Fig.~\ref{fig:polybench-div} and Fig.~\ref{fig:polybench-nodiv} compare the execution time of the 27 programs executed by \vericert{} and the different optimisation levels of \legup{}. Each graph uses optimised \legup{} as the baseline. When division/modulo operations are present \legup{} designs execute around 27$\times$ faster than \vericert{} designs. However, when division/modulo operations are replaced by the iterative algorithm, \legup{} designs are only 2$\times$ faster than \vericert{} designs. However, the benchmarks with division/modulo replaced show that \vericert{} actually achieves the same execution speed as \legup{} without LLVM optimisations and without operation chaining, which is encouraging, and shows that the hardware generation is following the right steps. The execution time is calculated by multiplying the maximum frequency that the FPGA can run at with this design, by the number of clock cycles that are needed to complete the execution. We can therefore analyse each separately.
+The top graphs of Fig.~\ref{fig:polybench-div} and Fig.~\ref{fig:polybench-nodiv} compare the execution time of the 27 programs executed by \vericert{} and the different optimisation levels of \legup{}. Each graph uses optimised \legup{} as the baseline. When division/modulo operations are present \legup{} designs execute around 27$\times$ faster than \vericert{} designs. However, when division/modulo operations are replaced by the iterative algorithm, \legup{} designs are only 2$\times$ faster than \vericert{} designs. The benchmarks with division/modulo replaced show that \vericert{} actually achieves the same execution speed as \legup{} without LLVM optimisations and without operation chaining, which is encouraging, and shows that the hardware generation is following the right steps. The execution time is calculated by multiplying the maximum frequency that the FPGA can run at with this design, by the number of clock cycles that are needed to complete the execution. We can therefore analyse each separately.
First, looking at the difference in clock cycles, \vericert{} produces designs that have around 4.5$\times$ as many clock cycles as \legup{} designs in both cases, when division/modulo operations are enabled as well as when they are replaced. This performance gap can be explained in part by LLVM optimisations, which seem to account for a 2$\times$ decrease in clock cycles, as well as operation chaining, which decreases the clock cycles by another 2$\times$. The rest of the speed-up is mostly due to \legup{} optimisations such as scheduling and memory analysis, which are designed to extract parallelism from input programs.
This gap does not represent the performance cost that comes with formally proving a HLS tool.
diff --git a/introduction.tex b/introduction.tex
index cafdd07..45bb1a6 100644
--- a/introduction.tex
+++ b/introduction.tex
@@ -43,7 +43,7 @@ We have designed a new HLS tool in the Coq theorem prover and proved that any ou
The contributions of this paper are as follows:
\begin{itemize}
- \item We present \vericert{}, the first mechanically verified HLS tool that compiles C to Verilog. In Section~\ref{sec:design}, we describe the design of \vericert{}, including a few optimisations related to memory accesses and division.
+ \item We present \vericert{}, the first mechanically verified HLS tool that compiles C to Verilog. In Section~\ref{sec:design}, we describe the design of \vericert{}, including certain optimisations related to memory accesses and division.
\item We state the correctness theorem of \vericert{} with respect to an existing semantics for Verilog due to \citet{loow19_proof_trans_veril_devel_hol}. In Section~\ref{sec:verilog}, we describe how we extended this semantics to make it suitable as an HLS target. We also describe how the Verilog semantics is integrated into CompCert's language execution model and its framework for performing simulation proofs. A mapping of CompCert's infinite memory model onto a finite Verilog array is also described.
\item In Section~\ref{sec:proof}, we describe how we proved the correctness theorem. The proof follows standard \compcert{} techniques -- forward simulations, intermediate specifications, and determinism results -- but we encountered several challenges peculiar to our hardware-oriented setting. %\NR{`specific' is better than `peculiar'?} %JW: I think this is a nice use of peculiar. Note that it means `distinctive' here, not `weird' -- the third meaning at https://www.dictionary.com/browse/peculiar
These include handling discrepancies between the byte-addressed memory assumed by the input software and the word-addressed memory that we implement in the output hardware, different handling of unsigned comparisons between C and Verilog, and carefully implementing memory reads and writes so that these behave properly as a RAM in hardware.
diff --git a/proof.tex b/proof.tex
index 944649e..2f78380 100644
--- a/proof.tex
+++ b/proof.tex
@@ -33,7 +33,7 @@ The main correctness theorem is analogous to that stated in \compcert{}~\cite{le
\end{equation*}
\end{theorem}
-Why is this correctness theorem also the right one for HLS? It could be argued that hardware inherently runs forever and therefore does not produce a definitive final result. This would mean that the \compcert{} correctness theorem would likely not help with proving that the hardware is actually working correctly, as the behaviour would always be divergent. However, in practice, HLS does not normally produce the top-level of the design that needs to connect to other components, therefore needing to run forever. Rather, HLS often produces smaller components that take an input, execute, and then terminate with an answer. To start the execution of the hardware and to signal to the HLS component that the inputs are ready, the $\mathit{rst}$ signal is set and unset. Then, once the result is ready, the $\mathit{fin}$ signal is set and the result value is placed in $\mathit{ret}$. These signals are also present in the semantics of execution shown in Fig.~\ref{fig:inference_module}. The correctness theorem therefore also uses these signals, and the proof shows that once the $\mathit{fin}$ flag is set, the value in $\mathit{ret}$ is correct according to the semantics of Verilog and Clight. Note that the compiler is allowed to fail and not produce any output; the correctness theorem only applies when the translation succeeds.
+Why is this correctness theorem also the right one for HLS? It could be argued that hardware inherently runs forever and therefore does not produce a definitive final result. This would mean that the \compcert{} correctness theorem would probably be unhelpful with proving hardware correctness, as the behaviour would always be divergent. However, in practice, HLS does not normally produce the top-level of the design that needs to connect to other components, therefore needing to run forever. Rather, HLS often produces smaller components that take an input, execute, and then terminate with an answer. To start the execution of the hardware and to signal to the HLS component that the inputs are ready, the $\mathit{rst}$ signal is set and unset. Then, once the result is ready, the $\mathit{fin}$ signal is set and the result value is placed in $\mathit{ret}$. These signals are also present in the semantics of execution shown in Fig.~\ref{fig:inference_module}. The correctness theorem therefore also uses these signals, and the proof shows that once the $\mathit{fin}$ flag is set, the value in $\mathit{ret}$ is correct according to the semantics of Verilog and Clight. Note that the compiler is allowed to fail and not produce any output; the correctness theorem only applies when the translation succeeds.
How can we prove this theorem? First, note that the theorem is a `backwards simulation' result (every target behaviour must also be a source behaviour), following the terminology used in the \compcert{} literature~\cite{leroy09_formal_verif_realis_compil}. The reverse direction (every source behaviour must also be a target behaviour) is not demanded because compilers are permitted to resolve any non-determinism present in their source programs. However, since Clight programs are all deterministic, as are the Verilog programs in the fragment we consider, we can actually reformulate the correctness theorem above as a forwards simulation result (following standard \compcert{} practice), which makes it easier to prove.
@@ -42,7 +42,7 @@ The forward simulation from 3AC to HTL is stated in Lemma~\ref{lemma:htl} (Secti
\subsection{Forward Simulation from 3AC to HTL}\label{sec:proof:3ac_htl}
-As HTL is quite far removed from 3AC, this first translation is the most involved and therefore requires a larger proof, because the translation from 3AC instructions to Verilog statements needs to be proven correct in this step. In addition to that, the semantics of HTL are also quite different to the 3AC semantics, as instead of defining small-step semantics for each construct in Verilog, the semantics are instead defined over one clock cycle and mirror the semantics defined for Verilog. Lemma~\ref{lemma:htl} shows the result that needs to be proven in this subsection.
+As HTL is quite far removed from 3AC, this first translation is the most involved and therefore requires a larger proof, because the translation from 3AC instructions to Verilog statements needs to be proven correct in this step. In addition to that, the semantics of HTL are also quite different to the 3AC semantics. Instead of defining small-step semantics for each construct in Verilog, the semantics are defined over one clock cycle and mirror the semantics defined for Verilog. Lemma~\ref{lemma:htl} shows the result that needs to be proven in this subsection.
\begin{lemma}[Forward simulation from 3AC to HTL]\label{lemma:htl}
We write \texttt{tr\_htl} for the translation from 3AC to HTL.
@@ -237,7 +237,7 @@ The final lemma we need is that the Verilog semantics is deterministic. This res
\end{tabular}
\end{table}
-The lines of code for the implementation and proof of \vericert{} can be found in Table~\ref{tab:proof_statistics}. Overall, it took about 1.5 person-years to build \vericert{} -- about three person-months on implementation and 15 person-months on proofs. The largest proof is the correctness proof for the HTL generation, which required equivalence proofs between all integer operations supported by \compcert{} and those supported in hardware. From the 3069 lines of proof code in the HTL generation, 1189 are for the correctness proof of just the load and store instructions. These were tedious to prove correct because of the substantial difference between the memory models used, and the need to prove properties such as stores outside of the allocated memory being undefined, so that a finite array could be used. In addition to that, since pointers in HTL and Verilog are represented as integers, whereas there is a separate `pointer' value in the \compcert{} semantics, it was painful to reason about them and many new theorems had to be proven about integers and pointers in \vericert{}. Moreover, the second-largest proof of the correct RAM generation includes many proofs about the extensional equality of array operations, such as merging arrays with different assignments. As the negative edge implies two merges take place every clock cycle, the proofs about the equality of the contents in memory and in the merged arrays become more tedious as well.
+The lines of code for the implementation and proof of \vericert{} can be found in Table~\ref{tab:proof_statistics}. Overall, it took about 1.5 person-years to build \vericert{} -- about three person-months on implementation and 15 person-months on proofs. The largest proof is the correctness proof for the HTL generation, which required equivalence proofs between all integer operations supported by \compcert{} and those supported in hardware. From the 3069 lines of proof code in the HTL generation, 1189 are for the correctness proof of just the load and store instructions. These were tedious to prove correct because of the substantial difference between the memory models used, and the need to prove properties such as stores outside of the allocated memory being undefined, so that a finite array could be used. In addition to that, since pointers in HTL and Verilog are represented as integers, instead of as a separate `pointer' type like in the \compcert{} semantics, it was painful to reason about them. Many new theorems had to be proven about them in \vericert{} to prove the conversion from pointer to integer. Moreover, the second-largest proof of the correct RAM generation includes many proofs about the extensional equality of array operations, such as merging arrays with different assignments. As the negative edge implies that two merges take place every clock cycle, the proofs about the equality of the contents in memory and in the merged arrays become more tedious as well.
Looking at the trusted computing base of \vericert{}, the Verilog semantics are 431 lines of code. This, together with the Clight semantics from \compcert{}, are the only parts of the compiler that need to be trusted. The Verilog semantics specification is therefore much smaller compared to the 1721 lines of the implementation that are written in Coq, which are the verified parts of the HLS tool, even when the Clight semantics are added. In addition to that, understanding the semantics specification is simpler than trying to understand the translation algorithm, meaning the trusted base has been successfully reduced.