summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorYann Herklotz <git@yannherklotz.com>2021-08-05 11:12:51 +0200
committerYann Herklotz <git@yannherklotz.com>2021-08-05 11:12:51 +0200
commitcae2909811563106d5469c2825f7e001be2b1325 (patch)
treee7e5a4e682661f6b6137ac13719c051debe555bb
parent915879bab3498b817f26a5389898330c12b98417 (diff)
downloadoopsla21_fvhls-cae2909811563106d5469c2825f7e001be2b1325.tar.gz
oopsla21_fvhls-cae2909811563106d5469c2825f7e001be2b1325.zip
Address most notes
-rw-r--r--algorithm.tex7
-rw-r--r--proof.tex12
-rw-r--r--verilog.tex12
3 files changed, 17 insertions, 14 deletions
diff --git a/algorithm.tex b/algorithm.tex
index 737031b..34358cf 100644
--- a/algorithm.tex
+++ b/algorithm.tex
@@ -1,5 +1,4 @@
-\section{Designing a verified HLS tool}
-\label{sec:design}
+\section{Designing a verified HLS tool}\label{sec:design}
This section describes the main architecture of the HLS tool, and the way in which the Verilog back end was added to \compcert{}. This section also covers an example of converting a simple C program into hardware, expressed in the Verilog language.
@@ -397,7 +396,7 @@ Secondly, the logic in the enable signal of the RAM (\texttt{en != u\_en}) is al
\caption{Timing diagrams showing the execution of loads and stores over multiple clock cycles.}\label{fig:ram_load_store}
\end{figure}
-\JW{The following paragraph could probably be cut, as the same explanation is already in the Figure 4 caption, and replaced with something like ``Figure~\ref{fig:ram_load_store} gives an example of how the RAM interface behaves when values are loaded and stored.''} Figure~\ref{fig:ram_load} shows an example of how the waveforms in the RAM in Figure~\ref{fig:accumulator_v} behave when a value is loaded. To initiate a load, the data-path enable signal \texttt{u\_en} flag is toggled, the address \texttt{addr} is set and the write enable \texttt{wr\_en} is set to low. This all happens at the positive edge of the clock, at time slice 1. Then, on the next negative edge of the clock, at time slice 2, the \texttt{u\_en} is now different from the RAM enable \texttt{en}, so the RAM is enabled. A load is then performed by assigning the \texttt{d\_out} register to the value stored at the address in the RAM and the \texttt{en} is set to the same value as \texttt{u\_en} to disable the RAM again. Finally, on the next positive edge of the clock, the value in \texttt{d\_out} is assigned to the destination register \texttt{r}. An example of a store is shown in Figure~\ref{fig:ram_store}. The \texttt{d\_in} register is assigned the value to be stored. The store is then performed on the negative edge of the clock and is therefore complete by the next positive edge.
+\JW{The following paragraph could probably be cut, as the same explanation is already in the Figure 4 caption, and replaced with something like ``Figure~\ref{fig:ram_load_store} gives an example of how the RAM interface behaves when values are loaded and stored.''}\YH{Ah ok, yes sure, I just had it there to explain the figure in case some readers are unfamiliar with timing diagrams, but it's true that' it's already in the caption.} Figure~\ref{fig:ram_load} shows an example of how the waveforms in the RAM in Figure~\ref{fig:accumulator_v} behave when a value is loaded. To initiate a load, the data-path enable signal \texttt{u\_en} flag is toggled, the address \texttt{addr} is set and the write enable \texttt{wr\_en} is set to low. This all happens at the positive edge of the clock, at time slice 1. Then, on the next negative edge of the clock, at time slice 2, the \texttt{u\_en} is now different from the RAM enable \texttt{en}, so the RAM is enabled. A load is then performed by assigning the \texttt{d\_out} register to the value stored at the address in the RAM and the \texttt{en} is set to the same value as \texttt{u\_en} to disable the RAM again. Finally, on the next positive edge of the clock, the value in \texttt{d\_out} is assigned to the destination register \texttt{r}. An example of a store is shown in Figure~\ref{fig:ram_store}. The \texttt{d\_in} register is assigned the value to be stored. The store is then performed on the negative edge of the clock and is therefore complete by the next positive edge.
\subsubsection{Implementing the \texttt{Oshrximm} instruction}\label{sec:algorithm:optimisation:oshrximm}
@@ -424,7 +423,7 @@ One might hope that the synthesis tool consuming our generated Verilog would con
where $\gg$ stands for a logical right shift. %Once this equivalence about the shifts and division operator is proven correct, it can be used to implement the \texttt{Oshrximm} using the efficient shift version instead of how the \compcert{} semantics described it.
When proving this equivalence, we actually found a bug in our original implementation that was due to the fact that a na\"{i}ve shift rounds towards $-\infty$.
-\JW{I don't really understand the following paragraph.} \compcert{} eventually performs a translation from this representation into assembly code which uses shifts to implement the division, however, the specification of the instruction itself still uses division instead of shifts, meaning the proof of the translation cannot be reused. In \vericert{}, the equivalence of the representation in terms of divisions and shifts is proven over the integers and the specification, thereby making it simpler to prove the correctness of the Verilog implementation in terms of shifts.
+\JW{I don't really understand the following paragraph.}\YH{So my intention with this was to make this section more meaningful, as one of the reviewers mentioned that because compcert already did this, this section is not needed. But I wanted to explain that because we do the reasoning of equivalence between shifts and division at the Integer level, our proof is more general than the language specific proofs that \compcert{} has in it's back ends, as we fix the specification instead of the implementation. I'll try and reword this.} \compcert{} eventually performs a translation from this representation into assembly code which uses shifts to implement the division, however, the specification of the instruction itself still uses division instead of shifts, meaning the proof of the translation cannot be reused. In \vericert{}, the equivalence of the representation in terms of divisions and shifts is proven over the integers and the specification, thereby making it simpler to prove the correctness of the Verilog implementation in terms of shifts.
%The \compcert{} semantics for the \texttt{Oshrximm} instruction expresses its operation exactly as shown in the equation above, even though in hardware the computation that would be performed would be different. In \vericert{}, if the same operation would be implemented using Verilog operators, it is not guaranteed to be optimised correctly by the synthesis tools that convert the Verilog into a circuit. To guarantee an output that does not include divides, we therefore have to express it in Verilog using shifts, and then prove that this representation is equivalent to the divide representation used in the \compcert{} semantics. While conducting the proof, we discovered quite a few bugs in our initial implementation of optimisations, which rounded to $-\infty$ instead of 0.
diff --git a/proof.tex b/proof.tex
index b0962de..e3546ab 100644
--- a/proof.tex
+++ b/proof.tex
@@ -7,16 +7,18 @@ Now that the Verilog semantics have been adapted to the CompCert model, we are i
The proof of correctness of the Verilog back end is quite different from the usual proofs performed in CompCert, mainly because of the difference in Verilog semantics compared to the standard CompCert intermediate languages and because of the translation of the memory model.
\begin{itemize}
-\item First, because the memory model in our Verilog semantics is finite and concrete, but the CompCert memory model is more abstract and infinite with additional bounds, the equivalence of both these models needs to be proven. Moreover, our memory is word-addressed for efficiency reasons, whereas CompCert's memory is byte-addressed. \JW{This point has been made a couple of times already by now. I think it's ok to say it again briefly here, but I'd be tempted to acknowledge that it's repetitive by prepending it with something like ``As already mentioned in Section blah,'' }
+\item As already mentioned in Section~\ref{sec:verilog:memory}, because the memory model in our Verilog semantics is finite and concrete, but the CompCert memory model is more abstract and infinite with additional bounds, the equivalence of both these models needs to be proven. Moreover, our memory is word-addressed for efficiency reasons, whereas CompCert's memory is byte-addressed.
+%\JW{This point has been made a couple of times already by now. I think it's ok to say it again briefly here, but I'd be tempted to acknowledge that it's repetitive by prepending it with something like ``As already mentioned in Section blah,'' }
\item Second, the Verilog semantics operates quite differently to the usual intermediate languages in CompCert. All the CompCert intermediate languages use a map from control-flow nodes to instructions. An instruction can therefore be selected using an abstract program pointer. Meanwhile, in the Verilog semantics the whole design is executed at every clock cycle, because hardware is inherently parallel. The program pointer is part of the design as well, not just part of an abstract state. This makes the semantics of Verilog simpler, but comparing it to the semantics of 3AC becomes more challenging, as one has to map the abstract notion of the state to concrete values in registers.
\end{itemize}
-Together, these differences mean that translating 3AC directly to Verilog is infeasible, as the differences in the semantics are too large. Instead, a new intermediate language needs to be introduced, called HTL, which bridges the gap in the semantics between the two languages. HTL still consists of maps, like many of the other CompCert languages, however, each state corresponds to a Verilog statement. \JW{This is good text, but the problem is that it reads like you're introducing HTL here for the first time. In fact, the reader has already encountered HTL in Section 2. So this needs acknowledging.}
+Together, these differences mean that translating 3AC directly to Verilog is infeasible, as the differences in the semantics are too large. Instead, HTL, which was introduced in Section~\ref{sec:design}, bridges the gap in the semantics between the two languages. HTL still consists of maps, like many of the other CompCert languages, however, each state corresponds to a Verilog statement.
+%\JW{This is good text, but the problem is that it reads like you're introducing HTL here for the first time. In fact, the reader has already encountered HTL in Section 2. So this needs acknowledging.}\YH{Added that.}
\subsection{Formulating the correctness theorem}
-The main correctness theorem is analogous to that stated in \compcert{}~\cite{leroy09_formal_verif_realis_compil}: for all Clight source programs $C$, if the translation to the target Verilog code succeeds, and $C$ has safe observable behaviour $B$ when executed, then the target Verilog code will have the same behaviour $B$. Here, a `safe' execution is one that either converges or diverges, but does not ``go wrong''. If the program does admit some wrong behaviour (like undefined behaviour in C), the correctness theorem does not apply. A behaviour, then, is either a final state (in the case of convergence) or divergence. In \compcert{}, a behaviour is also associated with a trace of I/O events, but since external function calls are not supported in \vericert{}, this trace will always be empty. This correctness theorem is also appropriate for HLS \JW{Perhaps it would be worth first explaining why somebody might think this correctness theorem might \emph{not} be appropriate for HLS. At the moment, it feels like you're giving the answer without saying the question. Is it to do with the fact that hardware tends to run forever?}, as HLS is often used as a part of a larger hardware design that is connected together using a hardware description language like Verilog. This means that HLS designs are normally triggered multiple times and results are returned each time when the computation terminates, which is the property that the correctness theorem states. Note that the compiler is allowed to fail and not produce any output; the correctness theorem only applies when the translation succeeds.
+The main correctness theorem is analogous to that stated in \compcert{}~\cite{leroy09_formal_verif_realis_compil}: for all Clight source programs $C$, if the translation to the target Verilog code succeeds, and $C$ has safe observable behaviour $B$ when executed, then the target Verilog code will have the same behaviour $B$. Here, a `safe' execution is one that either converges or diverges, but does not ``go wrong''. If the program does admit some wrong behaviour (like undefined behaviour in C), the correctness theorem does not apply. A behaviour, then, is either a final state (in the case of convergence) or divergence. In \compcert{}, a behaviour is also associated with a trace of I/O events, but since external function calls are not supported in \vericert{}, this trace will always be empty. This correctness theorem is also appropriate for HLS \JW{Perhaps it would be worth first explaining why somebody might think this correctness theorem might \emph{not} be appropriate for HLS. At the moment, it feels like you're giving the answer without saying the question. Is it to do with the fact that hardware tends to run forever?}\YH{Yes definitely, will add that.}, as HLS is often used as a part of a larger hardware design that is connected together using a hardware description language like Verilog. This means that HLS designs are normally triggered multiple times and results are returned each time when the computation terminates, which is the property that the correctness theorem states. Note that the compiler is allowed to fail and not produce any output; the correctness theorem only applies when the translation succeeds.
%The following `backwards simulation' theorem describes the correctness theorem, where $\Downarrow$ stands for simulation and execution respectively.
@@ -51,12 +53,12 @@ We prove this lemma by first establishing a specification of the translation fun
%To simplify the proof, instead of using the translation algorithm as an assumption, as was done in Lemma~\ref{lemma:htl}, a specification of the translation can be constructed instead which contains all the properties that are needed to prove the correctness. For example, for the translation from 3AC to HTL,
The specification for the translation of 3AC instructions into HTL data-path and control logic can be defined by the following predicate:
\begin{equation*}
- \yhfunction{spec\_instr } \textit{fin rtrn }\ \sigma\ \textit{stk }\ i\ \textit{data }\ \textit{control}
+ \yhfunction{spec\_instr } \textit{fin ret }\ \sigma\ \textit{stk }\ i\ \textit{data }\ \textit{control}
\end{equation*}
\noindent Here, the \textit{control} and \textit{data} parameters are the statements that the current 3AC instruction $i$ should translate to. The other parameters are the special registers defined in Section~\ref{sec:verilog:integrating}. An example of a rule describing the translation of an arithmetic/logical operation from 3AC is the following:
\begin{equation*}
- \inferrule[Iop]{\yhfunction{tr\_op } \textit{op }\ \vec{a} = \yhconstant{OK } e}{\yhfunction{spec\_instr } \textit{fin rtrn }\ \sigma\ \textit{stk }\ (\yhconstant{Iop } \textit{op }\ \vec{a}\ d\ n)\ (d\ \yhkeyword{<=}\ e)\ (\sigma\ \yhkeyword{<=}\ n)}
+ \inferrule[Iop]{\yhfunction{tr\_op } \textit{op }\ \vec{a} = \yhconstant{OK } e}{\yhfunction{spec\_instr } \textit{fin ret }\ \sigma\ \textit{stk }\ (\yhconstant{Iop } \textit{op }\ \vec{a}\ d\ n)\ (d\ \yhkeyword{<=}\ e)\ (\sigma\ \yhkeyword{<=}\ n)}
\end{equation*}
\noindent Assuming that the translation of the operator \textit{op} with operands $\vec{a}$ is successful and results in expression $e$, the rule describes how the destination register $d$ is updated to $e$ via a non-blocking assignment in the data path, and how the program counter $\sigma$ is updated to point to the next CFG node $n$ via another non-blocking assignment in the control logic.
diff --git a/verilog.tex b/verilog.tex
index 21fadfd..791b9c1 100644
--- a/verilog.tex
+++ b/verilog.tex
@@ -78,7 +78,7 @@ The main execution of the module $\downarrow_{\text{module}}$ is split into $\do
%
\inferrule[Call]{ }{\yhconstant{Callstate } \textit{sf }\ m\ \vec{r} \longrightarrow \yhconstant{State } \textit{sf }\ m\ n\ ((\yhfunction{init\_params}\ \vec{r}\ a)[\sigma \mapsto n, \textit{fin} \mapsto 0, \textit{rst} \mapsto 0])\ \epsilon}\\
%
- \inferrule[Return]{ }{\yhconstant{Returnstate } (\yhconstant{Stackframe } r\ m\ \textit{pc }\ \Gamma_r\ \Gamma_a :: \textit{sf})\ v \longrightarrow \yhconstant{State } \textit{sf }\ m\ \textit{pc }\ (\Gamma_{r} [ \textit{st} \mapsto \textit{pc}, r \mapsto v ])\ \Gamma_{a}}
+ \inferrule[Return]{ }{\yhconstant{Returnstate } (\yhconstant{Stackframe } r\ m\ \textit{pc }\ \Gamma_r\ \Gamma_a :: \textit{sf})\ v \longrightarrow \yhconstant{State } \textit{sf }\ m\ \textit{pc }\ (\Gamma_{r} [ \sigma \mapsto \textit{pc}, r \mapsto v ])\ \Gamma_{a}}
\end{gather*}
\end{minipage}
\caption{Top-level small-step semantics for Verilog modules in \compcert{}'s computational framework.}%
@@ -99,8 +99,10 @@ To support this computational model, we extend the Verilog module we generate wi
\begin{description}
\item[program counter] The program counter can be modelled using a register that keeps track of the state, denoted as $\sigma$.
\item[function entry point] When a function is called, the entry point denotes the first instruction that will be executed. This can be modelled using a reset signal that sets the state accordingly, denoted as \textit{rst}.
- \item[return value] The return value can be modelled by setting a finished flag to 1 when the result is ready, and putting the result into a 32-bit output register. These are denoted as \textit{fin} and \textit{rtrn} respectively. \JW{Is there a mismatch between `ret' in the figure and `rtrn' in the text?}
- \item[stack] The function stack can be modelled as a RAM block, which is implemented using an array in the module, and denoted as \textit{stk}. \JW{Is there a mismatch between `st' in the figure and `stk' in the text?}
+ \item[return value] The return value can be modelled by setting a finished flag to 1 when the result is ready, and putting the result into a 32-bit output register. These are denoted as \textit{fin} and \textit{ret} respectively.
+%\JW{Is there a mismatch between `ret' in the figure and `rtrn' in the text?}
+ \item[stack] The function stack can be modelled as a RAM block, which is implemented using an array in the module, and denoted as \textit{stk}.
+\JW{Is there a mismatch between `st' in the figure and `stk' in the text?}\YH{It was actually between $\Gamma_{a}$ and \textit{stk}. The \textit{st} should have been $\sigma$.}
\end{description}
Figure~\ref{fig:inference_module} shows the inference rules for moving between the computational states. The first, \textsc{Step}, is the normal rule of execution. It defines one step in the \texttt{State} state, assuming that the module is not being reset, that the finish state has not been reached yet, that the current and next state are $v$ and $v'$, and that the module runs from state $\Gamma$ to $\Gamma'$ using the \textsc{Step} rule. The \textsc{Finish} rule returns the final value of running the module and is applied when the \textit{fin} register is set; the return value is then taken from the \textit{ret} register.
@@ -117,7 +119,7 @@ Therefore, in addition to the rules shown in Figure~\ref{fig:inference_module},
\subsection{Memory Model}\label{sec:verilog:memory}
-The Verilog semantics do not define a memory model for Verilog, as this is not needed for a hardware description language. There is no preexisting architecture that Verilog will produce; it can describe any memory layout that is needed. Instead of having specific semantics for memory, the semantics only needs to support the language features that can produce these different memory layouts, these being Verilog arrays. We therefore define semantics for updating Verilog arrays using blocking and nonblocking assignment. We then have to prove that the C memory model that \compcert{} uses matches with the interpretation of arrays used in Verilog. The \compcert{} memory model is infinite, whereas our representation of arrays in Verilog is inherently finite. There have already been various efforts to define a finite memory model for all compiler passes in \compcert{}, such as Comp\-Cert\-S~\cite{besson18_compc}, Comp\-Cert\-ELF~\cite{wang20_compc} and Comp\-Cert\-TSO~\cite{sevcik13_compc}, however, we define the translation from \compcert{}'s standard infinite memory model to finitely sized arrays that can be represented in Verilog, leaving the compiler passes intact. \JW{I'm not quite sure I understand. Let me check: Are you saying that previous work has shown how all the existing CompCert passes can be adapted from an infinite to a finite memory model, but what we're doing is leaving the default (infinite) memory model for the CompCert front end, and just converting from an infinite memory model to a finite memory model when we go from 3AC to HTL?}
+The Verilog semantics do not define a memory model for Verilog, as this is not needed for a hardware description language. There is no preexisting architecture that Verilog will produce; it can describe any memory layout that is needed. Instead of having specific semantics for memory, the semantics only needs to support the language features that can produce these different memory layouts, these being Verilog arrays. We therefore define semantics for updating Verilog arrays using blocking and nonblocking assignment. We then have to prove that the C memory model that \compcert{} uses matches with the interpretation of arrays used in Verilog. The \compcert{} memory model is infinite, whereas our representation of arrays in Verilog is inherently finite. There have already been various efforts to define a finite memory model for all compiler passes in \compcert{}, such as Comp\-Cert\-S~\cite{besson18_compc}, Comp\-Cert\-ELF~\cite{wang20_compc} and Comp\-Cert\-TSO~\cite{sevcik13_compc}, however, we define the translation from \compcert{}'s standard infinite memory model to finitely sized arrays that can be represented in Verilog, leaving the compiler passes intact. \JW{I'm not quite sure I understand. Let me check: Are you saying that previous work has shown how all the existing CompCert passes can be adapted from an infinite to a finite memory model, but what we're doing is leaving the default (infinite) memory model for the CompCert front end, and just converting from an infinite memory model to a finite memory model when we go from 3AC to HTL?}\YH{Yes exactly, most papers changed the whole memory model to thread through properties that were then needed in the back end, but we currently don't need to do that. I need to double check though for CompCertELF, it doesn't actually seem to be the case. Will edit this section later.}
\begin{figure}
\centering
@@ -193,7 +195,7 @@ The Verilog semantics do not define a memory model for Verilog, as this is not n
%\JW{It's not completely clear what the relationship is between your work and those works. The use of `only' suggests that you've re-done a subset of work that has already been done -- is that the right impression?}\YH{Hopefully that's more clear.}
-This translation is represented in Figure~\ref{fig:memory_model_transl}. \compcert{} defines a map from blocks to maps from memory addresses to memory contents. Each block represents an area in memory; for example, a block can represent a global variable or a stack for a function. \JW{Maybe mention here that block 0 is always the stack for the main function.} \JW{So the stack frame for a function called by main would be in a different block, is that the idea? Seems unusual not to have a single stack.} \JW{Would global variables normally be put in blocks 1, 2, etc.?} Meanwhile, our Verilog semantics defines two finite arrays of optional values, one for the blocking assignments map $\Gamma_{\rm a}$ and one for the nonblocking assignments map $\Delta_{\rm a}$. \JW{It's a slight shame that `block' is used in two different senses in the preceding two sentences. I guess that can't be helped.} The optional values are present to ensure correct merging of the two association maps at the end of the clock cycle. During our translation we only convert block 0 to a Verilog memory, and ensure that it is the only block that is present. This means that the block necessarily represents the stack of the main function. The invariant that then has to hold in the proofs is that block 0 should be equivalent to the merged representation of the $\Gamma_{\rm a}$ and $\Delta_{\rm a}$ maps.
+This translation is represented in Figure~\ref{fig:memory_model_transl}. \compcert{} defines a map from blocks to maps from memory addresses to memory contents. Each block represents an area in memory; for example, a block can represent a global variable or a stack for a function. As there are no global variables, the main stack's block number can be assumed to always be 0. \JW{So the stack frame for a function called by main would be in a different block, is that the idea? Seems unusual not to have a single stack.}\YH{Yeah exactly, it makes it much easier to reason about though, because everything is nicely isolated. This is exactly what CompCertELF and CompCertS try and solve though.} \JW{Would global variables normally be put in blocks 1, 2, etc.?}\YH{Yes, although it may also be possible that they could be numbered 0, 1, 2, 3, 4, pushing the block of the stack higher.} Meanwhile, our Verilog semantics defines two finite arrays of optional values, one for the blocking assignments map $\Gamma_{\rm a}$ and one for the nonblocking assignments map $\Delta_{\rm a}$. \JW{It's a slight shame that `block' is used in two different senses in the preceding two sentences. I guess that can't be helped.}\YH{Ah that's true, I hadn't even noticed. Yeah I think it would be good to keep the name ``block'' for CompCert's blocks.} The optional values are present to ensure correct merging of the two association maps at the end of the clock cycle. During our translation we only convert block 0 to a Verilog memory, and ensure that it is the only block that is present. This means that the block necessarily represents the stack of the main function. The invariant that then has to hold in the proofs is that block 0 should be equivalent to the merged representation of the $\Gamma_{\rm a}$ and $\Delta_{\rm a}$ maps.
%However, in practice, assigning and reading from an array directly in the state machine will not produce a memory in the final hardware design, as the synthesis tool cannot identify the array as having the necessary properties that a RAM needs, even though this is the most natural formulation of memory. Even though theoretically the memory will only be read from once per clock cycle, the synthesis tool cannot ensure that this is true, and will instead create a register for each memory location. This increases the size of the circuit dramatically, as the RAM on the FPGA chip will not be reused. Instead, the synthesis tool expects a specific interface that ensures these properties, and will then transform the interface into a proper RAM during synthesis. Therefore, a translation has to be performed from the naive use of memory in the state machine, to a proper use of a memory interface.