6 files changed, 34 insertions, 15 deletions
diff --git a/algorithm.tex b/algorithm.tex
index ba00bff..e67ce5c 100644
--- a/algorithm.tex
+++ b/algorithm.tex
@@ -399,7 +399,8 @@ One big difference between C and Verilog is how memory is represented.  Although
 However, the memory model that \compcert{} uses for its intermediate languages is byte-addre\-ssa\-ble~\cite{blazy05_formal_verif_memor_model_c}.  If a byte-addressable memory was used in the target hardware, which is closer to \compcert{}'s memory model, then a load and store would instead take four clock cycles, because a RAM can only perform one read and write per clock cycle.  It therefore has to be proven that the byte-addressable memory behaves in the same way as the word-addressable memory in hardware.  Any modifications of the bytes in the \compcert{} memory model also have to be shown to modify the word-addressable memory in the same way.  Since only integer loads and stores are currently supported in \vericert{}, it follows that the addresses given to the loads and stores will be multiples of four.  If that is the case, then the translation from byte-addressed memory to word-addressed memory can be done by dividing the address by four.
 
 \subsubsection{Implementation of RAM interface}
-The simplest way to implement loads and stores in \vericert{} would be to access the Verilog array directly from within the data-path (i.e., inside the always-block on lines 16--32 of Figure~\ref{fig:accumulator_v}). This would be correct, but when a Verilog array is accessed at several program points, the synthesis tool is unlikely to detect that it can be implemented as a RAM block, and will resort to using lots of registers instead, ruining the circuit's area and performance.  To avert this, we arrange that the data-path does not access memory directly, but simply sets the address it wishes to access and then toggles the \texttt{u\_en} flag. This activates the RAM interface (lines 9--15 of Figure~\ref{fig:accumulator_v}) on the next falling clock edge, which performs the requested load or store. By factoring all the memory accesses out into a separate interface like this, we ensure that the underlying array is only accessed from a single program point in the Verilog code, and thus ensure that the synthesis tool will correctly infer a RAM block.\footnote{Interestingly, the Verilog code shown for the RAM interface must not be modified, because the synthesis tool will only generate a RAM when the code matches a small set of specific patterns. \JW{I tweaked this slightly in an attempt to clarify; please check.}} %\NR{Bring forward this sentence to help with flow.}
+The simplest way to implement loads and stores in \vericert{} would be to access the Verilog array directly from within the data-path (i.e., inside the always-block on lines 16--32 of Figure~\ref{fig:accumulator_v}). This would be correct, but when a Verilog array is accessed at several program points, the synthesis tool is unlikely to detect that it can be implemented as a RAM block, and will resort to using lots of registers instead, ruining the circuit's area and performance.  To avert this, we arrange that the data-path does not access memory directly, but simply sets the address it wishes to access and then toggles the \texttt{u\_en} flag. This activates the RAM interface (lines 9--15 of Figure~\ref{fig:accumulator_v}) on the next falling clock edge, which performs the requested load or store. By factoring all the memory accesses out into a separate interface like this, we ensure that the underlying array is only accessed from a single program point in the Verilog code, and thus ensure that the synthesis tool will correctly infer a RAM block.\footnote{Interestingly, the Verilog code shown for the RAM interface must not be modified, because the synthesis tool will only generate a RAM when the code matches a small set of specific patterns.}
+%\JW{I tweaked this slightly in an attempt to clarify; please check.} %\NR{Bring forward this sentence to help with flow.}
 
 %\JW{I think the following sentence could be cut as we've said this kind of thing a couple of times already.} Without the interface, the array would be implemented using registers, which would increase the size of the hardware considerably.
 
@@ -483,10 +484,12 @@ One might hope that the synthesis tool consuming our generated Verilog would con
   \end{cases}\\
 \end{equation*}
 where $\gg$ stands for a logical right shift. %Once this equivalence about the shifts and division operator is proven correct, it can be used to implement the \texttt{Oshrximm} using the efficient shift version instead of how the \compcert{} semantics described it.
-When proving this equivalence, we actually found a bug in our original implementation that was due to the fact that a na\"{i}ve shift rounds towards $-\infty$. \NR{What do you mean byy naive shift here?}
+When proving this equivalence, we actually found a bug in our original implementation that was due to the fact that a na\"{i}ve shift implementation which rounds towards $-\infty$.
+%\NR{What do you mean byy naive shift here?}\YH{Just a normal shift, instead of being adapted for division.}
 
 %\JW{I don't really understand the following paragraph.}\YH{So my intention with this was to make this section more meaningful, as one of the reviewers mentioned that because compcert already did this, this section is not needed.  But I wanted to explain that because we do the reasoning of equivalence between shifts and division at the Integer level, our proof is more general than the language specific proofs that \compcert{} has in it's back ends, as we fix the specification instead of the implementation.  I'll try and reword this.}\NR{I am not following this paragraph below too.} 
-\compcert{} eventually performs a translation from this representation into assembly code which uses shifts to implement the division, however, the specification of the instruction itself still uses division instead of shifts, meaning the proof of the translation cannot be reused.  In \vericert{}, the equivalence of the representation in terms of divisions and shifts is proven over the integers and the specification, thereby making it simpler to prove the correctness of the Verilog implementation in terms of shifts. \JW{I wonder, can I file this under `things that you've improved in CompCert generally as part of your efforts on Vericert''?}
+\compcert{} eventually performs a translation from this representation into assembly code which uses shifts to implement the division, however, the specification of the instruction itself still uses division instead of shifts, meaning the proof of the translation cannot be reused.  In \vericert{}, the equivalence of the representation in terms of divisions and shifts is proven over the integers and the specification, thereby making it simpler to prove the correctness of the Verilog implementation in terms of shifts.
+%\JW{I wonder, can I file this under `things that you've improved in CompCert generally as part of your efforts on Vericert''?}\YH{I think so, although I have not tried to adapt existing passes to use this.  But I do think that this would make the proofs much easier.}
 
 %The \compcert{} semantics for the \texttt{Oshrximm} instruction expresses its operation exactly as shown in the equation above, even though in hardware the computation that would be performed would be different.  In \vericert{}, if the same operation would be implemented using Verilog operators, it is not guaranteed to be optimised correctly by the synthesis tools that convert the Verilog into a circuit.  To guarantee an output that does not include divides, we therefore have to express it in Verilog using shifts, and then prove that this representation is equivalent to the divide representation used in the \compcert{} semantics.  While conducting the proof, we discovered quite a few bugs in our initial implementation of optimisations, which rounded to $-\infty$ instead of 0.
 
diff --git a/evaluation.tex b/evaluation.tex
index e1719e3..0f18db9 100644
--- a/evaluation.tex
+++ b/evaluation.tex
@@ -1,10 +1,11 @@
 \section{Evaluation}\label{sec:evaluation}
 
-Our evaluation is designed to answer the following three research questions.
+Our evaluation is designed to answer the following four research questions.
 \begin{description}
-\item[RQ1] How fast is the hardware generated by \vericert{}?
-\item[RQ2] How area-efficient is the hardware generated by \vericert{}?
-\item[RQ3] How quickly does \vericert{} translate the C into Verilog?
+  \item[RQ1] How fast is the hardware generated by \vericert{}?
+  \item[RQ2] How area-efficient is the hardware generated by \vericert{}?
+  \item[RQ3] How quickly does \vericert{} translate the C into Verilog?
+  \item[RQ4] How effective is the correctness theorem in \vericert{}?
 \end{description}
 
 \subsection{Experimental Setup}
@@ -174,6 +175,11 @@ By looking at the median, when division/modulo operations are enabled, we see th
 
 \legup{} takes around 10$\times$ as long as \vericert{} to perform the translation from C into Verilog, showing at least that verification does not have a substantial effect on the run-time of the HLS tool.  However, this is a meaningless victory, as a lot of the extra time that \legup{} uses is on performing computationally heavy optimisations such as loop pipelining and scheduling.
 
+\subsection{RQ4: How effective is the correctness theorem in \vericert{}?}
+
+
+To check that the Verilog designs generated by \vericert{} are actually correct, and that the correctness theorem is indeed effective, \vericert{} was fuzzed using Csmith~\cite{yang11_findin_under_bugs_c_compil}, with a similar configuration used by \citet{du21_fuzzin_high_level_synth_tools} when other HLS tools were fuzzed, including \legup{}.
+
 %%% Local Variables:
 %%% mode: latex
 %%% TeX-master: "main"
diff --git a/introduction.tex b/introduction.tex
index 0269640..4573f63 100644
--- a/introduction.tex
+++ b/introduction.tex
@@ -44,7 +44,7 @@ The contributions of this paper are as follows:
 
 \begin{itemize}
   \item We present \vericert{}, the first mechanically verified HLS tool that compiles C to Verilog. In Section~\ref{sec:design}, we describe the design of \vericert{}, including a few optimisations related to memory accesses and division.
-  \item We state the correctness theorem of \vericert{} with respect to an existing semantics for Verilog due to \citet{loow19_proof_trans_veril_devel_hol}. In Section~\ref{sec:verilog}, we describe how we extended this semantics to make it suitable as an HLS target.  We also describe how the Verilog semantics is integrated into CompCert's language semantics model \JW{or `execution model'?} and its framework for performing simulation proofs. A mapping of CompCert's infinite memory model onto a finite Verilog array is also described.
+  \item We state the correctness theorem of \vericert{} with respect to an existing semantics for Verilog due to \citet{loow19_proof_trans_veril_devel_hol}. In Section~\ref{sec:verilog}, we describe how we extended this semantics to make it suitable as an HLS target.  We also describe how the Verilog semantics is integrated into CompCert's language execution model and its framework for performing simulation proofs. A mapping of CompCert's infinite memory model onto a finite Verilog array is also described.
   \item In Section~\ref{sec:proof}, we describe how we proved the correctness theorem. The proof follows standard \compcert{} techniques -- forward simulations, intermediate specifications, and determinism results -- but we encountered several challenges peculiar to our hardware-oriented setting. %\NR{`specific' is better than `peculiar'?} %JW: I think this is a nice use of peculiar. Note that it means `distinctive' here, not `weird' -- the third meaning at https://www.dictionary.com/browse/peculiar 
   These include handling discrepancies between the byte-addressed memory assumed by the input software and the word-addressed memory that we implement in the output hardware,  different handling of unsigned comparisons between C and Verilog, and carefully implementing memory reads and writes so that these behave properly as a RAM in hardware.
   %\JW{Not sure `rearranging' is quite the right word. Sounds like you're rearranging independent reads/writes w.r.t. each other. Maybe change `correctly rearranging' to `carefully implementing'?}
diff --git a/limitations.tex b/limitations.tex
index 0896e9a..b218c11 100644
--- a/limitations.tex
+++ b/limitations.tex
@@ -1,12 +1,16 @@
 \section{Limitations and Future Work}
 
-There are various limitations in \vericert{} compared to other HLS tools due to the fact that our main focus was on formally verifying the translation from 3AC to Verilog. In this section, we outline the current limitations of our tool and suggest how they can be overcome in future work.
+There are various limitations in \vericert{} compared to other HLS tools due to the fact that our main focus was on formally verifying the translation from 3AC to Verilog. In this section, we outline the current limitations of our tool and suggest how they can be overcome in future work, first describing limitations to the generated hardware, and then describing the limitations of the software input that \vericert{} accepts.
 
-\NR{You have very different types of limitations. I wonder if grouped them into software and hardware limitations respectively might simplify this section. Just a suggestion.}
+%\NR{You have very different types of limitations. I wonder if grouped them into software and hardware limitations respectively might simplify this section. Just a suggestion.}
+
+\subsection{Limitations and improvements to the hardware}
+
+This section describes the current limitations and possible improvements that could be done to the generated hardware.
 
 \paragraph{Lack of instruction-level parallelism}
 
-\JW{I'm hesitant about this paragraph. Yes, we have to write something about how future-proof Vericert is, as this is required by the reviewers and promised in our list of changes. But I worry that if we place too much emphasis on how straightforward it will be to add scheduling, then we'll make it harder for ourselves to publish a paper about adding scheduling (``But you said in your OOPSLA 2021 paper that this was all straightforward'' etc.) In what follows, I've tried to strike a balance -- see what you think.}
+%\JW{I'm hesitant about this paragraph. Yes, we have to write something about how future-proof Vericert is, as this is required by the reviewers and promised in our list of changes. But I worry that if we place too much emphasis on how straightforward it will be to add scheduling, then we'll make it harder for ourselves to publish a paper about adding scheduling (``But you said in your OOPSLA 2021 paper that this was all straightforward'' etc.) In what follows, I've tried to strike a balance -- see what you think.}\YH{Yeah I think I like the paragraph and the balance in it, I think it still explains that existing passes should not have to change.  I think reviewer 2 wanted a detailed explanation of this, but I do agree that it should not be needed in this paper.}
 The main limitation of \vericert{} is that it does not perform instruction scheduling, which means that instructions cannot be gathered into the same state and executed in parallel. 
 %This limitation is present by design so that the focus could be made on the initial proof of correctness of the translation of instructions with a sequential schedule.
 However, each language in \vericert{} was designed with scheduling in mind, so that these should not have to change fundamentally when it is implemented in the future. 
@@ -17,7 +21,7 @@ For instance, our HTL language already allows arbitrary Verilog to appear in eac
 
 %However, the design of the intermediate languages in \vericert{} take this optimisation into account and are designed to support scheduling in the future.
 
-%\NR{It is best to explain why we didn't focus on scheduling with a positive/future work spin. For example, ``It was more intuitive to handle one instruction per cycle at the initial stage of our project as we want to focus our efforts on correctness, which has been the main weakness of HLS, rather than performance. However, we were very much aware during the design stage that in order for our compiler to be able to perform better, supporting of scheduling was inevitable. Hence, we intentionally left room for support of scheduling. In essence, instead of supporting one instruction per cycle, we must be able to support a list of instructions per cycle. To do so, we envision extensions to our work in several ways.'' We might not even need to specify the details of how. You can keep the tricks in your sleeves for the next publication. :)}\YH{Well one of the only things the reviewers wanted us to add was a realistic implementation of scheduling, so I think we need to at least keep that.}.
+%\NRIt is best to explain why we didn't focus on scheduling with a positive/future work spin. For example, ``It was more intuitive to handle one instruction per cycle at the initial stage of our project as we want to focus our efforts on correctness, which has been the main weakness of HLS, rather than performance. However, we were very much aware during the design stage that in order for our compiler to be able to perform better, supporting of scheduling was inevitable. Hence, we intentionally left room for support of scheduling. In essence, instead of supporting one instruction per cycle, we must be able to support a list of instructions per cycle. To do so, we envision extensions to our work in several ways.'' We might not even need to specify the details of how. You can keep the tricks in your sleeves for the next publication. \YH{Well one of the only things the reviewers wanted us to add was a realistic implementation of scheduling, so I think we need to at least keep that.}.
 
 %To simplify the proof of the scheduling algorithm, and to minimise the changes necessary for the current translation from 3AC to HTL, a new language must be introduced, called 3ACPar, which would be equivalent to 3AC but instead of consisting of a map from program counter to instruction, it would consist of a map from program counter to list of instructions, which can all be executed in parallel.  A new proof for the scheduling algorithm would have to be written for the translation from 3AC to 3ACPar, for which a verified translation validation approach might be appropriate, however, the translation form 3ACPar to HTL would not change conceptually, except for the fact that multiple instructions can now be translated into the same state.  This small difference means that most of the proof can be reused without any changes, as the translation of instructions was the main body of the proof and did not change.
 
@@ -28,6 +32,10 @@ The introduction of pipelined operators to \vericert{}, especially for division,
 %JW I've chopped the following sentence because it felt like it was going into too much detail. 
 %However, 3ACPar would have to be modified to also describe such instructions so that these can be placed optimally using the external scheduling algorithm.
 
+\subsection{Limitations and improvements to the software input}
+
+This section describes the limitations and possible improvements to the software input accepted by \vericert{}.
+
 \paragraph{Limitations with I/O}
 
 \vericert{} is currently limited in terms of I/O because the main function cannot accept any arguments if the Clight program is to be well-formed. However, just like \compcert{}, \vericert{} can actually compile main functions that have arbitrary arguments and will handle the inputs appropriately. Still, the main correctness theorem in \compcert{} assumes that the main function does not have any arguments, so it may be possible that unexpected behaviour is introduced. Moreover, external function calls that produce traces have not been implemented yet, but once they have, they could enable the C program to read and write values on a bus that is shared with various other components in the hardware design.
diff --git a/related.tex b/related.tex
index 352970d..040c5c4 100644
--- a/related.tex
+++ b/related.tex
@@ -42,7 +42,7 @@
   \caption{Summary of related work}\label{fig:related_euler}
 \end{figure}
 
-A summary of the related works can be found in Figure~\ref{fig:related_euler}, which is represented as an Euler diagram.  The categories chosen for the Euler diagram are: whether the tool is usable \JWcouldcut{and available}, whether it takes a high-level software language as input, whether it has a correctness proof, and finally whether that proof is mechanised.  The goal of \vericert{} is to cover all of these categories.
+A summary of the related works can be found in Figure~\ref{fig:related_euler}, which is represented as an Euler diagram.  The categories chosen for the Euler diagram are: whether the tool is usable, whether it takes a high-level software language as input, whether it has a correctness proof, and finally whether that proof is mechanised.  The goal of \vericert{} is to cover all of these categories.
 
 Most practical HLS tools~\citep{canis11_legup,xilinx20_vivad_high_synth,intel20_sdk_openc_applic,nigam20_predic_accel_desig_time_sensit_affin_types} fit into the category of usable tools that take high-level inputs.  On the other end of the spectrum, there are tools such as BEDROC~\citep{chapman92_verif_bedroc} for which there is no practical tool, and even though it is described as high-level synthesis, it more closely resembles today's logic synthesis tools.
 
diff --git a/verilog.tex b/verilog.tex
index 39530d0..1a1ac2f 100644
--- a/verilog.tex
+++ b/verilog.tex
@@ -119,7 +119,9 @@ Therefore, in addition to the rules shown in Figure~\ref{fig:inference_module},
 
 \subsection{Memory Model}\label{sec:verilog:memory}
 
-The Verilog semantics do not define a memory model for Verilog, as this is not needed for a hardware description language.  There is no preexisting architecture that Verilog will produce; it can describe any memory layout that is needed.  Instead of having specific semantics for memory, the semantics only needs to support the language features that can produce these different memory layouts, these being Verilog arrays.  We therefore define semantics for updating Verilog arrays using blocking and nonblocking assignment.  We then have to prove that the C memory model that \compcert{} uses matches with the interpretation of arrays used in Verilog.  The \compcert{} memory model is infinite, whereas our representation of arrays in Verilog is inherently finite.  There have already been various efforts to define a finite memory model for \JW{individual?} compiler passes in \compcert{}, such as Comp\-Cert\-S~\cite{besson18_compc}, Comp\-Cert\-ELF~\cite{wang20_compc} and Comp\-Cert\-TSO~\cite{sevcik13_compc}, however, we define the translation from \compcert{}'s standard infinite memory model to finite arrays that can be represented in Verilog, leaving the compiler passes intact. \JW{I'm not quite sure I understand. Let me check: Are you saying that previous work has shown how all the existing CompCert passes can be adapted from an infinite to a finite memory model, but what we're doing is leaving the default (infinite) memory model for the CompCert front end, and just converting from an infinite memory model to a finite memory model when we go from 3AC to HTL?}\YH{Yes exactly, most papers changed the whole memory model to thread through properties that were then needed in the back end, but we currently don't need to do that.  I need to double check though for CompCertELF, it doesn't actually seem to be the case.  Will edit this section later.}
+The Verilog semantics do not define a memory model for Verilog, as this is not needed for a hardware description language.  There is no preexisting architecture that Verilog will produce; it can describe any memory layout that is needed.  Instead of having specific semantics for memory, the semantics only needs to support the language features that can produce these different memory layouts, these being Verilog arrays.  We therefore define semantics for updating Verilog arrays using blocking and nonblocking assignment.  We then have to prove that the C memory model that \compcert{} uses matches with the interpretation of arrays used in Verilog.  The \compcert{} memory model is infinite, whereas our representation of arrays in Verilog is inherently finite.  There have already been efforts to define a general finite memory model for all compiler passes in \compcert{}, such as Comp\-Cert\-S~\cite{besson18_compc}, or to translate to a more concrete finite memory model such as in Comp\-Cert\-ELF~\cite{wang20_compc} and Comp\-Cert\-TSO~\cite{sevcik13_compc}, however, we define the translation from \compcert{}'s standard infinite memory model to finite arrays that can be represented in Verilog, leaving the compiler passes intact.  There is therefore no more memory model in Verilog, and all the interactions to memory are encoded in the hardware itself.
+
+%\JW{I'm not quite sure I understand. Let me check: Are you saying that previous work has shown how all the existing CompCert passes can be adapted from an infinite to a finite memory model, but what we're doing is leaving the default (infinite) memory model for the CompCert front end, and just converting from an infinite memory model to a finite memory model when we go from 3AC to HTL?}\YH{Yes exactly, most papers changed the whole memory model to thread through properties that were then needed in the back end, but we currently don't need to do that.  I need to double check though for CompCertELF, it doesn't actually seem to be the case.  Will edit this section later.}
 
 \begin{figure}
   \centering
@@ -195,7 +197,7 @@ The Verilog semantics do not define a memory model for Verilog, as this is not n
 
 %\JW{It's not completely clear what the relationship is between your work and those works. The use of `only' suggests that you've re-done a subset of work that has already been done -- is that the right impression?}\YH{Hopefully that's more clear.}
 
-This translation is represented in Figure~\ref{fig:memory_model_transl}.  \compcert{} defines a map from blocks to maps from memory addresses to memory contents.  Each block represents an area in memory; for example, a block can represent a global variable or a stack for a function. As there are no global variables, the main stack can be assumed to be block 0, \JW{and this is the only block we translate}.
+This translation is represented in Figure~\ref{fig:memory_model_transl}.  \compcert{} defines a map from blocks to maps from memory addresses to memory contents.  Each block represents an area in memory; for example, a block can represent a global variable or a stack for a function. As there are no global variables, the main stack can be assumed to be block 0, and this is the only block we translate.
 %\JW{So the stack frame for a function called by main would be in a different block, is that the idea? Seems unusual not to have a single stack.}
 %\YH{Yeah exactly, it makes it much easier to reason about though, because everything is nicely isolated.  This is exactly what CompCertELF and CompCertS try and solve though.} 
 %\JW{Would global variables normally be put in blocks 1, 2, etc.?}