Fix many of the comments

author: Yann Herklotz <git@yannherklotz.com> 2021-08-12 11:28:30 +0200
committer: Yann Herklotz <git@yannherklotz.com> 2021-08-12 11:28:30 +0200
commit: bb505be23ef07e2b3e47e268513a6f4f1b33f66b (patch)
tree: e82a6b934c43ddffce8b6b19cb09911c01589d33 /algorithm.tex
parent: 221aa79714add6689aaa64522b6d6d8b0d2bea46 (diff)
download: oopsla21_fvhls-bb505be23ef07e2b3e47e268513a6f4f1b33f66b.tar.gz
oopsla21_fvhls-bb505be23ef07e2b3e47e268513a6f4f1b33f66b.zip
1 files changed, 6 insertions, 3 deletions
diff --git a/algorithm.tex b/algorithm.tex
index ba00bff..e67ce5c 100644
--- a/algorithm.tex
+++ b/algorithm.tex
@@ -399,7 +399,8 @@ One big difference between C and Verilog is how memory is represented.  Although
 However, the memory model that \compcert{} uses for its intermediate languages is byte-addre\-ssa\-ble~\cite{blazy05_formal_verif_memor_model_c}.  If a byte-addressable memory was used in the target hardware, which is closer to \compcert{}'s memory model, then a load and store would instead take four clock cycles, because a RAM can only perform one read and write per clock cycle.  It therefore has to be proven that the byte-addressable memory behaves in the same way as the word-addressable memory in hardware.  Any modifications of the bytes in the \compcert{} memory model also have to be shown to modify the word-addressable memory in the same way.  Since only integer loads and stores are currently supported in \vericert{}, it follows that the addresses given to the loads and stores will be multiples of four.  If that is the case, then the translation from byte-addressed memory to word-addressed memory can be done by dividing the address by four.
 
 \subsubsection{Implementation of RAM interface}
-The simplest way to implement loads and stores in \vericert{} would be to access the Verilog array directly from within the data-path (i.e., inside the always-block on lines 16--32 of Figure~\ref{fig:accumulator_v}). This would be correct, but when a Verilog array is accessed at several program points, the synthesis tool is unlikely to detect that it can be implemented as a RAM block, and will resort to using lots of registers instead, ruining the circuit's area and performance.  To avert this, we arrange that the data-path does not access memory directly, but simply sets the address it wishes to access and then toggles the \texttt{u\_en} flag. This activates the RAM interface (lines 9--15 of Figure~\ref{fig:accumulator_v}) on the next falling clock edge, which performs the requested load or store. By factoring all the memory accesses out into a separate interface like this, we ensure that the underlying array is only accessed from a single program point in the Verilog code, and thus ensure that the synthesis tool will correctly infer a RAM block.\footnote{Interestingly, the Verilog code shown for the RAM interface must not be modified, because the synthesis tool will only generate a RAM when the code matches a small set of specific patterns. \JW{I tweaked this slightly in an attempt to clarify; please check.}} %\NR{Bring forward this sentence to help with flow.}
+The simplest way to implement loads and stores in \vericert{} would be to access the Verilog array directly from within the data-path (i.e., inside the always-block on lines 16--32 of Figure~\ref{fig:accumulator_v}). This would be correct, but when a Verilog array is accessed at several program points, the synthesis tool is unlikely to detect that it can be implemented as a RAM block, and will resort to using lots of registers instead, ruining the circuit's area and performance.  To avert this, we arrange that the data-path does not access memory directly, but simply sets the address it wishes to access and then toggles the \texttt{u\_en} flag. This activates the RAM interface (lines 9--15 of Figure~\ref{fig:accumulator_v}) on the next falling clock edge, which performs the requested load or store. By factoring all the memory accesses out into a separate interface like this, we ensure that the underlying array is only accessed from a single program point in the Verilog code, and thus ensure that the synthesis tool will correctly infer a RAM block.\footnote{Interestingly, the Verilog code shown for the RAM interface must not be modified, because the synthesis tool will only generate a RAM when the code matches a small set of specific patterns.}
+%\JW{I tweaked this slightly in an attempt to clarify; please check.} %\NR{Bring forward this sentence to help with flow.}
 
 %\JW{I think the following sentence could be cut as we've said this kind of thing a couple of times already.} Without the interface, the array would be implemented using registers, which would increase the size of the hardware considerably.
 
@@ -483,10 +484,12 @@ One might hope that the synthesis tool consuming our generated Verilog would con
   \end{cases}\\
 \end{equation*}
 where $\gg$ stands for a logical right shift. %Once this equivalence about the shifts and division operator is proven correct, it can be used to implement the \texttt{Oshrximm} using the efficient shift version instead of how the \compcert{} semantics described it.
-When proving this equivalence, we actually found a bug in our original implementation that was due to the fact that a na\"{i}ve shift rounds towards $-\infty$. \NR{What do you mean byy naive shift here?}
+When proving this equivalence, we actually found a bug in our original implementation that was due to the fact that a na\"{i}ve shift implementation which rounds towards $-\infty$.
+%\NR{What do you mean byy naive shift here?}\YH{Just a normal shift, instead of being adapted for division.}
 
 %\JW{I don't really understand the following paragraph.}\YH{So my intention with this was to make this section more meaningful, as one of the reviewers mentioned that because compcert already did this, this section is not needed.  But I wanted to explain that because we do the reasoning of equivalence between shifts and division at the Integer level, our proof is more general than the language specific proofs that \compcert{} has in it's back ends, as we fix the specification instead of the implementation.  I'll try and reword this.}\NR{I am not following this paragraph below too.} 
-\compcert{} eventually performs a translation from this representation into assembly code which uses shifts to implement the division, however, the specification of the instruction itself still uses division instead of shifts, meaning the proof of the translation cannot be reused.  In \vericert{}, the equivalence of the representation in terms of divisions and shifts is proven over the integers and the specification, thereby making it simpler to prove the correctness of the Verilog implementation in terms of shifts. \JW{I wonder, can I file this under `things that you've improved in CompCert generally as part of your efforts on Vericert''?}
+\compcert{} eventually performs a translation from this representation into assembly code which uses shifts to implement the division, however, the specification of the instruction itself still uses division instead of shifts, meaning the proof of the translation cannot be reused.  In \vericert{}, the equivalence of the representation in terms of divisions and shifts is proven over the integers and the specification, thereby making it simpler to prove the correctness of the Verilog implementation in terms of shifts.
+%\JW{I wonder, can I file this under `things that you've improved in CompCert generally as part of your efforts on Vericert''?}\YH{I think so, although I have not tried to adapt existing passes to use this.  But I do think that this would make the proofs much easier.}
 
 %The \compcert{} semantics for the \texttt{Oshrximm} instruction expresses its operation exactly as shown in the equation above, even though in hardware the computation that would be performed would be different.  In \vericert{}, if the same operation would be implemented using Verilog operators, it is not guaranteed to be optimised correctly by the synthesis tools that convert the Verilog into a circuit.  To guarantee an output that does not include divides, we therefore have to express it in Verilog using shifts, and then prove that this representation is equivalent to the divide representation used in the \compcert{} semantics.  While conducting the proof, we discovered quite a few bugs in our initial implementation of optimisations, which rounded to $-\infty$ instead of 0.
author	Yann Herklotz <git@yannherklotz.com>	2021-08-12 11:28:30 +0200
committer	Yann Herklotz <git@yannherklotz.com>	2021-08-12 11:28:30 +0200
commit	bb505be23ef07e2b3e47e268513a6f4f1b33f66b (patch)
tree	e82a6b934c43ddffce8b6b19cb09911c01589d33 /algorithm.tex
parent	221aa79714add6689aaa64522b6d6d8b0d2bea46 (diff)
download	oopsla21_fvhls-bb505be23ef07e2b3e47e268513a6f4f1b33f66b.tar.gz oopsla21_fvhls-bb505be23ef07e2b3e47e268513a6f4f1b33f66b.zip