Finish fixing algorithm section

author: Yann Herklotz <git@yannherklotz.com> 2021-04-14 13:33:42 +0100
committer: Yann Herklotz <git@yannherklotz.com> 2021-04-14 14:23:18 +0100
commit: c8926c52d9077fb65b0310b38248abdc49f4335b (patch)
tree: d271219b382aa1a9e9b5e5416501e399a1bc3b5e
parent: 8c90b19b5f6e9a4a6e0a2b5835fca961828ef10e (diff)
download: oopsla21_fvhls-c8926c52d9077fb65b0310b38248abdc49f4335b.tar.gz
oopsla21_fvhls-c8926c52d9077fb65b0310b38248abdc49f4335b.zip
2 files changed, 13 insertions, 7 deletions
diff --git a/algorithm.tex b/algorithm.tex
index 14d1496..3ea6778 100644
--- a/algorithm.tex
+++ b/algorithm.tex
@@ -159,8 +159,8 @@ In this section, we describe the stages of the \vericert{} translation, referrin
 
 \subsubsection{Translating C to 3AC}
 
-The first stage of the translation uses unmodified \compcert{} to transform the C input into a 3AC intermediate representation, shown in Figure~\ref{fig:accumulator_rtl}.
-As part of this translation, function inlining is also performed on all functions, which allows us to support function calls without having to support the \texttt{Icall} 3AC instruction.  Although the duplication of the function bodies caused by inlining can increase the area of the hardware, it can have a positive effect on latency\YH{TODO: Add sentence on why inlining is often the better choice}. Inlining precludes support for recursive function calls, but this feature isn't supported in most other HLS tools either~\cite{davidthomas_asap16}.
+The first stage of the translation uses unmodified \compcert{} to transform the C input, shown in Figure~\ref{fig:accumulator_c}, into a 3AC intermediate representation, shown in Figure~\ref{fig:accumulator_rtl}.
+As part of this translation, function inlining is performed on all functions, which allows us to support function calls without having to support the \texttt{Icall} 3AC instruction.  Although the duplication of the function bodies caused by inlining can increase the area of the hardware, it can have a positive effect on latency\YH{TODO: Add sentence on why inlining is often the better choice}. Inlining precludes support for recursive function calls, but this feature isn't supported in most other HLS tools either~\cite{davidthomas_asap16}.
 
 %\JW{Is that definitely true? Was discussing this with Nadesh and George recently, and I ended up not being so sure. Inlining could actually lead to \emph{reduced} resource usage because once everything has been inlined, the (big) scheduling problem could then be solved quite optimally. Certainly inlining is known to increase register pressure, but that's not really an issue here. If we're  not sure, we could just say that inlining everything leads to bloated Verilog files and the inability to support recursion, and leave it at that.}\YH{I think that is true, just because we don't do scheduling.  With scheduling I think that's true, inlining actually becomes quite good.}
 
@@ -298,11 +298,13 @@ A high-level overview of the architecture can be seen in Figure~\ref{fig:accumul
 \paragraph{Translating instructions}
 
 Each 3AC instruction either corresponds to a hardware construct, or does not have to be handled by the translation, such as function calls (because of inlining).
-For example, line 2 in Figure~\ref{fig:accumulator_rtl} shows a 32-bit register \texttt{x2} being initialised to 3, after which the control flow moves execution to line 3. This initialisation is also encoded in HTL at state 4 in both the control- and data-path always-blocks, shown in Figure~\ref{fig:accumulator_v}.  Simple operator instructions are translated in a similar way.  For example, in state 5, the value of the array element is added to the current sum value, which is simply translated to an addition of the equivalent registers in the HTL code.
+For example, line 2 in Figure~\ref{fig:accumulator_rtl} shows a 32-bit register \texttt{x5} being initialised to 3, after which the control flow moves execution to line 3. This initialisation is also encoded in HTL at state 8 in both the control- and data-path always-blocks, shown in Figure~\ref{fig:accumulator_v}.  Simple operator instructions are translated in a similar way.
 
-Note that the comparison in state 3 is signed. C and Verilog handle signedness quite differently; by default, all operators and registers in Verilog (and HTL) are unsigned, so to force an operation to handle the bits as signed, both operators have to be forced to be signed.  In addition to that, Verilog implicitly resizes expressions to the largest needed size by default, which can affect the result of the computation.  This feature is not supported by the Verilog semantics we adopted, so to match the semantics to the behaviour of the simulator and synthesis tool, braces are placed around all expressions as this inhibits implicit resizing.  Instead, explicit resizing is used in the semantics and operations can only be performed on two registers that have the same size.  
+C and Verilog handle signedness quite differently; by default, all operators and registers in Verilog (and HTL) are unsigned, so to force an operation to handle the bits as signed, both operators have to be forced to be signed.  In addition to that, Verilog implicitly resizes expressions to the largest needed size by default, which can affect the result of the computation.  This feature is not supported by the Verilog semantics we adopted, so to match the semantics to the behaviour of the simulator and synthesis tool, braces are placed around all expressions as this inhibits implicit resizing.  Instead, explicit resizing is used in the semantics and operations can only be performed on two registers that have the same size.
 
-In addition to that, equality between \emph{unsigned} variables are actually not supported, because this requires supporting the comparison of pointers, which should only be performed between pointers with the same provenance.  In \vericert{} there is currently no way to determine the provenance of a pointer, and it therefore cannot model the semantics of unsigned comparison in \compcert{}.  As dynamic allocation is not supported either, comparison of pointers is rarely needed, and for the comparison of integers, these can be cast to signed integers during the comparison for the translation to succeed.
+In addition to that, equality between \emph{unsigned} variables are actually not supported, because this requires supporting the comparison of pointers, which should only be performed between pointers with the same provenance.  In \vericert{} there is currently no way to determine the provenance of a pointer, and it therefore cannot model the semantics of unsigned comparison in \compcert{}.  As dynamic allocation is not supported either, equality comparison of pointers is rarely needed, and for the equality comparison of integers, these can be cast to signed integers during the equality check for the translation to succeed.
+
+Finally, the \texttt{mulhs} and \texttt{mulhu} instructions are not translated by \vericert{} either.  These instructions fetch the upper bits of a 32-bit multiply.  However, 64-bit number representations are currently not supported in the generated hardware, so this operation cannot currently be performed.  These instructions are only generated to optimise divides by a constant number that is not a power of two, so turning off constant propagation will allow these programs to pass without error.
 
 \subsubsection{Translating HTL to Verilog}
 
@@ -311,7 +313,7 @@ The challenge here is to translate our FSMD representation into a Verilog AST.
 Figure~\ref{fig:accumulator_v} shows the final Verilog output that is generated for our example.
 
 Although this translation seems quite straight\-forward, proving that this translation is correct is complex.
-All the implicit assumptions that were made in HTL need to be translated explicitly to Verilog statements and it needs to be shown that these explicit behaviours are equivalent to the assumptions made in the HTL semantics.
+All the implicit assumptions that were made in HTL need to be translated explicitly to Verilog statements and it needs to be shown that these explicit behaviours are equivalent to the assumptions made in the HTL semantics.  One main example of this is proving that specification of the RAM in HTL does indeed behave the same as its Verilog implementation.
 We discuss these proofs in upcoming sections.
 
  %In general, the generated Verilog structure has similar to that of the HTL code.
@@ -327,6 +329,10 @@ Although we would not claim that \vericert{} is a proper `optimising' HLS compil
 One big difference between C and Verilog is how memory is represented.  Although Verilog arrays might seem to mirror their C counterparts directly, they must be treated quite differently. To reduce the design area and avoid timing issues, it is beneficial if Verilog arrays can be synthesised as RAMs, but this imposes various constraints on how Verilog arrays are used; for instance, RAMs often only allow one read and one write operation per clock cycle. To make loads and stores as efficient as possible, the RAM needs to be word-addressable, which means that an entire integer can be loaded or stored in one clock cycle.
 However, the memory model that \compcert{} uses for its intermediate languages is byte-addre\-ssa\-ble~\cite{blazy05_formal_verif_memor_model_c}.  It therefore has to be proven that the byte-addressable memory behaves in the same way as the word-addressable memory in hardware.  Any modifications of the bytes in the \compcert{} memory model also have to be shown to modify the word-addressable memory in the same way.  Since only integer loads and stores are currently supported in \vericert{}, it follows that the addresses given to the loads and stores should be multiples of four.  If that is the case, then the translation from byte-addressed memory to word-addressed memory can be done by dividing the address by four.
 
+\subsubsection{Implementation of RAM templates}
+
+Verilog arrays can be used in a variety of ways, however, these do not all produce optimal hardware designs.  Synthesis tools provide code snippets that they know how to transform into various constructs, including proper RAMs in the final hardware.  If, instead of using these templates, the array was accessed immediately in the data-path, then the synthesis tool would not be able to identify it as having the right properties for a RAM, and would instead implement the array using registers.  This is extremely expensive, and for large memories this can easily blow up the area usage of the FPGA, and because of the longer wires that are needed, it would also affect the performance of the circuit.  The initial translation from 3AC to HTL converts loads and stores to direct accesses to the memory, as this preserves the same behaviour without having to insert more registers and signalling.  Another pass from HTL to HTL then performs the translation from this na\"ive representation to using a proper RAM template, adding the necessary data, address and enable signals to make the synthesis tool infer that block as a proper RAM.  The RAM template that is used has some interesting additions to make the proof of this pass easier, which is further explained in Section~\ref{sec:verilog:memory}.
+
 \subsubsection{Implementing the \texttt{Oshrximm} instruction}
 
 % Mention that this optimisation is not performed sometimes (clang -03).
diff --git a/verilog.tex b/verilog.tex
index ad9af9f..e77aa97 100644
--- a/verilog.tex
+++ b/verilog.tex
@@ -110,7 +110,7 @@ Therefore, in addition to the rules shown in Figure~\ref{fig:inferrence_module},
 
 \noindent where the initial state is the \texttt{Callstate} with an empty stack frame and no arguments for the \texttt{main} function of program $P$, where this \texttt{main} function needs to be in the current translation unit.  The final state results in the program output of value $n$ when reaching a \texttt{Returnstate} with an empty stack frame.
 
-\subsection{Memory Model}
+\subsection{Memory Model}\label{sec:verilog:memory}
 
 The Verilog semantics do not define a memory model for Verilog, as this is not needed for a hardware description language.  There is no preexisting architecture that Verilog will produce, it can describe any memory layout that is needed.  Instead of having specific semantics for memory, the semantics only need to support the language features that can produce these different memory layouts, these being Verilog arrays.  We therefore define semantics for updating Verilog arrays using blocking and nonblocking assignment.  We then have to prove that the C memory model that \compcert{} uses matches with the interpretation of arrays that are used in Verilog.  The \compcert{} memory model is infinite, whereas our representation of arrays in Verilog is inherently finite.  There have already been various efforts to define a finite memory model for \compcert{}, such as CompCertS~\cite{besson18_compc}, CompCertELF~\cite{wang20_compc} and CompCertTSO~\cite{sevcik13_compc}, however, we only define the translation from \compcert{}'s standard infinite memory model to finitely sized arrays that can be represented in Verilog.
author	Yann Herklotz <git@yannherklotz.com>	2021-04-14 13:33:42 +0100
committer	Yann Herklotz <git@yannherklotz.com>	2021-04-14 14:23:18 +0100
commit	c8926c52d9077fb65b0310b38248abdc49f4335b (patch)
tree	d271219b382aa1a9e9b5e5416501e399a1bc3b5e
parent	8c90b19b5f6e9a4a6e0a2b5835fca961828ef10e (diff)
download	oopsla21_fvhls-c8926c52d9077fb65b0310b38248abdc49f4335b.tar.gz oopsla21_fvhls-c8926c52d9077fb65b0310b38248abdc49f4335b.zip