Add some fixes

author: Yann Herklotz <git@yannherklotz.com> 2021-04-16 17:10:19 +0100
committer: Yann Herklotz <git@yannherklotz.com> 2021-04-16 17:11:02 +0100
commit: efa1e45a13c073ab6bec67eec9f195a4dc8c7039 (patch)
tree: 71051a866e6799d0de98a9863fc1c907a007e42a /algorithm.tex
parent: 8746f4b67cc269a116a2497935edeec243a0bb7a (diff)
download: oopsla21_fvhls-efa1e45a13c073ab6bec67eec9f195a4dc8c7039.tar.gz
oopsla21_fvhls-efa1e45a13c073ab6bec67eec9f195a4dc8c7039.zip
1 files changed, 14 insertions, 17 deletions
diff --git a/algorithm.tex b/algorithm.tex
index 6dadef4..2c7cea2 100644
--- a/algorithm.tex
+++ b/algorithm.tex
@@ -150,7 +150,7 @@ module main(reset, clk, finish, return_val);
          endcase
 endmodule
 \end{minted}
-\caption{Verilog produced by \vericert{}. It demonstrates the instantiation of the RAM (lines 9--15), \JW{Sorry about the magic numbers here.} the data-path (lines 16--32) and the control logic (lines 33--42).}\label{fig:accumulator_v}
+\caption{Verilog produced by \vericert{}. It demonstrates the instantiation of the RAM (lines 9--15), the data-path (lines 16--32) and the control logic (lines 33--42).}\label{fig:accumulator_v}
 \end{subfigure}
   \caption{Translating a simple program from C to Verilog.}\label{fig:accumulator_c_rtl}
 \end{figure}
@@ -179,7 +179,7 @@ As part of this translation, function inlining is performed on all functions, wh
 The next translation is from 3AC to a new hardware translation language (HTL). %, which is one step towards being completely translated to hardware described in Verilog.
 This involves going from a CFG representation of the computation to a finite state machine with data-path (FSMD) representation~\cite{hwang99_fsmd}. The core idea of the FSMD representation is that it separates the control flow from the operations on the memory and registers. %\JP{I've become less comfortable with this term, but it's personal preference so feel free to ignore. I think `generalised finite state machine' (i.e.\ thinking of the entire `data-path' as contributing to the overall state) is more accurate.}\YH{Hmm, yes, I mainly chose FSMD because there is quite a lot of literature around it.  I think for now I'll keep it but for the final draft we could maybe change it.}
 %This means that the state transitions can be translated into a simple finite state machine (FSM) where each state contains data operations that update the memory and registers. 
-Hence, an HTL program consists of two maps from states to Verilog statements: control- and data-path maps that express state transitions and computations respectively.
+Hence, an HTL program consists of two maps from states to Verilog statements: control logic and data-path maps that express state transitions and computations respectively.
 Figure~\ref{fig:accumulator_diagram} shows the resulting FSMD architecture. The right-hand block is the control logic that computes the next state, while the left-hand block updates all the registers and RAM based on the current program state. 
 
 \begin{figure*}
@@ -300,19 +300,13 @@ A high-level overview of the architecture can be seen in Figure~\ref{fig:accumul
 \paragraph{Translating instructions}
 
 Each 3AC instruction either corresponds to a hardware construct, or does not have to be handled by the translation, such as function calls (because of inlining).
-For example, line 2 in Figure~\ref{fig:accumulator_rtl} shows a 32-bit register \texttt{x5} being initialised to 3, after which the control flow moves execution to line 3. This initialisation is also encoded in HTL at state 8 in both the control- and data-path \JW{There's an inconsistency throughout this section between `control logic' and `control-path'.} always-blocks, shown in Figure~\ref{fig:accumulator_v}.  Simple operator instructions are translated in a similar way.
+For example, line 2 in Figure~\ref{fig:accumulator_rtl} shows a 32-bit register \texttt{x5} being initialised to 3, after which the control flow moves execution to line 3. This initialisation is also encoded in HTL at state 8 in both the control logic and data-path. always-blocks, shown in Figure~\ref{fig:accumulator_v}.  Simple operator instructions are translated in a similar way.
 
 C and Verilog handle signedness quite differently. By default, all operators and registers in Verilog (and HTL) are unsigned, so to force an operation to handle the bits as signed, both operators have to be forced to be signed. Moreover, Verilog implicitly resizes expressions to the largest needed size by default, which can affect the result of the computation.  This feature is not supported by the Verilog semantics we adopted, so to match the semantics to the behaviour of the simulator and synthesis tool, braces are placed around all expressions to inhibit implicit resizing.  Instead, explicit resizing is used in the semantics, and operations can only be performed on two registers that have the same size.
 
-In addition to that, equality between \emph{unsigned} variables is actually not supported, because this requires supporting the comparison of pointers, which should only be performed between pointers with the same provenance.  In \vericert{} there is currently no way to determine the provenance of a pointer, and it therefore cannot model the semantics of unsigned comparison in \compcert{}. \JW{Perhaps begin this sentence with `This is not a severe restriction in practice, however, because...' and then rejig the rest of the sentence accordingly? That would make the sentiment clear to the casual reader.} As dynamic allocation is not supported either, equality comparison of pointers is rarely needed, and for the equality comparison of integers, these can be cast to signed integers during the equality check for the translation to succeed.
+In addition to that, equality between \emph{unsigned} variables is actually not supported, because this requires supporting the comparison of pointers, which should only be performed between pointers with the same provenance.  In \vericert{} there is currently no way to determine the provenance of a pointer, and it therefore cannot model the semantics of unsigned comparison in \compcert{}. This is not a severe restriction in practice, however, because, as dynamic allocation is not supported either, equality comparison of pointers is rarely needed, and equality comparison of integers can still be performed by casting them both to signed integers.
 
-Finally, the \texttt{mulhs} and \texttt{mulhu} instructions, which fetch the upper bits of a 32-bit multiplication, are not translated by \vericert{} either. 
-\JW{maybe add `because 64-bit numbers are not supported.' and then delete the next sentence?}
-However, 64-bit number representations are currently not supported in the generated hardware, so this operation cannot currently be performed.  These instructions are only generated to optimise divides \JW{divisions?} by a constant \JWcouldcut{number} that is not a power of two, so turning off constant propagation will allow these programs to pass without error.
-
-\subsubsection{RAM insertion}
-\JW{I think this belongs in the `optimisations' subsection?}
-This pass goes from HTL \JWcouldcut{back} to HTL and extracts all the direct accesses to the Verilog array implementing memory and replaces them by signals which access the memory in a separate always-block.  This ensures that the synthesis tool correctly identifies the array as being a RAM, so that it is not implemented by logic directly.  The translation is performed by going through all the instructions and replacing each load and store expression one after another.  Stores can simply be replaced by the necessary wires directly, however, loads using the RAM block take two clock cycles instead of a direct load from an array which only takes one clock cycles.  This pass therefore creates a extra state which is inserted after each load.
+Finally, the \texttt{mulhs} and \texttt{mulhu} instructions, which fetch the upper bits of a 32-bit multiplication, are not translated by \vericert{} either, because 64-bit numbers are not supported. These instructions are only generated to optimise divisions by a constant that is not a power of two, so turning off constant propagation will allow these programs to pass without error.
 
 \subsubsection{Translating HTL to Verilog}
 
@@ -334,14 +328,17 @@ Although we would not claim that \vericert{} is a proper `optimising' HLS compil
 
 \subsubsection{Byte- and word-addressable memories}
 
-One big difference between C and Verilog is how memory is represented.  Although Verilog arrays might seem to mirror their C counterparts directly, \JW{Or: `Although Verilog arrays use similar syntax to C arrays'} they must be treated quite differently. To reduce the design area and avoid timing issues, it is beneficial if Verilog arrays can be synthesised as RAMs, \JWcouldcut{but this imposes various constraints on how Verilog arrays are used; for instance, RAMs often only allow one read and one write operation per clock cycle.} \JW{I think that bit can be chopped from here because that's the topic of the next paragraph.} To make loads and stores as efficient as possible, the RAM needs to be word-addressable, which means that an entire integer can be loaded or stored in one clock cycle. \JW{I suggest spelling out explicitly that if you had byte-addressable RAMs you'd need to take four clock cycles to read a single word.}
-However, the memory model that \compcert{} uses for its intermediate languages is byte-addre\-ssa\-ble~\cite{blazy05_formal_verif_memor_model_c}.  It therefore has to be proven that the byte-addressable memory behaves in the same way as the word-addressable memory in hardware.  Any modifications of the bytes in the \compcert{} memory model also have to be shown to modify the word-addressable memory in the same way.  Since only integer loads and stores are currently supported in \vericert{}, it follows that the addresses given to the loads and stores should be multiples of four.  If that is the case, then the translation from byte-addressed memory to word-addressed memory can be done by dividing the address by four.
+One big difference between C and Verilog is how memory is represented.  Although Verilog arrays use similar syntax to C arrays, they must be treated quite differently. To make loads and stores as efficient as possible, the RAM needs to be word-addressable, which means that an entire integer can be loaded or stored in one clock cycle.
+However, the memory model that \compcert{} uses for its intermediate languages is byte-addre\-ssa\-ble~\cite{blazy05_formal_verif_memor_model_c}.  If a byte-addressable memory was used in the target hardware, which is closer to \compcert{}'s memory model, then a load and store would instead take four clock cycles, because a RAM can only perform one read and write per clock cycle.  It therefore has to be proven that the byte-addressable memory behaves in the same way as the word-addressable memory in hardware.  Any modifications of the bytes in the \compcert{} memory model also have to be shown to modify the word-addressable memory in the same way.  Since only integer loads and stores are currently supported in \vericert{}, it follows that the addresses given to the loads and stores should be multiples of four.  If that is the case, then the translation from byte-addressed memory to word-addressed memory can be done by dividing the address by four.
+
+The simplest way to implement loads and stores in \vericert{} would be to access the Verilog array directly in the data-path (i.e., within the always-block on lines 16--32 of Figure~\ref{fig:accumulator_v}). This would be correct, but when a Verilog array is accessed at several program points, the synthesis tool is unlikely to detect that it can be implemented as a RAM block, and will resort to using lots of registers instead, ruining the circuit's area and performance. To avert this, we arrange that the data-path doesn't access memory directly, but simply sets the address it wishes to access and then toggles the \texttt{u\_en} flag. This activates the RAM driver (lines 9--15 of Figure~\ref{fig:accumulator_v}) on the next falling clock edge, which performs the requested load or store. By factoring all the memory accesses out into a separate driver like this, we ensure that the underlying array is only accessed from a single program point in the Verilog code, and thus ensure that the synthesis tool will correctly infer a RAM block.
+
+Therefore, an extra compiler pass is added from HTL to HTL to extracts all the direct accesses to the Verilog array implementing memory and replaces them by signals which access the memory in a separate always-block.  This ensures that the synthesis tool correctly identifies the array as being a RAM, so that it is not implemented by logic directly.  The translation is performed by going through all the instructions and replacing each load and store expression one after another.  Stores can simply be replaced by the necessary wires directly, however, loads using the RAM block take two clock cycles instead of a direct load from an array which only takes one clock cycles.  This pass therefore creates a extra state which is inserted after each load.
 
-\subsubsection{Implementation of RAM templates}
-\JW{The simplest way to implement loads and stores in \vericert{} would be to access the Verilog array directly from within the data-path (i.e., inside the always-block on lines 16--32 of Figure~\ref{fig:accumulator_v}). This would be correct, but when a Verilog array is accessed at several program points, the synthesis tool is unlikely to detect that it can be implemented as a RAM block, and will resort to using lots of registers instead, ruining the circuit's area and performance. To avert this, we arrange that the data-path doesn't access memory directly, but simply sets the address it wishes to access and then toggles the \texttt{u\_en} flag. This activates the RAM driver (lines 9--15 of Figure~\ref{fig:accumulator_v}) on the next falling clock edge, which performs the requested load or store. By factoring all the memory accesses out into a separate driver like this, we ensure that the underlying array is only accessed from a single program point in the Verilog code, and thus ensure that the synthesis tool will correctly infer a RAM block.} \JW{I've called that negedge always-block the `RAM driver' in my proposed text above -- that feels like quite a nice a word for it to my mind -- what do you think?}
-Verilog arrays can be used in a variety of ways, however, these do not all produce optimal hardware designs.  If, for example, arrays in Verilog are accessed immediately in the data-path, then the synthesis tool is not be able to identify it as having the right properties for a RAM, and would instead implement the array using registers.  This is extremely expensive, and for large memories this can easily blow up the area usage of the FPGA, and because of the longer wires that are needed, it would also affect the performance of the circuit.  The synthesis tools therefore provide code snippets that they know how to transform into various constructs, including snippets that will generate proper RAMs in the final hardware.  This process is called memory inference.  The initial translation from 3AC to HTL converts loads and stores to direct accesses to the memory, as this preserves the same behaviour without having to insert more registers and logic.  We therefore have another pass from HTL to itself which performs the translation from this na\"ive use of arrays to a representation which always allows for memory inference.  This pass creates a separate always block to perform the loads and stores to the memory, and adds the necessary data, address and enable signals to communicate with that always-block from other always-blocks.  This always-block is shown between lines 10-15 in Figure~\ref{fig:accumulator_v}.
+%\JW{I've called that negedge always-block the `RAM driver' in my proposed text above -- that feels like quite a nice a word for it to my mind -- what do you think?}\YH{Yes I quite like it!}
+%Verilog arrays can be used in a variety of ways, however, these do not all produce optimal hardware designs.  If, for example, arrays in Verilog are accessed immediately in the data-path, then the synthesis tool is not be able to identify it as having the right properties for a RAM, and would instead implement the array using registers.  This is extremely expensive, and for large memories this can easily blow up the area usage of the FPGA, and because of the longer wires that are needed, it would also affect the performance of the circuit.  The synthesis tools therefore provide code snippets that they know how to transform into various constructs, including snippets that will generate proper RAMs in the final hardware.  This process is called memory inference.  The initial translation from 3AC to HTL converts loads and stores to direct accesses to the memory, as this preserves the same behaviour without having to insert more registers and logic.  We therefore have another pass from HTL to itself which performs the translation from this na\"ive use of arrays to a representation which always allows for memory inference.  This pass creates a separate always block to perform the loads and stores to the memory, and adds the necessary data, address and enable signals to communicate with that always-block from other always-blocks.  This always-block is shown between lines 10-15 in Figure~\ref{fig:accumulator_v}.
 
-There are two interesting parts to this RAM template.  Firstly, the memory updates are triggered on the negative edge of the clock, out of phase with the rest of the design which is triggered on the positive edge of the clock.  The main advantage is that instead of loads and stores taking three clock cycles and two clock cycles respectively, they only take two clock cycles and one clock cycle instead, greatly improving their performance.  In addition to that, using the negative edge for the clock is supported by many synthesis tools, it therefore does not affect the maximum frequency of the final design.
+There are two interesting parts to the inserted RAM template.  Firstly, the memory updates are triggered on the negative edge of the clock, out of phase with the rest of the design which is triggered on the positive edge of the clock.  The main advantage is that instead of loads and stores taking three clock cycles and two clock cycles respectively, they only take two clock cycles and one clock cycle instead, greatly improving their performance.  In addition to that, using the negative edge for the clock is supported by many synthesis tools, it therefore does not affect the maximum frequency of the final design.
 
 Secondly, the logic in the enable signal of the RAM (\texttt{en != u\_en}) is also atypical.  To make the proof simpler, the goal is to create a RAM which disables itself after every use, so that firstly, the proof can assume that the RAM is disabled at the start and end of every clock cycle, and secondly so that only the state which contains the load and store need to be modified to ensure this property.  Using a simple enable signal, it would not be possible to disable it in the RAM itself, because this would result in a register being driven twice from two different locations.  It has to be enabled from the data-path and disabled from the RAM.  The only other solutions are to either insert extra states that disable the RAM accordingly, thereby eliminating the speed advantage of the negative edge triggered RAM, or to check the next state after a load and store and insert disables into that state.  The latter solution can quickly become complicated though, especially as this new state could contain another memory operation, in which case the disable signal should not be added to that state.  We can instead generate a second enable signal that is set by the user, and the original enable signal is then updated by the RAM to be equal to the value that the user set.  This means that the RAM should be enabled whenever the two signals are different, and disabled otherwise.  A solution to this problem is to create another enable signal that is controlled by the self-disabling RAM, which is always set to be equal to the enable signal set by the data-path.  The RAM is then considered enabled if the data-path enable and the RAM enable are different.
author	Yann Herklotz <git@yannherklotz.com>	2021-04-16 17:10:19 +0100
committer	Yann Herklotz <git@yannherklotz.com>	2021-04-16 17:11:02 +0100
commit	efa1e45a13c073ab6bec67eec9f195a4dc8c7039 (patch)
tree	71051a866e6799d0de98a9863fc1c907a007e42a /algorithm.tex
parent	8746f4b67cc269a116a2497935edeec243a0bb7a (diff)
download	oopsla21_fvhls-efa1e45a13c073ab6bec67eec9f195a4dc8c7039.tar.gz oopsla21_fvhls-efa1e45a13c073ab6bec67eec9f195a4dc8c7039.zip