AddFix more things

author: Yann Herklotz <git@yannherklotz.com> 2021-04-16 20:53:57 +0100
committer: Yann Herklotz <git@yannherklotz.com> 2021-04-16 20:53:57 +0100
commit: 1479bb42c2877c29376549d768a97676e1b96841 (patch)
tree: 42dedad20c175e8c9340316a6abde823c213b44b /algorithm.tex
parent: 7d8150af139d30058a6be3b962f252505fd45d9b (diff)
download: oopsla21_fvhls-1479bb42c2877c29376549d768a97676e1b96841.tar.gz
oopsla21_fvhls-1479bb42c2877c29376549d768a97676e1b96841.zip
1 files changed, 11 insertions, 9 deletions
diff --git a/algorithm.tex b/algorithm.tex
index ba44613..f8514b8 100644
--- a/algorithm.tex
+++ b/algorithm.tex
@@ -114,7 +114,7 @@ module main(reset, clk, finish, return_val);
   reg [0:0] en = 0, u_en = 0;
   reg [31:0] state = 0, reg_2 = 0, reg_4 = 0, d_out = 0, reg_1 = 0;
   reg [31:0] stack [1:0];
-  // RAM template
+  // RAM interface
   always @(negedge clk)
     if ({u_en != en}) begin
       if (wr_en) stack[addr] <= d_in;
@@ -162,7 +162,7 @@ In this section, we describe the stages of the \vericert{} translation, referrin
 \subsubsection{Translating C to 3AC}
 
 The first stage of the translation uses unmodified \compcert{} to transform the C input, shown in Figure~\ref{fig:accumulator_c}, into a 3AC intermediate representation, shown in Figure~\ref{fig:accumulator_rtl}.
-As part of this translation, function inlining is performed on all functions, which allows us to support function calls without having to support the \texttt{Icall} 3AC instruction.  Although the duplication of the function bodies caused by inlining can increase the area of the hardware, it can have a positive effect on latency\YH{TODO: Add sentence on why inlining is often the better choice}. Inlining precludes support for recursive function calls, but this feature is not supported in most other HLS tools either~\cite{davidthomas_asap16}.
+As part of this translation, function inlining is performed on all functions, which allows us to support function calls without having to support the \texttt{Icall} 3AC instruction.  Although the duplication of the function bodies caused by inlining can increase the area of the hardware, it can have a positive effect on latency. Inlining precludes support for recursive function calls, but this feature is not supported in most other HLS tools either~\cite{davidthomas_asap16}.
 
 %\JW{Is that definitely true? Was discussing this with Nadesh and George recently, and I ended up not being so sure. Inlining could actually lead to \emph{reduced} resource usage because once everything has been inlined, the (big) scheduling problem could then be solved quite optimally. Certainly inlining is known to increase register pressure, but that's not really an issue here. If we're  not sure, we could just say that inlining everything leads to bloated Verilog files and the inability to support recursion, and leave it at that.}\YH{I think that is true, just because we don't do scheduling.  With scheduling I think that's true, inlining actually becomes quite good.}
 
@@ -293,7 +293,7 @@ Typically, HLS-generated hardware consists of a sea of registers and RAM memorie
 This memory view is very different to the C memory model, so we perform the following translation. 
 Variables that do not have their address taken are kept in registers, which correspond to the registers in 3AC.
 All address-taken variables, arrays, and structs are kept in RAM.
-The stack of the main function becomes an unpacked array of 32-bit integers, which is translated to a RAM when the hardware description is passed through a synthesis tool.  In this pass, loads and stores are translated to direct accesses to the Verilog array representing memory.
+The stack of the main function becomes an unpacked array of 32-bit integers, which is translated to a RAM when the hardware description is passed through a synthesis tool.  In a first pass, RAM is represented by direct reads and writes to an array, however, in a RAM insertion pass a proper RAM interface is added, and reads and writes are performed through that interface.
 Finally, global variables are not translated in \vericert{} at the moment.
 A high-level overview of the architecture can be seen in Figure~\ref{fig:accumulator_diagram}. 
 
@@ -304,7 +304,7 @@ For example, line 2 in Figure~\ref{fig:accumulator_rtl} shows a 32-bit register
 
 C and Verilog handle signedness quite differently. By default, all operators and registers in Verilog (and HTL) are unsigned, so to force an operation to handle the bits as signed, both operators have to be forced to be signed. Moreover, Verilog implicitly resizes expressions to the largest needed size by default, which can affect the result of the computation.  This feature is not supported by the Verilog semantics we adopted, so to match the semantics to the behaviour of the simulator and synthesis tool, braces are placed around all expressions to inhibit implicit resizing.  Instead, explicit resizing is used in the semantics, and operations can only be performed on two registers that have the same size.
 
-In addition to that, equality between \emph{unsigned} variables is actually not supported, because this requires supporting the comparison of pointers, which should only be performed between pointers with the same provenance.  In \vericert{} there is currently no way to determine the provenance of a pointer, and it therefore cannot model the semantics of unsigned comparison in \compcert{}. This is not a severe restriction in practice, however, because, as dynamic allocation is not supported either, equality comparison of pointers is rarely needed, and equality comparison of integers can still be performed by casting them both to signed integers.
+In addition to that, equality checks between \emph{unsigned} variables is actually not supported, because this requires supporting the comparison of pointers, which should only be performed between pointers with the same provenance.  In \vericert{} there is currently no way to determine the provenance of a pointer, and it therefore cannot model the semantics of unsigned comparison in \compcert{}. This is not a severe restriction in practice, however, because, as dynamic allocation is not supported either, equality comparison of pointers is rarely needed, and equality comparison of integers can still be performed by casting them both to signed integers.
 
 Finally, the \texttt{mulhs} and \texttt{mulhu} instructions, which fetch the upper bits of a 32-bit multiplication, are not translated by \vericert{} either, because 64-bit numbers are not supported. These instructions are only generated to optimise divisions by a constant that is not a power of two, so turning off constant propagation will allow these programs to pass without error.
 
@@ -331,17 +331,19 @@ Although we would not claim that \vericert{} is a proper `optimising' HLS compil
 One big difference between C and Verilog is how memory is represented.  Although Verilog arrays use similar syntax to C arrays, they must be treated quite differently. To make loads and stores as efficient as possible, the RAM needs to be word-addressable, which means that an entire integer can be loaded or stored in one clock cycle.
 However, the memory model that \compcert{} uses for its intermediate languages is byte-addre\-ssa\-ble~\cite{blazy05_formal_verif_memor_model_c}.  If a byte-addressable memory was used in the target hardware, which is closer to \compcert{}'s memory model, then a load and store would instead take four clock cycles, because a RAM can only perform one read and write per clock cycle.  It therefore has to be proven that the byte-addressable memory behaves in the same way as the word-addressable memory in hardware.  Any modifications of the bytes in the \compcert{} memory model also have to be shown to modify the word-addressable memory in the same way.  Since only integer loads and stores are currently supported in \vericert{}, it follows that the addresses given to the loads and stores should be multiples of four.  If that is the case, then the translation from byte-addressed memory to word-addressed memory can be done by dividing the address by four.
 
-\subsubsection{Implementation of RAM templates}
-The simplest way to implement loads and stores in \vericert{} would be to access the Verilog array directly from within the data-path (i.e., inside the always-block on lines 16--32 of Figure~\ref{fig:accumulator_v}). This would be correct, but when a Verilog array is accessed at several program points, the synthesis tool is unlikely to detect that it can be implemented as a RAM block, and will resort to using lots of registers instead, ruining the circuit's area and performance. To avert this, we arrange that the data-path does not access memory directly, but simply sets the address it wishes to access and then toggles the \texttt{u\_en} flag. This activates the RAM driver (lines 9--15 of Figure~\ref{fig:accumulator_v}) on the next falling clock edge, which performs the requested load or store. By factoring all the memory accesses out into a separate driver like this, we ensure that the underlying array is only accessed from a single program point in the Verilog code, and thus ensure that the synthesis tool will correctly infer a RAM block.
+\subsubsection{Implementation of RAM interface}
+The simplest way to implement loads and stores in \vericert{} would be to access the Verilog array directly from within the data-path (i.e., inside the always-block on lines 16--32 of Figure~\ref{fig:accumulator_v}). This would be correct, but when a Verilog array is accessed at several program points, the synthesis tool is unlikely to detect that it can be implemented as a RAM block, and will resort to using lots of registers instead, ruining the circuit's area and performance. To avert this, we arrange that the data-path does not access memory directly, but simply sets the address it wishes to access and then toggles the \texttt{u\_en} flag. This activates the RAM interface (lines 9--15 of Figure~\ref{fig:accumulator_v}) on the next falling clock edge, which performs the requested load or store. By factoring all the memory accesses out into a separate interface like this, we ensure that the underlying array is only accessed from a single program point in the Verilog code, and thus ensure that the synthesis tool will correctly infer a RAM block.  Without the interface, the array would be implemented using registers, which would increase the size of the hardware considerably.
 
-Therefore, an extra compiler pass is added from HTL to HTL to extracts all the direct accesses to the Verilog array implementing memory and replaces them by signals which access the memory in a separate always-block.  This ensures that the synthesis tool correctly identifies the array as being a RAM, so that it is not implemented by logic directly.  The translation is performed by going through all the instructions and replacing each load and store expression one after another.  Stores can simply be replaced by the necessary wires directly, however, loads using the RAM block take two clock cycles instead of a direct load from an array which only takes one clock cycles.  This pass therefore creates a extra state which is inserted after each load.
+Therefore, an extra compiler pass is added from HTL to HTL to extracts all the direct accesses to the Verilog array and replaces them by signals which access the RAM interface in a separate always-block. The translation is performed by going through all the instructions and replacing each load and store expression one after another.  Stores can simply be replaced by the necessary wires directly, however, loads using the RAM block take two clock cycles instead of a direct load from an array which only takes one clock cycles.  This pass therefore creates a extra state which is inserted after each load.
 
 %\JW{I've called that negedge always-block the `RAM driver' in my proposed text above -- that feels like quite a nice a word for it to my mind -- what do you think?}\YH{Yes I quite like it!}
 %Verilog arrays can be used in a variety of ways, however, these do not all produce optimal hardware designs.  If, for example, arrays in Verilog are accessed immediately in the data-path, then the synthesis tool is not be able to identify it as having the right properties for a RAM, and would instead implement the array using registers.  This is extremely expensive, and for large memories this can easily blow up the area usage of the FPGA, and because of the longer wires that are needed, it would also affect the performance of the circuit.  The synthesis tools therefore provide code snippets that they know how to transform into various constructs, including snippets that will generate proper RAMs in the final hardware.  This process is called memory inference.  The initial translation from 3AC to HTL converts loads and stores to direct accesses to the memory, as this preserves the same behaviour without having to insert more registers and logic.  We therefore have another pass from HTL to itself which performs the translation from this na\"ive use of arrays to a representation which always allows for memory inference.  This pass creates a separate always block to perform the loads and stores to the memory, and adds the necessary data, address and enable signals to communicate with that always-block from other always-blocks.  This always-block is shown between lines 10-15 in Figure~\ref{fig:accumulator_v}.
 
-There are two interesting parts to the inserted RAM template.  Firstly, the memory updates are triggered on the negative edge of the clock, out of phase with the rest of the design which is triggered on the positive edge of the clock.  The main advantage is that instead of loads and stores taking three clock cycles and two clock cycles respectively, they only take two clock cycles and one clock cycle instead, greatly improving their performance.  In addition to that, using the negative edge for the clock is supported by many synthesis tools, it therefore does not affect the maximum frequency of the final design.
+There are two interesting parts to the inserted RAM interface.  Firstly, the memory updates are triggered on the negative edge of the clock, out of phase with the rest of the design which is triggered on the positive edge of the clock.  The main advantage is that instead of loads and stores taking three clock cycles and two clock cycles respectively, they only take two clock cycles and one clock cycle instead, greatly improving their performance.  In addition to that, using the negative edge for the clock is supported by many synthesis tools, it therefore does not affect the maximum frequency of the final design.
 
-Secondly, the logic in the enable signal of the RAM (\texttt{en != u\_en}) is also atypical.  To make the proof simpler, the goal is to create a RAM which disables itself after every use, so that firstly, the proof can assume that the RAM is disabled at the start and end of every clock cycle, and secondly so that only the state which contains the load and store need to be modified to ensure this property.  Using a simple enable signal, it would not be possible to disable it in the RAM itself, because this would result in a register being driven twice from two different locations.  It has to be enabled from the data-path and disabled from the RAM.  The only other solutions are to either insert extra states that disable the RAM accordingly, thereby eliminating the speed advantage of the negative edge triggered RAM, or to check the next state after a load and store and insert disables into that state.  The latter solution can quickly become complicated though, especially as this new state could contain another memory operation, in which case the disable signal should not be added to that state.  We can instead generate a second enable signal that is set by the user, and the original enable signal is then updated by the RAM to be equal to the value that the user set.  This means that the RAM should be enabled whenever the two signals are different, and disabled otherwise.  A solution to this problem is to create another enable signal that is controlled by the self-disabling RAM, which is always set to be equal to the enable signal set by the data-path.  The RAM is then considered enabled if the data-path enable and the RAM enable are different.
+Secondly, the logic in the enable signal of the RAM (\texttt{en != u\_en}) is also atypical.  To make the proof simpler, the goal is to create a RAM which disables itself after every use, so that firstly, the proof can assume that the RAM is disabled at the start and end of every clock cycle, and secondly so that only the state which contains the load and store need to be modified to ensure this property.  With a simple enable signal, it would not be possible to disable it in the RAM itself, because this would result in a register being driven twice from two different locations, once from the RAM interface and once from the data-path.  The only other solutions are to either insert extra states that disable the RAM accordingly, thereby eliminating the speed advantage of the negative edge triggered RAM, or to check the next state after a load and store and insert disables into that state.  The latter solution can quickly become complicated though, especially as this new state could contain another memory operation, in which case the disable signal should not be added to that state.  A solution to this problem is to create another enable signal that is controlled by the self-disabling RAM, which is always set to be equal to the enable signal set by the data-path.  The RAM is then considered enabled if the data-path enable and the RAM enable are different.
+
+%We can instead generate a second enable signal that is set by the user, and the original enable signal is then updated by the RAM to be equal to the value that the user set.  This means that the RAM should be enabled whenever the two signals are different, and disabled otherwise.
 
 \begin{figure}
   \centering
author	Yann Herklotz <git@yannherklotz.com>	2021-04-16 20:53:57 +0100
committer	Yann Herklotz <git@yannherklotz.com>	2021-04-16 20:53:57 +0100
commit	1479bb42c2877c29376549d768a97676e1b96841 (patch)
tree	42dedad20c175e8c9340316a6abde823c213b44b /algorithm.tex
parent	7d8150af139d30058a6be3b962f252505fd45d9b (diff)
download	oopsla21_fvhls-1479bb42c2877c29376549d768a97676e1b96841.tar.gz oopsla21_fvhls-1479bb42c2877c29376549d768a97676e1b96841.zip