Add changes

author: Yann Herklotz <git@yannherklotz.com> 2021-04-16 20:26:34 +0100
committer: Yann Herklotz <git@yannherklotz.com> 2021-04-16 20:26:40 +0100
commit: a65bdc9ee527e66cf07dd0c4dea21ad342b141b6 (patch)
tree: 66e79d79449eb9f11a700747b2a0414af614f31d /algorithm.tex
parent: 7c43b3885f3b9248170974bc4c2b8a85457c6b40 (diff)
download: oopsla21_fvhls-a65bdc9ee527e66cf07dd0c4dea21ad342b141b6.tar.gz
oopsla21_fvhls-a65bdc9ee527e66cf07dd0c4dea21ad342b141b6.zip
1 files changed, 2 insertions, 2 deletions
diff --git a/algorithm.tex b/algorithm.tex
index 4d07424..ba44613 100644
--- a/algorithm.tex
+++ b/algorithm.tex
@@ -162,7 +162,7 @@ In this section, we describe the stages of the \vericert{} translation, referrin
 \subsubsection{Translating C to 3AC}
 
 The first stage of the translation uses unmodified \compcert{} to transform the C input, shown in Figure~\ref{fig:accumulator_c}, into a 3AC intermediate representation, shown in Figure~\ref{fig:accumulator_rtl}.
-As part of this translation, function inlining is performed on all functions, which allows us to support function calls without having to support the \texttt{Icall} 3AC instruction.  Although the duplication of the function bodies caused by inlining can increase the area of the hardware, it can have a positive effect on latency\YH{TODO: Add sentence on why inlining is often the better choice}. Inlining precludes support for recursive function calls, but this feature isn't supported in most other HLS tools either~\cite{davidthomas_asap16}.
+As part of this translation, function inlining is performed on all functions, which allows us to support function calls without having to support the \texttt{Icall} 3AC instruction.  Although the duplication of the function bodies caused by inlining can increase the area of the hardware, it can have a positive effect on latency\YH{TODO: Add sentence on why inlining is often the better choice}. Inlining precludes support for recursive function calls, but this feature is not supported in most other HLS tools either~\cite{davidthomas_asap16}.
 
 %\JW{Is that definitely true? Was discussing this with Nadesh and George recently, and I ended up not being so sure. Inlining could actually lead to \emph{reduced} resource usage because once everything has been inlined, the (big) scheduling problem could then be solved quite optimally. Certainly inlining is known to increase register pressure, but that's not really an issue here. If we're  not sure, we could just say that inlining everything leads to bloated Verilog files and the inability to support recursion, and leave it at that.}\YH{I think that is true, just because we don't do scheduling.  With scheduling I think that's true, inlining actually becomes quite good.}
 
@@ -332,7 +332,7 @@ One big difference between C and Verilog is how memory is represented.  Although
 However, the memory model that \compcert{} uses for its intermediate languages is byte-addre\-ssa\-ble~\cite{blazy05_formal_verif_memor_model_c}.  If a byte-addressable memory was used in the target hardware, which is closer to \compcert{}'s memory model, then a load and store would instead take four clock cycles, because a RAM can only perform one read and write per clock cycle.  It therefore has to be proven that the byte-addressable memory behaves in the same way as the word-addressable memory in hardware.  Any modifications of the bytes in the \compcert{} memory model also have to be shown to modify the word-addressable memory in the same way.  Since only integer loads and stores are currently supported in \vericert{}, it follows that the addresses given to the loads and stores should be multiples of four.  If that is the case, then the translation from byte-addressed memory to word-addressed memory can be done by dividing the address by four.
 
 \subsubsection{Implementation of RAM templates}
-The simplest way to implement loads and stores in \vericert{} would be to access the Verilog array directly from within the data-path (i.e., inside the always-block on lines 16--32 of Figure~\ref{fig:accumulator_v}). This would be correct, but when a Verilog array is accessed at several program points, the synthesis tool is unlikely to detect that it can be implemented as a RAM block, and will resort to using lots of registers instead, ruining the circuit's area and performance. To avert this, we arrange that the data-path doesn't access memory directly, but simply sets the address it wishes to access and then toggles the \texttt{u\_en} flag. This activates the RAM driver (lines 9--15 of Figure~\ref{fig:accumulator_v}) on the next falling clock edge, which performs the requested load or store. By factoring all the memory accesses out into a separate driver like this, we ensure that the underlying array is only accessed from a single program point in the Verilog code, and thus ensure that the synthesis tool will correctly infer a RAM block.
+The simplest way to implement loads and stores in \vericert{} would be to access the Verilog array directly from within the data-path (i.e., inside the always-block on lines 16--32 of Figure~\ref{fig:accumulator_v}). This would be correct, but when a Verilog array is accessed at several program points, the synthesis tool is unlikely to detect that it can be implemented as a RAM block, and will resort to using lots of registers instead, ruining the circuit's area and performance. To avert this, we arrange that the data-path does not access memory directly, but simply sets the address it wishes to access and then toggles the \texttt{u\_en} flag. This activates the RAM driver (lines 9--15 of Figure~\ref{fig:accumulator_v}) on the next falling clock edge, which performs the requested load or store. By factoring all the memory accesses out into a separate driver like this, we ensure that the underlying array is only accessed from a single program point in the Verilog code, and thus ensure that the synthesis tool will correctly infer a RAM block.
 
 Therefore, an extra compiler pass is added from HTL to HTL to extracts all the direct accesses to the Verilog array implementing memory and replaces them by signals which access the memory in a separate always-block.  This ensures that the synthesis tool correctly identifies the array as being a RAM, so that it is not implemented by logic directly.  The translation is performed by going through all the instructions and replacing each load and store expression one after another.  Stores can simply be replaced by the necessary wires directly, however, loads using the RAM block take two clock cycles instead of a direct load from an array which only takes one clock cycles.  This pass therefore creates a extra state which is inserted after each load.
author	Yann Herklotz <git@yannherklotz.com>	2021-04-16 20:26:34 +0100
committer	Yann Herklotz <git@yannherklotz.com>	2021-04-16 20:26:40 +0100
commit	a65bdc9ee527e66cf07dd0c4dea21ad342b141b6 (patch)
tree	66e79d79449eb9f11a700747b2a0414af614f31d /algorithm.tex
parent	7c43b3885f3b9248170974bc4c2b8a85457c6b40 (diff)
download	oopsla21_fvhls-a65bdc9ee527e66cf07dd0c4dea21ad342b141b6.tar.gz oopsla21_fvhls-a65bdc9ee527e66cf07dd0c4dea21ad342b141b6.zip