Update on Overleaf.

author: Nadesh Ramanathan <n.ramanathan@ieee.org> 2021-08-08 18:14:00 +0000
committer: node <node@git-bridge-prod-0> 2021-08-08 18:28:44 +0000
commit: 3aa90b4b91627bc851f79b47164280541f0c363c (patch)
tree: 2bb7808d9f4f9c9e64f731264aff97627418d207 /algorithm.tex
parent: aeb8620c1f530d5a43302ea4333fa6abdc951a25 (diff)
download: oopsla21_fvhls-3aa90b4b91627bc851f79b47164280541f0c363c.tar.gz
oopsla21_fvhls-3aa90b4b91627bc851f79b47164280541f0c363c.zip
1 files changed, 12 insertions, 12 deletions
diff --git a/algorithm.tex b/algorithm.tex
index 222849d..fa90090 100644
--- a/algorithm.tex
+++ b/algorithm.tex
@@ -157,7 +157,7 @@ endmodule
 \end{minted}
 \caption{Verilog produced by \vericert{}. It demonstrates the instantiation of the RAM (lines 9--15), the data-path (lines 16--32) and the control logic (lines 33--42).}\label{fig:accumulator_v}
 \end{subfigure}
-  \caption{Translating a simple program from C to Verilog.}\label{fig:accumulator_c_rtl}
+  \caption{Translating a simple program from C to Verilog.}\label{fig:accumulator_c_rtl} \NR{Is the default in line 41 of (c) supposed to be empty?}
 \end{figure}
 
 \subsection{Translating C to Verilog, by example}
@@ -167,7 +167,7 @@ In this section, we describe the stages of the \vericert{} translation, referrin
 \subsubsection{Translating C to 3AC}
 
 The first stage of the translation uses unmodified \compcert{} to transform the C input, shown in Figure~\ref{fig:accumulator_c}, into a 3AC intermediate representation, shown in Figure~\ref{fig:accumulator_rtl}.
-As part of this translation, function inlining is performed on all functions, which allows us to support function calls without having to support the \texttt{Icall} 3AC instruction.  Although the duplication of the function bodies caused by inlining can increase the area of the hardware, it can have a positive effect on latency. Inlining precludes support for recursive function calls, but this feature is not supported in most other HLS tools either~\cite{davidthomas_asap16}.
+As part of this translation, function inlining is performed on all functions, which allows us to support function calls without having to support the \texttt{Icall} 3AC instruction.  Although the duplication of the function bodies caused by inlining can increase the area of the hardware, it can have a positive effect on latency \NR{Perhaps we can flag that it is a common HLS optimisation to inline function and cite the function inlining paper?}. Inlining precludes support for recursive function calls, but this feature is not supported in most other HLS tools either~\cite{davidthomas_asap16}.
 
 %\JW{Is that definitely true? Was discussing this with Nadesh and George recently, and I ended up not being so sure. Inlining could actually lead to \emph{reduced} resource usage because once everything has been inlined, the (big) scheduling problem could then be solved quite optimally. Certainly inlining is known to increase register pressure, but that's not really an issue here. If we're  not sure, we could just say that inlining everything leads to bloated Verilog files and the inability to support recursion, and leave it at that.}\YH{I think that is true, just because we don't do scheduling.  With scheduling I think that's true, inlining actually becomes quite good.}
 
@@ -299,10 +299,10 @@ Compared to plain Verilog, HTL is simpler to manipulate and analyse, thereby mak
 %\JP{Does it? Verilog has neither physical registers nor RAMs, just language constructs which the synthesiser might implement with registers and RAMs. We should be clear whether we're talking about the HDL representation, or the synthesised result: in our case these can be very different since we don't target any specific architectural features of an FPGA fabric of ASIC process.}
 \paragraph{Translating memory}
 Typically, HLS-generated hardware consists of a sea of registers and RAMs.
-This memory view is very different from the C memory model, so we perform the following translation. 
+This memory view is very different from the C memory model, so we perform the following translation. \NR{Is this a good spot to say memory is `infinite' in software but finite in hardware?}
 Variables that do not have their address taken are kept in registers, which correspond to the registers in 3AC.
 All address-taken variables, arrays, and structs are kept in RAM.
-The stack of the main function becomes an unpacked array of 32-bit integers representing the RAM block.  Any loads and stores are temporarily translated to direct accesses to this array, where each address has its offset removed and is divided by four.  In a separate HTL-to-HTL conversion, these direct accesses are then translated to proper loads and stores that use a RAM interface to communicate with the RAM.  This pass inserts a RAM block with the interface around the unpacked array.  Without this interface and without the RAM block, the synthesis tool processing the Verilog hardware description would not identify the array as a RAM, and would instead implement it using many registers.  This interface is shown on lines 9--15 in the Verilog code in Figure~\ref{fig:accumulator_v}.
+The stack of the main function becomes an unpacked array of 32-bit integers representing the RAM block.  Any loads and stores are temporarily translated to direct accesses to this array, where each address has its offset removed and is divided by four.  In a separate HTL-to-HTL conversion, these direct accesses are then translated to proper loads and stores that use a RAM interface to communicate with the RAM, \NR{as seen in Line 21, 24 and 28 of Figure~\ref{fig:accumulator_c_rtl}}.  This pass inserts a RAM block with the interface around the unpacked array.  Without this interface and without the RAM block, the synthesis tool processing the Verilog hardware description would not identify the array as a RAM, and would instead implement it using many registers.  This interface is shown on lines 9--15 in the Verilog code in Figure~\ref{fig:accumulator_v}.
 A high-level overview of the architecture and of the RAM interface can be seen in Figure~\ref{fig:accumulator_diagram}.
 
 \paragraph{Translating instructions}
@@ -310,12 +310,12 @@ A high-level overview of the architecture and of the RAM interface can be seen i
 Most 3AC instructions correspond to hardware constructs.
 %Each 3AC instruction either corresponds to a hardware construct or does not have to be handled by the translation, such as function calls (because of inlining). \JW{Are function calls the only 3AC instruction that we ignore? (And I guess return statements too for the same reason.)}\YH{Actually, return instructions are translated (because you can return from main whenever), so call instructions (Icall, Ibuiltin and Itailcall) are the only functions that are not handled.}
 % JW: Thanks; please check proposed new text.
-For example, line 2 in Figure~\ref{fig:accumulator_rtl} shows a 32-bit register \texttt{x5} being initialised to 3, after which the control flow moves execution to line 3. This initialisation is also encoded in the Verilog generated from HTL at state 8 in both the control logic and data-path always-blocks, shown in Figure~\ref{fig:accumulator_v}.  Simple operator instructions are translated in a similar way.  For example, the add instruction is just translated to the built-in add operator, similarly for the multiply operator.  All 32-bit instructions can be translated in this way, but some special instructions require extra care. One such is the \texttt{Oshrximm} instruction, which is discussed further in Section~\ref{sec:algorithm:optimisation:oshrximm}. Another is the \texttt{Oshldimm} instruction, which is a left rotate instruction that has no Verilog equivalent and therefore has to be implemented in terms of other operations and proven to be equivalent. The only 32-bit instructions that we do not translate are those related to function calls (\texttt{Icall}, \texttt{Ibuiltin}, and \texttt{Itailcall}), because of inlining.
+For example, line 2 in Figure~\ref{fig:accumulator_rtl} shows a 32-bit register \texttt{x5} being initialised to 3, after which the control flow moves execution to line 3. This initialisation is also encoded in the Verilog generated from HTL at state 8 in both the control logic and data-path always-blocks, shown in Figure~\ref{fig:accumulator_v}\NR{Perhaps point out the line numbers?}.  Simple operator instructions are translated in a similar way.  For example, the add instruction is just translated to the built-in add operator, similarly for the multiply operator.  All 32-bit instructions can be translated in this way, but some special instructions require extra care. One such is the \texttt{Oshrximm} instruction, which is discussed further in Section~\ref{sec:algorithm:optimisation:oshrximm}. Another is the \texttt{Oshldimm} instruction, which is a left rotate instruction that has no Verilog equivalent and therefore has to be implemented in terms of other operations and proven to be equivalent\NR{This is interesting, how many operators are directly translatable and how many were not? Nice point to mention.}. The only 32-bit instructions that we do not translate are those related to function calls (\texttt{Icall}, \texttt{Ibuiltin}, and \texttt{Itailcall}), because \NR{ \sout{of} we enable} inlining \NR{by default}.
 
 \subsubsection{Translating HTL to Verilog}
 
 Finally, we have to translate the HTL code into proper Verilog. % and prove that it behaves the same as the 3AC according to the Verilog semantics.
-The challenge here is to translate our FSMD representation into a Verilog AST.  However, as all the instructions in HTL are already expressed as Verilog statements, only the top-level data-path and control logic maps need to be translated to valid Verilog. We also require declarations for all the variables in the program, as well as declarations of the inputs and outputs to the module, so that the module can be used inside a larger hardware design.  In addition to translating the maps of Verilog statements, an always-block that will behave like the RAM also has to be created, which is only modelled abstractly at the HTL level.
+The challenge here is to translate our FSMD representation into a Verilog AST.  However, as all the instructions in HTL are already expressed as Verilog statements, only the top-level data-path and control logic maps need to be translated to valid Verilog. \NR{If the operators are in Verilog, how is the data-path not already in Verilog? Maybe some clarification on this.} We also require declarations for all the variables in the program, as well as declarations of the inputs and outputs to the module, so that the module can be used inside a larger hardware design.  In addition to translating the maps of Verilog statements, an always-block that will behave like the RAM also has to be created, which is only modelled abstractly at the HTL level.
 Figure~\ref{fig:accumulator_v} shows the final Verilog output that is generated for our example.
 
 Although this translation seems quite straight\-forward, proving that this translation is correct is complex.
@@ -336,8 +336,8 @@ One big difference between C and Verilog is how memory is represented.  Although
 However, the memory model that \compcert{} uses for its intermediate languages is byte-addre\-ssa\-ble~\cite{blazy05_formal_verif_memor_model_c}.  If a byte-addressable memory was used in the target hardware, which is closer to \compcert{}'s memory model, then a load and store would instead take four clock cycles, because a RAM can only perform one read and write per clock cycle.  It therefore has to be proven that the byte-addressable memory behaves in the same way as the word-addressable memory in hardware.  Any modifications of the bytes in the \compcert{} memory model also have to be shown to modify the word-addressable memory in the same way.  Since only integer loads and stores are currently supported in \vericert{}, it follows that the addresses given to the loads and stores will be multiples of four.  If that is the case, then the translation from byte-addressed memory to word-addressed memory can be done by dividing the address by four.
 
 \subsubsection{Implementation of RAM interface}
-The simplest way to implement loads and stores in \vericert{} would be to access the Verilog array directly from within the data-path (i.e., inside the always-block on lines 16--32 of Figure~\ref{fig:accumulator_v}). This would be correct, but when a Verilog array is accessed at several program points, the synthesis tool is unlikely to detect that it can be implemented as a RAM block, and will resort to using lots of registers instead, ruining the circuit's area and performance. To avert this, we arrange that the data-path does not access memory directly, but simply sets the address it wishes to access and then toggles the \texttt{u\_en} flag. This activates the RAM interface (lines 9--15 of Figure~\ref{fig:accumulator_v}) on the next falling clock edge, which performs the requested load or store. By factoring all the memory accesses out into a separate interface like this, we ensure that the underlying array is only accessed from a single program point in the Verilog code, and thus ensure that the synthesis tool will correctly infer a RAM block.
-Interestingly, the Verilog syntax for the RAM interface is quite strict, as the synthesis tool will pattern-match on it and only work for a predefined set of interfaces.
+The simplest way to implement loads and stores in \vericert{} would be to access the Verilog array directly from within the data-path (i.e., inside the always-block on lines 16--32 of Figure~\ref{fig:accumulator_v}). This would be correct, but when a Verilog array is accessed at several program points, the synthesis tool is unlikely to detect that it can be implemented as a RAM block, and will resort to using lots of registers instead, ruining the circuit's area and performance. \NR{Worthwhile to discuss that this a coding style/template enforced by synthesis tools, which is a quirk in hardware. Cite the Altera manual too.} To avert this, we arrange that the data-path does not access memory directly, but simply sets the address it wishes to access and then toggles the \texttt{u\_en} flag. This activates the RAM interface (lines 9--15 of Figure~\ref{fig:accumulator_v}) on the next falling clock edge, which performs the requested load or store. By factoring all the memory accesses out into a separate interface like this, we ensure that the underlying array is only accessed from a single program point in the Verilog code, and thus ensure that the synthesis tool will correctly infer a RAM block.
+Interestingly, the Verilog syntax for the RAM interface is quite strict, as the synthesis tool will pattern-match on it and only work for a predefined set of interfaces. \NR{Bring forward this sentence to help with flow.}
 
 %\JW{I think the following sentence could be cut as we've said this kind of thing a couple of times already.} Without the interface, the array would be implemented using registers, which would increase the size of the hardware considerably.
 
@@ -348,9 +348,9 @@ Therefore, an extra compiler pass is added from HTL to HTL to extract all the di
 
 There are two interesting parts to the inserted RAM interface.  Firstly, the memory updates are triggered on the negative (falling) edge of the clock, out of phase with the rest of the design which is triggered on the positive (rising) edge of the clock.  The advantage of this is that instead of loads and stores taking three clock cycles and two clock cycles respectively, they only take two clock cycles and one clock cycle instead, greatly improving their performance. %\JW{Is this a standard `trick' in hardware design? If so it might be nice to cite it.}\YH{Hmm, not really, because it has the downside of kind of halving your available clock period. However, RAMs normally come in both forms on the FPGA (Page 12, Figure 2, \url{https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/ug/ug_ram_rom.pdf})} 
 % JW: thanks!
-Using the negative edge of the clock is widely supported by synthesis tools, and does not affect the maximum frequency of the final design.
+Using the negative edge of the clock is widely supported by synthesis tools, and does not affect the maximum frequency of the final design \NR{during timing closure?}. 
 
-Secondly, the logic in the enable signal of the RAM (\texttt{en != u\_en}) is also atypical.  Enable signals are normally manually controlled and inserted into the appropriate states, by using a check like the following in the RAM:\@ \texttt{en == 1}.  This means that the RAM only turns on when the enable signal is set.  However, to make the proof simpler and to not have to reason about possible side effects introduced by the RAM being enabled but not used, a RAM which disables itself after every use would be ideal.  One method for implementing this would be to insert an extra state after each load or store that disables the RAM, but this extra state would eliminate the speed advantage of the negative-edge-triggered RAM. Another method would be to determine the next state after each load or store and disable the RAM in that state, but this could quickly become complicated, especially in the case where the next state also contains a memory operation, and hence the disable signal should not be added. The method we ultimately chose was to have the RAM become enabled not when the enable signal is high, but when it \emph{toggles} its value.  This can be arranged by keeping track of the old value of the enable signal in \texttt{en} and comparing it to the current value \texttt{u\_en} set by the data-path.  When the values are different, the RAM gets enabled, and then \texttt{en} is set to the value of \texttt{u\_en}. This ensures that the RAM will always be disabled straight after it was used, without having to insert or modify any other states.
+Secondly, the logic in the enable signal of the RAM (\texttt{en != u\_en}) is also atypical \NR{atypical to which audience? I think you mean hardware designers.}.  Enable signals are normally manually controlled and inserted into the appropriate states, by using a check like the following in the RAM:\@ \texttt{en == 1}.  This means that the RAM only turns on when the enable signal is set.  However, to make the proof simpler and to not have to reason about possible side effects introduced by the RAM being enabled but not used, a RAM which disables itself after every use would be ideal.  One method for implementing this would be to insert an extra state after each load or store that disables the RAM, but this extra state would eliminate the speed advantage of the negative-edge-triggered RAM. Another method would be to determine the next state after each load or store and disable the RAM in that state, but this could quickly become complicated, especially in the case where the next state also contains a memory operation, and hence the disable signal should not be added. The method we ultimately chose was to have the RAM become enabled not when the enable signal is high, but when it \emph{toggles} its value.  This can be arranged by keeping track of the old value of the enable signal in \texttt{en} and comparing it to the current value \texttt{u\_en} set by the data-path.  When the values are different, the RAM gets enabled, and then \texttt{en} is set to the value of \texttt{u\_en}. This ensures that the RAM will always be disabled straight after it was used, without having to insert or modify any other states.
 
 %We can instead generate a second enable signal that is set by the user, and the original enable signal is then updated by the RAM to be equal to the value that the user set.  This means that the RAM should be enabled whenever the two signals are different, and disabled otherwise.
 
@@ -421,9 +421,9 @@ One might hope that the synthesis tool consuming our generated Verilog would con
   \end{cases}\\
 \end{equation*}
 where $\gg$ stands for a logical right shift. %Once this equivalence about the shifts and division operator is proven correct, it can be used to implement the \texttt{Oshrximm} using the efficient shift version instead of how the \compcert{} semantics described it.
-When proving this equivalence, we actually found a bug in our original implementation that was due to the fact that a na\"{i}ve shift rounds towards $-\infty$.
+When proving this equivalence, we actually found a bug in our original implementation that was due to the fact that a na\"{i}ve shift rounds towards $-\infty$. \NR{What do you mean byy naive shift here?}
 
-\JW{I don't really understand the following paragraph.}\YH{So my intention with this was to make this section more meaningful, as one of the reviewers mentioned that because compcert already did this, this section is not needed.  But I wanted to explain that because we do the reasoning of equivalence between shifts and division at the Integer level, our proof is more general than the language specific proofs that \compcert{} has in it's back ends, as we fix the specification instead of the implementation.  I'll try and reword this.} \compcert{} eventually performs a translation from this representation into assembly code which uses shifts to implement the division, however, the specification of the instruction itself still uses division instead of shifts, meaning the proof of the translation cannot be reused.  In \vericert{}, the equivalence of the representation in terms of divisions and shifts is proven over the integers and the specification, thereby making it simpler to prove the correctness of the Verilog implementation in terms of shifts.
+\JW{I don't really understand the following paragraph.}\YH{So my intention with this was to make this section more meaningful, as one of the reviewers mentioned that because compcert already did this, this section is not needed.  But I wanted to explain that because we do the reasoning of equivalence between shifts and division at the Integer level, our proof is more general than the language specific proofs that \compcert{} has in it's back ends, as we fix the specification instead of the implementation.  I'll try and reword this.}\NR{I am not following this paragraph below too.} \compcert{} eventually performs a translation from this representation into assembly code which uses shifts to implement the division, however, the specification of the instruction itself still uses division instead of shifts, meaning the proof of the translation cannot be reused.  In \vericert{}, the equivalence of the representation in terms of divisions and shifts is proven over the integers and the specification, thereby making it simpler to prove the correctness of the Verilog implementation in terms of shifts.
 
 %The \compcert{} semantics for the \texttt{Oshrximm} instruction expresses its operation exactly as shown in the equation above, even though in hardware the computation that would be performed would be different.  In \vericert{}, if the same operation would be implemented using Verilog operators, it is not guaranteed to be optimised correctly by the synthesis tools that convert the Verilog into a circuit.  To guarantee an output that does not include divides, we therefore have to express it in Verilog using shifts, and then prove that this representation is equivalent to the divide representation used in the \compcert{} semantics.  While conducting the proof, we discovered quite a few bugs in our initial implementation of optimisations, which rounded to $-\infty$ instead of 0.
author	Nadesh Ramanathan <n.ramanathan@ieee.org>	2021-08-08 18:14:00 +0000
committer	node <node@git-bridge-prod-0>	2021-08-08 18:28:44 +0000
commit	3aa90b4b91627bc851f79b47164280541f0c363c (patch)
tree	2bb7808d9f4f9c9e64f731264aff97627418d207 /algorithm.tex
parent	aeb8620c1f530d5a43302ea4333fa6abdc951a25 (diff)
download	oopsla21_fvhls-3aa90b4b91627bc851f79b47164280541f0c363c.tar.gz oopsla21_fvhls-3aa90b4b91627bc851f79b47164280541f0c363c.zip