Clear up the Introduction and Algorithm section

author: Yann Herklotz <git@yannherklotz.com> 2021-08-02 00:26:15 +0200
committer: Yann Herklotz <git@yannherklotz.com> 2021-08-02 00:26:15 +0200
commit: a89ffd74a82b225be85627926407ed9a05aefc5e (patch)
tree: e4f37b5c5dec7331323992477e9fe31f01055337 /algorithm.tex
parent: 5ba8d98c98ca17b543f212486a5a4e2e7b91ca84 (diff)
download: oopsla21_fvhls-a89ffd74a82b225be85627926407ed9a05aefc5e.tar.gz
oopsla21_fvhls-a89ffd74a82b225be85627926407ed9a05aefc5e.zip
1 files changed, 17 insertions, 12 deletions
diff --git a/algorithm.tex b/algorithm.tex
index 5fe6016..8b707e1 100644
--- a/algorithm.tex
+++ b/algorithm.tex
@@ -180,7 +180,9 @@ The next translation is from 3AC to a new hardware translation language (HTL). %
 This involves going from a CFG representation of the computation to a finite state machine with data-path (FSMD) representation~\cite{hwang99_fsmd}. The core idea of the FSMD representation is that it separates the control flow from the operations on the memory and registers. %\JP{I've become less comfortable with this term, but it's personal preference so feel free to ignore. I think `generalised finite state machine' (i.e.\ thinking of the entire `data-path' as contributing to the overall state) is more accurate.}\YH{Hmm, yes, I mainly chose FSMD because there is quite a lot of literature around it.  I think for now I'll keep it but for the final draft we could maybe change it.}
 %This means that the state transitions can be translated into a simple finite state machine (FSM) where each state contains data operations that update the memory and registers. 
 Hence, an HTL program consists of two maps from states to Verilog statements: control logic and data-path maps that express state transitions and computations respectively.
-Figure~\ref{fig:accumulator_diagram} shows the resulting FSMD architecture. The right-hand block is the control logic that computes the next state, while the left-hand block updates all the registers and RAM based on the current program state. 
+Figure~\ref{fig:accumulator_diagram} shows the resulting FSMD architecture. The right-hand block is the control logic that computes the next state, while the left-hand block updates all the registers and RAM based on the current program state.
+
+This intermediate language was mainly introduced to prove the translation from 3AC to Verilog, as the semantics of these languages differ greatly.  It therefore serves as an intermediate language with similar semantics as 3AC as the top-level, using maps to represents what to execute at every state, and similar semantics as Verilog at the low-level by already using Verilog statements instead of more abstract instructions.  However, the benefit of having such an intermediary language is that it is simpler to manipulate and analyse it, thereby making it easier to prove optimisations like proper RAM insertion.
 
 \begin{figure*}
   \centering
@@ -284,23 +286,22 @@ Figure~\ref{fig:accumulator_diagram} shows the resulting FSMD architecture. The
   \node[scale=0.4] at (3.5,3.6) {\texttt{reg\_4}};
   \node[scale=0.4] at (3.5,3.4) {\texttt{reg\_5}};
 \end{tikzpicture}}
-  \caption{The FSMD for the example shown in Figure~\ref{fig:accumulator_c_rtl}, split into a data-path and control logic for the next state calculation.  The Update block takes the current state, current values of all registers and at most one value stored in the array, and calculates a new value that can either be stored back in the array or in a register.}\label{fig:accumulator_diagram}
+  \caption{The FSMD for the example shown in Figure~\ref{fig:accumulator_c_rtl}, split into a data-path and control logic for the next state calculation.  The Update block takes the current state, current values of all registers and at most one value stored in the RAM, and calculates a new value that can either be stored back in the  or in a register.}\label{fig:accumulator_diagram}
 \end{figure*}
 
 %\JP{Does it? Verilog has neither physical registers nor RAMs, just language constructs which the synthesiser might implement with registers and RAMs. We should be clear whether we're talking about the HDL representation, or the synthesised result: in our case these can be very different since we don't target any specific architectural features of an FPGA fabric of ASIC process.}
 \paragraph{Translating memory}
-Typically, HLS-generated hardware consists of a sea of registers and RAM memories.
+Typically, HLS-generated hardware consists of a sea of registers and RAMs.
 This memory view is very different to the C memory model, so we perform the following translation. 
 Variables that do not have their address taken are kept in registers, which correspond to the registers in 3AC.
 All address-taken variables, arrays, and structs are kept in RAM.
-The stack of the main function becomes an unpacked array of 32-bit integers, which is translated to a RAM when the hardware description is passed through a synthesis tool.  In a first pass, \vericert{} represents the RAM by direct reads and writes to an array. This is followed by a RAM insertion pass which inserts a proper RAM interface, and reads and writes are performed through that interface instead.  This interface is shown on lines 9-15 in the Verilog code in Figure~\ref{fig:accumulator_v}.
-Finally, global variables are not translated in \vericert{} at the moment.
-A high-level overview of the architecture can be seen in Figure~\ref{fig:accumulator_diagram}. 
+The stack of the main function becomes an unpacked array of 32-bit integers representing the RAM block.  Any loads and stores are temporarily translated to direct accesses to this array, where each address has it's offset removed and is divided by four.  In a separate HTL to HTL conversion, these direct accesses are then translated to proper loads and stores that use a RAM interface to communicate with the RAM.  This pass inserts a RAM block with the interface around the unpacked array.  Without this interface and without the RAM block, the synthesis tool processing the Verilog hardware description would not identify the array as a RAM, and would instead implement it using many registers.  This interface is shown on lines 9-15 in the Verilog code in Figure~\ref{fig:accumulator_v}.
+A high-level overview of the architecture and of the RAM interface can be seen in Figure~\ref{fig:accumulator_diagram}.
 
 \paragraph{Translating instructions}
 
-Each 3AC instruction either corresponds to a hardware construct, or does not have to be handled by the translation, such as function calls (because of inlining).
-For example, line 2 in Figure~\ref{fig:accumulator_rtl} shows a 32-bit register \texttt{x5} being initialised to 3, after which the control flow moves execution to line 3. This initialisation is also encoded in HTL at state 8 in both the control logic and data-path. always-blocks, shown in Figure~\ref{fig:accumulator_v}.  Simple operator instructions are translated in a similar way.
+Each 3AC instruction either corresponds to a hardware construct or does not have to be handled by the translation, such as function calls (because of inlining).
+For example, line 2 in Figure~\ref{fig:accumulator_rtl} shows a 32-bit register \texttt{x5} being initialised to 3, after which the control flow moves execution to line 3. This initialisation is also encoded in the Verilog generated from HTL at state 8 in both the control logic and data-path always-blocks, shown in Figure~\ref{fig:accumulator_v}.  Simple operator instructions are translated in a similar way.  For example, the add instruction is just translated to the built-in add operator, similarly for the multiply operator.  All 32-bit instructions can be translated in this way, however, some special instructions require extra care, one being the \texttt{Oshrximm} instruction, which is discussed further in Section~\ref{sec:algorithm:optimisation:oshrximm}, or the \texttt{Oshldimm} instruction, which is a left rotate instruction for which there is no Verilog equivalent and it therefore has to be implemented in terms of other operations and proven to be equivalent.
 
 \subsubsection{Translating HTL to Verilog}
 
@@ -336,7 +337,7 @@ Therefore, an extra compiler pass is added from HTL to HTL to extract all the di
 
 There are two interesting parts to the inserted RAM interface.  Firstly, the memory updates are triggered on the negative edge of the clock, out of phase with the rest of the design which is triggered on the positive edge of the clock.  The main advantage is that instead of loads and stores taking three clock cycles and two clock cycles respectively, they only take two clock cycles and one clock cycle instead, greatly improving their performance.  In addition to that, using the negative edge for the clock is supported by many synthesis tools, and does not affect the maximum frequency of the final design.
 
-Secondly, the logic in the enable signal of the RAM (\texttt{en != u\_en}) is also atypical.  In existing hardware designs, enables would be manually controlled and turned on and off explicitly in the data-path by the hardware designer whenever it is necessary, by using a check like the following in the RAM:\@ \texttt{en == 1}.  This means that the RAM only turns on when the enable signal is set.  However, to make the proof simpler as many side effects need to be eliminated from the hardware as possible, which can be achieved by creating a RAM which disables itself after every use, by using a toggle instead.  This ensures that the RAM will always be disabled directly after it was used without having to add extra signals in later states, as it is not a high or low value that triggers the RAM block, but the fact that the user enable signal \texttt{u\_en} changed in value. This simplifies the proof, because firstly it can assume that the RAM is disabled at the start and end of every clock cycle, and secondly only the state which contains the load and store statements need to be modified to ensure this property.  The only other solutions would be to either insert extra states that disable the RAM accordingly, thereby eliminating the speed advantage of the negative edge triggered RAM, or to check the next state after a load and store and insert disables into that state.  The latter solution can quickly become complicated though, especially as this new state could contain another memory operation, in which case the disable signal should not be added to that state.  A solution to this problem is to create another enable signal that is controlled by the self-disabling RAM, which is always set to be equal to the enable signal set by the data-path.  The RAM is then considered enabled if the data-path enable and the RAM enable are different.
+Secondly, the logic in the enable signal of the RAM (\texttt{en != u\_en}) is also atypical.  In existing hardware designs, enables normally manually controlled and inserted into the appropriate states, by using a check like the following in the RAM:\@ \texttt{en == 1}.  This means that the RAM only turns on when the enable signal is set.  However, to make the proof simpler and to not have to reason about possible side effects introduced by the RAM being enabled but not used, a RAM which disables itself after every use would be ideal.  One solutions would be to either insert an extra state after each load or store that disable the RAM accordingly, thereby eliminating the speed advantage of the negative edge triggered RAM, or to check the next state after a load and store and insert disables into that state.  The latter solution can quickly become complicated though, especially as this new state could contain another memory operation, in which case the disable signal should not be added to that state.   Instead, the RAM can disable itself after each use, by instead of being toggled on a high signal, it gets toggled by the change of the value of the signal.  This can be done by keeping track of the old value of the enable in \texttt{en} and comparing it to the current value of the enable signal \texttt{u\_en} set by the data-path.  When the values are different, the RAM gets enabled, and then sets \texttt{en} equal to the value of \texttt{u\_en}. This ensures that the RAM will always be disabled directly after it was used without having to modify any extra states.
 
 %We can instead generate a second enable signal that is set by the user, and the original enable signal is then updated by the RAM to be equal to the value that the user set.  This means that the RAM should be enabled whenever the two signals are different, and disabled otherwise.
 
@@ -382,9 +383,9 @@ Secondly, the logic in the enable signal of the RAM (\texttt{en != u\_en}) is al
   \caption{Timing diagrams showing the execution of loads and stores over multiple clock cycles.}\label{fig:ram_load_store}
 \end{figure}
 
-Figure~\ref{fig:ram_load} shows an example of how the waveforms in the RAM shown in Figure~\ref{fig:accumulator_v} behave when a value is loaded.  To initiate a load, the data-path enable signal \texttt{u\_en} flag is toggled, the address \texttt{addr} is set and the write enable \texttt{wr\_en} is set to low.  This all happens at the positive edge of the clock, in the time slice 1.  Then, on the next negative edge of the clock, at time slice 2, the \texttt{u\_en} is now different to the RAM enable \texttt{en}, so the RAM is enabled.  A load is then performed by assigning the data-out register to the value stored at the address in the RAM and the \texttt{en} is set to the same value as \texttt{u\_en} to disable the RAM again.  Finally, on the next positive edge of the clock, the value in \texttt{d\_out} is assigned to the destination register \texttt{r}.  An example of a store is shown in Figure~\ref{fig:ram_store}, which instead assigns the \texttt{d\_in} register with the value to be stored.  The store is then performed on the negative edge of the clock and is therefore complete by the next positive edge.
+Figure~\ref{fig:ram_load} shows an example of how the waveforms in the RAM shown in Figure~\ref{fig:accumulator_v} behave when a value is loaded.  To initiate a load, the data-path enable signal \texttt{u\_en} flag is toggled, the address \texttt{addr} is set and the write enable \texttt{wr\_en} is set to low.  This all happens at the positive edge of the clock, in the time slice 1.  Then, on the next negative edge of the clock, at time slice 2, the \texttt{u\_en} is now different to the RAM enable \texttt{en}, so the RAM is enabled.  A load is then performed by assigning the \texttt{d\_out} register to the value stored at the address in the RAM and the \texttt{en} is set to the same value as \texttt{u\_en} to disable the RAM again.  Finally, on the next positive edge of the clock, the value in \texttt{d\_out} is assigned to the destination register \texttt{r}.  An example of a store is shown in Figure~\ref{fig:ram_store}, which instead assigns the \texttt{d\_in} register with the value to be stored.  The store is then performed on the negative edge of the clock and is therefore complete by the next positive edge.
 
-\subsubsection{Implementing the \texttt{Oshrximm} instruction}
+\subsubsection{Implementing the \texttt{Oshrximm} instruction}\label{sec:algorithm:optimisation:oshrximm}
 
 % Mention that this optimisation is not performed sometimes (clang -03).
 
@@ -394,6 +395,8 @@ Many of the \compcert{} instructions map well to hardware, but \texttt{Oshrximm}
 \end{equation*}
 (where $x, y \in \mathbb{Z}$, $0 \leq y < 31$, and $-2^{31} \leq x < 2^{31}$) and instantiating divider circuits in hardware is well-known to cripple performance. Moreover, since \vericert{} requires the result of a divide operation to be ready within a single clock cycle, the divide circuit needs to be entirely combinational. This is inefficient in terms of area, but also in terms of latency, because it means that the maximum frequency of the hardware must be reduced dramatically so that the divide circuit has enough time to finish.
 
+\compcert{} eventually performs a similar translation when generating the assembly code, however, the specification of the instruction itself still uses division instead of shifts, meaning the proof of the translation cannot be reused.  In \vericert{}, the equivalence of the representation in terms of divisions and shifts is proven over the integers and the specification, thereby it simpler to prove the correctness of the Verilog implementation in terms of shifts.
+
 %\JP{Multi-cycle paths might be something worth exploring in future work: fairly error-prone/dangerous for hand-written code, but might be interesting in generated code.}\YH{Definitely is on the list for next things to look into, will make divide so much more efficient.}
 
 %These small optimisations were found to be the most error prone, and guaranteeing that the new representation is equivalent to representation used in the \compcert{} semantics is difficult without proving this for all possible inputs.
@@ -427,7 +430,9 @@ In a similar vein, the introduction of pipelined operators, especially for divis
 
 \paragraph{Language restrictions}
 
-There are minor language restrictions: comparisons of unsigned integers as well as minor instructions are not supported.
+There are minor language restrictions: global variables, comparisons of unsigned integers and minor instructions are not supported.
+
+In \compcert{} global variables are each stored in their own memory.  A generalisation of the stack translation into a RAM block could therefore be performed to also translate global variables in the same manner.  This would require a slight generalisation of pointers so that these store provenance information so that each pointer accesses the right RAM, as well as generalising the RAM interface so that it decodes the provenance information and indexes the right array.
 
 C and Verilog handle signedness quite differently. By default, all operators and registers in Verilog (and HTL) are unsigned, so to force an operation to handle the bits as signed, both operators have to be forced to be signed. Moreover, Verilog implicitly resizes expressions to the largest needed size by default, which can affect the result of the computation.  This feature is not supported by the Verilog semantics we adopted, so to match the semantics to the behaviour of the simulator and synthesis tool, braces are placed around all expressions to inhibit implicit resizing.  Instead, explicit resizing is used in the semantics, and operations can only be performed on two registers that have the same size.
author	Yann Herklotz <git@yannherklotz.com>	2021-08-02 00:26:15 +0200
committer	Yann Herklotz <git@yannherklotz.com>	2021-08-02 00:26:15 +0200
commit	a89ffd74a82b225be85627926407ed9a05aefc5e (patch)
tree	e4f37b5c5dec7331323992477e9fe31f01055337 /algorithm.tex
parent	5ba8d98c98ca17b543f212486a5a4e2e7b91ca84 (diff)
download	oopsla21_fvhls-a89ffd74a82b225be85627926407ed9a05aefc5e.tar.gz oopsla21_fvhls-a89ffd74a82b225be85627926407ed9a05aefc5e.zip