Update on Overleaf.

author: John Wickerson <j.wickerson@imperial.ac.uk> 2021-08-03 08:53:41 +0000
committer: node <node@git-bridge-prod-0> 2021-08-03 09:01:00 +0000
commit: d4d8ac89c2b2efa4c291d00010fd585922ce28e2 (patch)
tree: af01c9b619c3c8d953f7d9e689b36f62fb84ed84 /algorithm.tex
parent: a89ffd74a82b225be85627926407ed9a05aefc5e (diff)
download: oopsla21_fvhls-d4d8ac89c2b2efa4c291d00010fd585922ce28e2.tar.gz
oopsla21_fvhls-d4d8ac89c2b2efa4c291d00010fd585922ce28e2.zip
1 files changed, 20 insertions, 19 deletions
diff --git a/algorithm.tex b/algorithm.tex
index 8b707e1..b974b97 100644
--- a/algorithm.tex
+++ b/algorithm.tex
@@ -182,7 +182,9 @@ This involves going from a CFG representation of the computation to a finite sta
 Hence, an HTL program consists of two maps from states to Verilog statements: control logic and data-path maps that express state transitions and computations respectively.
 Figure~\ref{fig:accumulator_diagram} shows the resulting FSMD architecture. The right-hand block is the control logic that computes the next state, while the left-hand block updates all the registers and RAM based on the current program state.
 
-This intermediate language was mainly introduced to prove the translation from 3AC to Verilog, as the semantics of these languages differ greatly.  It therefore serves as an intermediate language with similar semantics as 3AC as the top-level, using maps to represents what to execute at every state, and similar semantics as Verilog at the low-level by already using Verilog statements instead of more abstract instructions.  However, the benefit of having such an intermediary language is that it is simpler to manipulate and analyse it, thereby making it easier to prove optimisations like proper RAM insertion.
+The HTL language was mainly introduced to make it easier to prove the translation from 3AC to Verilog, as these languages have very different semantics.
+It serves as an intermediate language with similar semantics to 3AC at the top-level, using maps to represents what to execute at every state, and similar semantics to Verilog at the low-level by already using Verilog statements instead of more abstract instructions. 
+Compared to plain Verilog, HTL is simpler to manipulate and analyse, thereby making it easier to prove optimisations like proper RAM insertion.
 
 \begin{figure*}
   \centering
@@ -292,16 +294,16 @@ This intermediate language was mainly introduced to prove the translation from 3
 %\JP{Does it? Verilog has neither physical registers nor RAMs, just language constructs which the synthesiser might implement with registers and RAMs. We should be clear whether we're talking about the HDL representation, or the synthesised result: in our case these can be very different since we don't target any specific architectural features of an FPGA fabric of ASIC process.}
 \paragraph{Translating memory}
 Typically, HLS-generated hardware consists of a sea of registers and RAMs.
-This memory view is very different to the C memory model, so we perform the following translation. 
+This memory view is very different from the C memory model, so we perform the following translation. 
 Variables that do not have their address taken are kept in registers, which correspond to the registers in 3AC.
 All address-taken variables, arrays, and structs are kept in RAM.
-The stack of the main function becomes an unpacked array of 32-bit integers representing the RAM block.  Any loads and stores are temporarily translated to direct accesses to this array, where each address has it's offset removed and is divided by four.  In a separate HTL to HTL conversion, these direct accesses are then translated to proper loads and stores that use a RAM interface to communicate with the RAM.  This pass inserts a RAM block with the interface around the unpacked array.  Without this interface and without the RAM block, the synthesis tool processing the Verilog hardware description would not identify the array as a RAM, and would instead implement it using many registers.  This interface is shown on lines 9-15 in the Verilog code in Figure~\ref{fig:accumulator_v}.
+The stack of the main function becomes an unpacked array of 32-bit integers representing the RAM block.  Any loads and stores are temporarily translated to direct accesses to this array, where each address has its offset removed and is divided by four.  In a separate HTL-to-HTL conversion, these direct accesses are then translated to proper loads and stores that use a RAM interface to communicate with the RAM.  This pass inserts a RAM block with the interface around the unpacked array.  Without this interface and without the RAM block, the synthesis tool processing the Verilog hardware description would not identify the array as a RAM, and would instead implement it using many registers.  This interface is shown on lines 9--15 in the Verilog code in Figure~\ref{fig:accumulator_v}.
 A high-level overview of the architecture and of the RAM interface can be seen in Figure~\ref{fig:accumulator_diagram}.
 
 \paragraph{Translating instructions}
 
-Each 3AC instruction either corresponds to a hardware construct or does not have to be handled by the translation, such as function calls (because of inlining).
-For example, line 2 in Figure~\ref{fig:accumulator_rtl} shows a 32-bit register \texttt{x5} being initialised to 3, after which the control flow moves execution to line 3. This initialisation is also encoded in the Verilog generated from HTL at state 8 in both the control logic and data-path always-blocks, shown in Figure~\ref{fig:accumulator_v}.  Simple operator instructions are translated in a similar way.  For example, the add instruction is just translated to the built-in add operator, similarly for the multiply operator.  All 32-bit instructions can be translated in this way, however, some special instructions require extra care, one being the \texttt{Oshrximm} instruction, which is discussed further in Section~\ref{sec:algorithm:optimisation:oshrximm}, or the \texttt{Oshldimm} instruction, which is a left rotate instruction for which there is no Verilog equivalent and it therefore has to be implemented in terms of other operations and proven to be equivalent.
+Each 3AC instruction either corresponds to a hardware construct or does not have to be handled by the translation, such as function calls (because of inlining). \JW{Are function calls the only 3AC instruction that we ignore? (And I guess return statements too for the same reason.)}
+For example, line 2 in Figure~\ref{fig:accumulator_rtl} shows a 32-bit register \texttt{x5} being initialised to 3, after which the control flow moves execution to line 3. This initialisation is also encoded in the Verilog generated from HTL at state 8 in both the control logic and data-path always-blocks, shown in Figure~\ref{fig:accumulator_v}.  Simple operator instructions are translated in a similar way.  For example, the add instruction is just translated to the built-in add operator, similarly for the multiply operator.  All 32-bit instructions can be translated in this way, but some special instructions require extra care. One such is the \texttt{Oshrximm} instruction, which is discussed further in Section~\ref{sec:algorithm:optimisation:oshrximm}. Another is the \texttt{Oshldimm} instruction, which is a left rotate instruction that has no Verilog equivalent and therefore has to be implemented in terms of other operations and proven to be equivalent.
 
 \subsubsection{Translating HTL to Verilog}
 
@@ -328,16 +330,16 @@ However, the memory model that \compcert{} uses for its intermediate languages i
 
 \subsubsection{Implementation of RAM interface}
 The simplest way to implement loads and stores in \vericert{} would be to access the Verilog array directly from within the data-path (i.e., inside the always-block on lines 16--32 of Figure~\ref{fig:accumulator_v}). This would be correct, but when a Verilog array is accessed at several program points, the synthesis tool is unlikely to detect that it can be implemented as a RAM block, and will resort to using lots of registers instead, ruining the circuit's area and performance. To avert this, we arrange that the data-path does not access memory directly, but simply sets the address it wishes to access and then toggles the \texttt{u\_en} flag. This activates the RAM interface (lines 9--15 of Figure~\ref{fig:accumulator_v}) on the next falling clock edge, which performs the requested load or store. By factoring all the memory accesses out into a separate interface like this, we ensure that the underlying array is only accessed from a single program point in the Verilog code, and thus ensure that the synthesis tool will correctly infer a RAM block.
-Interestingly, the syntax for the RAM interface is quite strict, as the synthesis tool will pattern match on it and only work for a predefined set of interfaces.  Without the interface, the array would be implemented using registers, which would increase the size of the hardware considerably.
+Interestingly, the syntax for the RAM interface is quite strict, as the synthesis tool will pattern-match on it and only work for a predefined set of interfaces.  \JW{I think the following sentence could be cut as we've said this kind of thing a couple of times already.} Without the interface, the array would be implemented using registers, which would increase the size of the hardware considerably.
 
-Therefore, an extra compiler pass is added from HTL to HTL to extract all the direct accesses to the Verilog array and replace them by signals which access the RAM interface in a separate always-block. The translation is performed by going through all the instructions and replacing each load and store expression in turn.  Stores can simply be replaced by the necessary wires directly, however, because loads using the RAM interface take two clock cycles where a direct load from an array takes only one, this pass inserts an extra state after each load.
+Therefore, an extra compiler pass is added from HTL to HTL to extract all the direct accesses to the Verilog array and replace them by signals that access the RAM interface in a separate always-block. The translation is performed by going through all the instructions and replacing each load and store expression in turn.  Stores can simply be replaced by the necessary wires directly. Loads are a little more subtle: loads that use the RAM interface take two clock cycles where a direct load from an array takes only one, so this pass inserts an extra state after each load.
 
 %\JW{I've called that negedge always-block the `RAM driver' in my proposed text above -- that feels like quite a nice a word for it to my mind -- what do you think?}\YH{Yes I quite like it!}
 %Verilog arrays can be used in a variety of ways, however, these do not all produce optimal hardware designs.  If, for example, arrays in Verilog are accessed immediately in the data-path, then the synthesis tool is not be able to identify it as having the right properties for a RAM, and would instead implement the array using registers.  This is extremely expensive, and for large memories this can easily blow up the area usage of the FPGA, and because of the longer wires that are needed, it would also affect the performance of the circuit.  The synthesis tools therefore provide code snippets that they know how to transform into various constructs, including snippets that will generate proper RAMs in the final hardware.  This process is called memory inference.  The initial translation from 3AC to HTL converts loads and stores to direct accesses to the memory, as this preserves the same behaviour without having to insert more registers and logic.  We therefore have another pass from HTL to itself which performs the translation from this na\"ive use of arrays to a representation which always allows for memory inference.  This pass creates a separate always block to perform the loads and stores to the memory, and adds the necessary data, address and enable signals to communicate with that always-block from other always-blocks.  This always-block is shown between lines 10-15 in Figure~\ref{fig:accumulator_v}.
 
-There are two interesting parts to the inserted RAM interface.  Firstly, the memory updates are triggered on the negative edge of the clock, out of phase with the rest of the design which is triggered on the positive edge of the clock.  The main advantage is that instead of loads and stores taking three clock cycles and two clock cycles respectively, they only take two clock cycles and one clock cycle instead, greatly improving their performance.  In addition to that, using the negative edge for the clock is supported by many synthesis tools, and does not affect the maximum frequency of the final design.
+There are two interesting parts to the inserted RAM interface.  Firstly, the memory updates are triggered on the negative edge of the clock, out of phase with the rest of the design which is triggered on the positive edge of the clock.  The main advantage is that instead of loads and stores taking three clock cycles and two clock cycles respectively, they only take two clock cycles and one clock cycle instead, greatly improving their performance. \JW{Is this a standard `trick' in hardware design? If so it might be nice to cite it.} In addition to that, using the negative edge for the clock is supported by many synthesis tools, and does not affect the maximum frequency of the final design.
 
-Secondly, the logic in the enable signal of the RAM (\texttt{en != u\_en}) is also atypical.  In existing hardware designs, enables normally manually controlled and inserted into the appropriate states, by using a check like the following in the RAM:\@ \texttt{en == 1}.  This means that the RAM only turns on when the enable signal is set.  However, to make the proof simpler and to not have to reason about possible side effects introduced by the RAM being enabled but not used, a RAM which disables itself after every use would be ideal.  One solutions would be to either insert an extra state after each load or store that disable the RAM accordingly, thereby eliminating the speed advantage of the negative edge triggered RAM, or to check the next state after a load and store and insert disables into that state.  The latter solution can quickly become complicated though, especially as this new state could contain another memory operation, in which case the disable signal should not be added to that state.   Instead, the RAM can disable itself after each use, by instead of being toggled on a high signal, it gets toggled by the change of the value of the signal.  This can be done by keeping track of the old value of the enable in \texttt{en} and comparing it to the current value of the enable signal \texttt{u\_en} set by the data-path.  When the values are different, the RAM gets enabled, and then sets \texttt{en} equal to the value of \texttt{u\_en}. This ensures that the RAM will always be disabled directly after it was used without having to modify any extra states.
+Secondly, the logic in the enable signal of the RAM (\texttt{en != u\_en}) is also atypical.  In existing hardware designs, enables \JW{`enable signals are'?} normally manually controlled and inserted into the appropriate states, by using a check like the following in the RAM:\@ \texttt{en == 1}.  This means that the RAM only turns on when the enable signal is set.  However, to make the proof simpler and to not have to reason about possible side effects introduced by the RAM being enabled but not used, a RAM which disables itself after every use would be ideal.  One method for implementing this would be to insert an extra state after each load or store that disables the RAM accordingly, but this would eliminate the speed advantage of the negative-edge-triggered RAM. Another method would be to determine the next state after each load or store and add logic to disable the RAM in that state, but this could quickly become complicated, especially in the case where the next state contains another memory operation, and hence the disable signal should not be added. The method we ultimately chose was to have the RAM become enabled not when the enable signal is high, but when it toggles its value.  This can be arranged by keeping track of the old value of the enable signal in \texttt{en} and comparing it to the current value \texttt{u\_en} set by the data-path.  When the values are different, the RAM gets enabled, and then \texttt{en} is set to the value of \texttt{u\_en}. This ensures that the RAM will always be disabled directly after it was used without having to modify any extra states.
 
 %We can instead generate a second enable signal that is set by the user, and the original enable signal is then updated by the RAM to be equal to the value that the user set.  This means that the RAM should be enabled whenever the two signals are different, and disabled otherwise.
 
@@ -393,9 +395,9 @@ Many of the \compcert{} instructions map well to hardware, but \texttt{Oshrximm}
 \begin{equation*}
   \texttt{Oshrximm } x\ y = \text{round\_towards\_zero}\left(\frac{x}{2^{y}}\right)
 \end{equation*}
-(where $x, y \in \mathbb{Z}$, $0 \leq y < 31$, and $-2^{31} \leq x < 2^{31}$) and instantiating divider circuits in hardware is well-known to cripple performance. Moreover, since \vericert{} requires the result of a divide operation to be ready within a single clock cycle, the divide circuit needs to be entirely combinational. This is inefficient in terms of area, but also in terms of latency, because it means that the maximum frequency of the hardware must be reduced dramatically so that the divide circuit has enough time to finish.
+(where $x, y \in \mathbb{Z}$, $0 \leq y < 31$, and $-2^{31} \leq x < 2^{31}$) and instantiating divider circuits in hardware is well known to cripple performance. Moreover, since \vericert{} requires the result of a divide operation to be ready within a single clock cycle, the divide circuit needs to be entirely combinational. This is inefficient in terms of area, but also in terms of latency, because it means that the maximum frequency of the hardware must be reduced dramatically so that the divide circuit has enough time to finish.
 
-\compcert{} eventually performs a similar translation when generating the assembly code, however, the specification of the instruction itself still uses division instead of shifts, meaning the proof of the translation cannot be reused.  In \vericert{}, the equivalence of the representation in terms of divisions and shifts is proven over the integers and the specification, thereby it simpler to prove the correctness of the Verilog implementation in terms of shifts.
+\compcert{} eventually performs a similar translation \JW{I'm not sure which `translation' this is referring to.} when generating the assembly code, however, the specification of the instruction itself still uses division instead of shifts, meaning the proof of the translation cannot be reused.  In \vericert{}, the equivalence of the representation in terms of divisions and shifts is proven over the integers and the specification, thereby \JW{missing word?} it simpler to prove the correctness of the Verilog implementation in terms of shifts.
 
 %\JP{Multi-cycle paths might be something worth exploring in future work: fairly error-prone/dangerous for hand-written code, but might be interesting in generated code.}\YH{Definitely is on the list for next things to look into, will make divide so much more efficient.}
 
@@ -418,22 +420,21 @@ When proving this equivalence, we actually found a bug in our original implement
 
 There are various limitations in \vericert{} compared to other HLS tools due to the fact that the main focus was to formally verify the translation from 3AC to Verilog.
 
-\paragraph{Parallel execution of instructions}
+\paragraph{Lack of instruction-level parallelism}
 
-The main limitation of \vericert{} is that it does not perform scheduling of instructions into the same state, meaning that instructions cannot be executed in parallel.  However, the design of the intermediate languages in \vericert{} take this optimisation into account and are designed to support scheduling.  HTL is already an intermediate language which represents an FSMD which can have any number of statements in each state, and the semantics of HTL and Verilog allow for parallel assignments to registers.  To simplify the proof of the scheduling algorithm, and to minimise the changes necessary for the current translation from 3AC to HTL, a new language must be introduced, called 3ACPar, which would be equivalent to 3AC but instead of consisting of a map from program counter to instruction, it would consist of a map from program counter to list of instructions, which can all be executed in parallel.  A new proof for the scheduling algorithm would have to be written for the translation from 3AC to 3ACPar, for which a verified translation validation approach might be appropriate, however, the translation form 3ACPar to HTL would not change conceptually, except for the fact that multiple instructions can now be translated into the same state.  This small difference means that most of the proof can be reused without any changes, as the translation of instructions was the main body of the proof and did not change.
+The main limitation of \vericert{} is that it does not perform scheduling of instructions into the same state, meaning that instructions cannot be executed in parallel.  However, the design of the intermediate languages in \vericert{} take this optimisation into account and are designed to support scheduling.  HTL is already an intermediate language which represents an FSMD which can have any number of statements in each state, and the semantics of HTL and Verilog allow for parallel assignments to registers.  To simplify the proof of the scheduling algorithm, and to minimise the changes necessary for the current translation from 3AC to HTL, a new language must be introduced, called 3ACPar, \JW{I think this is getting too far into the details of how to add scheduling, and risks stealing the thunder of our next submission. I think the basic point to make is: Vericert is currently all sequential, that's pretty bad, but it's not a fundamental limitation. So it's good to point out that HTL is a very general language -- anything that can be put into an always-block in Verilog can be put into an HTL state -- but I wouldn't go into details about new languages like 3ACPar here.} which would be equivalent to 3AC but instead of consisting of a map from program counter to instruction, it would consist of a map from program counter to list of instructions, which can all be executed in parallel.  A new proof for the scheduling algorithm would have to be written for the translation from 3AC to 3ACPar, for which a verified translation validation approach might be appropriate, however, the translation form 3ACPar to HTL would not change conceptually, except for the fact that multiple instructions can now be translated into the same state.  This small difference means that most of the proof can be reused without any changes, as the translation of instructions was the main body of the proof and did not change.
 
+\paragraph{Lack of pipelined division}
 In a similar vein, the introduction of pipelined operators, especially for division, can alleviate many of the timing issues shown in the \polybench{} benchmarks with divisions included in Section~\ref{sec:evaluation}.  These operators can execute different stages of the operation in parallel, and therefore run long operations in parallel while sharing the same hardware.  In HTL, operations like this can be represented in a similar fashion to the load and store instructions to the RAM, by using wires to communicate with an abstract computation block modelled in HTL and later replaced by a hardware implementation.  However, 3ACPar would have to be modified to also describe such instructions so that these can be placed optimally using the external scheduling algorithm.
 
 \paragraph{Limitations with I/O}
 
-\vericert{} is currently limited in terms of I/O as well, because the main function cannot accept any arguments for the Clight program to be well-formed.  In addition to that, external function calls that produce traces have also not been implemented yet, however, these could enable the C program to read and write values to a bus that is shared by various other components in the hardware design.
+\vericert{} is currently limited in terms of I/O as well, because the main function cannot accept any arguments for the Clight program to be well-formed. \JW{Is it worth mentioning that it will compile programs with main(a,b), just as CompCert does, but just not prove them correct? This might be a nice `little-known fact' about CompCert?}  In addition to that, external function calls that produce traces have also not been implemented yet, however, these could enable the C program to read and write values to a bus that is shared by various other components in the hardware design.
 
-\paragraph{Language restrictions}
-
-There are minor language restrictions: global variables, comparisons of unsigned integers and minor instructions are not supported.
-
-In \compcert{} global variables are each stored in their own memory.  A generalisation of the stack translation into a RAM block could therefore be performed to also translate global variables in the same manner.  This would require a slight generalisation of pointers so that these store provenance information so that each pointer accesses the right RAM, as well as generalising the RAM interface so that it decodes the provenance information and indexes the right array.
+\paragraph{Lack of support for global variables}
+In \compcert{}, each global variable is stored in its own memory.  A generalisation of the stack translation into a RAM block could therefore be performed to translate global variables in the same manner.  This would require a slight generalisation of pointers so that they store provenance information to ensure that each pointer accesses the right RAM. It would also be necessary to generalise the RAM interface so that it decodes the provenance information and indexes the right array.
 
+\paragraph{Other language restrictions}
 C and Verilog handle signedness quite differently. By default, all operators and registers in Verilog (and HTL) are unsigned, so to force an operation to handle the bits as signed, both operators have to be forced to be signed. Moreover, Verilog implicitly resizes expressions to the largest needed size by default, which can affect the result of the computation.  This feature is not supported by the Verilog semantics we adopted, so to match the semantics to the behaviour of the simulator and synthesis tool, braces are placed around all expressions to inhibit implicit resizing.  Instead, explicit resizing is used in the semantics, and operations can only be performed on two registers that have the same size.
 
 In addition to that, equality checks between \emph{unsigned} variables are actually not supported, because this requires supporting the comparison of pointers, which should only be performed between pointers with the same provenance.  In \vericert{} there is currently no way to determine the provenance of a pointer, and it therefore cannot model the semantics of unsigned comparison in \compcert{}. This is not a severe restriction in practice, however, because, as dynamic allocation is not supported either, equality comparison of pointers is rarely needed, and equality comparison of integers can still be performed by casting them both to signed integers.
author	John Wickerson <j.wickerson@imperial.ac.uk>	2021-08-03 08:53:41 +0000
committer	node <node@git-bridge-prod-0>	2021-08-03 09:01:00 +0000
commit	d4d8ac89c2b2efa4c291d00010fd585922ce28e2 (patch)
tree	af01c9b619c3c8d953f7d9e689b36f62fb84ed84 /algorithm.tex
parent	a89ffd74a82b225be85627926407ed9a05aefc5e (diff)
download	oopsla21_fvhls-d4d8ac89c2b2efa4c291d00010fd585922ce28e2.tar.gz oopsla21_fvhls-d4d8ac89c2b2efa4c291d00010fd585922ce28e2.zip