Figure -> Fig.

author: Yann Herklotz <git@yannherklotz.com> 2021-09-09 21:09:05 +0100
committer: Yann Herklotz <git@yannherklotz.com> 2021-09-09 21:09:05 +0100
commit: 55528c1f1a11988897d46993756d6aa5873095af (patch)
tree: 3f77f427e8aba3712ac9a145a28471c9ddf2ead7
parent: 301291487f61c924ec16ec73c632b19ac5395a6d (diff)
download: oopsla21_fvhls-55528c1f1a11988897d46993756d6aa5873095af.tar.gz
oopsla21_fvhls-55528c1f1a11988897d46993756d6aa5873095af.zip
5 files changed, 27 insertions, 27 deletions
diff --git a/algorithm.tex b/algorithm.tex
index 1783c29..abec2bb 100644
--- a/algorithm.tex
+++ b/algorithm.tex
@@ -72,7 +72,7 @@ The .NET framework has been used as a basis for other HLS tools, such as Kiwi~\c
 \end{figure}
 
 \paragraph{Architecture of \vericert{}}
-The main work flow of \vericert{} is given in Figure~\ref{fig:rtlbranch}, which shows those parts of the translation that are performed in \compcert{}, and those that have been added.  This includes translations to two new intermediate languages added in \vericert{}, HTL and Verilog, as well as an additional optimisation pass labelled as ``RAM insertion''.
+The main work flow of \vericert{} is given in Fig.~\ref{fig:rtlbranch}, which shows those parts of the translation that are performed in \compcert{}, and those that have been added.  This includes translations to two new intermediate languages added in \vericert{}, HTL and Verilog, as well as an additional optimisation pass labelled as ``RAM insertion''.
 
 \def\numcompcertlanguages{ten}
 
@@ -138,7 +138,7 @@ endmodule
 \end{figure}
 
 
-A simple state machine can be implemented as shown in Figure~\ref{fig:tutorial:state_machine}.
+A simple state machine can be implemented as shown in Fig.~\ref{fig:tutorial:state_machine}.
 At every positive edge of the clock (\texttt{clk}), both of the always-blocks will trigger simultaneously.  The first always-block controls the values in the register \texttt{x} and the output \texttt{z}, while the second always-block controls the next state the state machine should go to.  When the \texttt{state} is 0, \texttt{x} will be assigned to the input \texttt{y} using nonblocking assignment, denoted by \texttt{<=}.  Nonblocking assignment assigns registers in parallel at the end of the clock cycle, rather than sequentially throughout the always-block. In the second always-block, the input \texttt{y} will be checked, and if it's high it will move on to the next state, otherwise it will stay in the current state.  When \texttt{state} is 1, the first always-block will reset the value of \texttt{x} and then set \texttt{z} to the original value of \texttt{x}, since nonblocking assignment does not change its value until the end of the clock cycle.  Finally, the last always-block will set the state to be 0 again.
 
 \begin{figure}
@@ -224,12 +224,12 @@ endmodule
 \end{figure}
 
 \subsection{Translating C to Verilog by Example}
-Figure~\ref{fig:accumulator_c_rtl} illustrates the translation of a simple program that stores and retrieves values from an array.
+Fig.~\ref{fig:accumulator_c_rtl} illustrates the translation of a simple program that stores and retrieves values from an array.
 In this section, we describe the stages of the \vericert{} translation, referring to this program as an example.
 
 \subsubsection{Translating C to 3AC}
 
-The first stage of the translation uses unmodified \compcert{} to transform the C input, shown in Figure~\ref{fig:accumulator_c}, into a 3AC intermediate representation, shown in Figure~\ref{fig:accumulator_rtl}.
+The first stage of the translation uses unmodified \compcert{} to transform the C input, shown in Fig.~\ref{fig:accumulator_c}, into a 3AC intermediate representation, shown in Fig.~\ref{fig:accumulator_rtl}.
 As part of this translation, function inlining is performed on all functions, which allows us to support function calls without having to support the \texttt{Icall} 3AC instruction.  Although the duplication of the function bodies caused by inlining can increase the area of the hardware, it can have a positive effect on latency and is therefore a common HLS optimisation~\cite{noronha17_rapid_fpga}. Inlining precludes support for recursive function calls, but this feature is not supported in most other HLS tools either~\cite{davidthomas_asap16}.
 
 %\JW{Is that definitely true? Was discussing this with Nadesh and George recently, and I ended up not being so sure. Inlining could actually lead to \emph{reduced} resource usage because once everything has been inlined, the (big) scheduling problem could then be solved quite optimally. Certainly inlining is known to increase register pressure, but that's not really an issue here. If we're  not sure, we could just say that inlining everything leads to bloated Verilog files and the inability to support recursion, and leave it at that.}\YH{I think that is true, just because we don't do scheduling.  With scheduling I think that's true, inlining actually becomes quite good.}
@@ -248,7 +248,7 @@ The next translation is from 3AC to a new hardware translation language (HTL). %
 This involves going from a CFG representation of the computation to a finite state machine with data-path (FSMD) representation~\cite{hwang99_fsmd}. The core idea of the FSMD representation is that it separates the control flow from the operations on the memory and registers. %\JP{I've become less comfortable with this term, but it's personal preference so feel free to ignore. I think `generalised finite state machine' (i.e.\ thinking of the entire `data-path' as contributing to the overall state) is more accurate.}\YH{Hmm, yes, I mainly chose FSMD because there is quite a lot of literature around it.  I think for now I'll keep it but for the final draft we could maybe change it.}
 %This means that the state transitions can be translated into a simple finite state machine (FSM) where each state contains data operations that update the memory and registers. 
 Hence, an HTL program consists of two maps from states to Verilog statements: the \emph{control logic} map, which expresses state transitions, and the \emph{data-path} map, which expresses computations.
-Figure~\ref{fig:accumulator_diagram} shows the resulting FSMD architecture. The right-hand block is the control logic that computes the next state, while the left-hand block updates all the registers and RAM based on the current program state.
+Fig.~\ref{fig:accumulator_diagram} shows the resulting FSMD architecture. The right-hand block is the control logic that computes the next state, while the left-hand block updates all the registers and RAM based on the current program state.
 
 The HTL language was mainly introduced to simplify the proof of translation from 3AC to Verilog, as these languages have very different semantics.
 It serves as an intermediate language with similar semantics to 3AC at the top level, using maps to represents what to execute at every state, and similar semantics to Verilog at the lower level by already using Verilog statements instead of more abstract instructions.
@@ -356,7 +356,7 @@ Compared to plain Verilog, HTL is simpler to manipulate and analyse, thereby mak
   \node[scale=0.4] at (3.5,3.6) {\texttt{reg\_4}};
   \node[scale=0.4] at (3.5,3.4) {\texttt{reg\_5}};
 \end{tikzpicture}}
-  \caption{The FSMD for the example shown in Figure~\ref{fig:accumulator_c_rtl}, split into a data-path and control logic for the next state calculation.  The Update block takes the current state, current values of all registers and at most one value stored in the RAM, and calculates a new value that can either be stored back in the  or in a register.}\label{fig:accumulator_diagram}
+  \caption{The FSMD for the example shown in Fig.~\ref{fig:accumulator_c_rtl}, split into a data-path and control logic for the next state calculation.  The Update block takes the current state, current values of all registers and at most one value stored in the RAM, and calculates a new value that can either be stored back in the  or in a register.}\label{fig:accumulator_diagram}
 \end{figure*}
 
 %\JP{Does it? Verilog has neither physical registers nor RAMs, just language constructs which the synthesiser might implement with registers and RAMs. We should be clear whether we're talking about the HDL representation, or the synthesised result: in our case these can be very different since we don't target any specific architectural features of an FPGA fabric of ASIC process.}
@@ -365,15 +365,15 @@ Typically, HLS-generated hardware consists of a sea of registers and RAMs.
 This memory view is very different from the C memory model, so we perform the following translation from \compcert{}'s abstract memory model to a concrete RAM.\@
 Variables that do not have their address taken are kept in registers, which correspond to the registers in 3AC.
 All address-taken variables, arrays, and structs are kept in RAM.
-The stack of the main function becomes an unpacked array of 32-bit integers representing the RAM block.  Any loads and stores are temporarily translated to direct accesses to this array, where each address has its offset removed and is divided by four.  In a separate HTL-to-HTL conversion, these direct accesses are then translated to proper loads and stores that use a RAM interface to communicate with the RAM, shown on lines 21, 24 and 28 of Figure~\ref{fig:accumulator_v}.  This pass inserts a RAM block with the interface around the unpacked array.  Without this interface and without the RAM block, the synthesis tool processing the Verilog hardware description would not identify the array as a RAM, and would instead implement it using many registers.  This interface is shown on lines 9--15 in the Verilog code in Figure~\ref{fig:accumulator_v}.
-A high-level overview of the architecture and of the RAM interface can be seen in Figure~\ref{fig:accumulator_diagram}.
+The stack of the main function becomes an unpacked array of 32-bit integers representing the RAM block.  Any loads and stores are temporarily translated to direct accesses to this array, where each address has its offset removed and is divided by four.  In a separate HTL-to-HTL conversion, these direct accesses are then translated to proper loads and stores that use a RAM interface to communicate with the RAM, shown on lines 21, 24 and 28 of Fig.~\ref{fig:accumulator_v}.  This pass inserts a RAM block with the interface around the unpacked array.  Without this interface and without the RAM block, the synthesis tool processing the Verilog hardware description would not identify the array as a RAM, and would instead implement it using many registers.  This interface is shown on lines 9--15 in the Verilog code in Fig.~\ref{fig:accumulator_v}.
+A high-level overview of the architecture and of the RAM interface can be seen in Fig.~\ref{fig:accumulator_diagram}.
 
 \paragraph{Translating instructions}
 
 Most 3AC instructions correspond to hardware constructs.
 %Each 3AC instruction either corresponds to a hardware construct or does not have to be handled by the translation, such as function calls (because of inlining). \JW{Are function calls the only 3AC instruction that we ignore? (And I guess return statements too for the same reason.)}\YH{Actually, return instructions are translated (because you can return from main whenever), so call instructions (Icall, Ibuiltin and Itailcall) are the only functions that are not handled.}
 % JW: Thanks; please check proposed new text.
-For example, line 2 in Figure~\ref{fig:accumulator_rtl} shows a 32-bit register \texttt{x5} being initialised to 3, after which the control flow moves execution to line 3. This initialisation is also encoded in the Verilog generated from HTL at state 8 in both the control logic and data-path always-blocks, shown at lines 33 and 16 respectively in Figure~\ref{fig:accumulator_v}.  Simple operator instructions are translated in a similar way.  For example, the add instruction is just translated to the built-in add operator, similarly for the multiply operator.  All 32-bit instructions can be translated in this way, but some special instructions require extra care. One such is the \texttt{Oshrximm} instruction, which is discussed further in Section~\ref{sec:algorithm:optimisation:oshrximm}. Another is the \texttt{Oshldimm} instruction, which is a left rotate instruction that has no Verilog equivalent and therefore has to be implemented in terms of other operations and proven to be equivalent.
+For example, line 2 in Fig.~\ref{fig:accumulator_rtl} shows a 32-bit register \texttt{x5} being initialised to 3, after which the control flow moves execution to line 3. This initialisation is also encoded in the Verilog generated from HTL at state 8 in both the control logic and data-path always-blocks, shown at lines 33 and 16 respectively in Fig.~\ref{fig:accumulator_v}.  Simple operator instructions are translated in a similar way.  For example, the add instruction is just translated to the built-in add operator, similarly for the multiply operator.  All 32-bit instructions can be translated in this way, but some special instructions require extra care. One such is the \texttt{Oshrximm} instruction, which is discussed further in Section~\ref{sec:algorithm:optimisation:oshrximm}. Another is the \texttt{Oshldimm} instruction, which is a left rotate instruction that has no Verilog equivalent and therefore has to be implemented in terms of other operations and proven to be equivalent.
 % In addition to any non-32-bit operations, the remaining
 The only 32-bit instructions that we do not translate are case-statements (\texttt{Ijumptable}) and those instructions related to function calls (\texttt{Icall}, \texttt{Ibuiltin}, and \texttt{Itailcall}), because we enable inlining by default.
 
@@ -381,7 +381,7 @@ The only 32-bit instructions that we do not translate are case-statements (\text
 
 Finally, we have to translate the HTL code into proper Verilog. % and prove that it behaves the same as the 3AC according to the Verilog semantics.
 The challenge here is to translate our FSMD representation into a Verilog AST.  However, as all the instructions in HTL are already expressed as Verilog statements, only the top-level data-path and control logic maps need to be translated to valid Verilog case-statements.  We also require declarations for all the variables in the program, as well as declarations of the inputs and outputs to the module, so that the module can be used inside a larger hardware design.  In addition to translating the maps of Verilog statements, an always-block that will behave like the RAM also has to be created, which is only modelled abstractly at the HTL level.
-Figure~\ref{fig:accumulator_v} shows the final Verilog output that is generated for our example.
+Fig.~\ref{fig:accumulator_v} shows the final Verilog output that is generated for our example.
 
 Although this translation seems quite straight\-forward, proving that this translation is correct is complex.
 All the implicit assumptions that were made in HTL need to be translated explicitly to Verilog statements and it needs to be shown that these explicit behaviours are equivalent to the assumptions made in the HTL semantics.  One main example of this is proving that specification of the RAM in HTL does indeed behave the same as its Verilog implementation.
@@ -401,7 +401,7 @@ One big difference between C and Verilog is how memory is represented.  Although
 However, the memory model that \compcert{} uses for its intermediate languages is byte-addre\-ssa\-ble~\cite{blazy05_formal_verif_memor_model_c}.  If a byte-addressable memory was used in the target hardware, which is closer to \compcert{}'s memory model, then a load and store would instead take four clock cycles, because a RAM can only perform one read and write per clock cycle.  It therefore has to be proven that the byte-addressable memory behaves in the same way as the word-addressable memory in hardware.  Any modifications of the bytes in the \compcert{} memory model also have to be shown to modify the word-addressable memory in the same way.  Since only integer loads and stores are currently supported in \vericert{}, it follows that the addresses given to the loads and stores will be multiples of four.  Translating from byte-addressed memory to word-addressed memory can then be done by dividing the address by four.
 
 \subsubsection{Implementation of RAM Interface}\label{sec:algorithm:optimisation:ram}
-The simplest way to implement loads and stores in \vericert{} would be to access the Verilog array directly from within the data-path (i.e., inside the always-block on lines 16--32 of Figure~\ref{fig:accumulator_v}). This would be correct, but when a Verilog array is accessed at several program points, the synthesis tool is unlikely to detect that it can be implemented as a RAM block, and will resort to using lots of registers instead, ruining the circuit's area and performance.  To avert this, we arrange that the data-path does not access memory directly, but simply sets the address it wishes to access and then toggles the \texttt{u\_en} flag. This activates the RAM interface (lines 9--15 of Figure~\ref{fig:accumulator_v}) on the next falling clock edge, which performs the requested load or store. By factoring all the memory accesses out into a separate interface like this, we ensure that the underlying array is only accessed from a single program point in the Verilog code, and thus ensure that the synthesis tool will correctly infer a RAM block.\footnote{Interestingly, the Verilog code shown for the RAM interface must not be modified, because the synthesis tool will only generate a RAM when the code matches a small set of specific patterns.}
+The simplest way to implement loads and stores in \vericert{} would be to access the Verilog array directly from within the data-path (i.e., inside the always-block on lines 16--32 of Fig.~\ref{fig:accumulator_v}). This would be correct, but when a Verilog array is accessed at several program points, the synthesis tool is unlikely to detect that it can be implemented as a RAM block, and will resort to using lots of registers instead, ruining the circuit's area and performance.  To avert this, we arrange that the data-path does not access memory directly, but simply sets the address it wishes to access and then toggles the \texttt{u\_en} flag. This activates the RAM interface (lines 9--15 of Fig.~\ref{fig:accumulator_v}) on the next falling clock edge, which performs the requested load or store. By factoring all the memory accesses out into a separate interface like this, we ensure that the underlying array is only accessed from a single program point in the Verilog code, and thus ensure that the synthesis tool will correctly infer a RAM block.\footnote{Interestingly, the Verilog code shown for the RAM interface must not be modified, because the synthesis tool will only generate a RAM when the code matches a small set of specific patterns.}
 %\JW{I tweaked this slightly in an attempt to clarify; please check.} %\NR{Bring forward this sentence to help with flow.}
 
 %\JW{I think the following sentence could be cut as we've said this kind of thing a couple of times already.} Without the interface, the array would be implemented using registers, which would increase the size of the hardware considerably.
@@ -409,9 +409,9 @@ The simplest way to implement loads and stores in \vericert{} would be to access
 Therefore, an extra compiler pass is added from HTL to HTL to extract all the direct accesses to the Verilog array and replace them by signals that access the RAM interface in a separate always-block. The translation is performed by going through all the instructions and replacing each load and store expression in turn.  Stores can simply be replaced by the necessary wires directly. Loads are a little more subtle: loads that use the RAM interface take two clock cycles where a direct load from an array takes only one, so this pass inserts an extra state after each load.
 
 %\JW{I've called that negedge always-block the `RAM driver' in my proposed text above -- that feels like quite a nice a word for it to my mind -- what do you think?}\YH{Yes I quite like it!}
-%Verilog arrays can be used in a variety of ways, however, these do not all produce optimal hardware designs.  If, for example, arrays in Verilog are accessed immediately in the data-path, then the synthesis tool is not be able to identify it as having the right properties for a RAM, and would instead implement the array using registers.  This is extremely expensive, and for large memories this can easily blow up the area usage of the FPGA, and because of the longer wires that are needed, it would also affect the performance of the circuit.  The synthesis tools therefore provide code snippets that they know how to transform into various constructs, including snippets that will generate proper RAMs in the final hardware.  This process is called memory inference.  The initial translation from 3AC to HTL converts loads and stores to direct accesses to the memory, as this preserves the same behaviour without having to insert more registers and logic.  We therefore have another pass from HTL to itself which performs the translation from this na\"ive use of arrays to a representation which always allows for memory inference.  This pass creates a separate always-block to perform the loads and stores to the memory, and adds the necessary data, address and enable signals to communicate with that always-block from other always-blocks.  This always-block is shown between lines 10-15 in Figure~\ref{fig:accumulator_v}.
+%Verilog arrays can be used in a variety of ways, however, these do not all produce optimal hardware designs.  If, for example, arrays in Verilog are accessed immediately in the data-path, then the synthesis tool is not be able to identify it as having the right properties for a RAM, and would instead implement the array using registers.  This is extremely expensive, and for large memories this can easily blow up the area usage of the FPGA, and because of the longer wires that are needed, it would also affect the performance of the circuit.  The synthesis tools therefore provide code snippets that they know how to transform into various constructs, including snippets that will generate proper RAMs in the final hardware.  This process is called memory inference.  The initial translation from 3AC to HTL converts loads and stores to direct accesses to the memory, as this preserves the same behaviour without having to insert more registers and logic.  We therefore have another pass from HTL to itself which performs the translation from this na\"ive use of arrays to a representation which always allows for memory inference.  This pass creates a separate always-block to perform the loads and stores to the memory, and adds the necessary data, address and enable signals to communicate with that always-block from other always-blocks.  This always-block is shown between lines 10-15 in Fig.~\ref{fig:accumulator_v}.
 
-There are two interesting parts to the inserted RAM interface.  Firstly, the memory updates are triggered on the negative (falling) edge of the clock, out of phase with the rest of the design which is triggered on the positive (rising) edge of the clock.  The advantage of this is that instead of loads and stores taking three clock cycles and two clock cycles respectively, they only take two clock cycles and one clock cycle instead, greatly improving their performance. %\JW{Is this a standard `trick' in hardware design? If so it might be nice to cite it.}\YH{Hmm, not really, because it has the downside of kind of halving your available clock period. However, RAMs normally come in both forms on the FPGA (Page 12, Figure 2, \url{https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/ug/ug_ram_rom.pdf})} 
+There are two interesting parts to the inserted RAM interface.  Firstly, the memory updates are triggered on the negative (falling) edge of the clock, out of phase with the rest of the design which is triggered on the positive (rising) edge of the clock.  The advantage of this is that instead of loads and stores taking three clock cycles and two clock cycles respectively, they only take two clock cycles and one clock cycle instead, greatly improving their performance. %\JW{Is this a standard `trick' in hardware design? If so it might be nice to cite it.}\YH{Hmm, not really, because it has the downside of kind of halving your available clock period. However, RAMs normally come in both forms on the FPGA (Page 12, Fig. 2, \url{https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/ug/ug_ram_rom.pdf})}
 % JW: thanks!
 Using the negative edge of the clock is widely supported by synthesis tools, and does not affect the maximum frequency of the final design.
 
@@ -461,7 +461,7 @@ Secondly, the logic in the enable signal of the RAM (\texttt{en != u\_en}) is al
   \caption{Timing diagrams showing the execution of loads and stores over multiple clock cycles.}\label{fig:ram_load_store}
 \end{figure}
 
-Figure~\ref{fig:ram_load_store} gives an example of how the RAM interface behaves when values are loaded and stored.
+Fig.~\ref{fig:ram_load_store} gives an example of how the RAM interface behaves when values are loaded and stored.
 
 \subsubsection{Implementing the \texttt{Oshrximm} Instruction}\label{sec:algorithm:optimisation:oshrximm}
 
diff --git a/evaluation.tex b/evaluation.tex
index f5deff8..2a98aa7 100644
--- a/evaluation.tex
+++ b/evaluation.tex
@@ -25,7 +25,7 @@ We were able to use 27 of the 30 programs; three had to be discarded (\texttt{co
 %In summary, we evaluate 27 programs from the latest Polybench suite. 
 We configured \polybench{}'s parameters so that only integer types are used.  We use \polybench{}'s smallest datasets for each program to ensure that data can reside within on-chip memories of the FPGA, avoiding any need for off-chip memory accesses. We have not modified the benchmarks to make them run through \legup{} optimally, e.g. by adding pragmas that trigger more advanced optimisations.
 
-\vericert{} implements divisions and modulo operations in C using the corresponding built-in Verilog operators. These built-in operators are designed to complete within a single clock cycle, and this causes substantial penalties in clock frequency. Other HLS tools, including LegUp, supply their own multi-cycle division/modulo implementations, and we plan to do the same in future versions of \vericert{}. Implementing pipelined operators such as the divide and modulus operator can be solved by scheduling the instructions so that these can execute in parallel, which is the main optimisation that needs to be added to \vericert{}. In the meantime, we have prepared an alternative version of the benchmarks in which each division/modulo operation is replaced with our own implementation that uses repeated division and multiplications by 2.  Figure~\ref{fig:polybench-div} shows the results of comparing \vericert{} with optimised LegUp 4.0 on the \polybench{} benchmarks, where divisions have been left intact.  Figure~\ref{fig:polybench-nodiv} performs the comparison where the division/modulo operations have been replaced by the iterative algorithm.
+\vericert{} implements divisions and modulo operations in C using the corresponding built-in Verilog operators. These built-in operators are designed to complete within a single clock cycle, and this causes substantial penalties in clock frequency. Other HLS tools, including LegUp, supply their own multi-cycle division/modulo implementations, and we plan to do the same in future versions of \vericert{}. Implementing pipelined operators such as the divide and modulus operator can be solved by scheduling the instructions so that these can execute in parallel, which is the main optimisation that needs to be added to \vericert{}. In the meantime, we have prepared an alternative version of the benchmarks in which each division/modulo operation is replaced with our own implementation that uses repeated division and multiplications by 2.  Fig.~\ref{fig:polybench-div} shows the results of comparing \vericert{} with optimised LegUp 4.0 on the \polybench{} benchmarks, where divisions have been left intact.  Fig.~\ref{fig:polybench-nodiv} performs the comparison where the division/modulo operations have been replaced by the iterative algorithm.
 
 \paragraph{Synthesis setup} The Verilog that is generated by \vericert{} or \legup{} is provided to Xilinx Vivado v2017.1~\cite{xilinx_vivad_desig_suite}, which synthesises it to a netlist, before placing-and-routing this netlist onto a Xilinx XC7Z020 FPGA device that contains approximately 85000 LUTs.
 
@@ -155,7 +155,7 @@ We configured \polybench{}'s parameters so that only integer types are used.  We
 Firstly, before comparing any performance metrics, it is worth highlighting that any Verilog produced by \vericert{} is guaranteed to be \emph{correct}, whilst no such guarantee can be provided by \legup{}.
 This guarantee in itself provides a significant leap in terms of HLS reliability, compared to any other HLS tools available.
 
-The top graphs of Figure~\ref{fig:polybench-div} and Figure~\ref{fig:polybench-nodiv} compare the execution time of the 27 programs executed by \vericert{} and the different optimisation levels of \legup{}.  Each graph uses optimised \legup{} as the baseline.  When division/modulo operations are present \legup{} designs execute around 27$\times$ faster than \vericert{} designs.  However, when division/modulo operations are replaced by the iterative algorithm, \legup{} designs are only 2$\times$ faster than \vericert{} designs.  However, the benchmarks with division/modulo replaced show that \vericert{} actually achieves the same execution speed as \legup{} without LLVM optimisations and without operation chaining, which is encouraging, and shows that the hardware generation is following the right steps.  The execution time is calculated by multiplying the maximum frequency that the FPGA can run at with this design, by the number of clock cycles that are needed to complete the execution.  We can therefore analyse each separately.
+The top graphs of Fig.~\ref{fig:polybench-div} and Fig.~\ref{fig:polybench-nodiv} compare the execution time of the 27 programs executed by \vericert{} and the different optimisation levels of \legup{}.  Each graph uses optimised \legup{} as the baseline.  When division/modulo operations are present \legup{} designs execute around 27$\times$ faster than \vericert{} designs.  However, when division/modulo operations are replaced by the iterative algorithm, \legup{} designs are only 2$\times$ faster than \vericert{} designs.  However, the benchmarks with division/modulo replaced show that \vericert{} actually achieves the same execution speed as \legup{} without LLVM optimisations and without operation chaining, which is encouraging, and shows that the hardware generation is following the right steps.  The execution time is calculated by multiplying the maximum frequency that the FPGA can run at with this design, by the number of clock cycles that are needed to complete the execution.  We can therefore analyse each separately.
 
 First, looking at the difference in clock cycles, \vericert{} produces designs that have around 4.5$\times$ as many clock cycles as \legup{} designs in both cases, when division/modulo operations are enabled as well as when they are replaced.  This performance gap can be explained in part by LLVM optimisations, which seem to account for a 2$\times$ decrease in clock cycles, as well as operation chaining, which decreases the clock cycles by another 2$\times$.  The rest of the speed-up is mostly due to \legup{} optimisations such as scheduling and memory analysis, which are designed to extract parallelism from input programs.
 This gap does not represent the performance cost that comes with formally proving a HLS tool.
@@ -164,12 +164,12 @@ As we improve \vericert{} by incorporating further optimisations, this gap shoul
 
 Secondly, looking at the maximum clock frequency that each design can achieve, \legup{} designs achieve 8.2$\times$ the maximum clock frequency of \vericert{} when division/modulo operations are present.  This is in great contrast to the maximum clock frequency that \vericert{} can achieve when no divide/modulo operations are present, where \vericert{} generates designs that are actually 2$\times$ better than the frequency achieved by \legup{} designs.  The dramatic discrepancy in performance for the former case can be largely attributed to \vericert{}'s na\"ive implementations of division and modulo operations, as explained in Section~\ref{sec:evaluation:setup}. Indeed, \vericert{} achieved an average clock frequency of just 13MHz, while \legup{} managed about 111MHz. After replacing the division/modulo operations with our own C-based implementations, \vericert{}'s average clock frequency becomes about 220MHz.  This improvement in frequency can be explained by the fact that \legup{} uses a memory controller to manage multiple RAMs using one interface, which is not needed in \vericert{} as a single RAM is used for the memory.
 
-Looking at a few benchmarks in particular in Figure~\ref{fig:polybench-nodiv} for some interesting cases.  For the \texttt{trmm} benchmark, \vericert{} produces hardware that executes with the same cycle count as \legup{}, and manages to create hardware that achieves twice the frequency compared to \legup{}, thereby actually producing a design that executes twice as fast as \legup{}.  Another interesting benchmark is \texttt{doitgen}, where \vericert{} is comparable to \legup{} without LLVM optimisations, however, LLVM optimisations seem to have a large effect on the cycle count.
+Looking at a few benchmarks in particular in Fig.~\ref{fig:polybench-nodiv} for some interesting cases.  For the \texttt{trmm} benchmark, \vericert{} produces hardware that executes with the same cycle count as \legup{}, and manages to create hardware that achieves twice the frequency compared to \legup{}, thereby actually producing a design that executes twice as fast as \legup{}.  Another interesting benchmark is \texttt{doitgen}, where \vericert{} is comparable to \legup{} without LLVM optimisations, however, LLVM optimisations seem to have a large effect on the cycle count.
 
 \subsection{RQ2: How Area-Efficient is \vericert{}-Generated Hardware?}
 
-The bottom graphs in both Figure~\ref{fig:polybench-div} and Figure~\ref{fig:polybench-nodiv} compare the resource utilisation of the \polybench{} programs generated by \vericert{} and \legup{} at various optimisation levels.
-By looking at the median, when division/modulo operations are enabled, we see that \vericert{} produces hardware that is about the same size as optimised \legup{}, whereas the unoptimised versions of \legup{} actually produce slightly smaller hardware.  This is because optimisations can often increase the size of the hardware to make it faster.  Especially in Figure~\ref{fig:polybench-div}, there are a few benchmarks where the size of the \legup{} design is much smaller than that produced by \vericert{}.  This can mostly be explained because of resource sharing in LegUp.  Division/modulo operations need large circuits, and it is therefore usual to only have one circuit per design.  As \vericert{} uses the na\"ive implementation of division/modulo, there will be multiple circuits present in the design, which blows up the size.  Looking at Figure~\ref{fig:polybench-nodiv}, one can see that without division, the size of \vericert{} designs are almost always around the same size as \legup{} designs, never being more than 2$\times$ larger, and sometimes even being smaller.  The similarity in area also shows that RAM is correctly being inferred by the synthesis tool, and is therefore not implemented as registers.
+The bottom graphs in both Fig.~\ref{fig:polybench-div} and Fig.~\ref{fig:polybench-nodiv} compare the resource utilisation of the \polybench{} programs generated by \vericert{} and \legup{} at various optimisation levels.
+By looking at the median, when division/modulo operations are enabled, we see that \vericert{} produces hardware that is about the same size as optimised \legup{}, whereas the unoptimised versions of \legup{} actually produce slightly smaller hardware.  This is because optimisations can often increase the size of the hardware to make it faster.  Especially in Fig.~\ref{fig:polybench-div}, there are a few benchmarks where the size of the \legup{} design is much smaller than that produced by \vericert{}.  This can mostly be explained because of resource sharing in LegUp.  Division/modulo operations need large circuits, and it is therefore usual to only have one circuit per design.  As \vericert{} uses the na\"ive implementation of division/modulo, there will be multiple circuits present in the design, which blows up the size.  Looking at Fig.~\ref{fig:polybench-nodiv}, one can see that without division, the size of \vericert{} designs are almost always around the same size as \legup{} designs, never being more than 2$\times$ larger, and sometimes even being smaller.  The similarity in area also shows that RAM is correctly being inferred by the synthesis tool, and is therefore not implemented as registers.
 
 \subsection{RQ3: How Quickly does \vericert{} Translate the C into Verilog?}
 
@@ -207,7 +207,7 @@ By looking at the median, when division/modulo operations are enabled, we see th
 
 To gain further confidence that the Verilog designs generated by \vericert{} are actually correct, and that the correctness theorem is indeed effective, we fuzzed \vericert{} using Csmith~\cite{yang11_findin_under_bugs_c_compil}. \citeauthor{yang11_findin_under_bugs_c_compil} previously used Csmith in an extensive fuzzing campaign on CompCert and found a handful of bugs in the unverified parts of that compiler, so it is natural to explore whether it can find bugs in \vericert{} too. \citet{herklotz21_empir_study_reliab_high_level_synth_tools} have recently used Csmith to fuzz other HLS tools including \legup{}, so we configured Csmith in a similar way. In addition to the features turned off by \citeauthor{du21_fuzzin_high_level_synth_tools}, we turned off the generation of global variables and non-32-bit operations. The generated designs were tested by simulating them and comparing the output value to the results of compiling the test-cases with GCC 10.3.0.
 
-The results of the fuzzing run are shown in Figure~\ref{tab:fuzzing}.  Out of 155267 test-cases generated by Csmith, 26\% of them passed, meaning they compiled without error and resulted in the same final value as GCC. Most of the test-cases, 73.97\%, failed at compile time.  The most common reasons for this were unsigned comparisons between integers (\vericert{} requires them to be signed), and the presence of 8-bit operations (which \vericert{} does not support, and which we could not turn off due to a limitation in Csmith). % The latter programs are present because of a slight limitation with how Csmith can be configured, as it does not allow 8-bit operations to be turned off completely, and will get stuck when it generates the C code.  
+The results of the fuzzing run are shown in Fig.~\ref{tab:fuzzing}.  Out of 155267 test-cases generated by Csmith, 26\% of them passed, meaning they compiled without error and resulted in the same final value as GCC. Most of the test-cases, 73.97\%, failed at compile time.  The most common reasons for this were unsigned comparisons between integers (\vericert{} requires them to be signed), and the presence of 8-bit operations (which \vericert{} does not support, and which we could not turn off due to a limitation in Csmith). % The latter programs are present because of a slight limitation with how Csmith can be configured, as it does not allow 8-bit operations to be turned off completely, and will get stuck when it generates the C code.
 Because the test-cases generated by Csmith could not be tailored exactly to the C fragment that \vericert{} supports, such a high compile-time failure rate is not unexpected. Finally, and most interestingly, there were a total of 39 run-time failures, which the correctness theorem should be proving impossible.  However, all 39 of these failures are due to a bug in the pretty-printing of the final Verilog code, where a logical negation (\texttt{!}) was accidentally used instead of a bitwise negation (\verb|~|).  Once this bug was fixed, all test-cases passed.
 
 %%% Local Variables:
diff --git a/proof.tex b/proof.tex
index 6446c9b..19d21d5 100644
--- a/proof.tex
+++ b/proof.tex
@@ -33,7 +33,7 @@ The main correctness theorem is analogous to that stated in \compcert{}~\cite{le
   \end{equation*}
 \end{theorem}
 
-Why is this correctness theorem also the right one for HLS? It could be argued that hardware inherently runs forever and therefore does not produce a definitive final result.  This would mean that the \compcert{} correctness theorem would likely not help with proving that the hardware is actually working correctly, as the behaviour would always be divergent.  However, in practice, HLS does not normally produce the top-level of the design that needs to connect to other components and would therefore need to run forever.  Rather, HLS often produces smaller components that take an input, execute, and then terminate with an answer.  To start the execution of the hardware and to signal to the HLS component that the inputs are ready, the $\mathit{rst}$ signal is set and unset.  Then, once the result is ready, the $\mathit{fin}$ signal is set and the result value is placed in $\mathit{ret}$.  These signals are also present in the semantics of execution shown in Figure~\ref{fig:inference_module}.  The correctness theorem therefore also uses these signals, and the proof shows that once the $\mathit{fin}$ flag is set, the value in $\mathit{ret}$ is correct according to the semantics of Verilog and Clight.  Note that the compiler is allowed to fail and not produce any output; the correctness theorem only applies when the translation succeeds.
+Why is this correctness theorem also the right one for HLS? It could be argued that hardware inherently runs forever and therefore does not produce a definitive final result.  This would mean that the \compcert{} correctness theorem would likely not help with proving that the hardware is actually working correctly, as the behaviour would always be divergent.  However, in practice, HLS does not normally produce the top-level of the design that needs to connect to other components and would therefore need to run forever.  Rather, HLS often produces smaller components that take an input, execute, and then terminate with an answer.  To start the execution of the hardware and to signal to the HLS component that the inputs are ready, the $\mathit{rst}$ signal is set and unset.  Then, once the result is ready, the $\mathit{fin}$ signal is set and the result value is placed in $\mathit{ret}$.  These signals are also present in the semantics of execution shown in Fig.~\ref{fig:inference_module}.  The correctness theorem therefore also uses these signals, and the proof shows that once the $\mathit{fin}$ flag is set, the value in $\mathit{ret}$ is correct according to the semantics of Verilog and Clight.  Note that the compiler is allowed to fail and not produce any output; the correctness theorem only applies when the translation succeeds.
 
 How can we prove this theorem? First, note that the theorem is a `backwards simulation' result (every target behaviour must also be a source behaviour), following the terminology used in the \compcert{} literature~\cite{leroy09_formal_verif_realis_compil}. The reverse direction (every source behaviour must also be a target behaviour) is not demanded because compilers are permitted to resolve any non-determinism present in their source programs. However, since Clight programs are all deterministic, as are the Verilog programs in the fragment we consider, we can actually reformulate the correctness theorem above as a forwards simulation result (following standard \compcert{} practice), which makes it easier to prove.
 
@@ -143,7 +143,7 @@ We can now define the simulation diagram for the translation. The 3AC state can
   \caption{Specification for the memory implementation in HTL, where $r$ is the RAM, which is then implemented by equivalent Verilog code.}\label{fig:htl_ram_spec}
 \end{figure}
 
-HTL can only represent a single state machine, it is therefore necessary to model the RAM abstractly to reason about the correctness of replacing the direct read and writes to the array by loads and stores to a RAM.  The specification used to model the RAM is shown in Figure~\ref{fig:htl_ram_spec}, which defines how the RAM $r$ will behave for all the possible combinations of the input signals.
+HTL can only represent a single state machine, it is therefore necessary to model the RAM abstractly to reason about the correctness of replacing the direct read and writes to the array by loads and stores to a RAM.  The specification used to model the RAM is shown in Fig.~\ref{fig:htl_ram_spec}, which defines how the RAM $r$ will behave for all the possible combinations of the input signals.
 
 \subsubsection{From Implementation to Specification}
 
@@ -207,7 +207,7 @@ The final lemma we need is that the Verilog we generate is deterministic. This r
 \end{lemma}
 
 \begin{proof}[Proof sketch]
-  The Verilog semantics is deterministic because the order of operation of all the constructs is defined, and there is therefore only one way to evaluate the module and hence only one possible behaviour. This was proven for the small-step semantics shown in Figure~\ref{fig:inference_module}.
+  The Verilog semantics is deterministic because the order of operation of all the constructs is defined, and there is therefore only one way to evaluate the module and hence only one possible behaviour. This was proven for the small-step semantics shown in Fig.~\ref{fig:inference_module}.
 \end{proof}
 
 %\subsection{Coq Mechanisation}
diff --git a/related.tex b/related.tex
index ee8e6b4..2ec5114 100644
--- a/related.tex
+++ b/related.tex
@@ -42,7 +42,7 @@
   \caption{Summary of related work}\label{fig:related_euler}
 \end{figure}
 
-A summary of the related works can be found in Figure~\ref{fig:related_euler}, which is represented as an Euler diagram.  The categories chosen for the Euler diagram are: whether the tool is usable, whether it takes a high-level software language as input, whether it has a correctness proof, and finally whether that proof is mechanised.  The goal of \vericert{} is to cover all of these categories.
+A summary of the related works can be found in Fig.~\ref{fig:related_euler}, which is represented as an Euler diagram.  The categories chosen for the Euler diagram are: whether the tool is usable, whether it takes a high-level software language as input, whether it has a correctness proof, and finally whether that proof is mechanised.  The goal of \vericert{} is to cover all of these categories.
 
 Most practical HLS tools~\citep{canis11_legup,xilinx20_vivad_high_synth,intel20_sdk_openc_applic,nigam20_predic_accel_desig_time_sensit_affin_types} fit into the category of usable tools that take high-level inputs.  On the other end of the spectrum, there are tools such as BEDROC~\citep{chapman92_verif_bedroc} for which there is no practical tool, and even though it is described as high-level synthesis, it more closely resembles today's logic synthesis tools.
 
diff --git a/verilog.tex b/verilog.tex
index 67ae148..772fec7 100644
--- a/verilog.tex
+++ b/verilog.tex
@@ -105,10 +105,10 @@ To support this computational model, we extend the Verilog module we generate wi
 %\JW{Is there a mismatch between `st' in the figure and `stk' in the text?}\YH{It was actually between $\Gamma_{a}$ and \mathit{stk}.  The \mathit{st} should have been $\sigma$.}
 \end{description}
 
-Figure~\ref{fig:inference_module} shows the inference rules for moving between the computational states.  The first, \textsc{Step}, is the normal rule of execution.  It defines one step in the \texttt{State} state, assuming that the module is not being reset, that the finish state has not been reached yet, that the current and next state are $v$ and $v'$, and that the module runs from state $\Gamma$ to $\Gamma'$ using the \textsc{Step} rule.  The \textsc{Finish} rule returns the final value of running the module and is applied when the $\mathit{fin}$ register is set; the return value is then taken from the $\mathit{ret}$ register.
+Fig.~\ref{fig:inference_module} shows the inference rules for moving between the computational states.  The first, \textsc{Step}, is the normal rule of execution.  It defines one step in the \texttt{State} state, assuming that the module is not being reset, that the finish state has not been reached yet, that the current and next state are $v$ and $v'$, and that the module runs from state $\Gamma$ to $\Gamma'$ using the \textsc{Step} rule.  The \textsc{Finish} rule returns the final value of running the module and is applied when the $\mathit{fin}$ register is set; the return value is then taken from the $\mathit{ret}$ register.
 
 Note that there is no step from \texttt{State} to \texttt{Callstate}; this is because function calls are not supported, and it is therefore impossible in our semantics ever to reach a \texttt{Callstate} except for the initial call to main. So the \textsc{Call} rule is only used at the very beginning of execution; likewise, the \textsc{Return} rule is only matched for the final return value from the main function.
-Therefore, in addition to the rules shown in Figure~\ref{fig:inference_module}, an initial state and final state need to be defined:
+Therefore, in addition to the rules shown in Fig.~\ref{fig:inference_module}, an initial state and final state need to be defined:
 
 \begin{gather*}
   \inferrule[Initial]{\yhfunction{is\_internal}\ (P.\texttt{main})}{\yhfunction{initial\_state}\ (\yhconstant{Callstate } []\ (P.\texttt{main})\ [])}\qquad
@@ -197,7 +197,7 @@ The Verilog semantics do not define a memory model for Verilog, as this is not n
 
 %\JW{It's not completely clear what the relationship is between your work and those works. The use of `only' suggests that you've re-done a subset of work that has already been done -- is that the right impression?}\YH{Hopefully that's more clear.}
 
-This translation is represented in Figure~\ref{fig:memory_model_transl}.  \compcert{} defines a map from blocks to maps from memory addresses to memory contents.  Each block represents an area in memory; for example, a block can represent a global variable or a stack for a function. As there are no global variables, the main stack can be assumed to be block 0, and this is the only block we translate.
+This translation is represented in Fig.~\ref{fig:memory_model_transl}.  \compcert{} defines a map from blocks to maps from memory addresses to memory contents.  Each block represents an area in memory; for example, a block can represent a global variable or a stack for a function. As there are no global variables, the main stack can be assumed to be block 0, and this is the only block we translate.
 %\JW{So the stack frame for a function called by main would be in a different block, is that the idea? Seems unusual not to have a single stack.}
 %\YH{Yeah exactly, it makes it much easier to reason about though, because everything is nicely isolated.  This is exactly what CompCertELF and CompCertS try and solve though.} 
 %\JW{Would global variables normally be put in blocks 1, 2, etc.?}
author	Yann Herklotz <git@yannherklotz.com>	2021-09-09 21:09:05 +0100
committer	Yann Herklotz <git@yannherklotz.com>	2021-09-09 21:09:05 +0100
commit	55528c1f1a11988897d46993756d6aa5873095af (patch)
tree	3f77f427e8aba3712ac9a145a28471c9ddf2ead7
parent	301291487f61c924ec16ec73c632b19ac5395a6d (diff)
download	oopsla21_fvhls-55528c1f1a11988897d46993756d6aa5873095af.tar.gz oopsla21_fvhls-55528c1f1a11988897d46993756d6aa5873095af.zip