Update on Overleaf.

author: John Wickerson <j.wickerson@imperial.ac.uk> 2020-11-19 20:51:45 +0000
committer: overleaf <overleaf@localhost> 2020-11-19 20:52:11 +0000
commit: 70df35bc74805473cd4a1e48293cb29d09b3767c (patch)
tree: 6526879d2701ecbdee18f256779f08789a093c52
parent: 1d66503454f22db76b8a314ea1f30babca8f7c93 (diff)
download: oopsla21_fvhls-70df35bc74805473cd4a1e48293cb29d09b3767c.tar.gz
oopsla21_fvhls-70df35bc74805473cd4a1e48293cb29d09b3767c.zip
6 files changed, 44 insertions, 26 deletions
diff --git a/algorithm.tex b/algorithm.tex
index 1aabe01..4c4971e 100644
--- a/algorithm.tex
+++ b/algorithm.tex
@@ -110,11 +110,11 @@ main() {
 
 \subsection{Translating C to Verilog, by example}
 
-Using the simple accumulator program shown in Figure~\ref{fig:accumulator_c} as a worked example, this section describes the main translation that is performed in \vericert{} to go from the behavioural description in C into a hardware design in Verilog.
+Using the simple program in Figure~\ref{fig:accumulator_c} as a worked example, this section describes how \vericert{} translates a behavioural description in C into a hardware design in Verilog.
 
 \subsubsection{Translating C to 3AC}
 
-The first step of the translation is to use \compcert{} to transform the input C code into the 3AC shown in Figure~\ref{fig:accumulator_rtl}. As part of this, \compcert{} performs optimisations such as constant propagation and dead-code elimination.  Function inlining is also performed, which allows us to support function calls without having to support the \texttt{Icall} 3AC instruction.  The duplication of the function bodies caused by inlining does affect the total area of the hardware, however, latency is improved instead.  In addition to that, inlining removes the possibility of supporting recursive function calls, however, this is a feature that isn't supported in most other HLS tools either.
+The first step of the translation is to use \compcert{} to transform the input C code into the 3AC shown in Figure~\ref{fig:accumulator_rtl}. As part of this, \compcert{} performs such optimisations as constant propagation and dead-code elimination.  Function inlining is also performed, which allows us to support function calls without having to support the \texttt{Icall} 3AC instruction.  Although the duplication of the function bodies caused by inlining can increase the area of the hardware, it can have a positive effect on latency. Moreover, inlining excludes support for recursive function calls, but this feature isn't supported in most other HLS tools either~\cite{davidthomas_asap16}.
 
 %\JW{Is that definitely true? Was discussing this with Nadesh and George recently, and I ended up not being so sure. Inlining could actually lead to \emph{reduced} resource usage because once everything has been inlined, the (big) scheduling problem could then be solved quite optimally. Certainly inlining is known to increase register pressure, but that's not really an issue here. If we're  not sure, we could just say that inlining everything leads to bloated Verilog files and the inability to support recursion, and leave it at that.}\YH{I think that is true, just because we don't do scheduling.  With scheduling I think that's true, inlining actually becomes quite good.}
 
@@ -128,7 +128,9 @@ The first step of the translation is to use \compcert{} to transform the input C
 %\JW{I feel like this could use some sort of citation, but I'm not sure what. I guess this is all from "Hardware Design 101", right?}\YH{I think I found a good one actually, which goes over the basics.}
 %\JW{I think it would be worth having a sentence to explain how the C model of memory is translated to a hardware-centric model of memory. For instance, in C we have global variables/arrays, stack-allocated variables/arrays, and heap-allocated variables/arrays (anything else?). In Verilog we have registers and RAM blocks. So what's the correspondence between the two worlds? Globals and heap-allocated are not handled, stack-allocated variables become registers, and stack-allocated arrays become RAM blocks? Am I close?}\YH{Stack allocated variables become RAM as well, so that we can deal with addresses easily and take addresses of any variable.} \JW{I see, thanks. So, in short, the only registers in your hardware designs are those that store things like the current state, etc. You generate a fixed number of registers every time you synthesis -- you don't generate extra registers to store any of the program variables. Right?}
 
-The first translation performed in \vericert{} is from 3AC to a new hardware translation language (HTL), which is one step towards being completely translated to hardware described in Verilog.  The main translation that is performed is going from a CFG representation of the computation to a finite state machine with datapath (FSMD)~\cite{hwang99_fsmd} representation in HTL.\@  The core idea of the FSMD representation is that it separates the control flow from the operations on the memory and registers, so that the state transitions can be translated into a simple finite state machine (FSM) and each state then contains data operations that update the memory and registers.  Figure~\ref{fig:accumulator_diagram} shows the resulting architecture of the FSMD.  Hardware does not have the same memory model as C, the memory model therefore needs to be translated in the following way.  Global variables are not translated in \vericert{} at the moment, however, the stack of the main function will become the RAM seen in Figure~\ref{fig:accumulator_diagram}.  Variables that have their address is taken will therefore be stored in the RAM, as well as any arrays or structs defined in the function.  Variables that did not have their address taken will be kept in registers.
+The first translation performed in \vericert{} is from 3AC to a new hardware translation language (HTL). %, which is one step towards being completely translated to hardware described in Verilog. 
+This involves going from a CFG representation of the computation to a finite state machine with datapath (FSMD) representation~\cite{hwang99_fsmd}. The core idea of the FSMD representation is that it separates the control flow from the operations on the memory and registers. %This means that the state transitions can be translated into a simple finite state machine (FSM) where each state contains data operations that update the memory and registers. 
+An HTL program thus consists of two maps: a control map that describes how to calculate the next state from the current state, and a datapath map that describes how to update the registers and RAM given the current state. Figure~\ref{fig:accumulator_diagram} shows the resulting architecture of the FSMD. 
 
 \begin{figure*}
   \centering
@@ -136,19 +138,22 @@ The first translation performed in \vericert{} is from 3AC to a new hardware tra
   \caption{The FSMD for our running example. \JW{Maybe replace `State' with `Current State'? And maybe `Calculate State' could be clearer as `Calculate Next State'?} \JW{Can state 15 (or should it be state 16??) have a dangling incoming arrow to indicate that it is the start state? And perhaps state 1 could have a double outline to indicate that it is an `accepting' state? Since there's space above the `Calculate State' box, I'd be mildly in favour of expanding that box a bit so that it included all 15 states explicitly (snaking back and forth).}\YH{If this is better I can mock up a tikz version of it maybe and fix the last bits then too.}}\label{fig:accumulator_diagram}
 \end{figure*}
 
-The translation from 3AC to HTL is straightforward, as each 3AC instruction either matches up quite well to a hardware construct, or does not have to be handled by the translation, such as function calls.
-%At each instruction, the control flow is separated from the data computation and is then added to the control logic and data-flow map respectively.
-%\JW{I suspect that you could safely chop that sentence.}
+\paragraph{Translating memory}
+Hardware does not have the same memory model as C, so the memory model needs to be translated, as follows.  Global variables are not translated in \vericert{} at the moment. The stack of the main function becomes a block of RAM, as seen in Figure~\ref{fig:accumulator_diagram}. Program variables that have their address taken are stored in this RAM, as are any arrays or structs defined in the function. Variables that do not have their address taken are kept in registers.
+
+\paragraph{Translating instructions}
+Each 3AC instruction either corresponds to a hardware construct, or does not have to be handled by the translation, such as function calls.
 For example, in state 16 in figure~\ref{fig:accumulator_rtl}, the register \texttt{x9} is initialised to 1, after which the control flow moves to state 15.  This is encoded in HTL by initialising a 32-bit register \texttt{reg\_9} to 1 in the data-flow section, and also adding a transition to the state 15 in the control logic section.  Simple operator instructions are translated in a similar way.  For example, in state 5, the value in the array is added to the current value of the accumulated sum, which is simply translated to an addition of the equivalent registers in the HTL code.
 
-\paragraph{Key challenge: signedness} Note that the comparison in state 3 is signed.  By default, all operators and registers in Verilog and HTL are unsigned, so to force an operation to handle the bits as signed, both operators have to be forced to signed.  In addition to that, Verilog resizes expressions to the largest needed size by default, which can affect the result of the computation.  This feature is also not supported by the Verilog semantics we used, and there would therefore be a mismatch between the Verilog semantics and the actual behaviour of Verilog according to the standard.  To bypass this issue braces are used to stop the Verilog simulator or synthesis tool from resizing anything inside the braces.  Instead, explicit resizing is used in the semantics and operations can only be performed on two registers that have the same size.
+\paragraph{Key challenge: signedness} Note that the comparison in state 3 is signed. This is because C and Verilog handle signedness quite differently. By default, all operators and registers in Verilog (and HTL) are unsigned, so to force an operation to handle the bits as signed, both operators have to be forced to be signed.  In addition to that, Verilog resizes expressions to the largest needed size by default, which can affect the result of the computation.  This feature is also not supported by the Verilog semantics we adopted, and there would therefore be a mismatch between the Verilog semantics and the actual behaviour of Verilog according to the standard.  To bypass this issue, braces are used to stop the Verilog simulator or synthesis tool from resizing anything inside the braces.  Instead, explicit resizing is used in the semantics and operations can only be performed on two registers that have the same size.
 
 \subsection{Translating HTL to Verilog}
 
-Finally, we have to translate the HTL code into proper Verilog and prove that it behaves the same as the 3AC according to the Verilog semantics.  Whereas HTL is a language that is specifically designed to represent the FSMDs we are interested in, Verilog is a general-purpose HDL.\@  So the challenge here is to translate our FSMD representation into a Verilog AST.  However, as all the instructions are already expressed in Verilog, only the maps need to be translated to valid Verilog, and correct declarations for all the variables in the program need to be added as well.
+Finally, we have to translate the HTL code into proper Verilog. % and prove that it behaves the same as the 3AC according to the Verilog semantics.
+Whereas HTL is a language that is specifically designed to represent the FSMDs we are interested in, Verilog is a general-purpose HDL.\@  So the challenge here is to translate our FSMD representation into a Verilog AST.  However, as all the instructions are already expressed in Verilog, only the maps need to be translated to valid Verilog, and correct declarations for all the variables in the program need to be added as well.
 
-This translation seems quite straightforward, however, proving that it is correct is not that simple, as all the implicit assumptions that were made in HTL need to be translated explicitly to Verilog statements and needs it needs to be shown that these explicit behaviours are equivalent to the assumptions made in the HTL semantics.
-Figure~\ref{fig:accumulator_v} shows the final Verilog output that is generated.  In general, the structure is similar to that of the HTL code, however, the control and datapath maps have been translated to case statements.  The other main addition to the code is the initialisation of all the variables in the code to the correct bitwidths and the declaration of the inputs and outputs to the module, so that the module can be used inside a larger hardware design.
+This translation seems quite straightforward, but proving it correct is not that simple, as all the implicit assumptions that were made in HTL need to be translated explicitly to Verilog statements and it needs to be shown that these explicit behaviours are equivalent to the assumptions made in the HTL semantics.
+Figure~\ref{fig:accumulator_v} shows the final Verilog output that is generated for our worked example.  In general, the structure is similar to that of the HTL code, but the control and datapath maps become Verilog case-statements.  The other main addition to the code is the initialisation of all the variables in the code to the correct bitwidths and the declaration of the inputs and outputs to the module, so that the module can be used inside a larger hardware design.
 
 \begin{figure}
   \centering
@@ -223,24 +228,24 @@ endmodule
 
 \subsection{Optimisations}
 
-Although \vericert{} is not yet a proper `optimising' HLS compiler, we have implemented a few optimisations that aim to improve the quality of the hardware designs it produces.
+Although we would not claim that \vericert{} is a proper `optimising' HLS compiler yet, we have nonetheless implemented a few optimisations that aim to improve the quality of the hardware designs it produces.
 
 \subsubsection{Byte- and word-addressable memories}
 
-One big difference between C and Verilog is how memory is represented.  In hardware, efficient RAMs are not as available as in software, and need to be explicitly implemented by declaring two-dimensional arrays with specific properties.  A major limitation is that RAMs often only allow one read and one write per clock cycle.  To make loads and stores as efficient as possible, the RAM needs to be word-addressable, so that an entire integer can be loaded or stored in one clock cycle.
-However, the memory model that \compcert{} uses for its intermediate languages~\cite{blazy05_formal_verif_memor_model_c} is byte-addressable.  It therefore has to be proven that the byte-addressable memory behaves in the same way as the word-addressable memory in hardware.  Any modifications of the bytes in the \compcert{} memory model also have to be shown to modify the word-addressable memory in the same way.  As only integer loads and stores are currently supported in our HLS back end, it follows that the addresses given to the loads and stores should be divisible by four.  If that is the case, then the translation from byte-addressed memory to word-addressed memory could be done by dividing the address by four.
+One big difference between C and Verilog is how memory is represented.  In hardware, efficient RAMs are not as available as in software, and need to be explicitly implemented by declaring two-dimensional arrays with specific properties.  A major limitation is that RAMs often only allow one read and one write per clock cycle. So, to make loads and stores as efficient as possible, the RAM needs to be word-addressable, which means that an entire integer can be loaded or stored in one clock cycle.
+However, the memory model that \compcert{} uses for its intermediate languages is byte-addressable~\cite{blazy05_formal_verif_memor_model_c}.  It therefore has to be proven that the byte-addressable memory behaves in the same way as the word-addressable memory in hardware.  Any modifications of the bytes in the \compcert{} memory model also have to be shown to modify the word-addressable memory in the same way.  Since only integer loads and stores are currently supported in \vericert{}, it follows that the addresses given to the loads and stores should be multiples of four.  If that is the case, then the translation from byte-addressed memory to word-addressed memory can be done by dividing the address by four.
 
 \subsubsection{Implementing the \texttt{Oshrximm} instruction}
 
 % Mention that this optimisation is not performed sometimes (clang -03).
 
-\vericert{} performs some optimisations at the level of the instructions that are generated, so that the hardware performs the instructions as quickly as possible and so that the maximum frequency at which the hardware can run is increased.  One of the main constructs that cripple performance of the generated hardware is the instantiation of divider circuits in the hardware.  In the case of \vericert{}, it requires the result of the divide operation to be ready in the same clock cycle, meaning the divide circuit needs to be implemented fully combinationally.  This is inefficient in terms of hardware size, but also in terms of latency, because it means that the maximum frequency of the hardware needs to be reduced dramatically so that the divide circuit has enough time to finish.
+\vericert{} performs some optimisations at the level of the instructions that are generated, so that the hardware performs the instructions as quickly as possible and so that the maximum frequency at which the hardware can run is increased.  One of the main constructs that cripples performance of the generated hardware is the instantiation of divider circuits.  Since \vericert{} requires the result of a divide operation to be ready within a single clock cycle, the divide circuit needs to be entirely combinational.  This is inefficient in terms of area, but also in terms of latency, because it means that the maximum frequency of the hardware must be reduced dramatically so that the divide circuit has enough time to finish.
 
 %\JP{Multi-cycle paths might be something worth exploring in future work: fairly error-prone/dangerous for hand-written code, but might be interesting in generated code.}\YH{Definitely is on the list for next things to look into, will make divide so much more efficient.}
 
-These small optimisations were found to be the most error prone, and guaranteeing that the new representation is equivalent to representation used in the \compcert{} semantics is difficult without proving this for all possible inputs.
+%These small optimisations were found to be the most error prone, and guaranteeing that the new representation is equivalent to representation used in the \compcert{} semantics is difficult without proving this for all possible inputs.
 
-Dividing by a constant can often be optimised to a more efficient operation, especially if the denominator is a factor of two.  In \compcert{}, the \texttt{Oshrximm} is a strength-reduced version of a signed divide operation, performing the following operation, which is transformed to our optimal representation on the right, where $\div$ stands for integer signed division and $>>$ stands for a logical right shift:
+But dividing by a constant can often be optimised to a more efficient operation, especially if the denominator is a power of two.  In \compcert{}, the \texttt{Oshrximm} instruction is a strength-reduced version of a signed divide operation that performs the following operation, which is transformed to our optimal representation on the right, where $\div$ stands for integer signed division and $>>$ stands for a logical right shift:
 
 \begin{align*}
   &\forall x, y \in \mathbb{Z},\ \ 0 \leq y < 31,\ \ -2^{31} \leq x < 2^{31},\\
diff --git a/evaluation.tex b/evaluation.tex
index 8981f11..5970de1 100644
--- a/evaluation.tex
+++ b/evaluation.tex
@@ -16,16 +16,16 @@ Our evaluation is designed to answer the following three research questions.
 We chose Polybench 4.2.1 for our experiments, which consists of 30 programs. 
 Out of these 30 programs, three programs utilise square root functions: \texttt{co-relation}, \texttt{gramschmidt} and \texttt{deriche}. 
 Hence, we were unable evaluate these programs, since they mandatorily require \texttt{float}s.
-Interestingly, we were also unable to evaluate \texttt{cholesky} on \legup{}, since it produce an error during its HLS compilation. 
-In summary, we evaluate 26 programs from the latest Polybench suite. 
+% Interestingly, we were also unable to evaluate \texttt{cholesky} on \legup{}, since it produce an error during its HLS compilation. 
+In summary, we evaluate 27 programs from the latest Polybench suite. 
 
 \paragraph{Configuring Polybench for experimentation}
 We configure Polybench's metadata and slightly modified the source code to suit our purposes.
 First, we restrict Polybench to only generate integer data types, since we do not support floats or doubles currently.
 Secondly, we utilise Polybench's smallest data set size for each program to ensure that data can reside within on-chip memories of the FPGA, avoiding any need for off-chip memory accesses.
 Furthermore, using the C divide or modulo operators results in directly translate to built-in Verilog divide and modulo operators. 
-Unfortunately, the built-in operators are designed as single-cycle operation, causing large penalties in latency and area. 
-To work around this issue, we use a C implementation of the divide and modulo operations, which is indirectly compiles them as multi-cycle operations on the FPGA, reducing their latency penalties drastically. 
+Unfortunately, the built-in operators are designed as single-cycle operation, causing large penalties in clock frequency. 
+To work around this issue, we use a C implementation of the divide and modulo operations, which is indirectly compiles them as multi-cycle operations on the FPGA. 
 In addition, we initial the input arrays and check the output arrays of all programs entirely on-chip. 
 
 % For completeness, we use the full set of 24 benchmarks. We set the benchmark parameters so that all datatypes are integers (since \vericert{} only supports integers) and all datasets are `small' (to fit into the small on-chip memories). A current limitation of \vericert{}, as discussed in Section~\ref{?}, is that it does not support addition and subtraction operations involving integer literals not divisible by 4. To work around this, we lightly modified each benchmark program so that literals other than multiples of 4 are stored into variables before being added or subtracted. \JW{Any other notable changes to the benchmarks?}
@@ -95,13 +95,13 @@ In addition, we initial the input arrays and check the output arrays of all prog
 Firstly, before comparing any performance metrics, it is worth highlighting that any Verilog produced by \vericert{} is guaranteed to be \emph{correct}, whilst no such guarantee can be provided by \legup{}.
 This guarantee in itself provides a significant leap in terms of reliability of HLS, compared to any other HLS tools available. 
 
-Figure~\ref{fig:comparison_cycles} compares the cycle counts of our 26 programs executed by \vericert{} and \legup{} respectively. 
+Figure~\ref{fig:comparison_cycles} compares the cycle counts of our 27 programs executed by \vericert{} and \legup{} respectively. 
 In most cases, we see that the data points are above the diagonal, demonstrating that the \legup{}-generated hardware is faster than \vericert{}-generated Verilog.
 This performance gap is mostly due to \legup{} optimisations such as scheduling and memory analysis, which are designed to exploit parallelism from input programs.
-On average, \legup{} designs are $4\times$ faster than \vericert{} designs.
+On average, \legup{} designs are $4\times$ faster than \vericert{} designs on Polybench programs.
 This gap does not represent the performance cost that comes with formally proving a HLS tool.  
 Instead, it is simply a gap between an unoptimised \vericert{} versus an optimised \legup{}.
-In fact, without any optimisation, a few data points are close to diagonal and even below diagonal, which means \vericert{} is competitive to \legup{}.
+In fact, even without any optimisations, a few data points are close to diagonal and even below diagonal, which means \vericert{} is competitive to \legup{}.
 We are very encouraged by these data points. 
 As we optimise \vericert{} to incorporate other HLS optimisations in a formally-proved manner, this gap can reduce whilst preserving our correctness guarantees.
 
diff --git a/references.bib b/references.bib
index b18de8f..e2b7468 100644
--- a/references.bib
+++ b/references.bib
@@ -59,6 +59,15 @@ year = {2020},
   pages={69-70},
   doi={10.1109/MEMCOD.2004.1459818}}
   
+@inproceedings{davidthomas_asap16,
+  author    = {Thomas, David B.},
+  title     = {Synthesisable recursion for {C++} {HLS} tools},
+  booktitle = {{ASAP}},
+  pages     = {91--98},
+  publisher = {{IEEE} Computer Society},
+  year      = {2016}
+}  
+  
 @article{cong+11,
   author    = {Jason Cong and
                Bin Liu and
diff --git a/results/coqup.csv b/results/coqup.csv
index d570563..063a667 100644
--- a/results/coqup.csv
+++ b/results/coqup.csv
@@ -23,4 +23,5 @@ jacobi-1d,17980,262.12,1652,3261,0,0
 jacobi-2d,385212,115.1,30426,59844,0,4
 seidel-2d,979334,128.88,26988,53293,0,2
 floyd-warshall,5038446,96.62,60609,119811,0,2
-nussinov,882120,73.22,60238,119719,0,0
-\ No newline at end of file
+nussinov,882120,73.22,60238,119719,0,0
+cholesky,2347850,0.100,90.88,53992,106180,0,8
+\ No newline at end of file
diff --git a/results/legup.csv b/results/legup.csv
index d98a6f2..6be3136 100644
--- a/results/legup.csv
+++ b/results/legup.csv
@@ -23,4 +23,6 @@ jacobi-1d,6914,386.25,1355,1885,0,0,4.72
 jacobi-2d,84609,240.79,2347,3185,0,2,4.81
 seidel-2d,345294,232.4,2128,3337,0,2,4.68
 floyd-warshall,1238764,235.52,1869,2367,0,2,4.71
-nussinov,216467,273.07,1078,1431,0,2,4.79
-\ No newline at end of file
+nussinov,216467,273.07,1078,1431,0,2,4.79
+adi,90644,256.28,1111,2391,0,5.32
+cholesky,148443,273.6,1528,3419,15,4.89
+\ No newline at end of file
diff --git a/results/poly.csv b/results/poly.csv
index dee66fb..1aaf90b 100644
--- a/results/poly.csv
+++ b/results/poly.csv
@@ -23,4 +23,5 @@ jacobi-1d,6914,386.25,1355,1885,0,0,4.72,16606,277.93,1636,3305,0,0,0.071
 jacobi-2d,84609,240.79,2347,3185,0,2,4.81,357100,113.53,30393,59782,0,4,0.079
 seidel-2d,345294,232.4,2128,3337,0,2,4.68,875758,127.99,26948,53133,0,2,0.091
 floyd-warshall,1238764,235.52,1869,2367,0,2,4.71,4762766,109.4,59859,118101,0,2,0.094
-nussinov,216467,273.07,1078,1431,0,2,4.79,837958,90.73,60663,119303,0,0,0.080
-\ No newline at end of file
+nussinov,216467,273.07,1078,1431,0,2,4.79,837958,90.73,60663,119303,0,0,0.080
+cholesky,148443,273.6,1528,3419,0,15,4.89,2347850,90.88,53992,106180,0,8,0.100
+\ No newline at end of file
author	John Wickerson <j.wickerson@imperial.ac.uk>	2020-11-19 20:51:45 +0000
committer	overleaf <overleaf@localhost>	2020-11-19 20:52:11 +0000
commit	70df35bc74805473cd4a1e48293cb29d09b3767c (patch)
tree	6526879d2701ecbdee18f256779f08789a093c52
parent	1d66503454f22db76b8a314ea1f30babca8f7c93 (diff)
download	oopsla21_fvhls-70df35bc74805473cd4a1e48293cb29d09b3767c.tar.gz oopsla21_fvhls-70df35bc74805473cd4a1e48293cb29d09b3767c.zip