1 files changed, 14 insertions, 6 deletions
diff --git a/algorithm.tex b/algorithm.tex
index c38d7e0..52f3f7a 100644
--- a/algorithm.tex
+++ b/algorithm.tex
@@ -7,15 +7,17 @@ This section covers the main architecture of the HLS tool, and the way in which
 First of all, the choice of C for the input language of \vericert{} is simply because it is what most major HLS tools use~\cite{canis11_legup, xilinx20_vivad_high_synth, intel_hls, bambu_hls}. This, in turn, may be because C is ``[t]he starting point for the vast majority of algorithms to be implemented in hardware''~\cite{5522874}.
 %Since a lot of existing code for HLS is written in C, supporting C as an input language, rather than a custom domain-specific language, means that \vericert{} is more practical. 
 %An alternative was to support LLVM IR as an input language, however, to get a full work flow from a higher level language to hardware, a front end for that language to LLVM IR would also have to be verified. \JW{Maybe save LLVM for the `Choice of implementation language'?}
-\JW{We considered Bluespec~\cite{bluespec}, but decided that although it ``can be classed as a high-level language''~\cite{greaves_note}, it is too hardware-oriented to be used for traditional HLS.
-We also considered using a language with built-in parallel constructs that map well to parallel hardware, such as occam~\cite{page91_compil_occam} or Spatial~\cite{spatial}, but found these languages are too niche.}
+We considered Bluespec~\cite{nikhil04_blues_system_veril}, but decided that although it ``can be classed as a high-level language''~\cite{greaves_note}, it is too hardware-oriented to be used for traditional HLS.
+We also considered using a language with built-in parallel constructs that map well to parallel hardware, such as occam~\cite{page91_compil_occam} or Spatial~\cite{spatial}, but found these languages are too niche.
 % However, this would not qualify as being HLS due to the manual parallelism that would have to be performed. \JW{I don't think the presence of parallelism stops it being proper HLS.} 
 %\JP{I think I agree with Yann here, but it could be worded better. At any rate not many people have experience writing what is essentially syntactic sugar over a process calculus.} 
 %\JW{I mean: there are plenty of software languages that involve parallel constructs. Anyway, perhaps we can just dismiss occam for being too obscure.}
 
 
 \paragraph{Choice of target language}
-Verilog~\cite{06_ieee_stand_veril_hardw_descr_languag} is an HDL that can be synthesised into logic cells which can be either placed onto a field-programmable gate array (FPGA) or turned into an application-specific integrated circuit (ASIC).  Verilog was chosen as the output language for \vericert{} because it is one of the most popular HDLs and there already exist a few formal semantics for it that could be used as a target~\cite{loow19_verif_compil_verif_proces, meredith10_veril}. Other possible targets could have been Bluespec, a higher level hardware description language, for which there exists a formally verified translation to circuits using K\^{o}ika~\cite{bourgeat20_essen_blues}, however, targeting this language would not be trivial as it is not meant to be targeted by an automatic tool.  Finally, a custom circuit language could also have been targeted, which can then be translated to Verilog in an unverified way, however, some guarantees would be lost and it would not be possible to completely trust the output. \JP{Is this meant to be a dig at Koika :p It feels a bit odd mentioning Koika and then saying this without any comment or evaluation.}\YH{Yeah not at all, I guess it could have been another avenue to go down, don't know how to not make it sound like a dig, it's quite likely an author of koika might review this :)} \JP{What about that IR from ETH (I think) at PLDI 2020? Obviously didn't exist at the start, but might be worth comment.}\YH{Yeah, LLHD might be worth mentioning, but I think we can just say lack of formalisation makes it difficult} %\JW{Can we mention one or two alternatives that we considered? Bluespec or Chisel or one of Adam Chlipala's languages, perhaps?}
+Verilog~\cite{06_ieee_stand_veril_hardw_descr_languag} is an HDL that can be synthesised into logic cells which can be either placed onto a field-programmable gate array (FPGA) or turned into an application-specific integrated circuit (ASIC).  Verilog was chosen as the output language for \vericert{} because it is one of the most popular HDLs and there already exist a few formal semantics for it that could be used as a target~\cite{loow19_verif_compil_verif_proces, meredith10_veril}. Other possible targets could have been Bluespec, a higher level hardware description language, for which there exists a formally verified translation to circuits using K\^{o}ika~\cite{bourgeat20_essen_blues}, however, targeting this language would not be trivial as it is not meant to be targeted by an automatic tool, instead strives to a formally verified high-level hardware description language instead.  Finally, an intermediate language like LLHD~\cite{schuiki20_llhd} was also considered, however, currently there are no existing formal semantics.
+
+%\JW{Can we mention one or two alternatives that we considered? Bluespec or Chisel or one of Adam Chlipala's languages, perhaps?}
 
 \paragraph{Choice of implementation language}
 The framework that was chosen for the frontend was \compcert{}, as it is a mature framework for simulation proofs about intermediate languages, in addition to already providing a validated parser~\cite{jourdan12_valid_lr_parser} from C into the internal representation of Clight.  Other frameworks were also considered, such as Vellvm~\cite{zhao12_formal_llvm_inter_repres_verif_progr_trans}, as LLVM IR in particular is often used by HLS tools anyways, however, these would require more work to support a higher level language such as C as input, or even providing a parser for LLVM IR.\@ \JP{No mention of Coq here.} \JW{Could also mention Kiwi here~\cite{kiwi} as an example of an HLS tool built upon the .NET framework.}
@@ -110,7 +112,9 @@ Using the simple accumulator program shown in Figure~\ref{fig:accumulator_c} as
 
 \subsubsection{Translating C to 3AC}
 
-The first step of the translation is to use \compcert{} to transform the input C code into the 3AC shown in Figure~\ref{fig:accumulator_rtl}. As part of this, \compcert{} performs optimisations such as constant propagation and dead-code elimination.  Function inlining is also performed, which allows us to support function calls without having to support the \texttt{Icall} 3AC instruction.  The duplication of the function bodies caused by inlining does affect the total area of the hardware \JW{Is that definitely true? Was discussing this with Nadesh and George recently, and I ended up not being so sure. Inlining could actually lead to \emph{reduced} resource usage because once everything has been inlined, the (big) scheduling problem could then be solved quite optimally. Certainly inlining is known to increase register pressure, but that's not really an issue here. If we're  not sure, we could just say that inlining everything leads to bloated Verilog files and the inability to support recursion, and leave it at that.}\YH{I think that is true, just because we don't do scheduling.  With scheduling I think that's true, inlining actually becomes quite good.}, however it improves the latency of the hardware.  In addition to that, inlining removes the possibility of supporting recursive function calls, however, this is a feature that isn't supported in most other HLS tools either.
+The first step of the translation is to use \compcert{} to transform the input C code into the 3AC shown in Figure~\ref{fig:accumulator_rtl}. As part of this, \compcert{} performs optimisations such as constant propagation and dead-code elimination.  Function inlining is also performed, which allows us to support function calls without having to support the \texttt{Icall} 3AC instruction.  The duplication of the function bodies caused by inlining does affect the total area of the hardware, however, latency is improved instead.  In addition to that, inlining removes the possibility of supporting recursive function calls, however, this is a feature that isn't supported in most other HLS tools either.
+
+%\JW{Is that definitely true? Was discussing this with Nadesh and George recently, and I ended up not being so sure. Inlining could actually lead to \emph{reduced} resource usage because once everything has been inlined, the (big) scheduling problem could then be solved quite optimally. Certainly inlining is known to increase register pressure, but that's not really an issue here. If we're  not sure, we could just say that inlining everything leads to bloated Verilog files and the inability to support recursion, and leave it at that.}\YH{I think that is true, just because we don't do scheduling.  With scheduling I think that's true, inlining actually becomes quite good.}
 
 \subsubsection{Translating 3AC to HTL}
 
@@ -119,8 +123,10 @@ The first step of the translation is to use \compcert{} to transform the input C
 %   + TODO Clarify connection between CFG and FSMD
 
 %   + TODO Explain how memory is mapped
+%\JW{I feel like this could use some sort of citation, but I'm not sure what. I guess this is all from "Hardware Design 101", right?}\YH{I think I found a good one actually, which goes over the basics.}
+%\JW{I think it would be worth having a sentence to explain how the C model of memory is translated to a hardware-centric model of memory. For instance, in C we have global variables/arrays, stack-allocated variables/arrays, and heap-allocated variables/arrays (anything else?). In Verilog we have registers and RAM blocks. So what's the correspondence between the two worlds? Globals and heap-allocated are not handled, stack-allocated variables become registers, and stack-allocated arrays become RAM blocks? Am I close?}\YH{Stack allocated variables become RAM as well, so that we can deal with addresses easily and take addresses of any variable.} \JW{I see, thanks. So, in short, the only registers in your hardware designs are those that store things like the current state, etc. You generate a fixed number of registers every time you synthesis -- you don't generate extra registers to store any of the program variables. Right?}
 
-The first translation performed in \vericert{} is from 3AC to a new hardware translation language (HTL), which is one step towards being completely translated to hardware described in Verilog.  The main translation that is performed is going from a CFG representation of the computation to a finite state machine with datapath (FSMD)~\cite{hwang99_fsmd}\JW{I feel like this could use some sort of citation, but I'm not sure what. I guess this is all from "Hardware Design 101", right?}\YH{I think I found a good one actually, which goes over the basics.} representation in HTL.\@  The core idea of the FSMD representation is that it separates the control flow from the operations on the memory and registers, so that the state transitions can be translated into a simple finite state machine (FSM) and each state then contains data operations that update the memory and registers.  Figure~\ref{fig:accumulator_diagram} shows the resulting architecture of the FSMD. \JW{I think it would be worth having a sentence to explain how the C model of memory is translated to a hardware-centric model of memory. For instance, in C we have global variables/arrays, stack-allocated variables/arrays, and heap-allocated variables/arrays (anything else?). In Verilog we have registers and RAM blocks. So what's the correspondence between the two worlds? Globals and heap-allocated are not handled, stack-allocated variables become registers, and stack-allocated arrays become RAM blocks? Am I close?}\YH{Stack allocated variables become RAM as well, so that we can deal with addresses easily and take addresses of any variable.} \JW{I see, thanks. So, in short, the only registers in your hardware designs are those that store things like the current state, etc. You generate a fixed number of registers every time you synthesis -- you don't generate extra registers to store any of the program variables. Right?} Hardware does not have the same memory model as C, the memory model therefore needs to be translated in the following way.  Global variables are not translated in \vericert{} at the moment, however, the stack of the main function will become the RAM seen in Figure~\ref{fig:accumulator_diagram}.  Variables that have their address is taken will therefore be stored in the RAM, as well as any arrays or structs defined in the function.  Variables that did not have their address taken will be kept in registers.
+The first translation performed in \vericert{} is from 3AC to a new hardware translation language (HTL), which is one step towards being completely translated to hardware described in Verilog.  The main translation that is performed is going from a CFG representation of the computation to a finite state machine with datapath (FSMD)~\cite{hwang99_fsmd} representation in HTL.\@  The core idea of the FSMD representation is that it separates the control flow from the operations on the memory and registers, so that the state transitions can be translated into a simple finite state machine (FSM) and each state then contains data operations that update the memory and registers.  Figure~\ref{fig:accumulator_diagram} shows the resulting architecture of the FSMD.  Hardware does not have the same memory model as C, the memory model therefore needs to be translated in the following way.  Global variables are not translated in \vericert{} at the moment, however, the stack of the main function will become the RAM seen in Figure~\ref{fig:accumulator_diagram}.  Variables that have their address is taken will therefore be stored in the RAM, as well as any arrays or structs defined in the function.  Variables that did not have their address taken will be kept in registers.
 
 \begin{figure*}
   \centering
@@ -226,7 +232,9 @@ However, the memory model that \compcert{} uses for its intermediate languages~\
 
 % Mention that this optimisation is not performed sometimes (clang -03).
 
-\vericert{} performs some optimisations at the level of the instructions that are generated, so that the hardware performs the instructions as quickly as possible and so that the maximum frequency at which the hardware can run is increased.  One of the main constructs that cripple performance of the generated hardware is the instantiation of divider circuits in the hardware.  In the case of \vericert{}, it requires the result of the divide operation to be ready in the same clock cycle, meaning the divide circuit needs to be implemented fully combinationally.  This is inefficient in terms of hardware size, but also in terms of latency, because it means that the maximum frequency of the hardware needs to be reduced dramatically so that the divide circuit has enough time to finish. \JP{Multi-cycle paths might be something worth exploring in future work: fairly error-prone/dangerous for hand-written code, but might be interesting in generated code.}\YH{Definitely is on the list for next things to look into, will make divide so much more efficient.}
+\vericert{} performs some optimisations at the level of the instructions that are generated, so that the hardware performs the instructions as quickly as possible and so that the maximum frequency at which the hardware can run is increased.  One of the main constructs that cripple performance of the generated hardware is the instantiation of divider circuits in the hardware.  In the case of \vericert{}, it requires the result of the divide operation to be ready in the same clock cycle, meaning the divide circuit needs to be implemented fully combinationally.  This is inefficient in terms of hardware size, but also in terms of latency, because it means that the maximum frequency of the hardware needs to be reduced dramatically so that the divide circuit has enough time to finish.
+
+%\JP{Multi-cycle paths might be something worth exploring in future work: fairly error-prone/dangerous for hand-written code, but might be interesting in generated code.}\YH{Definitely is on the list for next things to look into, will make divide so much more efficient.}
 
 These small optimisations were found to be the most error prone, and guaranteeing that the new representation is equivalent to representation used in the \compcert{} semantics is difficult without proving this for all possible inputs.