From 2a716db2fb59f448590b189f7b953a80d02bc5ee Mon Sep 17 00:00:00 2001
From: Yann Herklotz <git@yannherklotz.com>
Date: Fri, 15 Apr 2022 11:46:10 +0100
Subject: Add changes

---
 chapters/background.tex | 306 +++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 304 insertions(+), 2 deletions(-)

(limited to 'chapters/background.tex')

diff --git a/chapters/background.tex b/chapters/background.tex
index 4636918..81ca653 100644
--- a/chapters/background.tex
+++ b/chapters/background.tex
@@ -1,7 +1,309 @@
+\environment fonts_env
+\environment lsr_env
+
 \startcomponent background
 
-\chapter{Background}
+\chapter[sec:background]{Background}
+
+%\begin{table}
+%  \centering
+%  \begin{tabular}{lccccc}\toprule
+%    \textbf{Tool} & \textbf{Implementation} & \textbf{Mainstream} & \textbf{Proof} & \textbf{Mechanised} & \textbf{HLS}\\
+%    \midrule
+%    Catapult C~\cite{mentor20_catap_high_level_synth} & \cmark{} & \cmark{} & \cmark{} & \xmark{} & \cmark{}\\
+%    Chapman \textit{et al.}~\cite{chapman92_verif_bedroc} & \cmark{} & \xmark{} & \cmark{} & \cmark{} & \cmark{}\\
+%    Clarke \textit{et al.}~\cite{clarke03_behav_c_veril} & \cmark{} & \cmark{} & \cmark{} & \xmark{} & \cmark{}\\
+%    Ellis~\cite{ellis08} & \xmark{} & \xmark{} & \cmark{} & \cmark{} & \cmark{}\\
+%    Karfa \textit{et al.}~\cite{karfa06_formal_verif_method_sched_high_synth} & \cmark{} & \xmark{} & \cmark{} & \xmark{} & \cmark{}\\
+%    Kundu \textit{et al.}~\cite{kundu08_valid_high_level_synth} & \cmark{} & \xmark{} & \cmark{} & \xmark{} & \cmark\\
+%    LegUp~\cite{canis13_legup} & \cmark{} & \cmark{} & \xmark{} & \xmark{} & \cmark{}\\
+%    Lööw \textit{et al.}~\cite{loow19_verif_compil_verif_proces} & \cmark{} & \xmark{} & \cmark{} & \cmark{} & \xmark{}\\
+%    OpenCL SDK~\cite{intel20_sdk_openc_applic} & \cmark{} & \xmark{} & \xmark{} & \xmark{} & \cmark{}\\
+%    Perna \textit{et al.}~\cite{perna12_mechan_wire_wise_verif_handel_c_synth} & \cmark{} & \xmark{} & \cmark{} & \cmark{} & \xmark{}\\
+%    Vivado HLS~\cite{xilinx20_vivad_high_synth} & \cmark{} & \cmark{} & \xmark{} & \xmark{} & \cmark{}\\
+%    \bottomrule
+%  \end{tabular}
+%  \caption{Table showing the various related works and comparing them.}\label{tab:existing_tools}
+%\end{table}
+
+This section describes the relevant literature around high-level synthesis, verified high-level
+synthesis, and finally mechanised proofs of HLS. Table~\ref{tab:existing_tools} classifies
+existing tools in the are of formal verification and high-level synthesis.  The next section covers
+the main different categories that the tools in Table~\ref{tab:existing_tools} show, which include
+tools that cover \emph{implementation}, \emph{proof} and finally tools that are \emph{mechanically
+  verified}.
+
+\section{Implementation: High-level Synthesis}
+
+High-level synthesis (HLS) is the transformation of software directly into hardware.  There are many
+different types of HLS, which can vary greatly with what languages they accept, however, they often
+share similar steps in how the translation is performed, as they all go from a higher-level,
+behavioural description of the algorithm to a timed hardware description.  The main steps performed
+in the translation are the following~\cite{coussy09_introd_to_high_level_synth,canis13_legup}:
+
+\desc{Compilation of specification} First, the input specification has to be compiled into a
+representation that can be used to apply all the necessary optimisations.  This can be an
+intermediate language such as the lower-level virtual machine intermediate representation (LLVM IR),
+or a custom representation of the code.
+
+\desc{Hardware resource allocation} It is possible to select a device to target, which gives
+information about how many LUTs can be used to implement the logic.  However, tools often assume
+that an unlimited amount of resources are available instead.  This is resource sharing can be too
+expensive for adders, as the logic for sharing the adder would have approximately the same cost as
+adding a separate adder.  However, multiplies and divide blocks should be shared, especially as
+divide blocks are often implemented in LUTs and are therefore quite expensive.
+
+\desc{Operation scheduling} The operations then need to be scheduled into a specific clock cycle, as
+the input specification language is normally an untimed behavioural representation.  This is where
+the spatial parallelism of the hardware can be taken advantage of, meaning operations that are not
+dependent on each other can run in parallel.  There are various possible scheduling methods that
+could be used, such as static or dynamic scheduling, which are described further in
+sections~\ref{sec:static_scheduling} and~\ref{sec:dynamic_scheduling}.
+
+\desc{Operation and variable binding} After the operations have been scheduled, the operations and
+the variables need to be bound to hardware units and registers respectively.  It is often not that
+practical to share many hardware units though, as this requires adding multiplexers, which are often
+the same size as the hardware units themselves.  It is therefore more practical to only share
+operations that take up a lot of space, such as modulo or divide circuits, as well as multiplies if
+there are not available any more on the FPGA.
+
+\desc{Hardware description generation} Finally, the hardware description is generated from the code
+that was described in the intermediate language.  This is often a direct translation of the
+instructions into equivalent hardware constructs.
+
+There are many examples of existing high-level synthesis tools, the most popular ones being
+LegUp~\cite{canis13_legup}, Vivado HLS~\cite{xilinx20_vivad_high_synth}, Catapult
+C~\cite{mentor20_catap_high_level_synth} and Intel's OpenCL SDK~\cite{intel20_sdk_openc_applic}.
+These HLS tools all accept general programming languages such as C or OpenCL, however, some HLS
+tools take in languages that were designed for hardware such as
+Handel-C~\cite{aubury96_handel_c_languag_refer_guide}, where time is encoded as part of the
+semantics.
+
+\subsection[sec:static_scheduling]{Static Scheduling}
+
+Static scheduling is used by the majority of synthesis
+tools~\cite{canis13_legup,xilinx20_vivad_high_synth,mentor20_catap_high_level_synth,intel20_sdk_openc_applic}
+and means that the time at which each operation will execute is known at compile time.  Static
+analysis is used to gather all the data dependencies between all the operations to determine in
+which clock cycle the operations should be placed, or if these operations can be parallelised.  Once
+the data-flow analysis has been performed, various scheduling schemes can be taken into account to
+place operations in various clock cycles.  Some common static-scheduling schemes are the following:
+
+\desc{As soon as possible (ASAP)} scheduling will place operations into the first possible clock
+cycle that they can be executed.
+
+\desc{As late as possible (ALAP)} scheduling places operations into the last possible clock cycle,
+right before they are used by later operations.
+
+\desc{List scheduling} uses priorities associated with each operation and chooses operations from
+each priority level to be placed into the current clock cycle if all the data-dependencies are
+met.  This is done until all the resources for the current clock cycle are used up.
+
+Static scheduling can normally produce extremely small circuits, however, it is often not possible
+to guarantee that the circuits will have the best
+throughput~\cite{cheng20_combin_dynam_static_sched_high_level_synth}, as this requires extensive
+control-flow analysis and complex optimisations.  Especially for loops, finding the optimal
+initiation interval (II) can be tricky if there are loads and stores in the loop or if the loops are
+nested.
+
+\subsection[sec:dynamic_scheduling]{Dynamic Scheduling}
+
+On the other hand, Dynamic scheduling~\cite{josipovic18_dynam_sched_high_level_synth} does not
+require the schedule to be known at compile time and it instead generates circuits using tokens to
+schedule the operations in parallel at run time.  Whenever the data for an operation is available,
+it sends a token to the next operation, signalling that the data is ready to be read.  The next
+operation does not start until all the required inputs to the operation are available, and once that
+is the case, it computes the result and then sends another token declaring that the result of that
+operation is also ready.  The benefit of this approach is that only basic data-flow analysis is
+needed to connect the tokens correctly, however, the scheduling is then done dynamically at run
+time, depending on how long each primitive takes to finish and when the tokens activate the next
+operations.
+
+The benefits of this approach over static scheduling is that the latency of these circuits is
+normally significantly lower than the latency of static scheduled circuits, because they can take
+advantage of runtime information of the circuit.  However, because of the signalling required to
+perform the runtime scheduling, the area of these circuits is usually much larger than the area of
+static scheduled circuits.  In addition to that, much more analysis is needed to properly
+parallelise loads and stores to prevent bugs, which requires the addition of buffers in certain
+locations.
+
+An example of a dynamically scheduled synthesis tool is
+Dynamatic~\cite{josipovic18_dynam_sched_high_level_synth}, which uses a load-store queue
+(LSQ)~\cite{josipovic17_out_of_order_load_store} to order memory operations correctly even when
+loops are pipelined and there are dependencies between iterations.  In addition to that, performance
+of the dynamically scheduled code is improved by careful buffer
+placement~\cite{josipovic20_buffer_placem_sizin_high_perfor_dataf_circuit}, which allows for better
+parallisation and pipelining of loops.
+
+\section{Proof: Verified HLS}
+
+Work is being done to prove the equivalence between the generated hardware and the original
+behavioural description in C.  An example of a tool that implements this is Mentor's
+Catapult~\cite{mentor20_catap_high_level_synth}, which tries to match the states in the register
+transfer level (RTL) description to states in the original C code after an unverified translation.
+This technique is called translation validation~\cite{pnueli98_trans}, whereby the translation that
+the HLS tool performed is proven to have been correct for that input, by showing that they behave in
+the same way for all possible inputs.  Using translation validation is quite effective for proving
+complex optimisations such as
+scheduling~\cite{kim04_autom_fsmd,karfa06_formal_verif_method_sched_high_synth,chouksey20_verif_sched_condit_behav_high_level_synth}
+or code
+motion~\cite{banerjee14_verif_code_motion_techn_using_value_propag,chouksey19_trans_valid_code_motion_trans_invol_loops},
+however, the validation has to be run every time the high-level synthesis is performed.  In addition
+to that, the proofs are often not mechanised or directly related to the actual implementation,
+meaning the verifying algorithm might be wrong and could give false positives or false negatives.
+
+More examples of translation validation for proofs about HLS
+algorithms~\cite{karfa06_formal_verif_method_sched_high_synth,karfa07_hand_verif_high_synth,kundu07_autom,karfa08_equiv_check_method_sched_verif,kundu08_valid_high_level_synth,karfa10_verif_datap_contr_gener_phase,karfa12_formal_verif_code_motion_techn,chouksey19_trans_valid_code_motion_trans_invol_loops,chouksey20_verif_sched_condit_behav_high_level_synth}
+are performed using a HLS tool called SPARK~\cite{gupta03_spark}.  These translation validation
+algorithms can check the correctness of complicated optimisations such as code motion or loop
+inversions.  However, even though the correctness of the verifier is proven in the papers, the proof
+does translate directly to the algorithm that was implemented for the verifier.  It is therefore
+possible that output is accepted even though it is not equivalent to the input.  In addition to
+that, these papers reason about the correctness of the algorithms in terms of the intermediate
+language of SPARK, and does not extend to the high-level input language that SPARK takes in, or the
+hardware description language that SPARK outputs.
+
+Finally there has also been work proving HLS correct without using translation validation, but by
+directly showing that the translation is correct.  The first instance of this is proving the
+BEDROC~\cite{chapman92_verif_bedroc} HLS tool is correct.  This HLS tool converts a high-level
+description of an algorithm, supporting loops and conditional statements, to a netlist and proves
+that the output is correct.  It works in two stages, first generating a DFG from the input language,
+HardwarePal.  It then optimises the DFG to improve the routing of the design and also improving the
+scheduling of operations.  Finally, the netlist is generated from the DFG by placing all the
+operations that do not depend on each other into the same clock cycle.  Datapath and register
+allocation is performed by an unverified clique partitioning algorithm.  The equivalence proof
+between the DFG and HardwarePal is done by proof by simulation, where it is proven that, given a
+valid input configuration, that applying a translation or optimisation rule will result in a valid
+DFG with the same behaviour as the input.
+
+There has also been work on proving the translation from Occam to gates~\cite{page91_compil_occam}
+correct using algebraic proofs~\cite{jifeng93_towar}.  This translation resembles dynamic scheduling
+as tokens are used to start the next operations.  However, Occam is timed by design, and will
+execute assignments sequentially.  To take advantage of the parallel nature of hardware, Occam uses
+explicit parallel constructs with channels to share state.  Handel-C, a version of Occam with many
+more features such as memory and pointers, has also been used to prove HLS correct, and is described
+further in Section~\ref{sec:wire-to-wire}.
+
+\section{Mechanised: Compiler Proofs}
+
+Even though a proof for the correctness of an algorithm might exist, this does not guarantee that
+the algorithm itself behaves in the same way as the assumed algorithm in the proof.  The
+implementation of the algorithm is separate from the actual implementation, meaning there could be
+various implementation bugs in the algorithm that mean that it does not behave correctly.  C
+compilers are a good example of this, where many optimisations performed by the compilers have been
+proven correct, however these proofs are not linked directly to the actual implementations of these
+algorithms in GCC or Clang.  Yang et al.~\cite{yang11_findin_under_bugs_c_compil} found
+more than 300 bugs in GCC and Clang, many of them being in the optimisation phases of the compiler.
+One way to link the proofs to the actual implementations in these compilers is to write the compiler
+in a language which allows for a theorem prover to check properties about the algorithms.  This
+allows for the proofs to be directly linked to the algorithms, ensuring that the actual
+implementations are proven correct.  Yang et al.~\cite{yang11_findin_under_bugs_c_compil}
+found that CompCert, a formally verified C Compiler, only had five bugs in all the unverified parts
+of the compiler, meaning this method of proving algorithms correct ensures a correct compiler.
+
+\subsection{CompCert}
+
+CompCert~\cite{leroy09_formal_verif_compil_back_end} is a formally verified C compiler written in
+Coq~\cite{bertot04_inter_theor_provin_progr_devel}.  Coq is a theorem prover, meaning algorithms
+written in Coq can be reasoned about in Coq itself by proving various properties about the
+algorithms.  To then run these algorithms that have been proven correct, they can be extracted
+directly to OCaml code and then executed, as there is a straightforward correspondence between Coq
+and OCaml code.  During this translation, all the proofs are erased, as they are not needed during
+the execution of the compiler, as they are only needed when the correctness of the compiler needs to
+be checked.  With this process, one can have a Compiler that satisfies various correctness
+properties and can therefore be proven to preserve the behaviour of the code.
+
+CompCert contains eleven intermediate languages, which are used to gradually translate C code into
+assembly that has the same behaviour.  Proving the translation directly without going through the
+intermediate languages would be infeasible, especially with the many optimisations that are
+performed during the translation, as assembly is very different to the abstract C code it receives.
+The first three intermediate languages (C\#minor, C\#minorgen, Cminor) are used to transform Clight
+into a more assembly like language called register transfer language (RTL).  This language consist
+of a control-flow graph of instructions, and is therefore well suited for various compiler
+optimisations such as constant propagation, dead-code elimination or function
+inlining~\cite{tristan08_formal_verif_trans_valid}.  After RTL, each intermediate language is used
+to get closer to the assembly language of the architecture, performing operations such as register
+allocation and making sure that the stack variables are correctly aligned.
+
+\subsection{Isabelle HLS}
+
+Martin Ellis' work on correct synthesis~\cite{ellis08} is the first example of a mechanically
+verified high-level synthesis tool, also called hardware/software compilers, as they produce
+software as well as hardware for special functional units that should be hardware accelerated.  The
+main goal of the thesis is to provide a framework to prove hardware/software compilers correct, by
+defining semantics for an intermediate language (IR) which supports partitioning of code into
+hardware and software parts, as well as a custom netlist format which is used to describe the
+hardware parts and is the final target of the hardware/software compiler.  The proof of correctness
+then says that the hardware/software design should have the same behaviour as if the design had only
+been implemented in software.  The framework used to prove the correctness of the compilation from
+the IR to the netlist format is written in Isabelle, which is a theorem prover comparable to Coq.
+Any proofs in the framework are therefore automatically checked by Isabelle for correctness.
+
+This work first defines a static single assignment (SSA) IR, with the capabilities of defining
+sections that should be hardware accelerated, and sections that should be executed on the CPU.  This
+language supports \emph{hyperblocks}, which are a list of basic blocks with one entry node and
+multiple exit nodes, which is well suited for various HLS optimisations and better scheduling.
+However, as this is quite a low-level intermediate language, the first thing that would be required
+is to create a tool that can actually translate a high-level language to this intermediate language.
+In addition to that, they also describe a netlist format in which the hardware accelerated code will
+be represented in.
+
+As both the input IR and output netlist format have been designed from scratch, they are not very
+useful for real world applications, as they require a different back end to be implemented from
+existing compilers.  In addition to that, it would most likely mean that the translation from
+higher-level language to the IR is unverified and could therefore contain bugs.  As the resulting
+netlist also uses a custom format, it cannot be fed directly to tools that can then translate it a
+bitstream to place onto an FPGA. The reason for designing a custom IR and netlist was so that
+these were compatible with each other, making proofs of equivalence between the two simpler.
+
+Finally, it is unclear whether or not a translation algorithm from the IR to the netlist format was
+actually implemented, as the only example in the thesis seems to be compiled by hand to explain the
+proof.  There are also no benchmarks on real input programs showing the efficiency of the
+translation algorithm, and it is therefore unclear whether the framework would be able to prove more
+complicated optimisations that a compiler might perform on the source code.  The thesis seems to
+describe the correctness proofs by assuming a compiler exists which outputs various properties that
+are needed by the equivalence proof, such a mapping between variables and netlist wires and
+registers.
+
+Even though this Isabelle framework does provide some equivalence relation between an IR and a
+netlist and describes how the translation would be proven correct by matching states to one another,
+it is actually not useable in practice.  First of all, the IR is only represented in Isabelle and
+does not have a parser or a compiler which can target this IR.  In addition to that, the netlist
+format cannot be passed to any existing tool, to then be placed onto an FPGA, meaning an additional,
+unverified translation would have to take place.  This further reduces the effectiveness of the
+correctness proof, as there are various stages that are not addressed by the proofs and therefore
+have to be assumed to be correct.
+
+\subsection[sec:wire-to-wire]{Mechanised Handel-C to Netlist Translation}
+
+Handel-C~\cite{bowen98_handel_c_languag_refer_manual} is a C-like language for hardware development.
+It supports many C features such as assignments, if-statements, loops, pointers and functions.  In
+addition to these constructs, Handel-C also supports explicit timing constructs, such as explicit
+parallelism as well as sequential or parallel assignment, similar to blocking and nonblocking
+assignments in Verilog.  Perna et al.~\cite{perna12_mechan_wire_wise_verif_handel_c_synth} developed
+a mechanically verified Handel-C to netlist translation written in HOL. The translation is based
+on previous work describing translation from Occam to gates by Page et
+al.~\cite{page91_compil_occam}, which was proven correct by Jifeng et al.~\cite{jifeng93_towar}
+using algebraic proofs.  As Handel-C is mostly an extension of Occam with C-like operations, the
+translation from Handel-C to gates can proceed in a similar way.
+
+Perna et al. verify the compilation of a subset of Handel-C to gates, which does not include memory,
+arrays or function calls.  In addition to the constructs presented in Page et al., the prioritised
+choice construct is also added to the Handel-C subset that is supported.  The verification proceeds
+by first defining the algorithm to perform the compilation, chaining operations together with start
+and end signals that determine the next construct which will be executed.  The circuits themselves
+are treated as black boxes by the compilation algorithm and are chosen based on the current
+statement in Handel-C which is being translated.  The compilation algorithm correctly has to link
+each black box to one another using the start and end signals that the circuits expose to correctly
+compile the Handel-C code.
 
-aroistenaoirstenoiaresntoien
+The verification of the compilation is done by proving that the control signal is propagated through
+the circuits correctly, and that each circuit is activated at the right time.  It is also proven
+that the internal circuits of the constructs also propagate the control signal in the correct way,
+and will be active at the right clock cycle.  However, it is not proven that the circuits have the
+same functionality as the Handel-C constructs, only that the control signals are propagated in the
+correct manner.  In addition to that, the assumption is made that the semantics for time are the
+same in Handel-C as well as in the netlist format that is generated, which could be proven if the
+constructs are shown to behave in the exact same way as the handel-C constructs.
 
 \stopcomponent
-- 
cgit