\environment fonts_env
\environment lsr_env

\startcomponent background

\chapter[sec:background]{Background}

\startsynopsis
  This chapter describes the current state-of-the-art optimisations implemented in high-level
  synthesis tools, in particular focusing on static scheduling, loop pipelining and finally also
  describing dynamic scheduling which is becoming a popular alternative.
\stopsynopsis

\section{Verification}

There are different kinds of verification tools, which can mostly be placed into two categories:
automatic theorem provers described in \in{Section}[sec:background:automatic] and interactive
theorem provers described in \in{Section}[sec:background:interactive].

\subsection[sec:background:automatic]{Automatic theorem provers}

Automatic theorem provers such as \SAT\ or \SMT\ solvers can be characterised as decision procedures
that will answer, for example, if a formula is satisfiable or not.  However, by default these tools
will only give the answer to the initial query, without showing the reasoning.  The reasoning is
often quite complex as the \SAT\ or \SMT\ tool will implement many optimisations to improve the
performance of the decision procedure.

The main advantage of using an automatic theorem prover is that if one is working in its constrained
decidable theory, then it will be efficient at proving or disproving if a formula is a theorem.
However, if the theorem requires inductive arguments to prove, then the theorem prover might need
some manual help from the user by adding the right lemmas to its collection of facts which the
automatic procedure can use.  The proof itself though will still be automatic, which means that many
of the tedious cases in the proofs can be ignored.

However, this is also the main disadvantage of automatic theorem provers, because they do not
provide details about the proof itself and often cannot communicate why they cannot prove a theorem.
This means that as a user one has to guess what theorems the prover is missing and try and add these
to the fact database.

Finally, automatic theorem provers do not provide reasoning for their final answer by default,
meaning one cannot check if the result is actually correct.  However, some SMT solvers support the
generation of proof witnesses, which can then be checked and reused in other theorem provers.  Some
examples of these are veriT~\cite[bouton09], and these can then be integrated into Coq using
SMTCoq~\cite[armand11_modul_integ_sat_smt_solver].

\subsection[sec:background:interactive]{Interactive theorem provers}

Interactive theorem provers, on the other hand, focus on checking proofs that are provided to them.
These can either be written manually by the user, or automatically generated by some decision
procedure.  However, these two ways of generating proofs can be combined, so the general proof
structure can be manually written, and simpler steps inside of the proof can be automatically
solved.

The main benefit of using an interactive theorem prover is that the proof is there and can be
checked by a small, trusted kernel.  This kernel does not need to be heavily optimised, and can
therefore be reasoned about.

The main cost of using an interactive theorem prover is the time it takes to prove theorems, and the
amount of formalisation that is needed to make the proofs pass.  For a proof to be completed, one
has to remove any axioms from the proof, meaning even the smallest detail must be proven to
continue.

\section[sec:background:hls]{High-level Synthesis}

\HLS\ is the transformation of software directly into hardware.  There are many different types of
\HLS, which can vary greatly with what languages they accept, however, they often share similar
steps in how the translation is performed, as they all go from a higher-level, behavioural
description of the algorithm to a timed hardware description.  The main steps performed in the
translation are the following~\cite[coussy09_introd_to_high_level_synth,canis13_legup]:

\desc{Compilation of specification} First, the input specification has to be compiled into a
representation that can be used to apply all the necessary optimisations.  This can be an
intermediate language such as the lower-level virtual machine intermediate representation (LLVM IR),
or a custom representation of the code.

\desc{Hardware resource allocation} It is possible to select a device to target, which gives
information about how many resources are available and can be used to implement the logic.  However,
tools often assume that an unlimited amount of resources are available instead.  This resource
sharing can be too expensive for adders, as the logic for sharing the adder would have approximately
the same cost as adding a separate adder.  However, multipliers and divider blocks should be shared,
especially as divider blocks are often implemented in logic and are therefore quite expensive.

\desc{Operation scheduling} The operations then need to be scheduled into a specific clock cycle, as
the input specification language is normally an untimed behavioural representation.  This is where
the spatial parallelism of the hardware can be taken advantage of, meaning operations that are not
dependent on each other can run in parallel.  There are various possible scheduling methods that
could be used, such as static or dynamic scheduling, which are described further in
\in{Section}[sec:static_scheduling] and in \in{Section}[sec:dynamic_scheduling].

\desc{Operation and variable binding} After the operations have been scheduled, the operations and
the variables need to be bound to hardware units and registers respectively.  It is often not that
practical to share many hardware units though, as this requires adding multiplexers, which are often
the same size as the hardware units themselves.  It is therefore more practical to only share
operations that take up a lot of space, such as modulo or divide circuits, as well as multipliers if
they are not available any more on the FPGA.

\desc{Hardware description generation} Finally, the hardware description is generated from the code
that was described in the intermediate language.  This is often a direct translation of the
instructions into equivalent hardware constructs.

There are many examples of existing high-level synthesis tools, the most popular ones being
LegUp~\cite{canis13_legup}, Vivado HLS~\cite{xilinx20_vivad_high_synth}, Catapult
C~\cite{mentor20_catap_high_level_synth} and Intel's OpenCL SDK~\cite{intel20_sdk_openc_applic}.
These HLS tools all accept general programming languages such as C or OpenCL, however, some HLS
tools take in languages that were designed for hardware such as
Handel-C~\cite{aubury96_handel_c_languag_refer_guide}, where time is encoded as part of the
semantics.

\subsection[sec:language-blocks]{Intermediate Language}

This section describes some possible characteristics of a language that could be used as an input to
an \HLS\ tool.  These languages normally require some structure to allow for easier optimisation and
analysis passes.  In particular, it is often useful to have contiguous blocks of instructions that
do not contain any control-flow in one list.  This means that these instructions can safely be
rearranged by only looking at local information of the block itself, and in particular it allows for
complete removal of control-flow as only the data-flow is important in that block.

\subsubsection{Basic blocks}

%TODO: Finish the basic blocks section

Basic blocks are the simplest form of structure, as these are only formed of lists of instructions
that do not include control-flow.

\subsubsection{Superblocks}

Superblocks extend the notion of basic blocks to contiguous regions without any incoming
control-flow, however, there can be multiple exits out of the block.  The main benefit of this
definition is that due to the extra flexibility of allowing multiple exits, the basic blocks can be
extended, which often improves most optimisations that make use of basic blocks.  However, the
downside is that the representation of the blocks can be more complex due to the introduction of the
extra control-flow.  Any analysis passes will have to take into account the possibility of
control-flow being present and many simplifications will not be possible anymore.

\subsubsection{Hyperblocks}

Hyperblocks are also an extension of basic blocks similar to superblocks, but instead of introducing
special control-flow instructions into the block, every instruction is predicated.  This leads to
possibly more complex control-flow than in both of the previous cases, however, it can be reasoned
with using a \SAT\ or \SMT\ solver.

\subsection[sec:static_scheduling]{Static Scheduling}

Static scheduling is used by the majority of synthesis
tools~\cite{canis13_legup,xilinx20_vivad_high_synth,mentor20_catap_high_level_synth,intel20_sdk_openc_applic}
and means that the time at which each operation will execute is known at compile time.  Static
analysis is used to gather all the data dependencies between all the operations to determine in
which clock cycle the operations should be placed, or if these operations can be parallelised.  Once
the data-flow analysis has been performed, various scheduling schemes can be taken into account to
place operations in various clock cycles.  Some common static-scheduling schemes are the following:

\desc{As soon as possible (ASAP)} scheduling will place operations into the first possible clock
cycle that they can be executed.

\desc{As late as possible (ALAP)} scheduling places operations into the last possible clock cycle,
right before they are used by later operations.

\desc{List scheduling} uses priorities associated with each operation and chooses operations from
each priority level to be placed into the current clock cycle if all the data-dependencies are
met.  This is done until all the resources for the current clock cycle are used up.

Static scheduling can normally produce extremely small circuits, however, it is often not possible
to guarantee that the circuits will have the best
throughput~\cite{cheng20_combin_dynam_static_sched_high_level_synth}, as this requires extensive
control-flow analysis and complex optimisations.  Especially for loops, finding the optimal \II\ can
be tricky if there are loads and stores in the loop or if the loops are nested.

\subsection[sec:dynamic_scheduling]{Dynamic Scheduling}

On the other hand, Dynamic scheduling~\cite{josipovic18_dynam_sched_high_level_synth} does not
require the schedule to be known at compile time and instead it generates circuits using tokens to
schedule the operations in parallel at run time.  Whenever the data for an operation is available,
it sends a token to the next operation, signalling that the data is ready to be read.  The next
operation does not start until all the required inputs to the operation are available, and once that
is the case, it computes the result and then sends another token declaring that the result of that
operation is also ready.  The benefit of this approach is that only basic data-flow analysis is
needed to connect the tokens correctly, however, the scheduling is then done dynamically at run
time, depending on how long each primitive takes to finish and when the tokens activate the next
operations.

The benefits of this approach over static scheduling is that the latency of these circuits is
normally significantly lower than the latency of static scheduled circuits, because they can take
advantage of runtime information of the circuit.  However, because of the signalling required to
perform the runtime scheduling, the area of these circuits is usually much larger than the area of
static scheduled circuits.  In addition to that, much more analysis is needed to properly
parallelise loads and stores to prevent bugs, which requires the addition of buffers in certain
locations.

An example of a dynamically scheduled synthesis tool is
Dynamatic~\cite{josipovic18_dynam_sched_high_level_synth}, which uses a load-store queue
(LSQ)~\cite{josipovic17_out_of_order_load_store} to order memory operations correctly even when
loops are pipelined and there are dependencies between iterations.  In addition to that, performance
of the dynamically scheduled code is improved by careful buffer
placement~\cite{josipovic20_buffer_placem_sizin_high_perfor_dataf_circuit}, which allows for better
parallisation and pipelining of loops.

\section{Proof: Verified HLS}

Work is being done to prove the equivalence between the generated hardware and the original
behavioural description in C.  An example of a tool that implements this is Mentor's
Catapult~\cite{mentor20_catap_high_level_synth}, which tries to match the states in the register
transfer level (RTL) description to states in the original C code after an unverified translation.
This technique is called translation validation~\cite{pnueli98_trans}, whereby the translation that
the HLS tool performed is proven to have been correct for that input, by showing that they behave in
the same way for all possible inputs.  Using translation validation is quite effective for proving
complex optimisations such as
scheduling~\cite{kim04_autom_fsmd,karfa06_formal_verif_method_sched_high_synth,chouksey20_verif_sched_condit_behav_high_level_synth}
or code
motion~\cite{banerjee14_verif_code_motion_techn_using_value_propag,chouksey19_trans_valid_code_motion_trans_invol_loops},
however, the validation has to be run every time the high-level synthesis is performed.  In addition
to that, the proofs are often not mechanised or directly related to the actual implementation,
meaning the verifying algorithm might be wrong and could give false positives or false negatives.

More examples of translation validation for proofs about HLS
algorithms~\cite{karfa06_formal_verif_method_sched_high_synth,karfa07_hand_verif_high_synth,kundu07_autom,karfa08_equiv_check_method_sched_verif,kundu08_valid_high_level_synth,karfa10_verif_datap_contr_gener_phase,karfa12_formal_verif_code_motion_techn,chouksey19_trans_valid_code_motion_trans_invol_loops,chouksey20_verif_sched_condit_behav_high_level_synth}
are performed using a HLS tool called SPARK~\cite{gupta03_spark}.  These translation validation
algorithms can check the correctness of complicated optimisations such as code motion or loop
inversions.  However, even though the correctness of the verifier is proven in the papers, the proof
does translate directly to the algorithm that was implemented for the verifier.  It is therefore
possible that output is accepted even though it is not equivalent to the input.  In addition to
that, these papers reason about the correctness of the algorithms in terms of the intermediate
language of SPARK, and does not extend to the high-level input language that SPARK takes in, or the
hardware description language that SPARK outputs.

Finally there has also been work proving HLS correct without using translation validation, but by
directly showing that the translation is correct.  The first instance of this is proving the
BEDROC~\cite{chapman92_verif_bedroc} HLS tool is correct.  This HLS tool converts a high-level
description of an algorithm, supporting loops and conditional statements, to a netlist and proves
that the output is correct.  It works in two stages, first generating a DFG from the input language,
HardwarePal.  It then optimises the DFG to improve the routing of the design and also improving the
scheduling of operations.  Finally, the netlist is generated from the DFG by placing all the
operations that do not depend on each other into the same clock cycle.  Datapath and register
allocation is performed by an unverified clique partitioning algorithm.  The equivalence proof
between the DFG and HardwarePal is done by proof by simulation, where it is proven that, given a
valid input configuration, that applying a translation or optimisation rule will result in a valid
DFG with the same behaviour as the input.

There has also been work on proving the translation from Occam to gates~\cite{page91_compil_occam}
correct using algebraic proofs~\cite{jifeng93_towar}.  This translation resembles dynamic scheduling
as tokens are used to start the next operations.  However, Occam is timed by design, and will
execute assignments sequentially.  To take advantage of the parallel nature of hardware, Occam uses
explicit parallel constructs with channels to share state.  Handel-C, a version of Occam with many
more features such as memory and pointers, has also been used to prove HLS correct, and is described
further in \in{Section}[sec:wire-to-wire].

\section{Mechanised: Compiler Proofs}

Even though a proof for the correctness of an algorithm might exist, this does not guarantee that
the algorithm itself behaves in the same way as the assumed algorithm in the proof.  The
implementation of the algorithm is separate from the actual implementation, meaning there could be
various implementation bugs in the algorithm that cause it to behave incorrectly.  C compilers are a
good example of this, where many optimisations performed by the compilers have been proven correct,
however these proofs are not linked directly to the actual implementations of these algorithms in
GCC or Clang.  Yang et al.~\cite{yang11_findin_under_bugs_c_compil} found more than 300 bugs in GCC
and Clang, many of them appearing in the optimisation phases of the compiler.  One way to link the
proofs to the actual implementations in these compilers is to write the compiler in a language which
allows for a theorem prover to check properties about the algorithms.  This allows for the proofs to
be directly linked to the algorithms, ensuring that the actual implementations are proven correct.
Yang et al.~\cite{yang11_findin_under_bugs_c_compil} found that CompCert, a formally verified C
Compiler, only had five bugs in all the unverified parts of the compiler, meaning this method of
proving algorithms correct ensures a correct compiler.

\subsection{CompCert}

\index{CompCert}CompCert~\cite{leroy09_formal_verif_compil_back_end} is a formally verified C
compiler written in Coq~\cite{bertot04_inter_theor_provin_progr_devel}.  Coq is a theorem prover,
meaning algorithms written in Coq can be reasoned about in Coq itself by proving various properties
about the algorithms.  To then run these algorithms that have been proven correct, they can be
extracted directly to OCaml code and then executed, as there is a straightforward correspondence
between Coq and OCaml code.  During this translation, all the proofs are erased, as they are not
needed during the execution of the compiler, as they are only needed when the correctness of the
compiler needs to be checked.  With this process, one can have a Compiler that satisfies various
correctness properties and can therefore be proven to preserve the behaviour of the code.

CompCert contains eleven intermediate languages, which are used to gradually translate C code into
assembly that has the same behaviour.  Proving the translation directly without going through the
intermediate languages would be infeasible, especially with the many optimisations that are
performed during the translation, as assembly is very different to the abstract C code it receives.
The first three intermediate languages (C\#minor, C\#minorgen, Cminor) are used to transform Clight
into a more assembly like language called register transfer language (RTL).  This language consist
of a control-flow graph of instructions, and is therefore well suited for various compiler
optimisations such as constant propagation, dead-code elimination or function
inlining~\cite{tristan08_formal_verif_trans_valid}.  After RTL, each intermediate language is used
to get closer to the assembly language of the architecture, performing operations such as register
allocation and making sure that the stack variables are correctly aligned.

\subsubsection{CompCertSSA}

\index{CompCertSSA}CompCertSSA is an extension of CompCert with an additional \SSA\ intermediate
language.  This language enforces \SSA\ properties and therefore allows for many convenient proofs
about optimisations performed on this intermediate language, as many assumptions about variables can
be made when these are encountered.  The main interesting porperty that arises from using \SSA\ is
the \emph{equational lemma}, stating that given a variable, if it was assigned by an operation that
does not depend on memory, then loading the destination value of that variable is the same as
recomputing the value of that variable with the current context.

Given a well formed SSA program $p$, a reachable state $\Sigma\ s\ f\ \sigma\ R\ M$, a memory
independent operation which was defined at a node $d$ as $\mono{Iop}\ \mathit{op}\ \vec{a}\ x\ n$
assuming that $\sigma$ is dominated by $d$ ($p \le_{d} d$), then the following equation holds:

\placeformula[eq:equational-lemma]\startformula
  \left(\mathit{op}, \vec{a}\right) \Downarrow (R, M) = \left\lfloor R[x] \right\rfloor
\stopformula

This is an important lemma as it essentially allows one to know the value of a register as long as
one knows that its assignment dominates the current node and one knows what expressions it was
assigned.

\subsection{HLS Formalised in Isabelle}

Martin Ellis' work on correct synthesis~\cite{ellis08_correc} is the first example of a mechanically
verified high-level synthesis tool, also called hardware/software compilers, as they produce
software as well as hardware for special functional units that should be hardware accelerated.  The
main goal of the thesis is to provide a framework to prove hardware/software compilers correct, by
defining semantics for an intermediate language (IR) which supports partitioning of code into
hardware and software parts, as well as a custom netlist format which is used to describe the
hardware parts and is the final target of the hardware/software compiler.  The proof of correctness
then says that the hardware/software design should have the same behaviour as if the design had only
been implemented in software.  The framework used to prove the correctness of the compilation from
the IR to the netlist format is written in Isabelle, which is a theorem prover comparable to Coq.
Any proofs in the framework are therefore automatically checked by Isabelle for correctness.

This work first defines a static single assignment (SSA) IR, with the capabilities of defining
sections that should be hardware accelerated, and sections that should be executed on the CPU.  This
language supports \emph{hyperblocks}, which are a list of basic blocks with one entry node and
multiple exit nodes, which is well suited for various HLS optimisations and better scheduling.
However, as this is quite a low-level intermediate language, the first thing that would be required
is to create a tool that can actually translate a high-level language to this intermediate language.
In addition to that, they also describe a netlist format in which the hardware accelerated code will
be represented in.

As both the input IR and output netlist format have been designed from scratch, they are not very
useful for real world applications, as they require a different back end to be implemented from
existing compilers.  In addition to that, it would most likely mean that the translation from
higher-level language to the IR is unverified and could therefore contain bugs.  As the resulting
netlist also uses a custom format, it cannot be fed directly to tools that can then translate it to
a bitstream to place onto an FPGA. The reason for designing a custom IR and netlist was so that
these were compatible with each other, making proofs of equivalence between the two simpler.

Finally, it is unclear whether or not a translation algorithm from the IR to the netlist format was
actually implemented, as the only example in the thesis seems to be compiled by hand to explain the
proof.  There are also no benchmarks on real input programs showing the efficiency of the
translation algorithm, and it is therefore unclear whether the framework would be able to prove more
complicated optimisations that a compiler might perform on the source code.  The thesis seems to
describe the correctness proofs by assuming a compiler exists which outputs various properties that
are needed by the equivalence proof, such a mapping between variables and netlist wires and
registers.

Even though this Isabelle framework does provide some equivalence relation between an IR and a
netlist and describes how the translation would be proven correct by matching states to one another,
it is actually not useable in practice.  First of all, the IR is only represented in Isabelle and
does not have a parser or a compiler which can target this IR.  In addition to that, the netlist
format cannot be passed to any existing tool, to then be placed onto an FPGA, meaning an additional,
unverified translation would have to take place.  This further reduces the effectiveness of the
correctness proof, as there are various stages that are not addressed by the proofs and therefore
have to be assumed to be correct.

\subsection[sec:wire-to-wire]{Mechanised Handel-C to Netlist Translation}

Handel-C~\cite{bowen98_handel_c_languag_refer_manual} is a C-like language for hardware development.
It supports many C features such as assignments, if-statements, loops, pointers and functions.  In
addition to these constructs, Handel-C also supports explicit timing constructs, such as explicit
parallelism as well as sequential or parallel assignment, similar to blocking and nonblocking
assignments in Verilog.  Perna et al.~\cite{perna12_mechan_wire_wise_verif_handel_c_synth} developed
a mechanically verified Handel-C to netlist translation written in HOL. The translation is based
on previous work describing translation from Occam to gates by Page et
al.~\cite{page91_compil_occam}, which was proven correct by Jifeng et al.~\cite{jifeng93_towar}
using algebraic proofs.  As Handel-C is mostly an extension of Occam with C-like operations, the
translation from Handel-C to gates can proceed in a similar way.

Perna et al. verify the compilation of a subset of Handel-C to gates, which does not include memory,
arrays or function calls.  In addition to the constructs presented in Page et al., the prioritised
choice construct is also added to the Handel-C subset that is supported.  The verification proceeds
by first defining the algorithm to perform the compilation, chaining operations together with start
and end signals that determine the next construct which will be executed.  The circuits themselves
are treated as black boxes by the compilation algorithm and are chosen based on the current
statement in Handel-C which is being translated.  The compilation algorithm correctly has to link
each black box to one another using the start and end signals that the circuits expose to correctly
compile the Handel-C code.

The verification of the compilation is done by proving that the control signal is propagated through
the circuits correctly, and that each circuit is activated at the right time.  It is also proven
that the internal circuits of the constructs also propagate the control signal in the correct way,
and will be active at the right clock cycle.  However, it is not proven that the circuits have the
same functionality as the Handel-C constructs, only that the control signals are propagated in the
correct manner.  In addition to that, the assumption is made that the semantics for time are the
same in Handel-C as well as in the netlist format that is generated, which could be proven if the
constructs are shown to behave in the exact same way as the handel-C constructs.

\startmode[chapter]
  \section{Bibliography}
  \placelistofpublications
\stopmode

\stopcomponent