\environment fonts_env \environment lsr_env \startcomponent background \chapter[sec:background]{Background} \startsynopsis This chapter describes the current state-of-the-art optimisations implemented in high-level synthesis tools, in particular focusing on static scheduling, loop pipelining and finally also describing dynamic scheduling which is becoming a popular alternative. \stopsynopsis \section{Verification} There are different kinds of verification tools, which can mostly be placed into two categories: automatic theorem provers described in \in{Section}[sec:background:automatic] and interactive theorem provers described in \in{Section}[sec:background:interactive]. \subsection[sec:background:automatic]{Automatic theorem provers} Automatic theorem provers such as \SAT\ or \SMT\ solvers can be characterised as decision procedures that will answer, for example, if a formula is satisfiable or not. However, by default these tools will only give the answer to the initial query, without showing the reasoning. The reasoning is often quite complex as the \SAT\ or \SMT\ tool will implement many optimisations to improve the performance of the decision procedure. The main advantage of using an automatic theorem prover is that if one is working in its constrained decidable theory, then it will be efficient at proving or disproving if a formula is a theorem. However, if the theorem requires inductive arguments to prove, then the theorem prover might need some manual help from the user by adding the right lemmas to its collection of facts which the automatic procedure can use. The proof itself though will still be automatic, which means that many of the tedious cases in the proofs can be ignored. However, this is also the main disadvantage of automatic theorem provers, because they do not provide details about the proof itself and often cannot communicate why they cannot prove a theorem. This means that as a user one has to guess what theorems the prover is missing and try and add these to the fact database. Finally, automatic theorem provers do not provide reasoning for their final answer by default, meaning one cannot check if the result is actually correct. However, some SMT solvers support the generation of proof witnesses, which can then be checked and reused in other theorem provers. Some examples of these are veriT~\cite[bouton09], and these can then be integrated into Coq using SMTCoq~\cite[armand11_modul_integ_sat_smt_solver]. \subsection[sec:background:interactive]{Interactive theorem provers} Interactive theorem provers, on the other hand, focus on checking proofs that are provided to them. These can either be written manually by the user, or automatically generated by some decision procedure. However, these two ways of generating proofs can be combined, so the general proof structure can be manually written, and simpler steps inside of the proof can be automatically solved. The main benefit of using an interactive theorem prover is that the proof is there and can be checked by a small, trusted kernel. This kernel does not need to be heavily optimised, and can therefore be reasoned about. The main cost of using an interactive theorem prover is the time it takes to prove theorems, and the amount of formalisation that is needed to make the proofs pass. For a proof to be completed, one has to remove any axioms from the proof, meaning even the smallest detail must be proven to continue. \section[sec:background:hls]{High-level Synthesis} \HLS\ is the transformation of software directly into hardware. There are many different types of \HLS, which can vary greatly with what languages they accept, however, they often share similar steps in how the translation is performed, as they all go from a higher-level, behavioural description of the algorithm to a timed hardware description. The main steps performed in the translation are the following~\cite[coussy09_introd_to_high_level_synth,canis13_legup]: \desc{Compilation of specification} First, the input specification has to be compiled into a representation that can be used to apply all the necessary optimisations. This can be an intermediate language such as the lower-level virtual machine intermediate representation (LLVM IR), or a custom representation of the code. \desc{Hardware resource allocation} It is possible to select a device to target, which gives information about how many resources are available and can be used to implement the logic. However, tools often assume that an unlimited amount of resources are available instead. This resource sharing can be too expensive for adders, as the logic for sharing the adder would have approximately the same cost as adding a separate adder. However, multipliers and divider blocks should be shared, especially as divider blocks are often implemented in logic and are therefore quite expensive. \desc{Operation scheduling} The operations then need to be scheduled into a specific clock cycle, as the input specification language is normally an untimed behavioural representation. This is where the spatial parallelism of the hardware can be taken advantage of, meaning operations that are not dependent on each other can run in parallel. There are various possible scheduling methods that could be used, such as static or dynamic scheduling, which are described further in \in{Section}[sec:static_scheduling] and in \in{Section}[sec:dynamic_scheduling]. \desc{Operation and variable binding} After the operations have been scheduled, the operations and the variables need to be bound to hardware units and registers respectively. It is often not that practical to share many hardware units though, as this requires adding multiplexers, which are often the same size as the hardware units themselves. It is therefore more practical to only share operations that take up a lot of space, such as modulo or divide circuits, as well as multipliers if they are not available any more on the FPGA. \desc{Hardware description generation} Finally, the hardware description is generated from the code that was described in the intermediate language. This is often a direct translation of the instructions into equivalent hardware constructs. There are many examples of existing high-level synthesis tools, the most popular ones being LegUp~\cite{canis13_legup}, Vivado HLS~\cite{xilinx20_vivad_high_synth}, Catapult C~\cite{mentor20_catap_high_level_synth} and Intel's OpenCL SDK~\cite{intel20_sdk_openc_applic}. These HLS tools all accept general programming languages such as C or OpenCL, however, some HLS tools take in languages that were designed for hardware such as Handel-C~\cite{aubury96_handel_c_languag_refer_guide}, where time is encoded as part of the semantics. \subsection[sec:language-blocks]{Intermediate Language} This section describes some possible characteristics of a language that could be used as an input to an \HLS\ tool. These languages normally require some structure to allow for easier optimisation and analysis passes. In particular, it is often useful to have contiguous blocks of instructions that do not contain any control-flow in one list. This means that these instructions can safely be rearranged by only looking at local information of the block itself, and in particular it allows for complete removal of control-flow as only the data-flow is important in that block. \subsubsection{Basic blocks} %TODO: Finish the basic blocks section Basic blocks are the simplest form of structure, as these are only formed of lists of instructions that do not include control-flow. \subsubsection{Superblocks} Superblocks extend the notion of basic blocks to contiguous regions without any incoming control-flow, however, there can be multiple exits out of the block. The main benefit of this definition is that due to the extra flexibility of allowing multiple exits, the basic blocks can be extended, which often improves most optimisations that make use of basic blocks. However, the downside is that the representation of the blocks can be more complex due to the introduction of the extra control-flow. Any analysis passes will have to take into account the possibility of control-flow being present and many simplifications will not be possible anymore. \subsubsection{Hyperblocks} Hyperblocks are also an extension of basic blocks similar to superblocks, but instead of introducing special control-flow instructions into the block, every instruction is predicated. This leads to possibly more complex control-flow than in both of the previous cases, however, it can be reasoned with using a \SAT\ or \SMT\ solver. \subsection[sec:static_scheduling]{Static Scheduling} Static scheduling is used by the majority of synthesis tools~\cite{canis13_legup,xilinx20_vivad_high_synth,mentor20_catap_high_level_synth,intel20_sdk_openc_applic} and means that the time at which each operation will execute is known at compile time. Static analysis is used to gather all the data dependencies between all the operations to determine in which clock cycle the operations should be placed, or if these operations can be parallelised. Once the data-flow analysis has been performed, various scheduling schemes can be taken into account to place operations in various clock cycles. Some common static-scheduling schemes are the following: \desc{As soon as possible (ASAP)} scheduling will place operations into the first possible clock cycle that they can be executed. \desc{As late as possible (ALAP)} scheduling places operations into the last possible clock cycle, right before they are used by later operations. \desc{List scheduling} uses priorities associated with each operation and chooses operations from each priority level to be placed into the current clock cycle if all the data-dependencies are met. This is done until all the resources for the current clock cycle are used up. Static scheduling can normally produce extremely small circuits, however, it is often not possible to guarantee that the circuits will have the best throughput~\cite{cheng20_combin_dynam_static_sched_high_level_synth}, as this requires extensive control-flow analysis and complex optimisations. Especially for loops, finding the optimal \II\ can be tricky if there are loads and stores in the loop or if the loops are nested. \subsection[sec:dynamic_scheduling]{Dynamic Scheduling} On the other hand, Dynamic scheduling~\cite{josipovic18_dynam_sched_high_level_synth} does not require the schedule to be known at compile time and instead it generates circuits using tokens to schedule the operations in parallel at run time. Whenever the data for an operation is available, it sends a token to the next operation, signalling that the data is ready to be read. The next operation does not start until all the required inputs to the operation are available, and once that is the case, it computes the result and then sends another token declaring that the result of that operation is also ready. The benefit of this approach is that only basic data-flow analysis is needed to connect the tokens correctly, however, the scheduling is then done dynamically at run time, depending on how long each primitive takes to finish and when the tokens activate the next operations. The benefits of this approach over static scheduling is that the latency of these circuits is normally significantly lower than the latency of static scheduled circuits, because they can take advantage of runtime information of the circuit. However, because of the signalling required to perform the runtime scheduling, the area of these circuits is usually much larger than the area of static scheduled circuits. In addition to that, much more analysis is needed to properly parallelise loads and stores to prevent bugs, which requires the addition of buffers in certain locations. An example of a dynamically scheduled synthesis tool is Dynamatic~\cite{josipovic18_dynam_sched_high_level_synth}, which uses a load-store queue (LSQ)~\cite{josipovic17_out_of_order_load_store} to order memory operations correctly even when loops are pipelined and there are dependencies between iterations. In addition to that, performance of the dynamically scheduled code is improved by careful buffer placement~\cite{josipovic20_buffer_placem_sizin_high_perfor_dataf_circuit}, which allows for better parallisation and pipelining of loops. \section{Proof: Verified HLS} Work is being done to prove the equivalence between the generated hardware and the original behavioural description in C. An example of a tool that implements this is Mentor's Catapult~\cite{mentor20_catap_high_level_synth}, which tries to match the states in the register transfer level (RTL) description to states in the original C code after an unverified translation. This technique is called translation validation~\cite{pnueli98_trans}, whereby the translation that the HLS tool performed is proven to have been correct for that input, by showing that they behave in the same way for all possible inputs. Using translation validation is quite effective for proving complex optimisations such as scheduling~\cite{kim04_autom_fsmd,karfa06_formal_verif_method_sched_high_synth,chouksey20_verif_sched_condit_behav_high_level_synth} or code motion~\cite{banerjee14_verif_code_motion_techn_using_value_propag,chouksey19_trans_valid_code_motion_trans_invol_loops}, however, the validation has to be run every time the high-level synthesis is performed. In addition to that, the proofs are often not mechanised or directly related to the actual implementation, meaning the verifying algorithm might be wrong and could give false positives or false negatives. More examples of translation validation for proofs about HLS algorithms~\cite{karfa06_formal_verif_method_sched_high_synth,karfa07_hand_verif_high_synth,kundu07_autom,karfa08_equiv_check_method_sched_verif,kundu08_valid_high_level_synth,karfa10_verif_datap_contr_gener_phase,karfa12_formal_verif_code_motion_techn,chouksey19_trans_valid_code_motion_trans_invol_loops,chouksey20_verif_sched_condit_behav_high_level_synth} are performed using a HLS tool called SPARK~\cite{gupta03_spark}. These translation validation algorithms can check the correctness of complicated optimisations such as code motion or loop inversions. However, even though the correctness of the verifier is proven in the papers, the proof does translate directly to the algorithm that was implemented for the verifier. It is therefore possible that output is accepted even though it is not equivalent to the input. In addition to that, these papers reason about the correctness of the algorithms in terms of the intermediate language of SPARK, and does not extend to the high-level input language that SPARK takes in, or the hardware description language that SPARK outputs. Finally there has also been work proving HLS correct without using translation validation, but by directly showing that the translation is correct. The first instance of this is proving the BEDROC~\cite{chapman92_verif_bedroc} HLS tool is correct. This HLS tool converts a high-level description of an algorithm, supporting loops and conditional statements, to a netlist and proves that the output is correct. It works in two stages, first generating a DFG from the input language, HardwarePal. It then optimises the DFG to improve the routing of the design and also improving the scheduling of operations. Finally, the netlist is generated from the DFG by placing all the operations that do not depend on each other into the same clock cycle. Datapath and register allocation is performed by an unverified clique partitioning algorithm. The equivalence proof between the DFG and HardwarePal is done by proof by simulation, where it is proven that, given a valid input configuration, that applying a translation or optimisation rule will result in a valid DFG with the same behaviour as the input. There has also been work on proving the translation from Occam to gates~\cite{page91_compil_occam} correct using algebraic proofs~\cite{jifeng93_towar}. This translation resembles dynamic scheduling as tokens are used to start the next operations. However, Occam is timed by design, and will execute assignments sequentially. To take advantage of the parallel nature of hardware, Occam uses explicit parallel constructs with channels to share state. Handel-C, a version of Occam with many more features such as memory and pointers, has also been used to prove HLS correct, and is described further in \in{Section}[sec:wire-to-wire]. \section{Mechanised: Compiler Proofs} Even though a proof for the correctness of an algorithm might exist, this does not guarantee that the algorithm itself behaves in the same way as the assumed algorithm in the proof. The implementation of the algorithm is separate from the actual implementation, meaning there could be various implementation bugs in the algorithm that cause it to behave incorrectly. C compilers are a good example of this, where many optimisations performed by the compilers have been proven correct, however these proofs are not linked directly to the actual implementations of these algorithms in GCC or Clang. Yang et al.~\cite{yang11_findin_under_bugs_c_compil} found more than 300 bugs in GCC and Clang, many of them appearing in the optimisation phases of the compiler. One way to link the proofs to the actual implementations in these compilers is to write the compiler in a language which allows for a theorem prover to check properties about the algorithms. This allows for the proofs to be directly linked to the algorithms, ensuring that the actual implementations are proven correct. Yang et al.~\cite{yang11_findin_under_bugs_c_compil} found that CompCert, a formally verified C Compiler, only had five bugs in all the unverified parts of the compiler, meaning this method of proving algorithms correct ensures a correct compiler. \subsection{CompCert} \index{CompCert}CompCert~\cite{leroy09_formal_verif_compil_back_end} is a formally verified C compiler written in Coq~\cite{bertot04_inter_theor_provin_progr_devel}. Coq is a theorem prover, meaning algorithms written in Coq can be reasoned about in Coq itself by proving various properties about the algorithms. To then run these algorithms that have been proven correct, they can be extracted directly to OCaml code and then executed, as there is a straightforward correspondence between Coq and OCaml code. During this translation, all the proofs are erased, as they are not needed during the execution of the compiler, as they are only needed when the correctness of the compiler needs to be checked. With this process, one can have a Compiler that satisfies various correctness properties and can therefore be proven to preserve the behaviour of the code. CompCert contains eleven intermediate languages, which are used to gradually translate C code into assembly that has the same behaviour. Proving the translation directly without going through the intermediate languages would be infeasible, especially with the many optimisations that are performed during the translation, as assembly is very different to the abstract C code it receives. The first three intermediate languages (C\#minor, C\#minorgen, Cminor) are used to transform Clight into a more assembly like language called register transfer language (RTL). This language consist of a control-flow graph of instructions, and is therefore well suited for various compiler optimisations such as constant propagation, dead-code elimination or function inlining~\cite{tristan08_formal_verif_trans_valid}. After RTL, each intermediate language is used to get closer to the assembly language of the architecture, performing operations such as register allocation and making sure that the stack variables are correctly aligned. \subsubsection{CompCertSSA} \index{CompCertSSA}CompCertSSA is an extension of CompCert with an additional \SSA\ intermediate language. This language enforces \SSA\ properties and therefore allows for many convenient proofs about optimisations performed on this intermediate language, as many assumptions about variables can be made when these are encountered. The main interesting porperty that arises from using \SSA\ is the \emph{equational lemma}, stating that given a variable, if it was assigned by an operation that does not depend on memory, then loading the destination value of that variable is the same as recomputing the value of that variable with the current context. Given a well formed SSA program $p$, a reachable state $\Sigma\ s\ f\ \sigma\ R\ M$, a memory independent operation which was defined at a node $d$ as $\mono{Iop}\ \mathit{op}\ \vec{a}\ x\ n$ assuming that $\sigma$ is dominated by $d$ ($p \le_{d} d$), then the following equation holds: \placeformula[eq:equational-lemma]\startformula \left(\mathit{op}, \vec{a}\right) \Downarrow (R, M) = \left\lfloor R[x] \right\rfloor \stopformula This is an important lemma as it essentially allows one to know the value of a register as long as one knows that its assignment dominates the current node and one knows what expressions it was assigned. \subsection{HLS Formalised in Isabelle} Martin Ellis' work on correct synthesis~\cite{ellis08_correc} is the first example of a mechanically verified high-level synthesis tool, also called hardware/software compilers, as they produce software as well as hardware for special functional units that should be hardware accelerated. The main goal of the thesis is to provide a framework to prove hardware/software compilers correct, by defining semantics for an intermediate language (IR) which supports partitioning of code into hardware and software parts, as well as a custom netlist format which is used to describe the hardware parts and is the final target of the hardware/software compiler. The proof of correctness then says that the hardware/software design should have the same behaviour as if the design had only been implemented in software. The framework used to prove the correctness of the compilation from the IR to the netlist format is written in Isabelle, which is a theorem prover comparable to Coq. Any proofs in the framework are therefore automatically checked by Isabelle for correctness. This work first defines a static single assignment (SSA) IR, with the capabilities of defining sections that should be hardware accelerated, and sections that should be executed on the CPU. This language supports \emph{hyperblocks}, which are a list of basic blocks with one entry node and multiple exit nodes, which is well suited for various HLS optimisations and better scheduling. However, as this is quite a low-level intermediate language, the first thing that would be required is to create a tool that can actually translate a high-level language to this intermediate language. In addition to that, they also describe a netlist format in which the hardware accelerated code will be represented in. As both the input IR and output netlist format have been designed from scratch, they are not very useful for real world applications, as they require a different back end to be implemented from existing compilers. In addition to that, it would most likely mean that the translation from higher-level language to the IR is unverified and could therefore contain bugs. As the resulting netlist also uses a custom format, it cannot be fed directly to tools that can then translate it to a bitstream to place onto an FPGA. The reason for designing a custom IR and netlist was so that these were compatible with each other, making proofs of equivalence between the two simpler. Finally, it is unclear whether or not a translation algorithm from the IR to the netlist format was actually implemented, as the only example in the thesis seems to be compiled by hand to explain the proof. There are also no benchmarks on real input programs showing the efficiency of the translation algorithm, and it is therefore unclear whether the framework would be able to prove more complicated optimisations that a compiler might perform on the source code. The thesis seems to describe the correctness proofs by assuming a compiler exists which outputs various properties that are needed by the equivalence proof, such a mapping between variables and netlist wires and registers. Even though this Isabelle framework does provide some equivalence relation between an IR and a netlist and describes how the translation would be proven correct by matching states to one another, it is actually not useable in practice. First of all, the IR is only represented in Isabelle and does not have a parser or a compiler which can target this IR. In addition to that, the netlist format cannot be passed to any existing tool, to then be placed onto an FPGA, meaning an additional, unverified translation would have to take place. This further reduces the effectiveness of the correctness proof, as there are various stages that are not addressed by the proofs and therefore have to be assumed to be correct. \subsection[sec:wire-to-wire]{Mechanised Handel-C to Netlist Translation} Handel-C~\cite{bowen98_handel_c_languag_refer_manual} is a C-like language for hardware development. It supports many C features such as assignments, if-statements, loops, pointers and functions. In addition to these constructs, Handel-C also supports explicit timing constructs, such as explicit parallelism as well as sequential or parallel assignment, similar to blocking and nonblocking assignments in Verilog. Perna et al.~\cite{perna12_mechan_wire_wise_verif_handel_c_synth} developed a mechanically verified Handel-C to netlist translation written in HOL. The translation is based on previous work describing translation from Occam to gates by Page et al.~\cite{page91_compil_occam}, which was proven correct by Jifeng et al.~\cite{jifeng93_towar} using algebraic proofs. As Handel-C is mostly an extension of Occam with C-like operations, the translation from Handel-C to gates can proceed in a similar way. Perna et al. verify the compilation of a subset of Handel-C to gates, which does not include memory, arrays or function calls. In addition to the constructs presented in Page et al., the prioritised choice construct is also added to the Handel-C subset that is supported. The verification proceeds by first defining the algorithm to perform the compilation, chaining operations together with start and end signals that determine the next construct which will be executed. The circuits themselves are treated as black boxes by the compilation algorithm and are chosen based on the current statement in Handel-C which is being translated. The compilation algorithm correctly has to link each black box to one another using the start and end signals that the circuits expose to correctly compile the Handel-C code. The verification of the compilation is done by proving that the control signal is propagated through the circuits correctly, and that each circuit is activated at the right time. It is also proven that the internal circuits of the constructs also propagate the control signal in the correct way, and will be active at the right clock cycle. However, it is not proven that the circuits have the same functionality as the Handel-C constructs, only that the control signals are propagated in the correct manner. In addition to that, the assumption is made that the semantics for time are the same in Handel-C as well as in the netlist format that is generated, which could be proven if the constructs are shown to behave in the exact same way as the handel-C constructs. \startmode[chapter] \section{Bibliography} \placelistofpublications \stopmode \stopcomponent