From 2a716db2fb59f448590b189f7b953a80d02bc5ee Mon Sep 17 00:00:00 2001 From: Yann Herklotz Date: Fri, 15 Apr 2022 11:46:10 +0100 Subject: Add changes --- chapters/background.tex | 306 +++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 304 insertions(+), 2 deletions(-) (limited to 'chapters/background.tex') diff --git a/chapters/background.tex b/chapters/background.tex index 4636918..81ca653 100644 --- a/chapters/background.tex +++ b/chapters/background.tex @@ -1,7 +1,309 @@ +\environment fonts_env +\environment lsr_env + \startcomponent background -\chapter{Background} +\chapter[sec:background]{Background} + +%\begin{table} +% \centering +% \begin{tabular}{lccccc}\toprule +% \textbf{Tool} & \textbf{Implementation} & \textbf{Mainstream} & \textbf{Proof} & \textbf{Mechanised} & \textbf{HLS}\\ +% \midrule +% Catapult C~\cite{mentor20_catap_high_level_synth} & \cmark{} & \cmark{} & \cmark{} & \xmark{} & \cmark{}\\ +% Chapman \textit{et al.}~\cite{chapman92_verif_bedroc} & \cmark{} & \xmark{} & \cmark{} & \cmark{} & \cmark{}\\ +% Clarke \textit{et al.}~\cite{clarke03_behav_c_veril} & \cmark{} & \cmark{} & \cmark{} & \xmark{} & \cmark{}\\ +% Ellis~\cite{ellis08} & \xmark{} & \xmark{} & \cmark{} & \cmark{} & \cmark{}\\ +% Karfa \textit{et al.}~\cite{karfa06_formal_verif_method_sched_high_synth} & \cmark{} & \xmark{} & \cmark{} & \xmark{} & \cmark{}\\ +% Kundu \textit{et al.}~\cite{kundu08_valid_high_level_synth} & \cmark{} & \xmark{} & \cmark{} & \xmark{} & \cmark\\ +% LegUp~\cite{canis13_legup} & \cmark{} & \cmark{} & \xmark{} & \xmark{} & \cmark{}\\ +% Lööw \textit{et al.}~\cite{loow19_verif_compil_verif_proces} & \cmark{} & \xmark{} & \cmark{} & \cmark{} & \xmark{}\\ +% OpenCL SDK~\cite{intel20_sdk_openc_applic} & \cmark{} & \xmark{} & \xmark{} & \xmark{} & \cmark{}\\ +% Perna \textit{et al.}~\cite{perna12_mechan_wire_wise_verif_handel_c_synth} & \cmark{} & \xmark{} & \cmark{} & \cmark{} & \xmark{}\\ +% Vivado HLS~\cite{xilinx20_vivad_high_synth} & \cmark{} & \cmark{} & \xmark{} & \xmark{} & \cmark{}\\ +% \bottomrule +% \end{tabular} +% \caption{Table showing the various related works and comparing them.}\label{tab:existing_tools} +%\end{table} + +This section describes the relevant literature around high-level synthesis, verified high-level +synthesis, and finally mechanised proofs of HLS. Table~\ref{tab:existing_tools} classifies +existing tools in the are of formal verification and high-level synthesis. The next section covers +the main different categories that the tools in Table~\ref{tab:existing_tools} show, which include +tools that cover \emph{implementation}, \emph{proof} and finally tools that are \emph{mechanically + verified}. + +\section{Implementation: High-level Synthesis} + +High-level synthesis (HLS) is the transformation of software directly into hardware. There are many +different types of HLS, which can vary greatly with what languages they accept, however, they often +share similar steps in how the translation is performed, as they all go from a higher-level, +behavioural description of the algorithm to a timed hardware description. The main steps performed +in the translation are the following~\cite{coussy09_introd_to_high_level_synth,canis13_legup}: + +\desc{Compilation of specification} First, the input specification has to be compiled into a +representation that can be used to apply all the necessary optimisations. This can be an +intermediate language such as the lower-level virtual machine intermediate representation (LLVM IR), +or a custom representation of the code. + +\desc{Hardware resource allocation} It is possible to select a device to target, which gives +information about how many LUTs can be used to implement the logic. However, tools often assume +that an unlimited amount of resources are available instead. This is resource sharing can be too +expensive for adders, as the logic for sharing the adder would have approximately the same cost as +adding a separate adder. However, multiplies and divide blocks should be shared, especially as +divide blocks are often implemented in LUTs and are therefore quite expensive. + +\desc{Operation scheduling} The operations then need to be scheduled into a specific clock cycle, as +the input specification language is normally an untimed behavioural representation. This is where +the spatial parallelism of the hardware can be taken advantage of, meaning operations that are not +dependent on each other can run in parallel. There are various possible scheduling methods that +could be used, such as static or dynamic scheduling, which are described further in +sections~\ref{sec:static_scheduling} and~\ref{sec:dynamic_scheduling}. + +\desc{Operation and variable binding} After the operations have been scheduled, the operations and +the variables need to be bound to hardware units and registers respectively. It is often not that +practical to share many hardware units though, as this requires adding multiplexers, which are often +the same size as the hardware units themselves. It is therefore more practical to only share +operations that take up a lot of space, such as modulo or divide circuits, as well as multiplies if +there are not available any more on the FPGA. + +\desc{Hardware description generation} Finally, the hardware description is generated from the code +that was described in the intermediate language. This is often a direct translation of the +instructions into equivalent hardware constructs. + +There are many examples of existing high-level synthesis tools, the most popular ones being +LegUp~\cite{canis13_legup}, Vivado HLS~\cite{xilinx20_vivad_high_synth}, Catapult +C~\cite{mentor20_catap_high_level_synth} and Intel's OpenCL SDK~\cite{intel20_sdk_openc_applic}. +These HLS tools all accept general programming languages such as C or OpenCL, however, some HLS +tools take in languages that were designed for hardware such as +Handel-C~\cite{aubury96_handel_c_languag_refer_guide}, where time is encoded as part of the +semantics. + +\subsection[sec:static_scheduling]{Static Scheduling} + +Static scheduling is used by the majority of synthesis +tools~\cite{canis13_legup,xilinx20_vivad_high_synth,mentor20_catap_high_level_synth,intel20_sdk_openc_applic} +and means that the time at which each operation will execute is known at compile time. Static +analysis is used to gather all the data dependencies between all the operations to determine in +which clock cycle the operations should be placed, or if these operations can be parallelised. Once +the data-flow analysis has been performed, various scheduling schemes can be taken into account to +place operations in various clock cycles. Some common static-scheduling schemes are the following: + +\desc{As soon as possible (ASAP)} scheduling will place operations into the first possible clock +cycle that they can be executed. + +\desc{As late as possible (ALAP)} scheduling places operations into the last possible clock cycle, +right before they are used by later operations. + +\desc{List scheduling} uses priorities associated with each operation and chooses operations from +each priority level to be placed into the current clock cycle if all the data-dependencies are +met. This is done until all the resources for the current clock cycle are used up. + +Static scheduling can normally produce extremely small circuits, however, it is often not possible +to guarantee that the circuits will have the best +throughput~\cite{cheng20_combin_dynam_static_sched_high_level_synth}, as this requires extensive +control-flow analysis and complex optimisations. Especially for loops, finding the optimal +initiation interval (II) can be tricky if there are loads and stores in the loop or if the loops are +nested. + +\subsection[sec:dynamic_scheduling]{Dynamic Scheduling} + +On the other hand, Dynamic scheduling~\cite{josipovic18_dynam_sched_high_level_synth} does not +require the schedule to be known at compile time and it instead generates circuits using tokens to +schedule the operations in parallel at run time. Whenever the data for an operation is available, +it sends a token to the next operation, signalling that the data is ready to be read. The next +operation does not start until all the required inputs to the operation are available, and once that +is the case, it computes the result and then sends another token declaring that the result of that +operation is also ready. The benefit of this approach is that only basic data-flow analysis is +needed to connect the tokens correctly, however, the scheduling is then done dynamically at run +time, depending on how long each primitive takes to finish and when the tokens activate the next +operations. + +The benefits of this approach over static scheduling is that the latency of these circuits is +normally significantly lower than the latency of static scheduled circuits, because they can take +advantage of runtime information of the circuit. However, because of the signalling required to +perform the runtime scheduling, the area of these circuits is usually much larger than the area of +static scheduled circuits. In addition to that, much more analysis is needed to properly +parallelise loads and stores to prevent bugs, which requires the addition of buffers in certain +locations. + +An example of a dynamically scheduled synthesis tool is +Dynamatic~\cite{josipovic18_dynam_sched_high_level_synth}, which uses a load-store queue +(LSQ)~\cite{josipovic17_out_of_order_load_store} to order memory operations correctly even when +loops are pipelined and there are dependencies between iterations. In addition to that, performance +of the dynamically scheduled code is improved by careful buffer +placement~\cite{josipovic20_buffer_placem_sizin_high_perfor_dataf_circuit}, which allows for better +parallisation and pipelining of loops. + +\section{Proof: Verified HLS} + +Work is being done to prove the equivalence between the generated hardware and the original +behavioural description in C. An example of a tool that implements this is Mentor's +Catapult~\cite{mentor20_catap_high_level_synth}, which tries to match the states in the register +transfer level (RTL) description to states in the original C code after an unverified translation. +This technique is called translation validation~\cite{pnueli98_trans}, whereby the translation that +the HLS tool performed is proven to have been correct for that input, by showing that they behave in +the same way for all possible inputs. Using translation validation is quite effective for proving +complex optimisations such as +scheduling~\cite{kim04_autom_fsmd,karfa06_formal_verif_method_sched_high_synth,chouksey20_verif_sched_condit_behav_high_level_synth} +or code +motion~\cite{banerjee14_verif_code_motion_techn_using_value_propag,chouksey19_trans_valid_code_motion_trans_invol_loops}, +however, the validation has to be run every time the high-level synthesis is performed. In addition +to that, the proofs are often not mechanised or directly related to the actual implementation, +meaning the verifying algorithm might be wrong and could give false positives or false negatives. + +More examples of translation validation for proofs about HLS +algorithms~\cite{karfa06_formal_verif_method_sched_high_synth,karfa07_hand_verif_high_synth,kundu07_autom,karfa08_equiv_check_method_sched_verif,kundu08_valid_high_level_synth,karfa10_verif_datap_contr_gener_phase,karfa12_formal_verif_code_motion_techn,chouksey19_trans_valid_code_motion_trans_invol_loops,chouksey20_verif_sched_condit_behav_high_level_synth} +are performed using a HLS tool called SPARK~\cite{gupta03_spark}. These translation validation +algorithms can check the correctness of complicated optimisations such as code motion or loop +inversions. However, even though the correctness of the verifier is proven in the papers, the proof +does translate directly to the algorithm that was implemented for the verifier. It is therefore +possible that output is accepted even though it is not equivalent to the input. In addition to +that, these papers reason about the correctness of the algorithms in terms of the intermediate +language of SPARK, and does not extend to the high-level input language that SPARK takes in, or the +hardware description language that SPARK outputs. + +Finally there has also been work proving HLS correct without using translation validation, but by +directly showing that the translation is correct. The first instance of this is proving the +BEDROC~\cite{chapman92_verif_bedroc} HLS tool is correct. This HLS tool converts a high-level +description of an algorithm, supporting loops and conditional statements, to a netlist and proves +that the output is correct. It works in two stages, first generating a DFG from the input language, +HardwarePal. It then optimises the DFG to improve the routing of the design and also improving the +scheduling of operations. Finally, the netlist is generated from the DFG by placing all the +operations that do not depend on each other into the same clock cycle. Datapath and register +allocation is performed by an unverified clique partitioning algorithm. The equivalence proof +between the DFG and HardwarePal is done by proof by simulation, where it is proven that, given a +valid input configuration, that applying a translation or optimisation rule will result in a valid +DFG with the same behaviour as the input. + +There has also been work on proving the translation from Occam to gates~\cite{page91_compil_occam} +correct using algebraic proofs~\cite{jifeng93_towar}. This translation resembles dynamic scheduling +as tokens are used to start the next operations. However, Occam is timed by design, and will +execute assignments sequentially. To take advantage of the parallel nature of hardware, Occam uses +explicit parallel constructs with channels to share state. Handel-C, a version of Occam with many +more features such as memory and pointers, has also been used to prove HLS correct, and is described +further in Section~\ref{sec:wire-to-wire}. + +\section{Mechanised: Compiler Proofs} + +Even though a proof for the correctness of an algorithm might exist, this does not guarantee that +the algorithm itself behaves in the same way as the assumed algorithm in the proof. The +implementation of the algorithm is separate from the actual implementation, meaning there could be +various implementation bugs in the algorithm that mean that it does not behave correctly. C +compilers are a good example of this, where many optimisations performed by the compilers have been +proven correct, however these proofs are not linked directly to the actual implementations of these +algorithms in GCC or Clang. Yang et al.~\cite{yang11_findin_under_bugs_c_compil} found +more than 300 bugs in GCC and Clang, many of them being in the optimisation phases of the compiler. +One way to link the proofs to the actual implementations in these compilers is to write the compiler +in a language which allows for a theorem prover to check properties about the algorithms. This +allows for the proofs to be directly linked to the algorithms, ensuring that the actual +implementations are proven correct. Yang et al.~\cite{yang11_findin_under_bugs_c_compil} +found that CompCert, a formally verified C Compiler, only had five bugs in all the unverified parts +of the compiler, meaning this method of proving algorithms correct ensures a correct compiler. + +\subsection{CompCert} + +CompCert~\cite{leroy09_formal_verif_compil_back_end} is a formally verified C compiler written in +Coq~\cite{bertot04_inter_theor_provin_progr_devel}. Coq is a theorem prover, meaning algorithms +written in Coq can be reasoned about in Coq itself by proving various properties about the +algorithms. To then run these algorithms that have been proven correct, they can be extracted +directly to OCaml code and then executed, as there is a straightforward correspondence between Coq +and OCaml code. During this translation, all the proofs are erased, as they are not needed during +the execution of the compiler, as they are only needed when the correctness of the compiler needs to +be checked. With this process, one can have a Compiler that satisfies various correctness +properties and can therefore be proven to preserve the behaviour of the code. + +CompCert contains eleven intermediate languages, which are used to gradually translate C code into +assembly that has the same behaviour. Proving the translation directly without going through the +intermediate languages would be infeasible, especially with the many optimisations that are +performed during the translation, as assembly is very different to the abstract C code it receives. +The first three intermediate languages (C\#minor, C\#minorgen, Cminor) are used to transform Clight +into a more assembly like language called register transfer language (RTL). This language consist +of a control-flow graph of instructions, and is therefore well suited for various compiler +optimisations such as constant propagation, dead-code elimination or function +inlining~\cite{tristan08_formal_verif_trans_valid}. After RTL, each intermediate language is used +to get closer to the assembly language of the architecture, performing operations such as register +allocation and making sure that the stack variables are correctly aligned. + +\subsection{Isabelle HLS} + +Martin Ellis' work on correct synthesis~\cite{ellis08} is the first example of a mechanically +verified high-level synthesis tool, also called hardware/software compilers, as they produce +software as well as hardware for special functional units that should be hardware accelerated. The +main goal of the thesis is to provide a framework to prove hardware/software compilers correct, by +defining semantics for an intermediate language (IR) which supports partitioning of code into +hardware and software parts, as well as a custom netlist format which is used to describe the +hardware parts and is the final target of the hardware/software compiler. The proof of correctness +then says that the hardware/software design should have the same behaviour as if the design had only +been implemented in software. The framework used to prove the correctness of the compilation from +the IR to the netlist format is written in Isabelle, which is a theorem prover comparable to Coq. +Any proofs in the framework are therefore automatically checked by Isabelle for correctness. + +This work first defines a static single assignment (SSA) IR, with the capabilities of defining +sections that should be hardware accelerated, and sections that should be executed on the CPU. This +language supports \emph{hyperblocks}, which are a list of basic blocks with one entry node and +multiple exit nodes, which is well suited for various HLS optimisations and better scheduling. +However, as this is quite a low-level intermediate language, the first thing that would be required +is to create a tool that can actually translate a high-level language to this intermediate language. +In addition to that, they also describe a netlist format in which the hardware accelerated code will +be represented in. + +As both the input IR and output netlist format have been designed from scratch, they are not very +useful for real world applications, as they require a different back end to be implemented from +existing compilers. In addition to that, it would most likely mean that the translation from +higher-level language to the IR is unverified and could therefore contain bugs. As the resulting +netlist also uses a custom format, it cannot be fed directly to tools that can then translate it a +bitstream to place onto an FPGA. The reason for designing a custom IR and netlist was so that +these were compatible with each other, making proofs of equivalence between the two simpler. + +Finally, it is unclear whether or not a translation algorithm from the IR to the netlist format was +actually implemented, as the only example in the thesis seems to be compiled by hand to explain the +proof. There are also no benchmarks on real input programs showing the efficiency of the +translation algorithm, and it is therefore unclear whether the framework would be able to prove more +complicated optimisations that a compiler might perform on the source code. The thesis seems to +describe the correctness proofs by assuming a compiler exists which outputs various properties that +are needed by the equivalence proof, such a mapping between variables and netlist wires and +registers. + +Even though this Isabelle framework does provide some equivalence relation between an IR and a +netlist and describes how the translation would be proven correct by matching states to one another, +it is actually not useable in practice. First of all, the IR is only represented in Isabelle and +does not have a parser or a compiler which can target this IR. In addition to that, the netlist +format cannot be passed to any existing tool, to then be placed onto an FPGA, meaning an additional, +unverified translation would have to take place. This further reduces the effectiveness of the +correctness proof, as there are various stages that are not addressed by the proofs and therefore +have to be assumed to be correct. + +\subsection[sec:wire-to-wire]{Mechanised Handel-C to Netlist Translation} + +Handel-C~\cite{bowen98_handel_c_languag_refer_manual} is a C-like language for hardware development. +It supports many C features such as assignments, if-statements, loops, pointers and functions. In +addition to these constructs, Handel-C also supports explicit timing constructs, such as explicit +parallelism as well as sequential or parallel assignment, similar to blocking and nonblocking +assignments in Verilog. Perna et al.~\cite{perna12_mechan_wire_wise_verif_handel_c_synth} developed +a mechanically verified Handel-C to netlist translation written in HOL. The translation is based +on previous work describing translation from Occam to gates by Page et +al.~\cite{page91_compil_occam}, which was proven correct by Jifeng et al.~\cite{jifeng93_towar} +using algebraic proofs. As Handel-C is mostly an extension of Occam with C-like operations, the +translation from Handel-C to gates can proceed in a similar way. + +Perna et al. verify the compilation of a subset of Handel-C to gates, which does not include memory, +arrays or function calls. In addition to the constructs presented in Page et al., the prioritised +choice construct is also added to the Handel-C subset that is supported. The verification proceeds +by first defining the algorithm to perform the compilation, chaining operations together with start +and end signals that determine the next construct which will be executed. The circuits themselves +are treated as black boxes by the compilation algorithm and are chosen based on the current +statement in Handel-C which is being translated. The compilation algorithm correctly has to link +each black box to one another using the start and end signals that the circuits expose to correctly +compile the Handel-C code. -aroistenaoirstenoiaresntoien +The verification of the compilation is done by proving that the control signal is propagated through +the circuits correctly, and that each circuit is activated at the right time. It is also proven +that the internal circuits of the constructs also propagate the control signal in the correct way, +and will be active at the right clock cycle. However, it is not proven that the circuits have the +same functionality as the Handel-C constructs, only that the control signals are propagated in the +correct manner. In addition to that, the assumption is made that the semantics for time are the +same in Handel-C as well as in the netlist format that is generated, which could be proven if the +constructs are shown to behave in the exact same way as the handel-C constructs. \stopcomponent -- cgit