From 2a716db2fb59f448590b189f7b953a80d02bc5ee Mon Sep 17 00:00:00 2001 From: Yann Herklotz Date: Fri, 15 Apr 2022 11:46:10 +0100 Subject: Add changes --- chapters/background.tex | 306 ++++++++++++++++++- chapters/conclusion.tex | 7 + chapters/dynamic.tex | 9 +- chapters/hls.tex | 5 +- chapters/pipelining.tex | 9 +- chapters/schedule.tex | 118 +++++++- chapters/scheduling.tex | 779 ++++++++++++++++++++---------------------------- 7 files changed, 768 insertions(+), 465 deletions(-) (limited to 'chapters') diff --git a/chapters/background.tex b/chapters/background.tex index 4636918..81ca653 100644 --- a/chapters/background.tex +++ b/chapters/background.tex @@ -1,7 +1,309 @@ +\environment fonts_env +\environment lsr_env + \startcomponent background -\chapter{Background} +\chapter[sec:background]{Background} + +%\begin{table} +% \centering +% \begin{tabular}{lccccc}\toprule +% \textbf{Tool} & \textbf{Implementation} & \textbf{Mainstream} & \textbf{Proof} & \textbf{Mechanised} & \textbf{HLS}\\ +% \midrule +% Catapult C~\cite{mentor20_catap_high_level_synth} & \cmark{} & \cmark{} & \cmark{} & \xmark{} & \cmark{}\\ +% Chapman \textit{et al.}~\cite{chapman92_verif_bedroc} & \cmark{} & \xmark{} & \cmark{} & \cmark{} & \cmark{}\\ +% Clarke \textit{et al.}~\cite{clarke03_behav_c_veril} & \cmark{} & \cmark{} & \cmark{} & \xmark{} & \cmark{}\\ +% Ellis~\cite{ellis08} & \xmark{} & \xmark{} & \cmark{} & \cmark{} & \cmark{}\\ +% Karfa \textit{et al.}~\cite{karfa06_formal_verif_method_sched_high_synth} & \cmark{} & \xmark{} & \cmark{} & \xmark{} & \cmark{}\\ +% Kundu \textit{et al.}~\cite{kundu08_valid_high_level_synth} & \cmark{} & \xmark{} & \cmark{} & \xmark{} & \cmark\\ +% LegUp~\cite{canis13_legup} & \cmark{} & \cmark{} & \xmark{} & \xmark{} & \cmark{}\\ +% Lööw \textit{et al.}~\cite{loow19_verif_compil_verif_proces} & \cmark{} & \xmark{} & \cmark{} & \cmark{} & \xmark{}\\ +% OpenCL SDK~\cite{intel20_sdk_openc_applic} & \cmark{} & \xmark{} & \xmark{} & \xmark{} & \cmark{}\\ +% Perna \textit{et al.}~\cite{perna12_mechan_wire_wise_verif_handel_c_synth} & \cmark{} & \xmark{} & \cmark{} & \cmark{} & \xmark{}\\ +% Vivado HLS~\cite{xilinx20_vivad_high_synth} & \cmark{} & \cmark{} & \xmark{} & \xmark{} & \cmark{}\\ +% \bottomrule +% \end{tabular} +% \caption{Table showing the various related works and comparing them.}\label{tab:existing_tools} +%\end{table} + +This section describes the relevant literature around high-level synthesis, verified high-level +synthesis, and finally mechanised proofs of HLS. Table~\ref{tab:existing_tools} classifies +existing tools in the are of formal verification and high-level synthesis. The next section covers +the main different categories that the tools in Table~\ref{tab:existing_tools} show, which include +tools that cover \emph{implementation}, \emph{proof} and finally tools that are \emph{mechanically + verified}. + +\section{Implementation: High-level Synthesis} + +High-level synthesis (HLS) is the transformation of software directly into hardware. There are many +different types of HLS, which can vary greatly with what languages they accept, however, they often +share similar steps in how the translation is performed, as they all go from a higher-level, +behavioural description of the algorithm to a timed hardware description. The main steps performed +in the translation are the following~\cite{coussy09_introd_to_high_level_synth,canis13_legup}: + +\desc{Compilation of specification} First, the input specification has to be compiled into a +representation that can be used to apply all the necessary optimisations. This can be an +intermediate language such as the lower-level virtual machine intermediate representation (LLVM IR), +or a custom representation of the code. + +\desc{Hardware resource allocation} It is possible to select a device to target, which gives +information about how many LUTs can be used to implement the logic. However, tools often assume +that an unlimited amount of resources are available instead. This is resource sharing can be too +expensive for adders, as the logic for sharing the adder would have approximately the same cost as +adding a separate adder. However, multiplies and divide blocks should be shared, especially as +divide blocks are often implemented in LUTs and are therefore quite expensive. + +\desc{Operation scheduling} The operations then need to be scheduled into a specific clock cycle, as +the input specification language is normally an untimed behavioural representation. This is where +the spatial parallelism of the hardware can be taken advantage of, meaning operations that are not +dependent on each other can run in parallel. There are various possible scheduling methods that +could be used, such as static or dynamic scheduling, which are described further in +sections~\ref{sec:static_scheduling} and~\ref{sec:dynamic_scheduling}. + +\desc{Operation and variable binding} After the operations have been scheduled, the operations and +the variables need to be bound to hardware units and registers respectively. It is often not that +practical to share many hardware units though, as this requires adding multiplexers, which are often +the same size as the hardware units themselves. It is therefore more practical to only share +operations that take up a lot of space, such as modulo or divide circuits, as well as multiplies if +there are not available any more on the FPGA. + +\desc{Hardware description generation} Finally, the hardware description is generated from the code +that was described in the intermediate language. This is often a direct translation of the +instructions into equivalent hardware constructs. + +There are many examples of existing high-level synthesis tools, the most popular ones being +LegUp~\cite{canis13_legup}, Vivado HLS~\cite{xilinx20_vivad_high_synth}, Catapult +C~\cite{mentor20_catap_high_level_synth} and Intel's OpenCL SDK~\cite{intel20_sdk_openc_applic}. +These HLS tools all accept general programming languages such as C or OpenCL, however, some HLS +tools take in languages that were designed for hardware such as +Handel-C~\cite{aubury96_handel_c_languag_refer_guide}, where time is encoded as part of the +semantics. + +\subsection[sec:static_scheduling]{Static Scheduling} + +Static scheduling is used by the majority of synthesis +tools~\cite{canis13_legup,xilinx20_vivad_high_synth,mentor20_catap_high_level_synth,intel20_sdk_openc_applic} +and means that the time at which each operation will execute is known at compile time. Static +analysis is used to gather all the data dependencies between all the operations to determine in +which clock cycle the operations should be placed, or if these operations can be parallelised. Once +the data-flow analysis has been performed, various scheduling schemes can be taken into account to +place operations in various clock cycles. Some common static-scheduling schemes are the following: + +\desc{As soon as possible (ASAP)} scheduling will place operations into the first possible clock +cycle that they can be executed. + +\desc{As late as possible (ALAP)} scheduling places operations into the last possible clock cycle, +right before they are used by later operations. + +\desc{List scheduling} uses priorities associated with each operation and chooses operations from +each priority level to be placed into the current clock cycle if all the data-dependencies are +met. This is done until all the resources for the current clock cycle are used up. + +Static scheduling can normally produce extremely small circuits, however, it is often not possible +to guarantee that the circuits will have the best +throughput~\cite{cheng20_combin_dynam_static_sched_high_level_synth}, as this requires extensive +control-flow analysis and complex optimisations. Especially for loops, finding the optimal +initiation interval (II) can be tricky if there are loads and stores in the loop or if the loops are +nested. + +\subsection[sec:dynamic_scheduling]{Dynamic Scheduling} + +On the other hand, Dynamic scheduling~\cite{josipovic18_dynam_sched_high_level_synth} does not +require the schedule to be known at compile time and it instead generates circuits using tokens to +schedule the operations in parallel at run time. Whenever the data for an operation is available, +it sends a token to the next operation, signalling that the data is ready to be read. The next +operation does not start until all the required inputs to the operation are available, and once that +is the case, it computes the result and then sends another token declaring that the result of that +operation is also ready. The benefit of this approach is that only basic data-flow analysis is +needed to connect the tokens correctly, however, the scheduling is then done dynamically at run +time, depending on how long each primitive takes to finish and when the tokens activate the next +operations. + +The benefits of this approach over static scheduling is that the latency of these circuits is +normally significantly lower than the latency of static scheduled circuits, because they can take +advantage of runtime information of the circuit. However, because of the signalling required to +perform the runtime scheduling, the area of these circuits is usually much larger than the area of +static scheduled circuits. In addition to that, much more analysis is needed to properly +parallelise loads and stores to prevent bugs, which requires the addition of buffers in certain +locations. + +An example of a dynamically scheduled synthesis tool is +Dynamatic~\cite{josipovic18_dynam_sched_high_level_synth}, which uses a load-store queue +(LSQ)~\cite{josipovic17_out_of_order_load_store} to order memory operations correctly even when +loops are pipelined and there are dependencies between iterations. In addition to that, performance +of the dynamically scheduled code is improved by careful buffer +placement~\cite{josipovic20_buffer_placem_sizin_high_perfor_dataf_circuit}, which allows for better +parallisation and pipelining of loops. + +\section{Proof: Verified HLS} + +Work is being done to prove the equivalence between the generated hardware and the original +behavioural description in C. An example of a tool that implements this is Mentor's +Catapult~\cite{mentor20_catap_high_level_synth}, which tries to match the states in the register +transfer level (RTL) description to states in the original C code after an unverified translation. +This technique is called translation validation~\cite{pnueli98_trans}, whereby the translation that +the HLS tool performed is proven to have been correct for that input, by showing that they behave in +the same way for all possible inputs. Using translation validation is quite effective for proving +complex optimisations such as +scheduling~\cite{kim04_autom_fsmd,karfa06_formal_verif_method_sched_high_synth,chouksey20_verif_sched_condit_behav_high_level_synth} +or code +motion~\cite{banerjee14_verif_code_motion_techn_using_value_propag,chouksey19_trans_valid_code_motion_trans_invol_loops}, +however, the validation has to be run every time the high-level synthesis is performed. In addition +to that, the proofs are often not mechanised or directly related to the actual implementation, +meaning the verifying algorithm might be wrong and could give false positives or false negatives. + +More examples of translation validation for proofs about HLS +algorithms~\cite{karfa06_formal_verif_method_sched_high_synth,karfa07_hand_verif_high_synth,kundu07_autom,karfa08_equiv_check_method_sched_verif,kundu08_valid_high_level_synth,karfa10_verif_datap_contr_gener_phase,karfa12_formal_verif_code_motion_techn,chouksey19_trans_valid_code_motion_trans_invol_loops,chouksey20_verif_sched_condit_behav_high_level_synth} +are performed using a HLS tool called SPARK~\cite{gupta03_spark}. These translation validation +algorithms can check the correctness of complicated optimisations such as code motion or loop +inversions. However, even though the correctness of the verifier is proven in the papers, the proof +does translate directly to the algorithm that was implemented for the verifier. It is therefore +possible that output is accepted even though it is not equivalent to the input. In addition to +that, these papers reason about the correctness of the algorithms in terms of the intermediate +language of SPARK, and does not extend to the high-level input language that SPARK takes in, or the +hardware description language that SPARK outputs. + +Finally there has also been work proving HLS correct without using translation validation, but by +directly showing that the translation is correct. The first instance of this is proving the +BEDROC~\cite{chapman92_verif_bedroc} HLS tool is correct. This HLS tool converts a high-level +description of an algorithm, supporting loops and conditional statements, to a netlist and proves +that the output is correct. It works in two stages, first generating a DFG from the input language, +HardwarePal. It then optimises the DFG to improve the routing of the design and also improving the +scheduling of operations. Finally, the netlist is generated from the DFG by placing all the +operations that do not depend on each other into the same clock cycle. Datapath and register +allocation is performed by an unverified clique partitioning algorithm. The equivalence proof +between the DFG and HardwarePal is done by proof by simulation, where it is proven that, given a +valid input configuration, that applying a translation or optimisation rule will result in a valid +DFG with the same behaviour as the input. + +There has also been work on proving the translation from Occam to gates~\cite{page91_compil_occam} +correct using algebraic proofs~\cite{jifeng93_towar}. This translation resembles dynamic scheduling +as tokens are used to start the next operations. However, Occam is timed by design, and will +execute assignments sequentially. To take advantage of the parallel nature of hardware, Occam uses +explicit parallel constructs with channels to share state. Handel-C, a version of Occam with many +more features such as memory and pointers, has also been used to prove HLS correct, and is described +further in Section~\ref{sec:wire-to-wire}. + +\section{Mechanised: Compiler Proofs} + +Even though a proof for the correctness of an algorithm might exist, this does not guarantee that +the algorithm itself behaves in the same way as the assumed algorithm in the proof. The +implementation of the algorithm is separate from the actual implementation, meaning there could be +various implementation bugs in the algorithm that mean that it does not behave correctly. C +compilers are a good example of this, where many optimisations performed by the compilers have been +proven correct, however these proofs are not linked directly to the actual implementations of these +algorithms in GCC or Clang. Yang et al.~\cite{yang11_findin_under_bugs_c_compil} found +more than 300 bugs in GCC and Clang, many of them being in the optimisation phases of the compiler. +One way to link the proofs to the actual implementations in these compilers is to write the compiler +in a language which allows for a theorem prover to check properties about the algorithms. This +allows for the proofs to be directly linked to the algorithms, ensuring that the actual +implementations are proven correct. Yang et al.~\cite{yang11_findin_under_bugs_c_compil} +found that CompCert, a formally verified C Compiler, only had five bugs in all the unverified parts +of the compiler, meaning this method of proving algorithms correct ensures a correct compiler. + +\subsection{CompCert} + +CompCert~\cite{leroy09_formal_verif_compil_back_end} is a formally verified C compiler written in +Coq~\cite{bertot04_inter_theor_provin_progr_devel}. Coq is a theorem prover, meaning algorithms +written in Coq can be reasoned about in Coq itself by proving various properties about the +algorithms. To then run these algorithms that have been proven correct, they can be extracted +directly to OCaml code and then executed, as there is a straightforward correspondence between Coq +and OCaml code. During this translation, all the proofs are erased, as they are not needed during +the execution of the compiler, as they are only needed when the correctness of the compiler needs to +be checked. With this process, one can have a Compiler that satisfies various correctness +properties and can therefore be proven to preserve the behaviour of the code. + +CompCert contains eleven intermediate languages, which are used to gradually translate C code into +assembly that has the same behaviour. Proving the translation directly without going through the +intermediate languages would be infeasible, especially with the many optimisations that are +performed during the translation, as assembly is very different to the abstract C code it receives. +The first three intermediate languages (C\#minor, C\#minorgen, Cminor) are used to transform Clight +into a more assembly like language called register transfer language (RTL). This language consist +of a control-flow graph of instructions, and is therefore well suited for various compiler +optimisations such as constant propagation, dead-code elimination or function +inlining~\cite{tristan08_formal_verif_trans_valid}. After RTL, each intermediate language is used +to get closer to the assembly language of the architecture, performing operations such as register +allocation and making sure that the stack variables are correctly aligned. + +\subsection{Isabelle HLS} + +Martin Ellis' work on correct synthesis~\cite{ellis08} is the first example of a mechanically +verified high-level synthesis tool, also called hardware/software compilers, as they produce +software as well as hardware for special functional units that should be hardware accelerated. The +main goal of the thesis is to provide a framework to prove hardware/software compilers correct, by +defining semantics for an intermediate language (IR) which supports partitioning of code into +hardware and software parts, as well as a custom netlist format which is used to describe the +hardware parts and is the final target of the hardware/software compiler. The proof of correctness +then says that the hardware/software design should have the same behaviour as if the design had only +been implemented in software. The framework used to prove the correctness of the compilation from +the IR to the netlist format is written in Isabelle, which is a theorem prover comparable to Coq. +Any proofs in the framework are therefore automatically checked by Isabelle for correctness. + +This work first defines a static single assignment (SSA) IR, with the capabilities of defining +sections that should be hardware accelerated, and sections that should be executed on the CPU. This +language supports \emph{hyperblocks}, which are a list of basic blocks with one entry node and +multiple exit nodes, which is well suited for various HLS optimisations and better scheduling. +However, as this is quite a low-level intermediate language, the first thing that would be required +is to create a tool that can actually translate a high-level language to this intermediate language. +In addition to that, they also describe a netlist format in which the hardware accelerated code will +be represented in. + +As both the input IR and output netlist format have been designed from scratch, they are not very +useful for real world applications, as they require a different back end to be implemented from +existing compilers. In addition to that, it would most likely mean that the translation from +higher-level language to the IR is unverified and could therefore contain bugs. As the resulting +netlist also uses a custom format, it cannot be fed directly to tools that can then translate it a +bitstream to place onto an FPGA. The reason for designing a custom IR and netlist was so that +these were compatible with each other, making proofs of equivalence between the two simpler. + +Finally, it is unclear whether or not a translation algorithm from the IR to the netlist format was +actually implemented, as the only example in the thesis seems to be compiled by hand to explain the +proof. There are also no benchmarks on real input programs showing the efficiency of the +translation algorithm, and it is therefore unclear whether the framework would be able to prove more +complicated optimisations that a compiler might perform on the source code. The thesis seems to +describe the correctness proofs by assuming a compiler exists which outputs various properties that +are needed by the equivalence proof, such a mapping between variables and netlist wires and +registers. + +Even though this Isabelle framework does provide some equivalence relation between an IR and a +netlist and describes how the translation would be proven correct by matching states to one another, +it is actually not useable in practice. First of all, the IR is only represented in Isabelle and +does not have a parser or a compiler which can target this IR. In addition to that, the netlist +format cannot be passed to any existing tool, to then be placed onto an FPGA, meaning an additional, +unverified translation would have to take place. This further reduces the effectiveness of the +correctness proof, as there are various stages that are not addressed by the proofs and therefore +have to be assumed to be correct. + +\subsection[sec:wire-to-wire]{Mechanised Handel-C to Netlist Translation} + +Handel-C~\cite{bowen98_handel_c_languag_refer_manual} is a C-like language for hardware development. +It supports many C features such as assignments, if-statements, loops, pointers and functions. In +addition to these constructs, Handel-C also supports explicit timing constructs, such as explicit +parallelism as well as sequential or parallel assignment, similar to blocking and nonblocking +assignments in Verilog. Perna et al.~\cite{perna12_mechan_wire_wise_verif_handel_c_synth} developed +a mechanically verified Handel-C to netlist translation written in HOL. The translation is based +on previous work describing translation from Occam to gates by Page et +al.~\cite{page91_compil_occam}, which was proven correct by Jifeng et al.~\cite{jifeng93_towar} +using algebraic proofs. As Handel-C is mostly an extension of Occam with C-like operations, the +translation from Handel-C to gates can proceed in a similar way. + +Perna et al. verify the compilation of a subset of Handel-C to gates, which does not include memory, +arrays or function calls. In addition to the constructs presented in Page et al., the prioritised +choice construct is also added to the Handel-C subset that is supported. The verification proceeds +by first defining the algorithm to perform the compilation, chaining operations together with start +and end signals that determine the next construct which will be executed. The circuits themselves +are treated as black boxes by the compilation algorithm and are chosen based on the current +statement in Handel-C which is being translated. The compilation algorithm correctly has to link +each black box to one another using the start and end signals that the circuits expose to correctly +compile the Handel-C code. -aroistenaoirstenoiaresntoien +The verification of the compilation is done by proving that the control signal is propagated through +the circuits correctly, and that each circuit is activated at the right time. It is also proven +that the internal circuits of the constructs also propagate the control signal in the correct way, +and will be active at the right clock cycle. However, it is not proven that the circuits have the +same functionality as the Handel-C constructs, only that the control signals are propagated in the +correct manner. In addition to that, the assumption is made that the semantics for time are the +same in Handel-C as well as in the netlist format that is generated, which could be proven if the +constructs are shown to behave in the exact same way as the handel-C constructs. \stopcomponent diff --git a/chapters/conclusion.tex b/chapters/conclusion.tex index 6422396..91ca14f 100644 --- a/chapters/conclusion.tex +++ b/chapters/conclusion.tex @@ -1 +1,8 @@ +\environment fonts_env +\environment lsr_env + +\startcomponent pipelining + \chapter{Conclusion} + +\stopcomponent diff --git a/chapters/dynamic.tex b/chapters/dynamic.tex index a1744e8..5edd812 100644 --- a/chapters/dynamic.tex +++ b/chapters/dynamic.tex @@ -1 +1,8 @@ -\chapter{\ymhfw{} Dynamic Scheduling} +\environment fonts_env +\environment lsr_env + +\startcomponent dynamic + +\chapter[sec:dynamic]{Dynamic Scheduling} + +\stopcomponent diff --git a/chapters/hls.tex b/chapters/hls.tex index 362309f..8ff7143 100644 --- a/chapters/hls.tex +++ b/chapters/hls.tex @@ -1,8 +1,5 @@ -\environment fonts_env -\environment lsr_env - \startcomponent hls -\product main +\product ../main.tex \def\slowdownOrig{27} \def\slowdownDiv{2} diff --git a/chapters/pipelining.tex b/chapters/pipelining.tex index 4308188..5e07bd7 100644 --- a/chapters/pipelining.tex +++ b/chapters/pipelining.tex @@ -1 +1,8 @@ -\chapter{\ymhfw{} Loop Pipelining} +\environment fonts_env +\environment lsr_env + +\startcomponent pipelining + +\chapter[sec:pipelining]{Loop Pipelining} + +\stopcomponent diff --git a/chapters/schedule.tex b/chapters/schedule.tex index 9dc8d38..1711108 100644 --- a/chapters/schedule.tex +++ b/chapters/schedule.tex @@ -1 +1,117 @@ -\chapter{Schedule} +\environment fonts_env +\environment lsr_env + +\startcomponent pipelining + +\definecolor[hyperblockcolour][x=66C2A5] +\definecolor[dynamiccolour][x=FC8D62] +\definecolor[modulocolour][x=8DA0CB] +\definecolor[thesiscolour][x=E78AC3] + +\chapter[sec:schedule]{Schedule} + +The structure of the thesis has been described in the previous chapters. Out of these chapters, the +work for \in{Chapter}[sec:hls] has been completed, the work for \in{Chapter}[sec:scheduling] is +still ongoing and finally the work in \in{Chapter}[sec:pipelining] and \in{Chapter}[sec:dynamic] has +not been started yet. This chapter will describe how the rest of the work will be completed. The +current plan is to finish all the work including the thesis by June 2023, which is when the funding +will run out. + +\section{Timeline} + +The main timeline for the rest of the PhD is shown in \in{Figure}[fig:future-gantt]. It is split +into four main sections: + +\desc{Hyperblock scheduling} The first section describes the work that is currently going on about +hyperblock scheduling, and is described in \in{Chapter}[sec:scheduling]. Currently, the main proofs +need to be completed, as well as submitting a paper about the work to FMCAD/POPL/PLDI. In general, +this work was supposed to be finished in November 2021, however, due to complications with the +actual formulation of the semantics, as well as the number of proofs that need to be completed, the +work is still on going. + +\desc{Modulo scheduling} The second piece of work is described in \in{Chapter}[sec:pipelining] and +describes the implementation of modulo scheduling to perform scheduling on loops. This work will be +based on existing work by \cite[alternative=authoryears][tristan10_simpl_verif_valid_softw_pipel]. +However, it is worth noting that the proof by +\cite[alternative=author][tristan10_simpl_verif_valid_softw_pipel] does include parts that are not +completely verified in Coq, meaning there may be various improvements that could be done to the +proof. + +\desc{Dynamic scheduling} Dynamic scheduling is another route that could be explored by using +Vericert to pass through \SSA, as well as GSA, and finally generate dynamic HTL (or some other +similar hardware language). During my visit at Inria Rennes, I worked with Sandrine Blazy and +Delphine Demange to extend CompCertSSA~\cite[barthe14_formal_verif_ssa_based_middl_end_compc] with +\infull{GSA} (\GSA)~\cite[ottenstein90_progr_depen_web]. Not only should this allow for global +optimisations of \SSA\ code, but it should also allow for a more straightforward translation to +hardware as the GSA code can just be viewed as a \DFG. Instead of working on modulo scheduling, it +could therefore be interesting to work on a formally verified dynamic scheduling translation. + +\startplacefigure[ + title={Gantt chart describing the future plan for 2022 and 2023. The main funding runs out + mid-2023, so the plan is to finish the thesis writing by then as well.}, + reference={fig:future-gantt}, + location=top, +] + \hbox{\startgantt[xunitlength=17.5mm]{15}{6} + \startganttitle + \numtitle{2022}{1}{2022}{4} + \numtitle{2023}{1}{2023}{2} + \stopganttitle + \startganttitle + \numtitle{1}{1}{4}{1} + \numtitle{1}{1}{2}{1} + \stopganttitle + \ganttgroup{\rmx Hyperblock Scheduling}{0}{3} + \ganttbar[color=hyperblockcolour]{\rmx Proof}{0}{2} + \ganttbarcon[color=hyperblockcolour]{\rmx Paper Submission}{2}{1} + \ganttgroup{\rmx Modulo Scheduling}{2}{3} + \ganttbar[color=modulocolour]{\rmx Implementation}{2}{1} + \ganttbarcon[color=modulocolour]{\rmx Proof}{3}{2} + \ganttgroup{\rmx Dynamic Scheduling}{0}{5} + \ganttbar[color=dynamiccolour]{\rmx RTL $\rightarrow$ GSA}{0}{1} + \ganttbarcon[color=dynamiccolour]{\rmx GSA $\rightarrow$ RTL}{1}{1} + \ganttbar[color=dynamiccolour]{\rmx GSA $\rightarrow$ HTL}{2}{3} + \ganttcon{1}{9}{2}{11} + \ganttgroup{\rmx Thesis}{4}{2} + \ganttbar[color=thesiscolour]{\rmx Thesis Writing}{4}{2} + \stopgantt} +\stopplacefigure + +\section{Detailed Work Plan} + +This section describes a more detailed work plan for finalising the thesis. First I will describe +what the current plan is to finish the short term work of proving hyperblock scheduling, followed by +a plan of either tackling dynamic scheduling or modulo scheduling. + +\subsection{Finishing Work on Hyperblock Scheduling} + +As described in \in{Chapter}{sec:scheduling}, the current milestone that needs to be achieved is +proving hyperblock scheduling. This requires proofs of the following translations: + +\desc{\sc RTL $\rightarrow$ RTLBlock} This translation generates basic blocks to ease the scheduling +proof as well as simplify the scheduling algorithm. + +\desc{\sc IfConv} The {\sc IfConv} transformation introduces predicate instructions by conversing +conditional statements. This is necessary to generates hyperblocks and therefore improve the +schedule by passing larger blocks to the scheduler. + +\desc{\sc RTLBlock $\rightarrow$ RTLPar} The proof of the scheduling algorithm itself, which is +performed by validating the schedule and verifying that the validator is sound. + +\desc{\sc RTLPar $\rightarrow$ HTL} This translation generates the hardware from the parallel +description of the schedule, which can then be translated to Verilog using an existing translation. +As a translation from RTL to HTL has already been proven correct, it should be possible to reuse +many proofs from that translation and only have to modify the top-level proofs that deal with the +slightly different structure of the program. + +\subsection{Starting Work on Modulo Scheduling} + +After the proofs about hyperblock scheduling are finished, a natural extension is to work on +implementing modulo scheduling, which entails rearranging instructions in loops. + +\subsection{Possible Implementation for Dynamic Scheduling} + +Another plausible extension to Vericert is adding dynamic scheduling, as an alternative, or in +addition to, static scheduling. + +\stopcomponent diff --git a/chapters/scheduling.tex b/chapters/scheduling.tex index 5edad2e..2201f72 100644 --- a/chapters/scheduling.tex +++ b/chapters/scheduling.tex @@ -1,320 +1,235 @@ +\environment fonts_env +\environment lsr_env + \startcomponent scheduling -\project main -\chapter{Static Scheduling} +\chapter[sec:scheduling]{Static Scheduling} \section[introduction]{Introduction} -The use of multicore processors has been increasing drastically, thereby -making parallelising compilers ever more important. In addition to that, -the need for custom hardware accelerators is also increasing, as -traditional processors are not always the best choice for all -applications. Compilers that support optimisations which can -automatically parallelise it's input programs are therefore also -becoming more important, as these back ends all benefit from it. For -processors, one wants to optimise the pipeline usage as much as -possible, whereas in hardware one can create custom pipelines, as well -as making more use out of the spacial properties that hardware offers, -making instruction parallelism important as well as speculative -execution. However, with more powerful optimisations there is a greater -chance to make mistakes in the compiler, making it possible to generate -programs that behave differently to the input program. An important step -towards making compilers more reliable is to formally verify these -optimisations so that it can be guaranteed that bugs are not introduced. - -A popular optimisation for processors increasing the instruction -parallelism of the resulting program is to use scheduling. This is an -optimisation that is especially important for processors that have deep -instruction pipelines or multiple functional units that can be used to -execute instructions in parallel. However, this optimisation is also -critical for high-level synthesis tools, which generate application -specific hardware designs based on a software input. Simple scheduling -algorithms often work on basic blocks, which are sequential lists of -instructions without any branches. However, this limits what kind of -reordering the scheduling algorithm can perform. It can never move -instructions past a branching instruction, even when this might -drastically improve the performance. Trace scheduling~ is an alternative -that addresses this issue, by creating paths that might cross branching -instructions, and scheduling the instructions in that path. However, the -main problem with trace scheduling is that it is often infeasible on -large programs, and because of the large design space, it's sometimes -hard to find an optimal schedule. Superblock~ and hyperblock~ scheduling -are two subsets of trace scheduling, which compromise on the flexibility -of trace scheduling to instead build more tractable algorithms. -Superblock scheduling produces more efficient code for processors with a -low issue rate, whereas hyperblock scheduling produces more performant -code for processors with a high issue rate~. - -CompCert~ is a compiler that is formally verified in Coq. In addition to -that, have also added a formally verified hardware back end to CompCert, -allowing it to also generate provably correct hardware designs. A -hardware back end is an ideal target for hyperblock scheduling, as the -issue rate is practically infinite. This paper describes the -implementation of a hyperblock instruction scheduler in CompCert so that -it can benefit existing CompCert back ends, as well as support more +The use of multicore processors has been increasing drastically, thereby making parallelising +compilers ever more important. In addition to that, the need for custom hardware accelerators is +also increasing, as traditional processors are not always the best choice for all +applications. Compilers that support optimisations which can automatically parallelise it's input +programs are therefore also becoming more important, as these back ends all benefit from it. For +processors, one wants to optimise the pipeline usage as much as possible, whereas in hardware one +can create custom pipelines, as well as making more use out of the spacial properties that hardware +offers, making instruction parallelism important as well as speculative execution. However, with +more powerful optimisations there is a greater chance to make mistakes in the compiler, making it +possible to generate programs that behave differently to the input program. An important step +towards making compilers more reliable is to formally verify these optimisations so that it can be +guaranteed that bugs are not introduced. + +A popular optimisation for processors increasing the instruction parallelism of the resulting +program is to use scheduling. This is an optimisation that is especially important for processors +that have deep instruction pipelines or multiple functional units that can be used to execute +instructions in parallel. However, this optimisation is also critical for high-level synthesis +tools, which generate application specific hardware designs based on a software input. Simple +scheduling algorithms often work on basic blocks, which are sequential lists of instructions without +any branches. However, this limits what kind of reordering the scheduling algorithm can perform. It +can never move instructions past a branching instruction, even when this might drastically improve +the performance. Trace scheduling~ is an alternative that addresses this issue, by creating paths +that might cross branching instructions, and scheduling the instructions in that path. However, the +main problem with trace scheduling is that it is often infeasible on large programs, and because of +the large design space, it's sometimes hard to find an optimal schedule. Superblock~ and hyperblock~ +scheduling are two subsets of trace scheduling, which compromise on the flexibility of trace +scheduling to instead build more tractable algorithms. Superblock scheduling produces more +efficient code for processors with a low issue rate, whereas hyperblock scheduling produces more +performant code for processors with a high issue rate~. + +CompCert~ is a compiler that is formally verified in Coq. In addition to that, have also added a +formally verified hardware back end to CompCert, allowing it to also generate provably correct +hardware designs. A hardware back end is an ideal target for hyperblock scheduling, as the issue +rate is practically infinite. This paper describes the implementation of a hyperblock instruction +scheduler in CompCert so that it can benefit existing CompCert back ends, as well as support more general targets such as scheduling instructions for custom hardware. \subsection[key-points]{Key Points} -The goal of this paper is to introduce hyperblock scheduling into -CompCert to allow for advanced and flexible scheduling operations that -could be beneficial to various different back ends. To this end, the key -points of this paper are the following: +The goal of this paper is to introduce hyperblock scheduling into CompCert to allow for advanced and +flexible scheduling operations that could be beneficial to various different back ends. To this end, +the key points of this paper are the following: \startitemize \item - Description of how hyperblocks are introduced for improved scheduling - of instructions, and how these can be built and translated back to - instructions without predicates. + Description of how hyperblocks are introduced for improved scheduling of instructions, and how + these can be built and translated back to instructions without predicates. \item - Description of the implementation and correctness proof of a system of - difference constraints (SDC) scheduling algorithm, which can be - adapted to various different scheduling strategies. {\bf {[}{[}JW:} - What's the link between hyperblocks and SDC-based scheduling? Are - these orthogonal?{\bf {]}{]}}{\bf {[}{[}YH:} Well the original SDC - papers only talk about basic blocks, and not SDC hyperblock - scheduling. But it's easy enough to adapt. SDC just requires some kind - of blocks without control-flow.{\bf {]}{]}} {\bf {[}{[}JW:} Ok cool, - so you can have SDC scheduling without hyperblocks. Can you have - hyperblocks without SDC scheduling? (I guess so, but just wanted to be - completely sure.){\bf {]}{]}}{\bf {[}{[}YH:} Yes exactly, you can - perform any type of basic block scheduling on hyperblocks as well I - think, as long as you mark the dependencies correctly. But in the - worst case you could even just see hyperblocks as basic blocks and - ignore predicates. It would still be correct but just less - efficient.{\bf {]}{]}} - - {\bf {[}{[}JW:} It would be interesting to make the case for SDC-based - scheduling. Is it superior to other scheduling mechanisms? Is it so - much superior that it's worth the additional complexity, compared to, - say, list scheduling? {\bf {]}{]}}{\bf {[}{[}YH:} SDC scheduling is - more flexible with the type of constraint to optimise for (latency / - throughput). I don't think it's too meaningful for compilers, where - there isn't as much flexibility.{\bf {]}{]}} {\bf {[}{[}JW:} Thanks. - Does Vericert have any idea that the scheduling is being performed - using SDC? Or is the SDC stuff entirely internal to your unverified - OCaml code? That is, could you swap your scheduler with a different - one that isn't based on SDC, and not have to change anything in - Vericert?{\bf {]}{]}}{\bf {[}{[}YH:} The verified part doesn't know - about SDC at all, so it could be changed to any other scheduler that - acts on hyperblocks.{\bf {]}{]}} + Description of the implementation and correctness proof of a system of difference constraints + (SDC) scheduling algorithm, which can be adapted to various different scheduling strategies. \stopitemize \section[sec:scheduling]{Implementation of Hyperblocks in CompCert} This section describes the structure of hyperblocks in -Section~\goto{2.1}[scheduling:hyperblocks]. Then, the structure of two -extra intermediate languages that were added to CompCert to implement -hyperblocks are also shown in Section~\goto{2.2}[sec:rtlblockdef]. +Section~\goto{2.1}[scheduling:hyperblocks]. Then, the structure of two extra intermediate languages +that were added to CompCert to implement hyperblocks are also shown in +Section~\goto{2.2}[sec:rtlblockdef]. \subsection[scheduling:hyperblocks]{Hyperblocks as a Generalisation of Basic Blocks} -Basic blocks are popular in intermediate languages because they describe -a list of sequential instructions that are not separated by -control-flow. This information can be used by various analysis passes to -perform optimisations, as these sequences of instructions can be -reordered in any way as long as their data dependencies are met. One -such optimisation is scheduling, which reorders instructions to -eliminate of any delay slots caused by the CPU pipeline. - -However, due to control-flow not being allowed in basic blocks, the -number of instructions that can be scheduled at any one time is limited. -The boundaries to other basic blocks act as hard boundary to the -scheduling algorithm. hyperblocks~ allow instructions to be predicated, -thereby introducing some control-flow information into the blocks, which -can also be taken into account by the scheduler. For example, any data -dependency between instructions that have mutually exclusive predicates -can be removed, as these instructions will never be activate at the same -time. - -\subsection[sec:rtlblockdef]{RTLBlock and RTLPar Intermediate Language Definition} +Basic blocks are popular in intermediate languages because they describe a list of sequential +instructions that are not separated by control-flow. This information can be used by various +analysis passes to perform optimisations, as these sequences of instructions can be reordered in any +way as long as their data dependencies are met. One such optimisation is scheduling, which reorders +instructions to eliminate of any delay slots caused by the CPU pipeline. -Figure~\goto{{[}fig:compcert_interm{]}}[fig:compcert_interm] shows the -intermediate languages introduced to implement hyperblocks in CompCert. -This consists of new RTLBlock and RTLPar intermediate languages, which -implement the sequential and parallel semantics of basic blocks -respectively. The semantics of RTLBlock and RTLPar are based on the -CompCert RTL intermediate language. However, instead of mapping from -states to instructions, RTLBlock maps from states to hyperblocks, and -RTLPar maps from states to parallel hyperblocks, which will be described -in the next sections. +However, due to control-flow not being allowed in basic blocks, the number of instructions that can +be scheduled at any one time is limited. The boundaries to other basic blocks act as hard boundary +to the scheduling algorithm. hyperblocks~ allow instructions to be predicated, thereby introducing +some control-flow information into the blocks, which can also be taken into account by the +scheduler. For example, any data dependency between instructions that have mutually exclusive +predicates can be removed, as these instructions will never be activate at the same time. -%\startformula \begin{aligned} -% i\ ::=&\ \ \mono{RBnop} \\ -% &|\ \mono{RBop}\ p?\ \mathit{op}\ \vec{r}\ d \\ -% &|\ \mono{RBload}\ p?\ \mathit{chunk}\ \mathit{addr}\ \vec{r}\ d \\ -% &|\ \mono{RBstore}\ p?\ \mathit{chunk}\ \mathit{addr}\ \vec{r}\ s \\ -% &|\ \mono{RBsetpred}\ p?\ c\ \vec{r}\ d \\ -% \end{aligned} \stopformula -% -%\startformula \begin{aligned} -% i_{\mathit{cf}}\ ::=&\ \ \mono{RBcall}\ \mathit{sig}\ f\ \vec{r}\ d\ n \\ -% &|\ \mono{RBtailcall}\ \mathit{sig}\ f\ \vec{r} \\ -% &|\ \mono{RBbuiltin}\ f_{\mathit{ext}}\ \vec{r}\ r\ n \\ -% &|\ \mono{RBcond}\ c\ \vec{r}\ n_{1}\ n_{2} \\ -% &|\ \mono{RBjumptable}\ r\ \vec{n} \\ -% &|\ \mono{RBreturn}\ r? \\ -% &|\ \mono{RBgoto}\ n \\ -% &|\ \mono{RBpred\_{cf}}\ P\ i_{\mathit{cf}_{1}}\ i_{\mathit{cf}_{2}} \\ -% \end{aligned} \stopformula +\subsection[sec:rtlblockdef]{RTLBlock and RTLPar Intermediate Language Definition} -RTLBlock instructions are split into two types of instructions, standard -instructions and control-flow instructions. The standard instructions -are the instructions that can be placed into hyperblocks, whereas -control-flow instructions are the instructions that can end hyperblocks. -The standard instructions are shown in -Figure~\goto{{[}fig:standard_instruction{]}}[fig:standard_instruction], -whereas the control-flow instructions are shown in -Figure~\goto{{[}fig:control_flow_instruction{]}}[fig:control_flow_instruction]. -Most instructions are quite similar to their RTL counterparts, however, -there are some instructions that have been added. \type{RBsetpred} is a -standard instruction which sets a predicate equal to an evaluated -condition. This instruction is used during if-conversion to set the -predicate to the value of the condition in the conditional statement -instruction. To support proper predicated instructions, each instruction -therefore also contains an optional predicate which has been set -beforehand, which produces the hyperblock. In additon to that, there is -also an extra control-flow instruction called \type{RBpred_cf}, which -can branch on a predicate an nests more control-flow instructions inside -of it. This is also necessary for if-conversion, when converting an -already converted conditional statement, as depending on the predicate, -a different control-flow instruction may be necessary. {\bf {[}{[}JW:} I -think there'd be a lot of value in having a picture showing op-chaining -in action. See tex source for the kind of thing I mean. Probably best -done as a running example throughout the paper. {\bf {]}{]}} - -These instructions are use in RTLBlock as well as in RTLPar. The main -difference between these two languages is how these instructions are -arranged within the hyperblock and the execution semantics of the -hyperblock in the languages. Let $\mathit{hb}_{b}$ be the definition of -a hyperblock for RTLBlock, and let $\mathit{hb}_{p}$ be the definition -of a hyperblock for RTLPar, then $\mathit{hb}_{b}$ and $\mathit{hb}_{p}$ -be defined as in Figure~\goto{{[}fig:hb_def{]}}[fig:hb_def]. The Figure -shows the different nesting levels of the basic blocks in RTLBlock as -well as RTLPar, and where the parallel semantics of RTLPar come into -play. RTLBlock is made of a list of instructions and a control-flow -instruction that ends the hyperblock. Each instruction in RTLBlock is -executed sequentially. RTLPar is made of three separate lists. The outer -list behaves like the list in RTLBlock, and executes each list inside of -it sequentially. RTLPar then has a list which executes it's contents in -parallel, meaning each instruction uses the same state of the registers -and memory while executing. Finally, instead of directly containing -instructions, there is an additional layer of sequentially executing -lists of instructions, to be able to execute small groups of -instructions in sequence, but still in parallel to the other -instructions in the current parallel group. - -RTLPar is structured in this way so that the scheduler has more -flexibility, so that it can not only parallelise operations in the basic -blocks, but that it can also define sequential operations in these -parallel operations. This is especially useful for the hardware back -end, as it means that sequential operations that do not quite fill one -clock cycle can be executed sequentially, but can still be executed in -parallel with other independent operations. This optimisation is called -operation chaining, and is critical to get the most performance out of -the hardware. +Figure~\goto{{[}fig:compcert_interm{]}}[fig:compcert_interm] shows the intermediate languages +introduced to implement hyperblocks in CompCert. This consists of new RTLBlock and RTLPar +intermediate languages, which implement the sequential and parallel semantics of basic blocks +respectively. The semantics of RTLBlock and RTLPar are based on the CompCert RTL intermediate +language. However, instead of mapping from states to instructions, RTLBlock maps from states to +hyperblocks, and RTLPar maps from states to parallel hyperblocks, which will be described in the +next sections. + +\placeformula \startformula \startmathalignment +\NC i\ \ \colon\colon= \NC \ \ \mono{RBnop} \NR +\NC \NC |\ \ \mono{RBop}\ p?\ \mathit{op}\ \vec{r}\ d \NR +\NC \NC |\ \ \mono{RBload}\ p?\ \mathit{chunk}\ \mathit{addr}\ \vec{r}\ d \NR +\NC \NC |\ \ \mono{RBstore}\ p?\ \mathit{chunk}\ \mathit{addr}\ \vec{r}\ s \NR +\NC \NC |\ \ \mono{RBsetpred}\ p?\ c\ \vec{r}\ d \NR +\stopmathalignment \stopformula + +\placeformula \startformula \startmathalignment +\NC i_{\mathit{cf}}\ \ \colon\colon= \NC \ \ \mono{RBcall}\ \mathit{sig}\ f\ \vec{r}\ d\ n \NR +\NC \NC |\ \ \mono{RBtailcall}\ \mathit{sig}\ f\ \vec{r} \NR +\NC \NC |\ \ \mono{RBbuiltin}\ f_{\mathit{ext}}\ \vec{r}\ r\ n \NR +\NC \NC |\ \ \mono{RBcond}\ c\ \vec{r}\ n_{1}\ n_{2} \NR +\NC \NC |\ \ \mono{RBjumptable}\ r\ \vec{n} \NR +\NC \NC |\ \ \mono{RBreturn}\ r? \NR +\NC \NC |\ \ \mono{RBgoto}\ n \NR +\NC \NC |\ \ \mono{RBpred\_{cf}}\ P\ i_{\mathit{cf}_{1}}\ i_{\mathit{cf}_{2}} \NR +\stopmathalignment \stopformula + +RTLBlock instructions are split into two types of instructions, standard instructions and +control-flow instructions. The standard instructions are the instructions that can be placed into +hyperblocks, whereas control-flow instructions are the instructions that can end hyperblocks. The +standard instructions are shown in +Figure~\goto{{[}fig:standard_instruction{]}}[fig:standard_instruction], whereas the control-flow +instructions are shown in +Figure~\goto{{[}fig:control_flow_instruction{]}}[fig:control_flow_instruction]. Most instructions +are quite similar to their RTL counterparts, however, there are some instructions that have been +added. \type{RBsetpred} is a standard instruction which sets a predicate equal to an evaluated +condition. This instruction is used during if-conversion to set the predicate to the value of the +condition in the conditional statement instruction. To support proper predicated instructions, each +instruction therefore also contains an optional predicate which has been set beforehand, which +produces the hyperblock. In additon to that, there is also an extra control-flow instruction called +\type{RBpred_cf}, which can branch on a predicate an nests more control-flow instructions inside of +it. This is also necessary for if-conversion, when converting an already converted conditional +statement, as depending on the predicate, a different control-flow instruction may be +necessary. + +These instructions are use in RTLBlock as well as in RTLPar. The main difference between these two +languages is how these instructions are arranged within the hyperblock and the execution semantics +of the hyperblock in the languages. Let $\mathit{hb}_{b}$ be the definition of a hyperblock for +RTLBlock, and let $\mathit{hb}_{p}$ be the definition of a hyperblock for RTLPar, then +$\mathit{hb}_{b}$ and $\mathit{hb}_{p}$ be defined as in +Figure~\goto{{[}fig:hb_def{]}}[fig:hb_def]. The Figure shows the different nesting levels of the +basic blocks in RTLBlock as well as RTLPar, and where the parallel semantics of RTLPar come into +play. RTLBlock is made of a list of instructions and a control-flow instruction that ends the +hyperblock. Each instruction in RTLBlock is executed sequentially. RTLPar is made of three separate +lists. The outer list behaves like the list in RTLBlock, and executes each list inside of it +sequentially. RTLPar then has a list which executes it's contents in parallel, meaning each +instruction uses the same state of the registers and memory while executing. Finally, instead of +directly containing instructions, there is an additional layer of sequentially executing lists of +instructions, to be able to execute small groups of instructions in sequence, but still in parallel +to the other instructions in the current parallel group. + +RTLPar is structured in this way so that the scheduler has more flexibility, so that it can not only +parallelise operations in the basic blocks, but that it can also define sequential operations in +these parallel operations. This is especially useful for the hardware back end, as it means that +sequential operations that do not quite fill one clock cycle can be executed sequentially, but can +still be executed in parallel with other independent operations. This optimisation is called +operation chaining, and is critical to get the most performance out of the hardware. \subsection[translating-rtlblock-to-rtlpar]{Translating RTLBlock to RTLPar} -The translation from RTLBlock to RTLPar is performed using a scheduling -algorithm, which will takes hyperblocks and rearranges their -instructions to maximise parallelism based on the data dependencies. It -then arranges the instructions in RTLPar by putting independent -instructions into the same parallel block. Scheduling is an optimisation -that is well suited to translation validation, as the algorithm itself -can be quite complex, whereas the output is normally relatively easy to -check for equivalence with the input. -Figure~\goto{{[}fig:compcert_interm{]}}[fig:compcert_interm] shows that -the scheduling step comes right after the if-conversion step which -originally creates the hyperblocks. - -Figure~\goto{{[}fig:op_chain{]}}[fig:op_chain] shows how the scheduling -step transforms RTLBlock into RTLPar. The RTLBlock hyperblock being -scheduled is shown in -Figure~\goto{{[}fig:op_chain_a{]}}[fig:op_chain_a], which contains five -predicated operations, comprising two additions and three -multiplications. The data dependencies of the instructions in the -hyperblock are shown in -Figure~\goto{{[}fig:op_chain_b{]}}[fig:op_chain_b]. Curiously, even -though operations and are normally dependent on each other due to a -write-after-write conflict, but because the predicates are independent -the write-after-write conflict can effectively be removed by the +The translation from RTLBlock to RTLPar is performed using a scheduling algorithm, which will takes +hyperblocks and rearranges their instructions to maximise parallelism based on the data +dependencies. It then arranges the instructions in RTLPar by putting independent instructions into +the same parallel block. Scheduling is an optimisation that is well suited to translation +validation, as the algorithm itself can be quite complex, whereas the output is normally relatively +easy to check for equivalence with the input. +Figure~\goto{{[}fig:compcert_interm{]}}[fig:compcert_interm] shows that the scheduling step comes +right after the if-conversion step which originally creates the hyperblocks. + +Figure~\goto{{[}fig:op_chain{]}}[fig:op_chain] shows how the scheduling step transforms RTLBlock +into RTLPar. The RTLBlock hyperblock being scheduled is shown in +Figure~\goto{{[}fig:op_chain_a{]}}[fig:op_chain_a], which contains five predicated operations, +comprising two additions and three multiplications. The data dependencies of the instructions in the +hyperblock are shown in Figure~\goto{{[}fig:op_chain_b{]}}[fig:op_chain_b]. Curiously, even though +operations and are normally dependent on each other due to a write-after-write conflict, but because +the predicates are independent the write-after-write conflict can effectively be removed by the scheduler. -The scheduler transforms the RTLBlock hyperblock into the RTLPar -hyperblock shown in Figure~\goto{{[}fig:op_chain_c{]}}[fig:op_chain_c]. -Even though the addition in is dependent on , they can still be put into -the same cycle as the additions do not take as long to complete as the -multiplications. This optimisation is called operation chaining. Then, -the multiplication in can also be placed in the same clock cycle as it -does not have any data dependencies. Then, in the next clock cycle, -either the multiplication in can take place, or the multiplication in -will take place, meaning these two multiplications can also be placed in -the same clock cycle. This gives the final schedule that is shown in -Figure~\goto{{[}fig:op_chain_d{]}}[fig:op_chain_d]. +The scheduler transforms the RTLBlock hyperblock into the RTLPar hyperblock shown in +Figure~\goto{{[}fig:op_chain_c{]}}[fig:op_chain_c]. Even though the addition in is dependent on , +they can still be put into the same cycle as the additions do not take as long to complete as the +multiplications. This optimisation is called operation chaining. Then, the multiplication in can +also be placed in the same clock cycle as it does not have any data dependencies. Then, in the next +clock cycle, either the multiplication in can take place, or the multiplication in will take place, +meaning these two multiplications can also be placed in the same clock cycle. This gives the final +schedule that is shown in Figure~\goto{{[}fig:op_chain_d{]}}[fig:op_chain_d]. \section[abstr_interp]{Abstract Interpretation of Hyperblocks} -The correctness of the scheduling algorithm is checked by comparing the -symbolic expressions of each register before and after scheduling. This -section describes how these symbolic expressions are constructed for -each hyperblock. +The correctness of the scheduling algorithm is checked by comparing the symbolic expressions of each +register before and after scheduling. This section describes how these symbolic expressions are +constructed for each hyperblock. \subsection[abstr_language]{Construction of Abstract Language} -The main difficulty when constructing symbolic expressions for -hyperblocks is the presence of predicates, which can be used by the -scheduling algorithm to reorder instructions further, even if these -contain data dependencies. The scheduling algorithm will manipulate -predicates in the following ways, which the comparison function needs to -take into account. +The main difficulty when constructing symbolic expressions for hyperblocks is the presence of +predicates, which can be used by the scheduling algorithm to reorder instructions further, even if +these contain data dependencies. The scheduling algorithm will manipulate predicates in the +following ways, which the comparison function needs to take into account. \startitemize \item - Instructions with predicates that are equivalent to false can be - removed. + Instructions with predicates that are equivalent to false can be removed. \item - Instructions with predicates that are mutually exclusive can be - executed in parallel. + Instructions with predicates that are mutually exclusive can be executed in parallel. \item Any predicate can be replaced with an equivalent predicate. \stopitemize The symbolic expressions have the following form: -There are three main resources that are present in symbolic expressions: -registers $r$, which represent the value of the registers in the -program, predicates $p$, which represent the value of the predicates -throughout the program, and finally, memory \type{Mem}, representing the -state of the memory. For each basic instruction in the hyperblock, there -is an equivalent symbolic expression which either modifies the state of -the registers, predicates or memory. Finally, as each instruction in the -hyperblock can be predicates, each symbolic expression is also paired -with a predicate as well. - -Assigning the same resource multiple times with instructions that -contain different predicates, means that the symbolic expression must -also express this conditional result based on the predicates. This is -expressed by assigning lists of predicated symbolic expressions to each -resource, where each predicate needs to be mutually exclusive from the -other predicates in the list. - -The expressions are then constructed using a function which updates the -symbolic expressions assigned for each resource. This is done using -multiple primitives which act on predicated types, which are made up of -a list of pairs of predicates and the element of that type. The first -important primitive is multiplication of two predicated types, which is -implemented as performing a cartesian multiplication between the -predicated types in the two lists, anding the two predicates and joining +There are three main resources that are present in symbolic expressions: registers $r$, which +represent the value of the registers in the program, predicates $p$, which represent the value of +the predicates throughout the program, and finally, memory \type{Mem}, representing the state of the +memory. For each basic instruction in the hyperblock, there is an equivalent symbolic expression +which either modifies the state of the registers, predicates or memory. Finally, as each instruction +in the hyperblock can be predicates, each symbolic expression is also paired with a predicate as +well. + +Assigning the same resource multiple times with instructions that contain different predicates, +means that the symbolic expression must also express this conditional result based on the +predicates. This is expressed by assigning lists of predicated symbolic expressions to each +resource, where each predicate needs to be mutually exclusive from the other predicates in the list. + +The expressions are then constructed using a function which updates the symbolic expressions +assigned for each resource. This is done using multiple primitives which act on predicated types, +which are made up of a list of pairs of predicates and the element of that type. The first important +primitive is multiplication of two predicated types, which is implemented as performing a cartesian +multiplication between the predicated types in the two lists, anding the two predicates and joining the two types of each list using a function. %\startformula \label{eq:1} % P_{1} \otimes_{f} P_{2} \equiv \mono{map } (\lambda ((p_{1}, e_{1}), (p_{2}, e_{2})) . (p_{1} \land p_{2}, f\ e_{1}\ e_{2})) P_{1} \times P_{2} \stopformula -In addition to that, another primitive that is needed is the following -append operation, which will negate the combination of all predicates in -the second predicated type, and it to the first predicated type and -append the first predicated type to the second: +In addition to that, another primitive that is needed is the following append operation, which will +negate the combination of all predicates in the second predicated type, and it to the first +predicated type and append the first predicated type to the second: %\startformula \label{eq:2} % \mu(p, P) \equiv \mono{map } (\lambda (p', e') . (p \land p', e'))\ P \stopformula @@ -322,8 +237,7 @@ append the first predicated type to the second: %\startformula \label{eq:3} % P_{1} \oplus_{p} P_{2} \equiv \mu(\neg p, P_{1}) \mathbin{++} \mu(p, P_{2}) \stopformula -\subsection[title={Example of -translation},reference={example-of-translation}] +\subsection[example-of-translation]{Example of translation} %\startformula \begin{aligned} % \label{eq:4} @@ -342,113 +256,87 @@ translation},reference={example-of-translation}] % \end{cases} % \end{aligned} \stopformula -Using the example shown in -Figure~\goto{{[}fig:op_chain{]}}[fig:op_chain], the RTLBlock hyperblock -shown in Figure~\goto{{[}fig:op_chain_a{]}}[fig:op_chain_a] is scheduled -into the hyperblock RTLPar shown in -Figure~\goto{{[}fig:op_chain_c{]}}[fig:op_chain_c]. The first step is to -translate the input and output of the scheduler into the Abstr -intermediate language by calculating the symbolic values that will be -stored in each of the registers. Symbolically evaluating the RTLBlock -hyperblock results in the expressions shown in -Figure~\goto{{[}fig:symbolic_expressions{]}}[fig:symbolic_expressions] -for registers \type{r1}, \type{r2} and \type{r3}, where the state of the -register \type{r1} at the start of the hyperblock is denoted as -$\mono{r1}^{0}$. - -This symbolic expression can be generated by sequentially going through -the list of predicated expressions and applying the update function -defined in Section~\goto{3.1}[abstr_language]. The resulting predicated -expressions is guaranteed to have predicates which are mutually -exclusive, meaning if the predicate evaluates to true, all other -predicates must evaluate to false. - -\subsection[title={Linear (Flat) Predicated -Expressions},reference={linear-flat-predicated-expressions}] - -The update function for predicated expressions is quite a bit more -complex compared to standard update functions for symbolic execution, -such as the one used by . The main reason for this is the non-recursive -nature of the predicated expressions. As the predicates are only -available at the top-level of the expressions, combining multiple -expressions means that the conditions and multiple possibilities also -need to trickle up to the top. - -Instead, if the predicated instructions were recursive, one would not -have to change any predicates when evaluating the argument lists of -instructions, for example, as these can just stay predicated. The -evaluation semantics would also be much simpler, because a rule can be -added for evaluating predicates. - -\section[title={Correctness of the Abstract -Interpretation},reference={sec:correctness}] - -The correctness of the scheduling step is dependent on the correctness -of the equivalence check between the two abstract languages. As the -abstract languages contain predicates, this comparison check cannot just -be a syntactic comparison between the two languages, because the -predicates might be different but logically equivalent. Instead, a mixed -equivalence check is performed using a verified SAT solver and a -syntactic equivalence check of the actual symbolic expressions. - -\subsection[title={Comparing Symbolic Abstract -Expressions},reference={comparing-symbolic-abstract-expressions}] - -Abstract expressions comprise a predicate and a symbolic expression. The -translation from RTLBlock and RTLPar to Abstract ensures that each -register will contain a predicated symbolic expression that describes -the value that will be assigned to it after evaluating the basic block. -This means that if an expression always produces the same result as -another expression, that the register will always contain the same -value. If the RTLBlock program and RTLPar program translate to two -Abstract languages where all the register's symbolic expressions match, -then this means that both programs will always produce the same value. - -The main comparison between two symbolic expressions can be done using a -simple syntactic comparison, because this ensures that each of the -expressions always evaluate to the same value, and is also general -enough for the expressions that need to be compared. However, as all -expressions might be predicated, this simple syntactic comparison cannot -be performed for the predicates. This is because predicated instructions -can be executed in many different orders, and the predicates are used to -calculate dependence information and therefore also determine the order -of operations. - -Instead, comparing predicates can be done by checking logical -equivalence of the predicates using a formally verified SAT solver. This -ensures that the predicates will always evaluate to the same value, -thereby making it a fair comparison. - -\subsection[title={Verified Decidable DPLL SAT -Solver},reference={verified_sat}] - -A SAT solver is needed to correctly and reliably compare predicated -expressions. This makes it possible to then compare many different -possible schedules, even if they remove expressions completely if these -are unnecessary. The SAT solver is needed in the verifier, which needs -to be proven correct in Coq to prove the scheduling translation correct. -The SAT solver therefore has to be verified, because it could otherwise -not be used in the proof of correctness. - -The DPLL algorithm is used to implement the SAT solver. This means that -the SAT solver takes in a conjunctive normal form (CNF) formula. The -conversion from a recursive predicate type into a CNF formula is done by -recursively expanding the predicate in equivalent CNF formulas. - -The SAT solver is implemented in Coq and proven to be sound and -complete. The soundness and completeness of an algorithm to check -whether a predicate is satisfiable or not also directly implies that the -check is decidable, simplifying many proofs in Coq by using a -\type{Setoid} instance for equivalence between predicates. This makes -rewriting predicates if they are equivalent much easier in proofs by -reusing many of the existing Coq tactics. - -\subsection[title={Encoding Predicated Expressions as SAT -Formulas},reference={encoding-predicated-expressions-as-sat-formulas}] - -Given the following two predicated expressions which should be tested -for equivalence, one can construct a SAT expression which will ensure -that both expressions will always result in the same value. +Using the example shown in Figure~\goto{{[}fig:op_chain{]}}[fig:op_chain], the RTLBlock hyperblock +shown in Figure~\goto{{[}fig:op_chain_a{]}}[fig:op_chain_a] is scheduled into the hyperblock RTLPar +shown in Figure~\goto{{[}fig:op_chain_c{]}}[fig:op_chain_c]. The first step is to translate the +input and output of the scheduler into the Abstr intermediate language by calculating the symbolic +values that will be stored in each of the registers. Symbolically evaluating the RTLBlock hyperblock +results in the expressions shown in +Figure~\goto{{[}fig:symbolic_expressions{]}}[fig:symbolic_expressions] for registers \type{r1}, +\type{r2} and \type{r3}, where the state of the register \type{r1} at the start of the hyperblock is +denoted as $\mono{r1}^{0}$. + +This symbolic expression can be generated by sequentially going through the list of predicated +expressions and applying the update function defined in Section~\goto{3.1}[abstr_language]. The +resulting predicated expressions is guaranteed to have predicates which are mutually exclusive, +meaning if the predicate evaluates to true, all other predicates must evaluate to false. + +\subsection[linear-flat-predicated-expressions]{Linear (Flat) Predicated Expressions} + +The update function for predicated expressions is quite a bit more complex compared to standard +update functions for symbolic execution, such as the one used by . The main reason for this is the +non-recursive nature of the predicated expressions. As the predicates are only available at the +top-level of the expressions, combining multiple expressions means that the conditions and multiple +possibilities also need to trickle up to the top. + +Instead, if the predicated instructions were recursive, one would not have to change any predicates +when evaluating the argument lists of instructions, for example, as these can just stay +predicated. The evaluation semantics would also be much simpler, because a rule can be added for +evaluating predicates. + +\section[sec:correctness]{Correctness of the Abstract Interpretation} + +The correctness of the scheduling step is dependent on the correctness of the equivalence check +between the two abstract languages. As the abstract languages contain predicates, this comparison +check cannot just be a syntactic comparison between the two languages, because the predicates might +be different but logically equivalent. Instead, a mixed equivalence check is performed using a +verified SAT solver and a syntactic equivalence check of the actual symbolic expressions. + +\subsection[comparing-symbolic-abstract-expressions]{Comparing Symbolic Abstract Expressions} + +Abstract expressions comprise a predicate and a symbolic expression. The translation from RTLBlock +and RTLPar to Abstract ensures that each register will contain a predicated symbolic expression that +describes the value that will be assigned to it after evaluating the basic block. This means that +if an expression always produces the same result as another expression, that the register will +always contain the same value. If the RTLBlock program and RTLPar program translate to two Abstract +languages where all the register's symbolic expressions match, then this means that both programs +will always produce the same value. + +The main comparison between two symbolic expressions can be done using a simple syntactic +comparison, because this ensures that each of the expressions always evaluate to the same value, and +is also general enough for the expressions that need to be compared. However, as all expressions +might be predicated, this simple syntactic comparison cannot be performed for the predicates. This +is because predicated instructions can be executed in many different orders, and the predicates are +used to calculate dependence information and therefore also determine the order of operations. + +Instead, comparing predicates can be done by checking logical equivalence of the predicates using a +formally verified SAT solver. This ensures that the predicates will always evaluate to the same +value, thereby making it a fair comparison. + +\subsection[verified_sat]{Verified Decidable DPLL SAT Solver} + +A SAT solver is needed to correctly and reliably compare predicated expressions. This makes it +possible to then compare many different possible schedules, even if they remove expressions +completely if these are unnecessary. The SAT solver is needed in the verifier, which needs to be +proven correct in Coq to prove the scheduling translation correct. The SAT solver therefore has to +be verified, because it could otherwise not be used in the proof of correctness. + +The DPLL algorithm is used to implement the SAT solver. This means that the SAT solver takes in a +conjunctive normal form (CNF) formula. The conversion from a recursive predicate type into a CNF +formula is done by recursively expanding the predicate in equivalent CNF formulas. + +The SAT solver is implemented in Coq and proven to be sound and complete. The soundness and +completeness of an algorithm to check whether a predicate is satisfiable or not also directly +implies that the check is decidable, simplifying many proofs in Coq by using a \type{Setoid} +instance for equivalence between predicates. This makes rewriting predicates if they are equivalent +much easier in proofs by reusing many of the existing Coq tactics. + +\subsection[encoding-predicated-expressions-as-sat-formulas]{Encoding Predicated Expressions as SAT + Formulas} + +Given the following two predicated expressions which should be tested for equivalence, one can +construct a SAT expression which will ensure that both expressions will always result in the same +value. %\startformula \begin{aligned} % \bgroup \left.\begin{aligned} @@ -465,33 +353,28 @@ that both expressions will always result in the same value. % e_{m}',\quad &p_{m}'\\ % \end{cases}\end{aligned} \stopformula -The SAT query that should determine if two predicated expressions are -equivalent is the following, where $\mathit{he}_{i}$ is the hashed -expression of $e_{i}$: +The SAT query that should determine if two predicated expressions are equivalent is the following, +where $\mathit{he}_{i}$ is the hashed expression of $e_{i}$: %\startformula \begin{gathered} % (p_{1} \implies \mathit{he}_{1}) \land (p_{2} \implies \mathit{he}_{2}) \land \cdots \land (p_{n} \implies \mathit{he}_{n}) \\ % \Longleftrightarrow\\ % (p_{1}' \implies \mathit{he}_{1}') \land (p_{2}' \implies \mathit{he}_{2}') \land \cdots \land (p_{m}' \implies \mathit{he}_{m}') \\\end{gathered} \stopformula -However, reasoning about the equivalence of individual expressions -within the greater list of predicated expressions is not quite -straightforward. The evaluation semantics of predicated expression list -will evaluate each predicate expression sequentially, and pick the -expression whose predicate evaluates to true. However, proving that both -predicated expression lists will return the same value assuming that the -SAT query above is true, means that for each predicate that is true, One -needs to be able to show that there is a predicate in the other list -that will also be true, which will return the same value. However, there -is no guarantee that there is a unique predicated expression that will -match the current predicate, as there may be duplicate expressions in -the list with different predicates. - -This means that it is quite difficult to prove that the SAT query -implies the semantic preservation of the scheduling transformation. The -SAT query can therefore be broken down into a slightly more direct -comparison that is more useful for the proof itself. The steps to -perform the more fine grained comparison are the following: +However, reasoning about the equivalence of individual expressions within the greater list of +predicated expressions is not quite straightforward. The evaluation semantics of predicated +expression list will evaluate each predicate expression sequentially, and pick the expression whose +predicate evaluates to true. However, proving that both predicated expression lists will return the +same value assuming that the SAT query above is true, means that for each predicate that is true, +One needs to be able to show that there is a predicate in the other list that will also be true, +which will return the same value. However, there is no guarantee that there is a unique predicated +expression that will match the current predicate, as there may be duplicate expressions in the list +with different predicates. + +This means that it is quite difficult to prove that the SAT query implies the semantic preservation +of the scheduling transformation. The SAT query can therefore be broken down into a slightly more +direct comparison that is more useful for the proof itself. The steps to perform the more fine +grained comparison are the following: %\startenumerate[n] %\item @@ -507,81 +390,65 @@ perform the more fine grained comparison are the following: % one of the maps, then it's predicate should be equivalent to $\perp$. %\stopenumerate -This comparison is logically equivalent to performing the large SAT -query, however, it maps better to the proof because it is comparing the -predicates of the same expressions with each other. +This comparison is logically equivalent to performing the large SAT query, however, it maps better +to the proof because it is comparing the predicates of the same expressions with each other. \subsection[proof-of-equivalence-between-passes]{Proof of equivalence between passes} -Figure~\goto{{[}fig:compcert_interm{]}}[fig:compcert_interm] shows the -four different passes that were added to CompCert to form hyperblocks. -This section will cover the proofs for each of the translations, -focusing mainly on the proof of translation of the scheduling pass. +Figure~\goto{{[}fig:compcert_interm{]}}[fig:compcert_interm] shows the four different passes that +were added to CompCert to form hyperblocks. This section will cover the proofs for each of the +translations, focusing mainly on the proof of translation of the scheduling pass. \subsubsection[proof-of-correctness-of-the-if-conversion-pass]{Proof of Correctness of the If-Conversion Pass} -The if-conversion pass is quite simple to verify, as it is an -optimisation that transforms the code in the same language. +The if-conversion pass is quite simple to verify, as it is an optimisation that transforms the code +in the same language. \subsubsection[proof-of-correctness-of-scheduling]{Proof of Correctness of Scheduling} -The scheduling pass is formally verified using translation validation, -using the Abstr language as a point of comparison between RTLBlock and -RTLPar. Using the algorithm defined in Section~\goto{3}[abstr_interp], -the formally verified SAT solver described in -Section~\goto{4.2}[verified_sat] can be used prove that if the two -predicated expression lists are equivalent, that they will always -evaluate to the same value. +The scheduling pass is formally verified using translation validation, using the Abstr language as a +point of comparison between RTLBlock and RTLPar. Using the algorithm defined in +Section~\goto{3}[abstr_interp], the formally verified SAT solver described in +Section~\goto{4.2}[verified_sat] can be used prove that if the two predicated expression lists are +equivalent, that they will always evaluate to the same value. \section[implementation-details-of-static-sdc-scheduling]{Implementation Details of Static SDC Scheduling} -This section describes the implementation details of the static SDC -scheduling algorithm used, and also describes some of the details of the -basic block generation as well as the if-conversion pass which generates -the predicated instructions. +This section describes the implementation details of the static SDC scheduling algorithm used, and +also describes some of the details of the basic block generation as well as the if-conversion pass +which generates the predicated instructions. \subsection[implementation-of-if-conversion]{Implementation of If-Conversion} -If-conversion is the conversion that introduces predicated instructions -which can make use of the hyperblocks. It converts conditional -statements into predicated instructions. This transformation has a few -limitations on the kind of conditional statements that it can translate. -First, only conditional statements without loops can be translated, -therefore, one must identify cycles in the control-flow before -performing the if-conversion. Secondly, if-conversion will not always -result in more efficient code, so it should not be applied to any -conditional statements. Instead, it is best applied to conditional -statements where each branch will take a similar amount of time to -execute. - -For back ends that do not support predicates, it is also important to -eliminate the predicates again using a process called reverse -if-conversion, which creates branches out of the predicates and groups -these together again. This allows for hyperblock scheduling to also be -used with back ends that do not support predicates, which are most the -default back ends included in CompCert. - -\subsection[title={Static SDC -Scheduling},reference={static-sdc-scheduling}] - -System of difference constraints (SDC) scheduling~ is a flexible -algorithm to perform scheduling of instructions, especially if the -target is hardware directly. It allows for flexible definitions of -optimisation functions in terms of latencies and dependencies of the -operations in the basic block. - -\section[title={Performance -Comparison},reference={performance-comparison}] - -\section[title={Related Work},reference={sec:related-work}] - -\section[title={Conclusion},reference={conclusion}] - -This material is based upon work supported by the under Grant No.~ and -Grant No.~. Any opinions, findings, and conclusions or recommendations -expressed in this material are those of the author and do not -necessarily reflect the views of the National Science Foundation. +If-conversion is the conversion that introduces predicated instructions which can make use of the +hyperblocks. It converts conditional statements into predicated instructions. This transformation +has a few limitations on the kind of conditional statements that it can translate. First, only +conditional statements without loops can be translated, therefore, one must identify cycles in the +control-flow before performing the if-conversion. Secondly, if-conversion will not always result in +more efficient code, so it should not be applied to any conditional statements. Instead, it is best +applied to conditional statements where each branch will take a similar amount of time to execute. + +For back ends that do not support predicates, it is also important to eliminate the predicates again +using a process called reverse if-conversion, which creates branches out of the predicates and +groups these together again. This allows for hyperblock scheduling to also be used with back ends +that do not support predicates, which are most the default back ends included in CompCert. + +\subsection[static-sdc-scheduling]{Static SDC Scheduling} + +System of difference constraints (SDC) scheduling~ is a flexible algorithm to perform scheduling of +instructions, especially if the target is hardware directly. It allows for flexible definitions of +optimisation functions in terms of latencies and dependencies of the operations in the basic block. + +\section[performance-comparison]{Performance Comparison} + +\section[sec:related-work]{Related Work} + +\section[conclusion]{Conclusion} + +This material is based upon work supported by the under Grant No.~ and Grant No.~. Any opinions, +findings, and conclusions or recommendations expressed in this material are those of the author and +do not necessarily reflect the views of the National Science Foundation. \stopcomponent -- cgit