\environment fonts_env \environment lsr_env \startcomponent scheduling \chapter[sec:scheduling]{Static Scheduling} \section[introduction]{Introduction} \oindex{static scheduling}The use of multicore processors has been increasing drastically, thereby making parallelising compilers ever more important. In addition to that, the need for custom hardware accelerators is also increasing, as traditional processors are not always the best choice for all applications. Compilers that support optimisations which can automatically parallelise its input programs are therefore also becoming more important, as these back ends all benefit from it. For processors, one wants to optimise the pipeline usage as much as possible, whereas in hardware one can create custom pipelines, as well as making more use out of the spacial properties that hardware offers, making instruction parallelism important as well as speculative execution. However, with more powerful optimisations there is a greater chance to make mistakes in the compiler, making it possible to generate programs that behave differently to the input program. An important step towards making compilers more reliable is to formally verify these optimisations, so taoirenst arsntoi earsite naoriestn aoirsten aorsnt ot so that it can be guaranteed that bugs are not introduced. A popular optimisation for processors increasing the instruction parallelism of the resulting program is to use scheduling. This is an optimisation that is especially important for processors that have deep instruction pipelines or multiple functional units that can be used to execute instructions in parallel. However, this optimisation is also critical for high-level synthesis tools, which generate application specific hardware designs based on a software input. Simple scheduling algorithms often work on basic blocks, which are sequential lists of instructions without any branches. However, this limits what kind of reordering the scheduling algorithm can perform. It can never move instructions past a branching instruction, even when this might drastically improve the performance. Trace scheduling~\cite[fisher81_trace_sched] is an alternative that addresses this issue, by creating paths that might cross branching instructions, and scheduling the instructions in that path. However, the main problem with trace scheduling is that it is often infeasible on large programs, and because of the large design space, it is sometimes hard to find an optimal schedule. \index{superblock}Superblock~\cite[hwu93_super] and \index{hyperblock}hyperblock~\cite[mahlke92_effec_compil_suppor_predic_execut_using_hyper] scheduling are two subsets of trace scheduling, which compromise on the flexibility of trace scheduling to instead build more tractable algorithms. Superblock scheduling produces more efficient code for processors with a low issue rate, whereas hyperblock scheduling produces more performant code for processors with a high issue rate~\cite[mahlke92_effec_compil_suppor_predic_execut_using_hyper,aiken16_trace_sched]. A hardware back end is an ideal target for hyperblock scheduling, as the issue rate is practically infinite. This paper describes the implementation of a hyperblock instruction scheduler in CompCert so that it can benefit existing CompCert back ends, as well as support more general targets such as scheduling instructions for custom hardware. \subsection[key-points]{Key Points} The goal of this chapter is to introduce hyperblock scheduling into CompCert to allow for advanced and flexible scheduling operations that could be beneficial to various different back ends. To this end, the key points of this paper are the following: \startitemize \item Description of how hyperblocks are introduced for improved scheduling of instructions, and how these can be built and translated to hardware. \item Description of the implementation and correctness proof of a \SDC\ scheduling algorithm. The correctness proof does not rely on the algorithm itself and so can safely be replaced by a different scheduling algorithm with similar properties. \stopitemize \section{Progress Summary} \startplacemarginfigure[reference={fig:compcert_interm},title={New intermediate languages introduced into CompCert. Progress bars are shown for each transformation pass, where green represents the estimated part of the work that is completed, and red stands for the remaining work to be done. Orange stands for a pass that will currently not be used for the current translation but could be added back in the future.}] \hbox{\starttikzpicture[shorten >=1pt,>=stealth] \node at (-3, 0) (dotsin) {$\cdots$}; \node at (10, 0) (dotsout) {$\cdots$}; \node[draw] at (-1, 0) (rtl) {\rtl}; \node[draw] at (2, 0) (rtlblock) {\rtlblock}; \node[draw] at (2, -2) (abstrblock) {\abstr}; \node[draw] at (6, 0) (rtlpar) {\rtlpar}; \node[draw] at (6, -2) (abstrpar) {\abstr}; \node[draw] at (8.5, 0) (htl) {\htl}; \draw[->] (rtl) -- node[above,midway,align=center,font=\small] {basic block \\[-0.3em] creation} node[below,yshift=-3mm,xshift=-0.75cm,anchor=west,fill=ForestGreen,minimum width=1.5cm] {} (rtlblock); \draw[->] (rtlblock) -- node[above,midway,font=\small] {scheduling} node[below,yshift=-3mm,xshift=-0.75cm,anchor=west,fill=ForestGreen,minimum width=1.5cm] {} (rtlpar); \draw[->] (rtlpar) -- node[below,yshift=-3mm,xshift=-5mm,anchor=west,fill=VenetianRed,minimum width=1cm] {} node[below,yshift=-3mm,xshift=-5mm,anchor=west,fill=ForestGreen,minimum width=2.5mm] {} (htl); \draw[->,dashed] (rtlblock) -- node[left,yshift=-5mm,xshift=-3mm,anchor=south,fill=VenetianRed,minimum height=1cm] {} node[left,yshift=-5mm,xshift=-3mm,anchor=south,fill=ForestGreen,minimum height=8mm] {} (abstrblock); \draw[->,dashed] (rtlpar) -- node[right,yshift=-5mm,xshift=3mm,anchor=south,fill=VenetianRed,minimum height=1cm] {} node[right,yshift=-5mm,xshift=3mm,anchor=south,fill=ForestGreen,minimum height=6mm] {} (abstrpar); \draw[dashed,shorten >=0pt] (abstrblock) -- node[above,midway] {$\sim$} node[below,yshift=-3mm,xshift=-0.75cm,anchor=west,fill=VenetianRed,minimum width=1.5cm] {} node[below,yshift=-3mm,xshift=-0.75cm,anchor=west,fill=ForestGreen,minimum width=1.2cm] {} (abstrpar); \draw[->] (rtlblock) to [out=130,in=50,loop,looseness=10] node[above,midway,font=\small] {if-conversion} node[below,yshift=6mm,xshift=-0.75cm,anchor=west,fill=VenetianRed,minimum width=1.5cm] {} node[below,yshift=6mm,xshift=-0.75cm,anchor=west,fill=ForestGreen,minimum width=1cm] {} (rtlblock); \draw[->] (rtlpar) to [out=130,in=50,loop,looseness=10] node[above,midway,align=center,font=\small] { operation \\[-0.3em] pipelining} node[below,yshift=1cm,xshift=-0.75cm,anchor=west,fill=TigersEye,minimum width=1.5cm] {} (rtlpar); \draw[->] (dotsin) -- (rtl); \draw[->] (htl) -- (dotsout); \stoptikzpicture} \stopplacemarginfigure This chapter is describing work that has not been fully completed yet. \in{Figure}[fig:compcert_interm] describes the current progress on implementing the scheduling pass, which relies heavily on the basic block generation, the if-conversion pass, as well as the \htl\ generation. The operation pipelining pass is not that critical to scheduling, but allows for the support of pipelined operations such as dividers. In general, the implementation of all the passes is now complete, and so current work is being done to complete the correctness proofs of all the passes. Starting at the basic block creating pass, which is further described in \in{Section}[sec:basic-block-generation], the proof is quite complex for generating basic blocks, but has now been completed. Secondly, the if-conversion pass has only been implemented and has not been verified yet. Even though the transformation is quite direct, it's still difficult to formulate the correct matching condition for the proof. However, this has nearly been completed. Next, for the scheduling pass the verification of the pass that translates the input and the output of the scheduling algorithm into the \abstr\ language and then compares the two languages has mainly been completed. However, due to changes in the \rtlblock\ and \rtlpar\ language semantics, some of these proofs will have to be redone. Finally, for the generation of \htl\ from \rtlpar\ the proof has not been started yet. However, the difficult parts of the proof are the translation of loads and stores into proper reads and writes from RAM, which has already been completely proven for the \rtl\ to \htl\ translation, and can be directly copied over. The operation pipelining pass has only been implemented until now, however, it will likely be removed from the current workflow if it proves to take too long to complete. \section[sec:basic-block-generation]{Basic Block Generation} \index{basic block}The basic block generation proof is surprisingly difficult, as the semantics of \rtl\ and \rtlblock\ do not match nicely. This is because \rtl\ executes each of its instructions sequentially in a \CFG. \rtlblock, however, executes one basic block in one go, so it is not possible to refer to the execution of only one instruction. The traditional simulation diagram for proofs in CompCert looks like the following: \placefigure[here,none][]{}{\hbox{\starttikzpicture[shorten >=1pt,>=stealth] \node (s2) at (0,0) {$S_{2}$}; \node (s1) at (0,2) {$S_{1}$}; \node (s2p) at (5,0) {$S_{2}'$}; \node (s1p) at (5,2) {$S_{1}'$}; \draw (s1) -- node[above] {$\sim$} (s1p); \draw[dashed] (s2) -- node[above] {$\sim$} (s2p); \draw[->] (s1) -- (s2); \draw[->,dashed] (s1p) -- node[right] {$+$} (s2p); \stoptikzpicture}} Which essentially says that given a step in the input language which goes from state $S_{1}$ to $S_{2}$ in one step, there should be one or more steps that go from the matching state $S_{1}'$ in the translated language to a state $S_{2}'$ which matches with $S_{2}$. Most the proofs in CompCert follow this pattern, as the languages normally get refined over time and produce more instructions than before. However, when generating basic blocks one wants a slightly different simulation diagram: \placefigure[here,none][]{}{\hbox{\starttikzpicture[shorten >=1pt,>=stealth] \node (s2) at (0,0) {$S_{2}$}; \node (s1) at (0,2) {$S_{1}$}; \node (s2p) at (5,0) {$S_{2}'$}; \node (s1p) at (5,2) {$S_{1}'$}; \draw (s1) -- node[above] {$\sim$} (s1p); \draw[dashed] (s2) -- node[above] {$\sim$} (s2p); \draw[->] (s1) -- node[left] {$+$} (s2); \draw[->,dashed] (s1p) -- (s2p); \stoptikzpicture}} This says that assuming one has performed multiple steps in the new language, from matching states $S_{1}$ and $S_{1}'$, then there should be one step in the translated language that goes from $S_{1}'$ to $S_{2}'$ which should then match $S_{2}$. This then matches the semantics of how basic blocks are executing, as multiple instructions will be grouped into one. However, in that form such a lemma is impossible to even express in CompCert's framework, as the proof is inductive over the semantics of the input language (i.e. executing one instruction at a time). The main trick that needed to be found, and which is what \cite[authoryears][six22_formal_verif_super_sched] did to generate superblocks, is to separate the above simulation diagram into the following two diagrams, where each diagram now only has one transition in the left hand side. \startplacefigure[location={here,none}] \startfloatcombination[nx=2,distance=2cm] \startplacefigure[location=none] \hbox{\starttikzpicture[shorten >=1pt,>=stealth] \node (s2) at (0,0) {$S_{2}$}; \node (s1) at (0,2) {$S_{1}$}; \node (s2p) at (3,0) {$S_{2}'$}; \node (s1p) at (3,2) {$S_{1}'$}; \draw (s1) -- node[above] {$\sim_{i::\varnothing}$} (s1p); \draw[dashed] (s2) -- node[above] {$\exists b, \sim_{b}$} (s2p); \draw[->] (s1) -- (s2); \draw[->,dashed] (s1p) -- (s2p); \stoptikzpicture} \stopplacefigure \startplacefigure[location=none] \hbox{\starttikzpicture[shorten >=1pt,>=stealth] \node (s2) at (0,0) {$S_{2}$}; \node (s1) at (0,2) {$S_{1}$}; \node (s1p) at (3,2) {$S_{1}'$}; \draw (s1) -- node[above] {$\sim_{i::b}$} (s1p); \draw[->] (s1) -- (s2); \draw[dashed] (s2) -- node[below,xshift=3mm] {$\sim_{b}$} (s1p); \stoptikzpicture} \stopplacefigure \stopfloatcombination \stopplacefigure The diagram on the left represents a control-flow transition, which is performed in lock-step. This means one has encountered a control-flow instruction, and therefore also implies that the basic block has finished executing and that the next block needs to be retrieved. In addition to that, the matching relation is parameterised by the current basic block that is executing, so in the case of the lock-step translation the translation will always be a control-flow instruction at the end of the basic block and will therefore populate the matching relation with the new basic block that is jumped to. The diagram on the right is the interesting one though, which allows for the multiple transitions in the input language. It requires matching on a predicate $\sim_{i::b}$ which matches the start of the basic block in \rtlblock\ with any state in the original \rtl\ language as long as one has not finished executing the basic block and has not encountered a control-flow instruction. Then, at every step, one has to show that the block in the matching relation reduces at every step to show termination, and one also has to keep track of the proof of simulation from the beginning of the basic block to the current point, to then show that the final control-flow step also matches. This is done by the following two properties that are inside of the simulation relation $S_{2} \sim_{b} S_{1}'$. The \in{Equation}[eq:sem_extrap] property describes that executing the basic block $B$ that is in the code $c$ of the translated function in \rtlblock\ at the program counter location from an initial state $S_{1}'$ will result in the same final state as executing the current list of instructions that the simulation relation is parametrised at (i.e. $b$). \placeformula[eq:sem_extrap]\startformula \forall S,\ \ c[S_{1}'.\mathit{pc}] = \Some{B} \land (b, S_2) \Downarrow S \implies (B, S_{1}') \Downarrow S \stopformula The following \in{Equation}[eq:star] then also states that there exists zero or more small steps in \rtl\ that can go from $S_{1}'$ to the current state in \rtl\ $S_{2}$. \placeformula[eq:star]\startformula S_{1}' \rightarrow_* S_{2} \stopformula By also demonstrating that some metric stored in the $\sim$ relation is decreasing, one can then show that one will not get stuck iterating forever, and effectively get the behaviour of executing multiple instructions in the input code to execute one big step in the output language. The \in{Equation}[eq:star] also incrementally builds a proof of the execution from the start of the basic block to the current state, so that when a control-flow instruction is encountered, this proof can be used show that one can jump over all the input instructions with that block. \subsection[scheduling:hyperblocks]{Hyperblocks as a Generalisation of Basic Blocks} Basic blocks are popular in intermediate languages because they describe a list of sequential instructions that are not separated by control-flow. This information can be used by various analysis passes to perform optimisations, as these sequences of instructions can be reordered in any way as long as their data dependencies are met. One such optimisation is scheduling, which reorders instructions to eliminate any delay slots caused by the CPU pipeline. However, due to control-flow not being allowed in basic blocks, the number of instructions that can be scheduled at any one time is limited. The boundaries to other basic blocks act as hard boundary to the scheduling algorithm. Hyperblocks~\cite[mahlke92_effec_compil_suppor_predic_execut_using_hyper] allow instructions to be predicated, thereby introducing some control-flow information into the blocks, which can also be taken into account by the scheduler. For example, any data dependency between instructions that have mutually exclusive predicates can be removed, as these instructions will never be active at the same time. \section[implementation-of-if-conversion]{Implementation of If-Conversion} If-conversion is the transformation that introduces predicated instructions which form the hyperblocks. It has a few limitations on the kind of conditional statements that it can translate. First, only conditional statements without loops can be translated, therefore, one must identify cycles in the control-flow before performing the if-conversion. Secondly, if-conversion will not always result in more efficient code, so it should not be applied to any conditional statements. From an implementation point of view, if-conversion can be implemented naïvely, however, it could be improved by adding better heuristics to determine which conditional statements to transform into predicated instructions. For hyperblocks, conditionals where each branch approximately execute in the same number of cycles are an ideal candidate for the transformation, as it means that no time is lost due to having to execute extra predicated instructions that are false. However, as hyperblocks are an extension of superblocks, single paths through the control-flow graph can also be represented. In our implementation, we use simple static heuristics to pick these paths~\cite[ball93_branc_predic_free]. \subsection[proof-if-conversion]{Correctness of If-Conversion} The correctness of the if-conversion pass does not rely on which paths are chosen to be combined. The proof therefore does not have to cover the static heuristics that are generated, because these only affect the performance. Even though the translation is quite simple, the reasoning about its correctness does have its difficulties. We have a similar problem as before, where we sometimes need to execute multiple steps in the input program to reach one step in the output program. The solution is to perform a similar style proof as for the basic block generation. The main difference is the matching condition, as the hyperblock is not reduced instruction by instruction anymore, but section by section, depending on if a section was if-converted or not. This means that at each point, one is either at the head of a section or in the middle of an if-converted section. \section[sec:scheduling]{Scheduling Implementation} This section will discuss the implementation of hyperblock scheduling in Vericert, which is still a work in progress. The scheduling step consists of many different smaller transformations, covering the initial basic block generation, hyperblock construction using if-conversion, then the scheduling and finally the hardware generation. Vericert uses \SDC\ scheduling~\cite[cong06_sdc], which is a flexible algorithm to perform scheduling of instructions, especially if the target is hardware directly. It allows for flexible definitions of optimisation functions in terms of latencies and dependencies of the operations in the basic block. \HLS\ tools often use \SDC\ scheduling performed on hyperblocks to find the optimal schedule. The main benefit of using \SDC\ scheduling is that it generates constraints and an equation that should be minimised, which can then be passed to a linear program solver. By changing the minimisation, different attributes of the hardware can be optimised. In addition to that though, there have been extensions to the \SDC\ scheduling algorithm to include more constraints that can then also schedule loops using modulo scheduling~\cite[zhang13_sdc]. \subsection[sec:rtlblockdef]{\rtlblock\ and \rtlpar\ Intermediate Language Definition} \lindex{\rtlblock}\lindex{\rtlpar}\in{Figure}[fig:compcert_interm] shows the intermediate languages introduced to implement hyperblocks in CompCert. This consists of new \rtlblock\ and \rtlpar\ intermediate languages, which implement the sequential and parallel semantics of basic blocks respectively. The semantics of \rtlblock\ and \rtlpar\ are based on the CompCert \rtl\ intermediate language. However, instead of mapping from states to instructions, \rtlblock\ maps from states to hyperblocks, and \rtlpar\ maps from states to parallel hyperblocks, which will be described in the next sections. \startplacefigure[location={here,none}] \startfloatcombination[nx=2] \startplacefigure[reference={eq:standard},title={Syntax for instructions within a hyperblock. There is a predicated exit instruction which serves as an early exit instruction from the hyperblock, thereby making it single entry and multiple exit.}] \startframedtext[frame=off,offset=none,width={0.45\textwidth}] \startformula \startmathalignment \NC i\ \ \eqdef \NC \ \ \text{\tt RBnop} \NR \NC \NC \|\ \ \text{\tt RBop}\ p?\ \mathit{op}\ \vec{r}\ d \NR \NC \NC \|\ \ \text{\tt RBload}\ p?\ \mathit{chunk}\ \mathit{addr}\ \vec{r}\ d \NR \NC \NC \|\ \ \text{\tt RBstore}\ p?\ \mathit{chunk}\ \mathit{addr}\ \vec{r}\ s \NR \NC \NC \|\ \ \text{\tt RBsetpred}\ p?\ c\ \vec{r}\ d \NR \NC \NC \|\ \ \text{\tt RBexit}\ p?\ i_{\mathit{cf}} \NR \stopmathalignment \stopformula \stopframedtext \stopplacefigure \startplacefigure[reference={eq:cf-instr},title={Control-flow instructions ending a hyperblock.}] \startframedtext[frame=off,offset=none,width={0.45\textwidth}] \startformula \startmathalignment \NC i_{\mathit{cf}}\ \ \eqdef \NC \ \ \text{\tt RBcall}\ \mathit{sig}\ f\ \vec{r}\ d\ n \NR \NC \NC \|\ \ \text{\tt RBtailcall}\ \mathit{sig}\ f\ \vec{r} \NR \NC \NC \|\ \ \text{\tt RBbuiltin}\ f_{\mathit{ext}}\ \vec{r}\ r\ n \NR \NC \NC \|\ \ \text{\tt RBcond}\ c\ \vec{r}\ n_{1}\ n_{2} \NR \NC \NC \|\ \ \text{\tt RBjumptable}\ r\ \vec{n} \NR \NC \NC \|\ \ \text{\tt RBreturn}\ r? \NR \NC \NC \|\ \ \text{\tt RBgoto}\ n \NR \stopmathalignment \stopformula \stopframedtext \stopplacefigure \stopfloatcombination \stopplacefigure \rtlblock\ instructions are split into two types of instructions, standard instructions and control-flow instructions. The standard instructions are the instructions that can be placed into hyperblocks, whereas control-flow instructions are the instructions that can end hyperblocks. The standard instructions are shown in \in{Figure}[eq:standard], whereas the control-flow instructions are shown in \in{Figure}[eq:cf-instr]. Most instructions are quite similar to their \rtl\ counterparts, however, there are some instructions that have been added. \type{RBsetpred} is a standard instruction which sets a predicate equal to an evaluated condition. This instruction is used during if-conversion to set the predicate to the value of the condition in the conditional statement instruction. To support proper predicated instructions, each instruction therefore also contains an optional predicate which has been set beforehand, which produces the hyperblock. In addition to that, there is also a predicated exit instruction called \type{RBexit}, which serves as an early exit from the hyperblock. \placeformula[fig:hb_def]\startformula\startmathalignment[align={1:left}] \NC \blockbb \qquad \eqdef \qquad (\text{\tt slist}\ i) \times i_{\mathit{cf}} \NR \NC \parbb \qquad \eqdef \qquad (\text{\tt slist}\ (\text{\tt plist}\ (\text{\tt slist}\ i))) \times i_{\mathit{cf}}\NR \stopmathalignment\stopformula These instructions are use in \rtlblock\ as well as in \rtlpar. The main difference between these two languages is how these instructions are arranged within the hyperblock and the execution semantics of the hyperblock in the languages. Let $\blockbb$ be the definition of a hyperblock for \rtlblock, and let $\parbb$ be the definition of a hyperblock for \rtlpar, then $\blockbb$ and $\parbb$ be defined as in \in{Equation}[fig:hb_def]. The Figure shows the different nesting levels of the basic blocks in \rtlblock\ as well as \rtlpar, and where the parallel semantics of \rtlpar\ come into play. \rtlblock\ is made of a list of instructions and a control-flow instruction that ends the hyperblock. Each instruction in \rtlblock\ is executed sequentially. \rtlpar\ is made of three separate lists. The outer list behaves like the list in \rtlblock, and executes each list inside of it sequentially. \rtlpar\ then has a list which executes its contents in parallel, meaning each instruction uses the same state of the registers and memory while executing. Finally, instead of directly containing instructions, there is an additional layer of sequentially executing lists of instructions, to be able to execute small groups of instructions in sequence, but still in parallel to the other instructions in the current parallel group. The latter construct enables operation chaining to be expressed, without having to create new operations for each type of chaining that could occur. Operation chaining is critical to get the most performance out of the hardware, as it fills the extra latency one may have available within one clock cycle when one is also scheduling instructions that have a larger latency. \subsection[translating-rtlblock-to-rtlpar]{Translating \rtlblock\ to \rtlpar} \definecolor[s1col][x=7FC97F] \definecolor[s2col][x=BEAED4] \definecolor[s3col][x=FDC086] \definecolor[s4col][x=FFFF99] \definecolor[s5col][x=386CB0] \define[2]\makedfgstate{\raisebox{-2pt}\hbox{\tikz{\node[circle,draw=black,minimum size=4mm,fill=#1,scale=0.5]{#2};}}} \define\sI{\makedfgstate{s1col}{1}} \define\sII{\makedfgstate{s2col}{2}} \define\sIII{\makedfgstate{s3col}{3}} \define\sIV{\makedfgstate{s4col}{4}} \define\sV{\makedfgstate{s5col}{5}} \startplacefigure[reference={fig:op_chain},title={Example of scheduling a hyperblock into multiple clock cycles.}] \startfloatcombination[ny=2,nx=2] \startplacesubfigure[title={RTLBlock hyperblock to be scheduled.}] \startframedtext[frame=off,offset=none,width={0.43\textwidth},style=\rmxx] \starthlC /BTEX \sI/ETEX r2 = r1 + r4; /BTEX \sII/ETEX if(p1) r1 = r2 + r4; /BTEX \sIII/ETEX if(!p2 && !p1) r3 = r1 * r1; /BTEX \sIV/ETEX if(p2) r3 = r1 * r4; /BTEX \sV/ETEX if(!p2) r3 = r3 * r3; \stophlC \stopframedtext \stopplacesubfigure \startplacesubfigure[title={Scheduled RTLPar hyperblock.}] \startframedtext[frame=off,offset=none,width={0.53\textwidth},style=\rmxx] \starthlC 1: [r2 = r1 + r4; if(p1) r1 = r2 + r4]; [if(!p2 && !p1) r3 = r1 * r1] 2: [if(p2) r1 * r4]; [if(!p2) r3 * r3] \stophlC \stopframedtext \stopplacesubfigure \startplacesubfigure[title={Data dependencies between each operation of the RTLBlock hyperblock.}] \startframedtext[frame=off,offset=none,width={0.43\textwidth}] \hbox{\starttikzpicture[>=stealth,shorten >=1pt,auto,node distance=1.3cm] \tikzstyle{every state}=[draw=black,thick,minimum size=4mm] \node[state,fill=s1col] (s1) {1}; \node[state,fill=s2col] (s2) [below left of=s1] {2}; \node[state,fill=s3col] (s3) [right of=s1] {3}; \node[state,fill=s4col] (s4) [below right of=s2] {4}; \node[state,fill=s5col] (s5) [below of=s3] {5}; \draw[->] (s1) -- (s2); \draw[->] (s2) -- (s4); %\draw[->] (s3) -- (s4); \draw[->] (s3) -- (s5); \stoptikzpicture} \stopframedtext \stopplacesubfigure \startplacesubfigure[title={Timing diagram showing how the operations in RTLPar will be executed.}] \externalfigure[figures/timing-3.pdf][width=0.5\textwidth] \stopplacesubfigure \stopfloatcombination \stopplacefigure The translation from \rtlblock\ to \rtlpar\ is performed using a scheduling algorithm, which will takes hyperblocks and rearranges their instructions to maximise parallelism based on the data dependencies. It then arranges the instructions in \rtlpar\ by putting independent instructions into the same parallel block. Scheduling is an optimisation that is well suited to translation validation, as the algorithm itself can be quite complex, whereas the output is normally relatively easy to check for equivalence with the input. \in{Figure}[fig:compcert_interm] shows that the scheduling step comes right after the if-conversion step which originally creates the hyperblocks. \in{Figure}[fig:op_chain] shows how the scheduling step transforms \rtlblock\ into \rtlpar. The \rtlblock\ hyperblock being scheduled is shown in \in{Figure}{a}[fig:op_chain], which contains five predicated operations, comprising two additions and three multiplications. The data dependencies of the instructions in the hyperblock are shown in \in{Figure}{c}[fig:op_chain]. Curiously, even though operations may normally be dependent on each other due to a data conflict, the dependency can be removed when the predicates are independent. The scheduler transforms the \rtlblock\ hyperblock into the \rtlpar\ hyperblock shown in \in{Figure}{b}[fig:op_chain]. Even though the addition in \sII\ is dependent on \sI, they can still be put into the same cycle as the additions do not take as long to complete as the multiplications. This optimisation is called operation chaining. Then, the multiplication in \sIII\ can also be placed in the same clock cycle as it does not have any data dependencies. Then, in the next clock cycle, either the multiplication in \sIV\ can take place, or the multiplication in \sV\ will take place, meaning these two multiplications can also be placed in the same clock cycle. This gives the final schedule that is shown in \in{Figure}{d}[fig:op_chain]. \subsection[abstr_interp]{Abstract Interpretation of Hyperblocks} \index{symbolic execution}\seeindex{abstract interpretation}{symbolic execution}The correctness of the scheduling algorithm is checked by comparing the symbolic expressions of each register before and after scheduling. This section describes how these symbolic expressions are constructed for each hyperblock. \subsubsection[abstr_language]{Construction of the Abstract Language} The main difficulty when constructing \index{symbolic expression}symbolic expressions for hyperblocks is the presence of predicates, which can be used by the scheduling algorithm to reorder instructions further, even if these contain data dependencies. The scheduling algorithm will manipulate predicates in the following ways, which the comparison function needs to take into account. \startitemize \item Instructions with predicates that are equivalent to false can be removed. \item Instructions with predicates that are mutually exclusive can be executed in parallel. \item Any predicate can be replaced with an equivalent predicate. \stopitemize \define\sep{\ |\ } \noindent The symbolic expressions have the following form: \noindent Resources: \startformula \rho\ \eqdef\ r \sep p \sep \text{\tt Mem} \stopformula Expressions: \startformula e\ \eqdef\ r^{0} \sep P \Rightarrow \text{\tt Op}(\mathit{op}, \vec{e}) \sep P \Rightarrow \text{\tt Load}(\mathit{chunk}, \mathit{mode}, \vec{e}, e_{m}) \stopformula Memory expressions: \startformula e_{m}\ \eqdef\ \text{\tt Mem}^{0} \sep P \Rightarrow \text{\tt Store}(\mathit{chunk}, \mathit{mode}, \vec{e}, e_{m}, e) \stopformula Predicate expressions: \startformula e_{p}\ \eqdef\ p^{0} \sep P \Rightarrow \text{\tt SetPred}(\mathit{cond}, \vec{e}, e_{p})\stopformula Code: \startformula m\ \eqdef\ \rho \mapsto (\vec{e} \sep \vec{e_{m}} \sep \vec{e_{p}}) \stopformula Predicates: \startformula P\ \eqdef\ (\text{\tt Bool} * p) \sep P \land P \sep P \lor P \sep \top\ \sep \perp \stopformula There are three main resources that are present in symbolic expressions: registers $r$, which represent the value of the registers in the program, predicates $p$, which represent the value of the predicates throughout the program, and finally, memory \type{Mem}, representing the state of the memory. For each basic instruction in the hyperblock, there is an equivalent symbolic expression which either modifies the state of the registers, predicates or memory. Finally, as each instruction in the hyperblock can be predicates, each symbolic expression is also paired with a predicate as well. Assigning the same resource multiple times with instructions that contain different predicates, means that the symbolic expression must also express this conditional result based on the predicates. This is expressed by assigning lists of predicated symbolic expressions to each resource, where each predicate needs to be mutually exclusive from the other predicates in the list. The expressions are then constructed using a function which updates the symbolic expressions assigned for each resource. This is done using multiple primitives which act on predicated types, which are made up of a list of pairs of predicates and the element of that type. The first important primitive is multiplication of two predicated types, which is implemented as performing a cartesian multiplication between the predicated types in the two lists, anding the two predicates and joining the two types of each list using a function. \subsection[example-of-translation]{Example of translation} \placeformula[eq:4]\startformula \startmathalignment[n=1,align={1:left}] \NC \text{\tt r1} = \startcases \NC \text{\tt r1}^{0} + \text{\tt r4}^{0} + \text{\tt r4}^{0}, \quad \NC \text{\tt p1} \NR \NC \text{\tt r1}^{0}, \quad \NC \text{\tt !p1} \NR \stopcases \NR \NC \text{\tt r2} = \text{\tt r1}^{0} + \text{\tt r4}^{0}\NR \NC\text{\tt r3} = \startcases \NC \left( \text{\tt r1}^{0} \times \text{\tt r1}^{0} \right) \times \left( \text{\tt r1}^{0} \times \text{\tt r1}^{0} \right),\quad \NC \text{\tt !p2 \&\& !p1} \NR \NC \text{\tt r1}^{0} \times \text{\tt r4}^{0},\quad \NC \text{\tt p2 \&\& !p1} \NR \NC \left( \text{\tt r1}^{0} + \text{\tt r4}^{0} + \text{\tt r4}^{0} \right) \times \text{\tt r4}^{0},\quad \NC \text{\tt !p2 \&\& p1}\NR \NC \text{\tt r3}^{0} \times \text{\tt r3}^{0},\quad \NC\text{\tt p2 \&\& p1}\NR \stopcases\NR \stopmathalignment \stopformula Using the example shown in \in{Figure}[fig:op_chain], the \rtlblock\ hyperblock shown in \in{Figure}{a}[fig:op_chain] is scheduled into the hyperblock \rtlpar\ shown in \in{Figure}{b}[fig:op_chain]. The first step is to translate the input and output of the scheduler into the \abstr\ intermediate language by calculating the symbolic values that will be stored in each of the registers. Symbolically evaluating the \rtlblock\ hyperblock results in the expressions shown in \in{Equation}[eq:4] for registers \type{r1}, \type{r2} and \type{r3}, where the state of the register \type{r1} at the start of the hyperblock is denoted as $\text{\tt r1}^{0}$. This symbolic expression can be generated by sequentially going through the list of predicated expressions and applying the update function defined in \in{Section}[abstr_language]. The resulting predicated expressions is guaranteed to have predicates which are mutually exclusive, meaning if the predicate evaluates to true, all other predicates must evaluate to false. \subsubsection[linear-flat-predicated-expressions]{Linear (Flat) Predicated Expressions} \index{symbolic expression}The update function for predicated expressions is substantially more complex compared to the standard update functions for symbolic execution, such as the one used by \cite[entry][tristan08_formal_verif_trans_valid]. The main reason for this is the non-recursive nature of the predicated expressions. As the predicates are only available at the top-level of the expressions, combining multiple expressions means that the conditions and multiple possibilities also need to trickle up to the top. Instead, if the predicated instructions were recursive, one would not have to change any predicates when evaluating the argument lists of instructions, for example, as these can just stay predicated. The evaluation semantics would also be much simpler, because a rule can be added for evaluating predicates. \subsection[sec:correctness]{Correctness of the Abstract Interpretation} The correctness of the scheduling step is dependent on the correctness of the equivalence check between the two abstract languages. As the abstract languages contain predicates, this comparison check cannot just be a syntactic comparison between the two languages, because the predicates might be different but logically equivalent. Instead, a mixed equivalence check is performed using a verified SAT solver and a syntactic equivalence check of the actual symbolic expressions. \subsubsection[comparing-symbolic-abstract-expressions]{Comparing Symbolic Abstract Expressions} Abstract expressions comprise a predicate and a symbolic expression. The translation from \rtlblock\ and \rtlpar\ to \abstr\ ensures that each register will contain a predicated symbolic expression that describes the value that will be assigned to it after evaluating the basic block. This means that if an expression always produces the same result as another expression, that the register will always contain the same value. If the \rtlblock\ program and \rtlpar\ program translate to two \abstr\ representations where all the register's symbolic expressions match, then this means that both programs will always produce the same value. The main comparison between two symbolic expressions can be done using a simple syntactic comparison, because this ensures that each of the expressions always evaluate to the same value, and is also general enough for the expressions that need to be compared. However, as all expressions might be predicated, this simple syntactic comparison cannot be performed for the predicates. This is because predicated instructions can be executed in many different orders, and the predicates are used to calculate dependence information and therefore also determine the order of operations. Instead, comparing predicates can be done by checking logical equivalence of the predicates using a formally verified \SAT\ solver. This ensures that the predicates will always evaluate to the same value, thereby making it a fair comparison. \subsubsection[verified_sat]{Verified Decidable DPLL SAT Solver} A SAT solver is needed to correctly and reliably compare predicated expressions. This makes it possible to then compare many different possible schedules, even if they remove expressions completely if these are unnecessary. The SAT solver is required by the verifier, which needs to be proven correct in Coq to prove the scheduling translation correct. The SAT solver therefore has to be verified, because it could otherwise not be used in the proof of correctness. The DPLL algorithm is used to implement the SAT solver. This means that the SAT solver takes in a \CNF\ formula. The conversion from a recursive predicate type into a \CNF\ formula is done by recursively expanding the predicate in equivalent \CNF\ formulas. The SAT solver is implemented in Coq and proven to be sound and complete. The soundness and completeness of an algorithm to check whether a predicate is satisfiable or not also directly implies that the check is decidable, simplifying many proofs in Coq by using a \type{Setoid} instance for equivalence between predicates. This makes rewriting predicates if they are equivalent much easier in proofs by reusing many of the existing Coq tactics. \subsubsection[encoding-predicated-expressions-as-sat-formulas]{Encoding Predicated Expressions as SAT Formulas} Given the following two predicated expressions which should be tested for equivalence, one can construct a SAT expression which will ensure that both expressions will always result in the same value. \startformula \bgroup \left.\startmatrix \NC e_{1},\quad \NC p_{1} \NR \NC e_{2},\quad \NC p_{2}\NR \NC \quad \vdots \NC \NR \NC e_{n},\quad \NC p_{n}\NR \stopmatrix\right\rbrace\egroup \Longleftrightarrow \startmathcases \NC e_{1}',\NC p_{1}' \NR \NC e_{2}',\NC p_{2}' \NR \NC \quad \vdots \NC \NR \NC e_{m}',\NC p_{m}'\NR \stopmathcases \stopformula The SAT query that should determine if two predicated expressions are equivalent is the following, where $\mathit{he}_{i}$ is the hashed expression of $e_{i}$: \startformula \startmathalignment[align={1:middle}] \NC (p_{1} \implies \mathit{he}_{1}) \land (p_{2} \implies \mathit{he}_{2}) \land \cdots \land (p_{n} \implies \mathit{he}_{n}) \NR \NC \Longleftrightarrow \NR \NC (p_{1}' \implies \mathit{he}_{1}') \land (p_{2}' \implies \mathit{he}_{2}') \land \cdots \land (p_{m}' \implies \mathit{he}_{m}') \NR \stopmathalignment \stopformula However, reasoning about the equivalence of individual expressions within the greater list of predicated expressions is not quite straightforward. The evaluation semantics of a predicated expression list is the sequentially evaluation of each predicate expression, picking the expression whose predicate evaluates to true. However, proving that two predicated expression lists will return the same value assuming that the SAT query above is true, means that for each predicate that is true, one needs to be able to show that there is a predicate in the other list that will also be true and will return the same value. However, there is no guarantee that there is a unique predicated expression that will match the current predicate, as there may be duplicate expressions in the list with different predicates. This means that it is quite difficult to prove that the SAT query implies the semantic preservation of the scheduling transformation. The SAT query can therefore be broken down into a slightly more direct comparison that is more useful for the proof itself. The steps to perform the more fine grained comparison are the following: \startitemize[n] \item Hash all of the expressions into a unique literal. \item Create a map from expression literals to their corresponding predicate, thereby having a unique predicate for each expressions. If there are duplicates in the list that forms the map, the predicates are combined using the or operation. \item Iterate through the expressions in both maps and compare the predicates using the SAT solver. If an expression is not present in one of the maps, then its predicate should be equivalent to $\perp$. \stopitemize This comparison is logically equivalent to performing the large SAT query, however, it maps better to the proof because it is comparing the predicates of the same expressions with each other. \subsection[proof-of-correctness-of-scheduling]{Proof of Correctness of Scheduling} The scheduling pass is formally verified using translation validation using the \abstr\ language as a point of comparison between \rtlblock\ and \rtlpar. Using the algorithm defined in \in{Section}[abstr_interp], the formally verified SAT solver described in \in{Section}[verified_sat] can be used to prove that if the two predicated expression lists are equivalent, that they will always evaluate to the same value. This follows directly from the proofs of correctness for the translation to \abstr. Then, the proof that if the comparison function using the SAT solver between two \abstr\ languages succeeds, then the two abstract languages will behave identically, this proof can then be composed with the proofs of correctness of the generation of \abstr. \section[scheduling:conclusion]{Conclusion} In general, this whole proof of correctness for scheduling was the most challenging because as the scheduling algorithm evolved, the language semantics of \rtlblock\ and \rtlpar\ also evolved, which then required large changes in the proofs of correctness for each step in the scheduling proof. In addition to that, the whole scheduling proof, from \rtlblock\ generation to hardware generation required greatly different approaches to the correctness proofs in each step. Even simple translations like the basic block generation and if-conversion pass required quite a few different insights into how the algorithms actually worked and how the semantics behaved to come up with good matching relation which are powerful enough to prove the final correctness result, but are also provable in the first place. Simply formulating these relations in the first place requires a lot of experimentation. \startmode[chapter] \section{Bibliography} \placelistofpublications \stopmode \stopcomponent