\environment fonts_env \environment lsr_env \startcomponent scheduling \chapter[sec:scheduling]{Static Scheduling} \section[introduction]{Introduction} \oindex{static scheduling}The use of multicore processors has been increasing drastically, thereby making parallelising compilers ever more important. In addition to that, the need for custom hardware accelerators is also increasing, as traditional processors are not always the best choice for all applications. Compilers that support optimisations which can automatically parallelise it's input programs are therefore also becoming more important, as these back ends all benefit from it. For processors, one wants to optimise the pipeline usage as much as possible, whereas in hardware one can create custom pipelines, as well as making more use out of the spacial properties that hardware offers, making instruction parallelism important as well as speculative execution. However, with more powerful optimisations there is a greater chance to make mistakes in the compiler, making it possible to generate programs that behave differently to the input program. An important step towards making compilers more reliable is to formally verify these optimisations so that it can be guaranteed that bugs are not introduced. A popular optimisation for processors increasing the instruction parallelism of the resulting program is to use scheduling. This is an optimisation that is especially important for processors that have deep instruction pipelines or multiple functional units that can be used to execute instructions in parallel. However, this optimisation is also critical for high-level synthesis tools, which generate application specific hardware designs based on a software input. Simple scheduling algorithms often work on basic blocks, which are sequential lists of instructions without any branches. However, this limits what kind of reordering the scheduling algorithm can perform. It can never move instructions past a branching instruction, even when this might drastically improve the performance. Trace scheduling~ is an alternative that addresses this issue, by creating paths that might cross branching instructions, and scheduling the instructions in that path. However, the main problem with trace scheduling is that it is often infeasible on large programs, and because of the large design space, it's sometimes hard to find an optimal schedule. \index{superblock}Superblock~ and \index{hyperblock}hyperblock~ scheduling are two subsets of trace scheduling, which compromise on the flexibility of trace scheduling to instead build more tractable algorithms. Superblock scheduling produces more efficient code for processors with a low issue rate, whereas hyperblock scheduling produces more performant code for processors with a high issue rate~. CompCert~ is a compiler that is formally verified in Coq. In addition to that, have also added a formally verified hardware back end to CompCert, allowing it to also generate provably correct hardware designs. A hardware back end is an ideal target for hyperblock scheduling, as the issue rate is practically infinite. This paper describes the implementation of a hyperblock instruction scheduler in CompCert so that it can benefit existing CompCert back ends, as well as support more general targets such as scheduling instructions for custom hardware. \subsection[key-points]{Key Points} The goal of this paper is to introduce hyperblock scheduling into CompCert to allow for advanced and flexible scheduling operations that could be beneficial to various different back ends. To this end, the key points of this paper are the following: \startitemize \item Description of how hyperblocks are introduced for improved scheduling of instructions, and how these can be built and translated back to instructions without predicates. \item Description of the implementation and correctness proof of a \SDC\ scheduling algorithm, which can be adapted to various different scheduling strategies. \stopitemize \section[sec:scheduling]{Implementation of Hyperblocks in CompCert} \index{hyperblock}This section describes the structure of hyperblocks in Section~\goto{2.1}[scheduling:hyperblocks]. Then, the structure of two extra intermediate languages that were added to CompCert to implement hyperblocks are also shown in Section~\goto{2.2}[sec:rtlblockdef]. \subsection[scheduling:hyperblocks]{Hyperblocks as a Generalisation of Basic Blocks} Basic blocks are popular in intermediate languages because they describe a list of sequential instructions that are not separated by control-flow. This information can be used by various analysis passes to perform optimisations, as these sequences of instructions can be reordered in any way as long as their data dependencies are met. One such optimisation is scheduling, which reorders instructions to eliminate of any delay slots caused by the CPU pipeline. However, due to control-flow not being allowed in basic blocks, the number of instructions that can be scheduled at any one time is limited. The boundaries to other basic blocks act as hard boundary to the scheduling algorithm. hyperblocks~ allow instructions to be predicated, thereby introducing some control-flow information into the blocks, which can also be taken into account by the scheduler. For example, any data dependency between instructions that have mutually exclusive predicates can be removed, as these instructions will never be activate at the same time. \subsection[sec:rtlblockdef]{RTLBlock and RTLPar Intermediate Language Definition} \lindex{RTLBlock}\lindex{RTLPar}Figure~\goto{{[}fig:compcert_interm{]}}[fig:compcert_interm] shows the intermediate languages introduced to implement hyperblocks in CompCert. This consists of new RTLBlock and RTLPar intermediate languages, which implement the sequential and parallel semantics of basic blocks respectively. The semantics of RTLBlock and RTLPar are based on the CompCert RTL intermediate language. However, instead of mapping from states to instructions, RTLBlock maps from states to hyperblocks, and RTLPar maps from states to parallel hyperblocks, which will be described in the next sections. \placeformula \startformula \startmathalignment \NC i\ \ \colon\colon= \NC \ \ \mono{RBnop} \NR \NC \NC |\ \ \mono{RBop}\ p?\ \mathit{op}\ \vec{r}\ d \NR \NC \NC |\ \ \mono{RBload}\ p?\ \mathit{chunk}\ \mathit{addr}\ \vec{r}\ d \NR \NC \NC |\ \ \mono{RBstore}\ p?\ \mathit{chunk}\ \mathit{addr}\ \vec{r}\ s \NR \NC \NC |\ \ \mono{RBsetpred}\ p?\ c\ \vec{r}\ d \NR \stopmathalignment \stopformula \placeformula \startformula \startmathalignment \NC i_{\mathit{cf}}\ \ \colon\colon= \NC \ \ \mono{RBcall}\ \mathit{sig}\ f\ \vec{r}\ d\ n \NR \NC \NC |\ \ \mono{RBtailcall}\ \mathit{sig}\ f\ \vec{r} \NR \NC \NC |\ \ \mono{RBbuiltin}\ f_{\mathit{ext}}\ \vec{r}\ r\ n \NR \NC \NC |\ \ \mono{RBcond}\ c\ \vec{r}\ n_{1}\ n_{2} \NR \NC \NC |\ \ \mono{RBjumptable}\ r\ \vec{n} \NR \NC \NC |\ \ \mono{RBreturn}\ r? \NR \NC \NC |\ \ \mono{RBgoto}\ n \NR \NC \NC |\ \ \mono{RBpred\_{cf}}\ P\ i_{\mathit{cf}_{1}}\ i_{\mathit{cf}_{2}} \NR \stopmathalignment \stopformula RTLBlock instructions are split into two types of instructions, standard instructions and control-flow instructions. The standard instructions are the instructions that can be placed into hyperblocks, whereas control-flow instructions are the instructions that can end hyperblocks. The standard instructions are shown in Figure~\goto{{[}fig:standard_instruction{]}}[fig:standard_instruction], whereas the control-flow instructions are shown in Figure~\goto{{[}fig:control_flow_instruction{]}}[fig:control_flow_instruction]. Most instructions are quite similar to their RTL counterparts, however, there are some instructions that have been added. \type{RBsetpred} is a standard instruction which sets a predicate equal to an evaluated condition. This instruction is used during if-conversion to set the predicate to the value of the condition in the conditional statement instruction. To support proper predicated instructions, each instruction therefore also contains an optional predicate which has been set beforehand, which produces the hyperblock. In additon to that, there is also an extra control-flow instruction called \type{RBpred_cf}, which can branch on a predicate an nests more control-flow instructions inside of it. This is also necessary for if-conversion, when converting an already converted conditional statement, as depending on the predicate, a different control-flow instruction may be necessary. These instructions are use in RTLBlock as well as in RTLPar. The main difference between these two languages is how these instructions are arranged within the hyperblock and the execution semantics of the hyperblock in the languages. Let $\mathit{hb}_{b}$ be the definition of a hyperblock for RTLBlock, and let $\mathit{hb}_{p}$ be the definition of a hyperblock for RTLPar, then $\mathit{hb}_{b}$ and $\mathit{hb}_{p}$ be defined as in Figure~\goto{{[}fig:hb_def{]}}[fig:hb_def]. The Figure shows the different nesting levels of the basic blocks in RTLBlock as well as RTLPar, and where the parallel semantics of RTLPar come into play. RTLBlock is made of a list of instructions and a control-flow instruction that ends the hyperblock. Each instruction in RTLBlock is executed sequentially. RTLPar is made of three separate lists. The outer list behaves like the list in RTLBlock, and executes each list inside of it sequentially. RTLPar then has a list which executes it's contents in parallel, meaning each instruction uses the same state of the registers and memory while executing. Finally, instead of directly containing instructions, there is an additional layer of sequentially executing lists of instructions, to be able to execute small groups of instructions in sequence, but still in parallel to the other instructions in the current parallel group. RTLPar is structured in this way so that the scheduler has more flexibility, so that it can not only parallelise operations in the basic blocks, but that it can also define sequential operations in these parallel operations. This is especially useful for the hardware back end, as it means that sequential operations that do not quite fill one clock cycle can be executed sequentially, but can still be executed in parallel with other independent operations. This optimisation is called operation chaining, and is critical to get the most performance out of the hardware. \subsection[translating-rtlblock-to-rtlpar]{Translating RTLBlock to RTLPar} The translation from RTLBlock to RTLPar is performed using a scheduling algorithm, which will takes hyperblocks and rearranges their instructions to maximise parallelism based on the data dependencies. It then arranges the instructions in RTLPar by putting independent instructions into the same parallel block. Scheduling is an optimisation that is well suited to translation validation, as the algorithm itself can be quite complex, whereas the output is normally relatively easy to check for equivalence with the input. Figure~\goto{{[}fig:compcert_interm{]}}[fig:compcert_interm] shows that the scheduling step comes right after the if-conversion step which originally creates the hyperblocks. Figure~\goto{{[}fig:op_chain{]}}[fig:op_chain] shows how the scheduling step transforms RTLBlock into RTLPar. The RTLBlock hyperblock being scheduled is shown in Figure~\goto{{[}fig:op_chain_a{]}}[fig:op_chain_a], which contains five predicated operations, comprising two additions and three multiplications. The data dependencies of the instructions in the hyperblock are shown in Figure~\goto{{[}fig:op_chain_b{]}}[fig:op_chain_b]. Curiously, even though operations and are normally dependent on each other due to a write-after-write conflict, but because the predicates are independent the write-after-write conflict can effectively be removed by the scheduler. The scheduler transforms the RTLBlock hyperblock into the RTLPar hyperblock shown in Figure~\goto{{[}fig:op_chain_c{]}}[fig:op_chain_c]. Even though the addition in is dependent on , they can still be put into the same cycle as the additions do not take as long to complete as the multiplications. This optimisation is called operation chaining. Then, the multiplication in can also be placed in the same clock cycle as it does not have any data dependencies. Then, in the next clock cycle, either the multiplication in can take place, or the multiplication in will take place, meaning these two multiplications can also be placed in the same clock cycle. This gives the final schedule that is shown in Figure~\goto{{[}fig:op_chain_d{]}}[fig:op_chain_d]. \section[abstr_interp]{Abstract Interpretation of Hyperblocks} \index{abstract interpretation}The correctness of the scheduling algorithm is checked by comparing the symbolic expressions of each register before and after scheduling. This section describes how these symbolic expressions are constructed for each hyperblock. \subsection[abstr_language]{Construction of Abstract Language} The main difficulty when constructing \index{symbolic expression}symbolic expressions for hyperblocks is the presence of predicates, which can be used by the scheduling algorithm to reorder instructions further, even if these contain data dependencies. The scheduling algorithm will manipulate predicates in the following ways, which the comparison function needs to take into account. \startitemize \item Instructions with predicates that are equivalent to false can be removed. \item Instructions with predicates that are mutually exclusive can be executed in parallel. \item Any predicate can be replaced with an equivalent predicate. \stopitemize The symbolic expressions have the following form: There are three main resources that are present in symbolic expressions: registers $r$, which represent the value of the registers in the program, predicates $p$, which represent the value of the predicates throughout the program, and finally, memory \type{Mem}, representing the state of the memory. For each basic instruction in the hyperblock, there is an equivalent symbolic expression which either modifies the state of the registers, predicates or memory. Finally, as each instruction in the hyperblock can be predicates, each symbolic expression is also paired with a predicate as well. Assigning the same resource multiple times with instructions that contain different predicates, means that the symbolic expression must also express this conditional result based on the predicates. This is expressed by assigning lists of predicated symbolic expressions to each resource, where each predicate needs to be mutually exclusive from the other predicates in the list. The expressions are then constructed using a function which updates the symbolic expressions assigned for each resource. This is done using multiple primitives which act on predicated types, which are made up of a list of pairs of predicates and the element of that type. The first important primitive is multiplication of two predicated types, which is implemented as performing a cartesian multiplication between the predicated types in the two lists, anding the two predicates and joining the two types of each list using a function. %\startformula \label{eq:1} % P_{1} \otimes_{f} P_{2} \equiv \mono{map } (\lambda ((p_{1}, e_{1}), (p_{2}, e_{2})) . (p_{1} \land p_{2}, f\ e_{1}\ e_{2})) P_{1} \times P_{2} \stopformula In addition to that, another primitive that is needed is the following append operation, which will negate the combination of all predicates in the second predicated type, and it to the first predicated type and append the first predicated type to the second: %\startformula \label{eq:2} % \mu(p, P) \equiv \mono{map } (\lambda (p', e') . (p \land p', e'))\ P \stopformula %\startformula \label{eq:3} % P_{1} \oplus_{p} P_{2} \equiv \mu(\neg p, P_{1}) \mathbin{++} \mu(p, P_{2}) \stopformula \subsection[example-of-translation]{Example of translation} %\startformula \begin{aligned} % \label{eq:4} % &\mono{r1} = % \begin{cases} % \mono{r1}^{0} + \mono{r4}^{0} + \mono{r4}^{0}, \quad &\mono{p1} \\ % \mono{r1}^{0}, \quad &\mono{!p1} \\ % \end{cases}\\ % &\mono{r2} = \mono{r1}^{0} + \mono{r4}^{0}\\ % &\mono{r3} = % \begin{cases} % \left( \mono{r1}^{0} \times \mono{r1}^{0} \right) \times \left( \mono{r1}^{0} \times \mono{r1}^{0} \right),\quad &\mono{!p2 \&\& !p1}\\ % \mono{r1}^{0} \times \mono{r4}^{0},\quad &\mono{p2 \&\& !p1}\\ % \left( \mono{r1}^{0} + \mono{r4}^{0} + \mono{r4}^{0} \right) \times \mono{r4}^{0},\quad &\mono{!p2 \&\& p1}\\ % \mono{r3}^{0} \times \mono{r3}^{0},\quad &\mono{p2 \&\& p1} % \end{cases} % \end{aligned} \stopformula Using the example shown in Figure~\goto{{[}fig:op_chain{]}}[fig:op_chain], the RTLBlock hyperblock shown in Figure~\goto{{[}fig:op_chain_a{]}}[fig:op_chain_a] is scheduled into the hyperblock RTLPar shown in Figure~\goto{{[}fig:op_chain_c{]}}[fig:op_chain_c]. The first step is to translate the input and output of the scheduler into the Abstr intermediate language by calculating the symbolic values that will be stored in each of the registers. Symbolically evaluating the RTLBlock hyperblock results in the expressions shown in Figure~\goto{{[}fig:symbolic_expressions{]}}[fig:symbolic_expressions] for registers \type{r1}, \type{r2} and \type{r3}, where the state of the register \type{r1} at the start of the hyperblock is denoted as $\mono{r1}^{0}$. This symbolic expression can be generated by sequentially going through the list of predicated expressions and applying the update function defined in Section~\goto{3.1}[abstr_language]. The resulting predicated expressions is guaranteed to have predicates which are mutually exclusive, meaning if the predicate evaluates to true, all other predicates must evaluate to false. \subsection[linear-flat-predicated-expressions]{Linear (Flat) Predicated Expressions} \index{symbolic expression}The update function for predicated expressions is quite a bit more complex compared to standard update functions for symbolic execution, such as the one used by . The main reason for this is the non-recursive nature of the predicated expressions. As the predicates are only available at the top-level of the expressions, combining multiple expressions means that the conditions and multiple possibilities also need to trickle up to the top. Instead, if the predicated instructions were recursive, one would not have to change any predicates when evaluating the argument lists of instructions, for example, as these can just stay predicated. The evaluation semantics would also be much simpler, because a rule can be added for evaluating predicates. \section[sec:correctness]{Correctness of the Abstract Interpretation} The correctness of the scheduling step is dependent on the correctness of the equivalence check between the two abstract languages. As the abstract languages contain predicates, this comparison check cannot just be a syntactic comparison between the two languages, because the predicates might be different but logically equivalent. Instead, a mixed equivalence check is performed using a verified SAT solver and a syntactic equivalence check of the actual symbolic expressions. \subsection[comparing-symbolic-abstract-expressions]{Comparing Symbolic Abstract Expressions} Abstract expressions comprise a predicate and a symbolic expression. The translation from RTLBlock and RTLPar to Abstract ensures that each register will contain a predicated symbolic expression that describes the value that will be assigned to it after evaluating the basic block. This means that if an expression always produces the same result as another expression, that the register will always contain the same value. If the RTLBlock program and RTLPar program translate to two Abstract languages where all the register's symbolic expressions match, then this means that both programs will always produce the same value. The main comparison between two symbolic expressions can be done using a simple syntactic comparison, because this ensures that each of the expressions always evaluate to the same value, and is also general enough for the expressions that need to be compared. However, as all expressions might be predicated, this simple syntactic comparison cannot be performed for the predicates. This is because predicated instructions can be executed in many different orders, and the predicates are used to calculate dependence information and therefore also determine the order of operations. Instead, comparing predicates can be done by checking logical equivalence of the predicates using a formally verified \SAT solver. This ensures that the predicates will always evaluate to the same value, thereby making it a fair comparison. \subsection[verified_sat]{Verified Decidable DPLL SAT Solver} A SAT solver is needed to correctly and reliably compare predicated expressions. This makes it possible to then compare many different possible schedules, even if they remove expressions completely if these are unnecessary. The SAT solver is needed in the verifier, which needs to be proven correct in Coq to prove the scheduling translation correct. The SAT solver therefore has to be verified, because it could otherwise not be used in the proof of correctness. The DPLL algorithm is used to implement the SAT solver. This means that the SAT solver takes in a \CNF\ formula. The conversion from a recursive predicate type into a \CNF\ formula is done by recursively expanding the predicate in equivalent \CNF\ formulas. The SAT solver is implemented in Coq and proven to be sound and complete. The soundness and completeness of an algorithm to check whether a predicate is satisfiable or not also directly implies that the check is decidable, simplifying many proofs in Coq by using a \type{Setoid} instance for equivalence between predicates. This makes rewriting predicates if they are equivalent much easier in proofs by reusing many of the existing Coq tactics. \subsection[encoding-predicated-expressions-as-sat-formulas]{Encoding Predicated Expressions as SAT Formulas} Given the following two predicated expressions which should be tested for equivalence, one can construct a SAT expression which will ensure that both expressions will always result in the same value. %\startformula \begin{aligned} % \bgroup \left.\begin{aligned} % e_{1},\qquad &p_{1}\\ % e_{2},\qquad &p_{2}\\ % \vdots\quad& \\ % e_{n},\qquad &p_{n}\\ % \end{aligned}\right\rbrace\egroup % \Longleftrightarrow % \begin{cases} % e_{1}',\quad &p_{1}'\\ % e_{2}',\quad &p_{2}'\\ % \;\;\;\quad\vdots \notag \\ % e_{m}',\quad &p_{m}'\\ % \end{cases}\end{aligned} \stopformula The SAT query that should determine if two predicated expressions are equivalent is the following, where $\mathit{he}_{i}$ is the hashed expression of $e_{i}$: %\startformula \begin{gathered} % (p_{1} \implies \mathit{he}_{1}) \land (p_{2} \implies \mathit{he}_{2}) \land \cdots \land (p_{n} \implies \mathit{he}_{n}) \\ % \Longleftrightarrow\\ % (p_{1}' \implies \mathit{he}_{1}') \land (p_{2}' \implies \mathit{he}_{2}') \land \cdots \land (p_{m}' \implies \mathit{he}_{m}') \\\end{gathered} \stopformula However, reasoning about the equivalence of individual expressions within the greater list of predicated expressions is not quite straightforward. The evaluation semantics of predicated expression list will evaluate each predicate expression sequentially, and pick the expression whose predicate evaluates to true. However, proving that both predicated expression lists will return the same value assuming that the SAT query above is true, means that for each predicate that is true, One needs to be able to show that there is a predicate in the other list that will also be true, which will return the same value. However, there is no guarantee that there is a unique predicated expression that will match the current predicate, as there may be duplicate expressions in the list with different predicates. This means that it is quite difficult to prove that the SAT query implies the semantic preservation of the scheduling transformation. The SAT query can therefore be broken down into a slightly more direct comparison that is more useful for the proof itself. The steps to perform the more fine grained comparison are the following: %\startenumerate[n] %\item % Hash all of the expressions into a unique literal. %\item % Create a map from expression literals to their corresponding % predicate, thereby having a unique predicate for each expressions. If % there are duplicates in the list that forms the map, the predicates % are combined using the or operation. %\item % Iterate through the expressions in both maps and compare the % predicates using the SAT solver. If an expression is not present in % one of the maps, then it's predicate should be equivalent to $\perp$. %\stopenumerate This comparison is logically equivalent to performing the large SAT query, however, it maps better to the proof because it is comparing the predicates of the same expressions with each other. \subsection[proof-of-equivalence-between-passes]{Proof of equivalence between passes} Figure~\goto{{[}fig:compcert_interm{]}}[fig:compcert_interm] shows the four different passes that were added to CompCert to form hyperblocks. This section will cover the proofs for each of the translations, focusing mainly on the proof of translation of the scheduling pass. \subsubsection[proof-of-correctness-of-the-if-conversion-pass]{Proof of Correctness of the If-Conversion Pass} The if-conversion pass is quite simple to verify, as it is an optimisation that transforms the code in the same language. \subsubsection[proof-of-correctness-of-scheduling]{Proof of Correctness of Scheduling} The scheduling pass is formally verified using translation validation, using the Abstr language as a point of comparison between RTLBlock and RTLPar. Using the algorithm defined in Section~\goto{3}[abstr_interp], the formally verified SAT solver described in Section~\goto{4.2}[verified_sat] can be used prove that if the two predicated expression lists are equivalent, that they will always evaluate to the same value. \section[implementation-details-of-static-sdc-scheduling]{\SDC\ Scheduling Implementation Details} This section describes the implementation details of the static \SDC\ scheduling algorithm used, and also describes some of the details of the basic block generation as well as the if-conversion pass which generates the predicated instructions. \subsection[implementation-of-if-conversion]{Implementation of If-Conversion} If-conversion is the conversion that introduces predicated instructions which can make use of the hyperblocks. It converts conditional statements into predicated instructions. This transformation has a few limitations on the kind of conditional statements that it can translate. First, only conditional statements without loops can be translated, therefore, one must identify cycles in the control-flow before performing the if-conversion. Secondly, if-conversion will not always result in more efficient code, so it should not be applied to any conditional statements. Instead, it is best applied to conditional statements where each branch will take a similar amount of time to execute. For back ends that do not support predicates, it is also important to eliminate the predicates again using a process called reverse if-conversion, which creates branches out of the predicates and groups these together again. This allows for hyperblock scheduling to also be used with back ends that do not support predicates, which are most the default back ends included in CompCert. \subsection[static-sdc-scheduling]{Static \SDC\ Scheduling} \Word{\infull{SDC}} (\SDC) scheduling~\cite[cong06_sdc] is a flexible algorithm to perform scheduling of instructions, especially if the target is hardware directly. It allows for flexible definitions of optimisation functions in terms of latencies and dependencies of the operations in the basic block. \section[performance-comparison]{Performance Comparison} \section[sec:related-work]{Related Work} \section[conclusion]{Conclusion} This material is based upon work supported by the under Grant No.~ and Grant No.~. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the National Science Foundation. \stopcomponent