From 2a716db2fb59f448590b189f7b953a80d02bc5ee Mon Sep 17 00:00:00 2001 From: Yann Herklotz Date: Fri, 15 Apr 2022 11:46:10 +0100 Subject: Add changes --- chapters/scheduling.tex | 779 ++++++++++++++++++++---------------------------- 1 file changed, 323 insertions(+), 456 deletions(-) (limited to 'chapters/scheduling.tex') diff --git a/chapters/scheduling.tex b/chapters/scheduling.tex index 5edad2e..2201f72 100644 --- a/chapters/scheduling.tex +++ b/chapters/scheduling.tex @@ -1,320 +1,235 @@ +\environment fonts_env +\environment lsr_env + \startcomponent scheduling -\project main -\chapter{Static Scheduling} +\chapter[sec:scheduling]{Static Scheduling} \section[introduction]{Introduction} -The use of multicore processors has been increasing drastically, thereby -making parallelising compilers ever more important. In addition to that, -the need for custom hardware accelerators is also increasing, as -traditional processors are not always the best choice for all -applications. Compilers that support optimisations which can -automatically parallelise it's input programs are therefore also -becoming more important, as these back ends all benefit from it. For -processors, one wants to optimise the pipeline usage as much as -possible, whereas in hardware one can create custom pipelines, as well -as making more use out of the spacial properties that hardware offers, -making instruction parallelism important as well as speculative -execution. However, with more powerful optimisations there is a greater -chance to make mistakes in the compiler, making it possible to generate -programs that behave differently to the input program. An important step -towards making compilers more reliable is to formally verify these -optimisations so that it can be guaranteed that bugs are not introduced. - -A popular optimisation for processors increasing the instruction -parallelism of the resulting program is to use scheduling. This is an -optimisation that is especially important for processors that have deep -instruction pipelines or multiple functional units that can be used to -execute instructions in parallel. However, this optimisation is also -critical for high-level synthesis tools, which generate application -specific hardware designs based on a software input. Simple scheduling -algorithms often work on basic blocks, which are sequential lists of -instructions without any branches. However, this limits what kind of -reordering the scheduling algorithm can perform. It can never move -instructions past a branching instruction, even when this might -drastically improve the performance. Trace scheduling~ is an alternative -that addresses this issue, by creating paths that might cross branching -instructions, and scheduling the instructions in that path. However, the -main problem with trace scheduling is that it is often infeasible on -large programs, and because of the large design space, it's sometimes -hard to find an optimal schedule. Superblock~ and hyperblock~ scheduling -are two subsets of trace scheduling, which compromise on the flexibility -of trace scheduling to instead build more tractable algorithms. -Superblock scheduling produces more efficient code for processors with a -low issue rate, whereas hyperblock scheduling produces more performant -code for processors with a high issue rate~. - -CompCert~ is a compiler that is formally verified in Coq. In addition to -that, have also added a formally verified hardware back end to CompCert, -allowing it to also generate provably correct hardware designs. A -hardware back end is an ideal target for hyperblock scheduling, as the -issue rate is practically infinite. This paper describes the -implementation of a hyperblock instruction scheduler in CompCert so that -it can benefit existing CompCert back ends, as well as support more +The use of multicore processors has been increasing drastically, thereby making parallelising +compilers ever more important. In addition to that, the need for custom hardware accelerators is +also increasing, as traditional processors are not always the best choice for all +applications. Compilers that support optimisations which can automatically parallelise it's input +programs are therefore also becoming more important, as these back ends all benefit from it. For +processors, one wants to optimise the pipeline usage as much as possible, whereas in hardware one +can create custom pipelines, as well as making more use out of the spacial properties that hardware +offers, making instruction parallelism important as well as speculative execution. However, with +more powerful optimisations there is a greater chance to make mistakes in the compiler, making it +possible to generate programs that behave differently to the input program. An important step +towards making compilers more reliable is to formally verify these optimisations so that it can be +guaranteed that bugs are not introduced. + +A popular optimisation for processors increasing the instruction parallelism of the resulting +program is to use scheduling. This is an optimisation that is especially important for processors +that have deep instruction pipelines or multiple functional units that can be used to execute +instructions in parallel. However, this optimisation is also critical for high-level synthesis +tools, which generate application specific hardware designs based on a software input. Simple +scheduling algorithms often work on basic blocks, which are sequential lists of instructions without +any branches. However, this limits what kind of reordering the scheduling algorithm can perform. It +can never move instructions past a branching instruction, even when this might drastically improve +the performance. Trace scheduling~ is an alternative that addresses this issue, by creating paths +that might cross branching instructions, and scheduling the instructions in that path. However, the +main problem with trace scheduling is that it is often infeasible on large programs, and because of +the large design space, it's sometimes hard to find an optimal schedule. Superblock~ and hyperblock~ +scheduling are two subsets of trace scheduling, which compromise on the flexibility of trace +scheduling to instead build more tractable algorithms. Superblock scheduling produces more +efficient code for processors with a low issue rate, whereas hyperblock scheduling produces more +performant code for processors with a high issue rate~. + +CompCert~ is a compiler that is formally verified in Coq. In addition to that, have also added a +formally verified hardware back end to CompCert, allowing it to also generate provably correct +hardware designs. A hardware back end is an ideal target for hyperblock scheduling, as the issue +rate is practically infinite. This paper describes the implementation of a hyperblock instruction +scheduler in CompCert so that it can benefit existing CompCert back ends, as well as support more general targets such as scheduling instructions for custom hardware. \subsection[key-points]{Key Points} -The goal of this paper is to introduce hyperblock scheduling into -CompCert to allow for advanced and flexible scheduling operations that -could be beneficial to various different back ends. To this end, the key -points of this paper are the following: +The goal of this paper is to introduce hyperblock scheduling into CompCert to allow for advanced and +flexible scheduling operations that could be beneficial to various different back ends. To this end, +the key points of this paper are the following: \startitemize \item - Description of how hyperblocks are introduced for improved scheduling - of instructions, and how these can be built and translated back to - instructions without predicates. + Description of how hyperblocks are introduced for improved scheduling of instructions, and how + these can be built and translated back to instructions without predicates. \item - Description of the implementation and correctness proof of a system of - difference constraints (SDC) scheduling algorithm, which can be - adapted to various different scheduling strategies. {\bf {[}{[}JW:} - What's the link between hyperblocks and SDC-based scheduling? Are - these orthogonal?{\bf {]}{]}}{\bf {[}{[}YH:} Well the original SDC - papers only talk about basic blocks, and not SDC hyperblock - scheduling. But it's easy enough to adapt. SDC just requires some kind - of blocks without control-flow.{\bf {]}{]}} {\bf {[}{[}JW:} Ok cool, - so you can have SDC scheduling without hyperblocks. Can you have - hyperblocks without SDC scheduling? (I guess so, but just wanted to be - completely sure.){\bf {]}{]}}{\bf {[}{[}YH:} Yes exactly, you can - perform any type of basic block scheduling on hyperblocks as well I - think, as long as you mark the dependencies correctly. But in the - worst case you could even just see hyperblocks as basic blocks and - ignore predicates. It would still be correct but just less - efficient.{\bf {]}{]}} - - {\bf {[}{[}JW:} It would be interesting to make the case for SDC-based - scheduling. Is it superior to other scheduling mechanisms? Is it so - much superior that it's worth the additional complexity, compared to, - say, list scheduling? {\bf {]}{]}}{\bf {[}{[}YH:} SDC scheduling is - more flexible with the type of constraint to optimise for (latency / - throughput). I don't think it's too meaningful for compilers, where - there isn't as much flexibility.{\bf {]}{]}} {\bf {[}{[}JW:} Thanks. - Does Vericert have any idea that the scheduling is being performed - using SDC? Or is the SDC stuff entirely internal to your unverified - OCaml code? That is, could you swap your scheduler with a different - one that isn't based on SDC, and not have to change anything in - Vericert?{\bf {]}{]}}{\bf {[}{[}YH:} The verified part doesn't know - about SDC at all, so it could be changed to any other scheduler that - acts on hyperblocks.{\bf {]}{]}} + Description of the implementation and correctness proof of a system of difference constraints + (SDC) scheduling algorithm, which can be adapted to various different scheduling strategies. \stopitemize \section[sec:scheduling]{Implementation of Hyperblocks in CompCert} This section describes the structure of hyperblocks in -Section~\goto{2.1}[scheduling:hyperblocks]. Then, the structure of two -extra intermediate languages that were added to CompCert to implement -hyperblocks are also shown in Section~\goto{2.2}[sec:rtlblockdef]. +Section~\goto{2.1}[scheduling:hyperblocks]. Then, the structure of two extra intermediate languages +that were added to CompCert to implement hyperblocks are also shown in +Section~\goto{2.2}[sec:rtlblockdef]. \subsection[scheduling:hyperblocks]{Hyperblocks as a Generalisation of Basic Blocks} -Basic blocks are popular in intermediate languages because they describe -a list of sequential instructions that are not separated by -control-flow. This information can be used by various analysis passes to -perform optimisations, as these sequences of instructions can be -reordered in any way as long as their data dependencies are met. One -such optimisation is scheduling, which reorders instructions to -eliminate of any delay slots caused by the CPU pipeline. - -However, due to control-flow not being allowed in basic blocks, the -number of instructions that can be scheduled at any one time is limited. -The boundaries to other basic blocks act as hard boundary to the -scheduling algorithm. hyperblocks~ allow instructions to be predicated, -thereby introducing some control-flow information into the blocks, which -can also be taken into account by the scheduler. For example, any data -dependency between instructions that have mutually exclusive predicates -can be removed, as these instructions will never be activate at the same -time. - -\subsection[sec:rtlblockdef]{RTLBlock and RTLPar Intermediate Language Definition} +Basic blocks are popular in intermediate languages because they describe a list of sequential +instructions that are not separated by control-flow. This information can be used by various +analysis passes to perform optimisations, as these sequences of instructions can be reordered in any +way as long as their data dependencies are met. One such optimisation is scheduling, which reorders +instructions to eliminate of any delay slots caused by the CPU pipeline. -Figure~\goto{{[}fig:compcert_interm{]}}[fig:compcert_interm] shows the -intermediate languages introduced to implement hyperblocks in CompCert. -This consists of new RTLBlock and RTLPar intermediate languages, which -implement the sequential and parallel semantics of basic blocks -respectively. The semantics of RTLBlock and RTLPar are based on the -CompCert RTL intermediate language. However, instead of mapping from -states to instructions, RTLBlock maps from states to hyperblocks, and -RTLPar maps from states to parallel hyperblocks, which will be described -in the next sections. +However, due to control-flow not being allowed in basic blocks, the number of instructions that can +be scheduled at any one time is limited. The boundaries to other basic blocks act as hard boundary +to the scheduling algorithm. hyperblocks~ allow instructions to be predicated, thereby introducing +some control-flow information into the blocks, which can also be taken into account by the +scheduler. For example, any data dependency between instructions that have mutually exclusive +predicates can be removed, as these instructions will never be activate at the same time. -%\startformula \begin{aligned} -% i\ ::=&\ \ \mono{RBnop} \\ -% &|\ \mono{RBop}\ p?\ \mathit{op}\ \vec{r}\ d \\ -% &|\ \mono{RBload}\ p?\ \mathit{chunk}\ \mathit{addr}\ \vec{r}\ d \\ -% &|\ \mono{RBstore}\ p?\ \mathit{chunk}\ \mathit{addr}\ \vec{r}\ s \\ -% &|\ \mono{RBsetpred}\ p?\ c\ \vec{r}\ d \\ -% \end{aligned} \stopformula -% -%\startformula \begin{aligned} -% i_{\mathit{cf}}\ ::=&\ \ \mono{RBcall}\ \mathit{sig}\ f\ \vec{r}\ d\ n \\ -% &|\ \mono{RBtailcall}\ \mathit{sig}\ f\ \vec{r} \\ -% &|\ \mono{RBbuiltin}\ f_{\mathit{ext}}\ \vec{r}\ r\ n \\ -% &|\ \mono{RBcond}\ c\ \vec{r}\ n_{1}\ n_{2} \\ -% &|\ \mono{RBjumptable}\ r\ \vec{n} \\ -% &|\ \mono{RBreturn}\ r? \\ -% &|\ \mono{RBgoto}\ n \\ -% &|\ \mono{RBpred\_{cf}}\ P\ i_{\mathit{cf}_{1}}\ i_{\mathit{cf}_{2}} \\ -% \end{aligned} \stopformula +\subsection[sec:rtlblockdef]{RTLBlock and RTLPar Intermediate Language Definition} -RTLBlock instructions are split into two types of instructions, standard -instructions and control-flow instructions. The standard instructions -are the instructions that can be placed into hyperblocks, whereas -control-flow instructions are the instructions that can end hyperblocks. -The standard instructions are shown in -Figure~\goto{{[}fig:standard_instruction{]}}[fig:standard_instruction], -whereas the control-flow instructions are shown in -Figure~\goto{{[}fig:control_flow_instruction{]}}[fig:control_flow_instruction]. -Most instructions are quite similar to their RTL counterparts, however, -there are some instructions that have been added. \type{RBsetpred} is a -standard instruction which sets a predicate equal to an evaluated -condition. This instruction is used during if-conversion to set the -predicate to the value of the condition in the conditional statement -instruction. To support proper predicated instructions, each instruction -therefore also contains an optional predicate which has been set -beforehand, which produces the hyperblock. In additon to that, there is -also an extra control-flow instruction called \type{RBpred_cf}, which -can branch on a predicate an nests more control-flow instructions inside -of it. This is also necessary for if-conversion, when converting an -already converted conditional statement, as depending on the predicate, -a different control-flow instruction may be necessary. {\bf {[}{[}JW:} I -think there'd be a lot of value in having a picture showing op-chaining -in action. See tex source for the kind of thing I mean. Probably best -done as a running example throughout the paper. {\bf {]}{]}} - -These instructions are use in RTLBlock as well as in RTLPar. The main -difference between these two languages is how these instructions are -arranged within the hyperblock and the execution semantics of the -hyperblock in the languages. Let $\mathit{hb}_{b}$ be the definition of -a hyperblock for RTLBlock, and let $\mathit{hb}_{p}$ be the definition -of a hyperblock for RTLPar, then $\mathit{hb}_{b}$ and $\mathit{hb}_{p}$ -be defined as in Figure~\goto{{[}fig:hb_def{]}}[fig:hb_def]. The Figure -shows the different nesting levels of the basic blocks in RTLBlock as -well as RTLPar, and where the parallel semantics of RTLPar come into -play. RTLBlock is made of a list of instructions and a control-flow -instruction that ends the hyperblock. Each instruction in RTLBlock is -executed sequentially. RTLPar is made of three separate lists. The outer -list behaves like the list in RTLBlock, and executes each list inside of -it sequentially. RTLPar then has a list which executes it's contents in -parallel, meaning each instruction uses the same state of the registers -and memory while executing. Finally, instead of directly containing -instructions, there is an additional layer of sequentially executing -lists of instructions, to be able to execute small groups of -instructions in sequence, but still in parallel to the other -instructions in the current parallel group. - -RTLPar is structured in this way so that the scheduler has more -flexibility, so that it can not only parallelise operations in the basic -blocks, but that it can also define sequential operations in these -parallel operations. This is especially useful for the hardware back -end, as it means that sequential operations that do not quite fill one -clock cycle can be executed sequentially, but can still be executed in -parallel with other independent operations. This optimisation is called -operation chaining, and is critical to get the most performance out of -the hardware. +Figure~\goto{{[}fig:compcert_interm{]}}[fig:compcert_interm] shows the intermediate languages +introduced to implement hyperblocks in CompCert. This consists of new RTLBlock and RTLPar +intermediate languages, which implement the sequential and parallel semantics of basic blocks +respectively. The semantics of RTLBlock and RTLPar are based on the CompCert RTL intermediate +language. However, instead of mapping from states to instructions, RTLBlock maps from states to +hyperblocks, and RTLPar maps from states to parallel hyperblocks, which will be described in the +next sections. + +\placeformula \startformula \startmathalignment +\NC i\ \ \colon\colon= \NC \ \ \mono{RBnop} \NR +\NC \NC |\ \ \mono{RBop}\ p?\ \mathit{op}\ \vec{r}\ d \NR +\NC \NC |\ \ \mono{RBload}\ p?\ \mathit{chunk}\ \mathit{addr}\ \vec{r}\ d \NR +\NC \NC |\ \ \mono{RBstore}\ p?\ \mathit{chunk}\ \mathit{addr}\ \vec{r}\ s \NR +\NC \NC |\ \ \mono{RBsetpred}\ p?\ c\ \vec{r}\ d \NR +\stopmathalignment \stopformula + +\placeformula \startformula \startmathalignment +\NC i_{\mathit{cf}}\ \ \colon\colon= \NC \ \ \mono{RBcall}\ \mathit{sig}\ f\ \vec{r}\ d\ n \NR +\NC \NC |\ \ \mono{RBtailcall}\ \mathit{sig}\ f\ \vec{r} \NR +\NC \NC |\ \ \mono{RBbuiltin}\ f_{\mathit{ext}}\ \vec{r}\ r\ n \NR +\NC \NC |\ \ \mono{RBcond}\ c\ \vec{r}\ n_{1}\ n_{2} \NR +\NC \NC |\ \ \mono{RBjumptable}\ r\ \vec{n} \NR +\NC \NC |\ \ \mono{RBreturn}\ r? \NR +\NC \NC |\ \ \mono{RBgoto}\ n \NR +\NC \NC |\ \ \mono{RBpred\_{cf}}\ P\ i_{\mathit{cf}_{1}}\ i_{\mathit{cf}_{2}} \NR +\stopmathalignment \stopformula + +RTLBlock instructions are split into two types of instructions, standard instructions and +control-flow instructions. The standard instructions are the instructions that can be placed into +hyperblocks, whereas control-flow instructions are the instructions that can end hyperblocks. The +standard instructions are shown in +Figure~\goto{{[}fig:standard_instruction{]}}[fig:standard_instruction], whereas the control-flow +instructions are shown in +Figure~\goto{{[}fig:control_flow_instruction{]}}[fig:control_flow_instruction]. Most instructions +are quite similar to their RTL counterparts, however, there are some instructions that have been +added. \type{RBsetpred} is a standard instruction which sets a predicate equal to an evaluated +condition. This instruction is used during if-conversion to set the predicate to the value of the +condition in the conditional statement instruction. To support proper predicated instructions, each +instruction therefore also contains an optional predicate which has been set beforehand, which +produces the hyperblock. In additon to that, there is also an extra control-flow instruction called +\type{RBpred_cf}, which can branch on a predicate an nests more control-flow instructions inside of +it. This is also necessary for if-conversion, when converting an already converted conditional +statement, as depending on the predicate, a different control-flow instruction may be +necessary. + +These instructions are use in RTLBlock as well as in RTLPar. The main difference between these two +languages is how these instructions are arranged within the hyperblock and the execution semantics +of the hyperblock in the languages. Let $\mathit{hb}_{b}$ be the definition of a hyperblock for +RTLBlock, and let $\mathit{hb}_{p}$ be the definition of a hyperblock for RTLPar, then +$\mathit{hb}_{b}$ and $\mathit{hb}_{p}$ be defined as in +Figure~\goto{{[}fig:hb_def{]}}[fig:hb_def]. The Figure shows the different nesting levels of the +basic blocks in RTLBlock as well as RTLPar, and where the parallel semantics of RTLPar come into +play. RTLBlock is made of a list of instructions and a control-flow instruction that ends the +hyperblock. Each instruction in RTLBlock is executed sequentially. RTLPar is made of three separate +lists. The outer list behaves like the list in RTLBlock, and executes each list inside of it +sequentially. RTLPar then has a list which executes it's contents in parallel, meaning each +instruction uses the same state of the registers and memory while executing. Finally, instead of +directly containing instructions, there is an additional layer of sequentially executing lists of +instructions, to be able to execute small groups of instructions in sequence, but still in parallel +to the other instructions in the current parallel group. + +RTLPar is structured in this way so that the scheduler has more flexibility, so that it can not only +parallelise operations in the basic blocks, but that it can also define sequential operations in +these parallel operations. This is especially useful for the hardware back end, as it means that +sequential operations that do not quite fill one clock cycle can be executed sequentially, but can +still be executed in parallel with other independent operations. This optimisation is called +operation chaining, and is critical to get the most performance out of the hardware. \subsection[translating-rtlblock-to-rtlpar]{Translating RTLBlock to RTLPar} -The translation from RTLBlock to RTLPar is performed using a scheduling -algorithm, which will takes hyperblocks and rearranges their -instructions to maximise parallelism based on the data dependencies. It -then arranges the instructions in RTLPar by putting independent -instructions into the same parallel block. Scheduling is an optimisation -that is well suited to translation validation, as the algorithm itself -can be quite complex, whereas the output is normally relatively easy to -check for equivalence with the input. -Figure~\goto{{[}fig:compcert_interm{]}}[fig:compcert_interm] shows that -the scheduling step comes right after the if-conversion step which -originally creates the hyperblocks. - -Figure~\goto{{[}fig:op_chain{]}}[fig:op_chain] shows how the scheduling -step transforms RTLBlock into RTLPar. The RTLBlock hyperblock being -scheduled is shown in -Figure~\goto{{[}fig:op_chain_a{]}}[fig:op_chain_a], which contains five -predicated operations, comprising two additions and three -multiplications. The data dependencies of the instructions in the -hyperblock are shown in -Figure~\goto{{[}fig:op_chain_b{]}}[fig:op_chain_b]. Curiously, even -though operations and are normally dependent on each other due to a -write-after-write conflict, but because the predicates are independent -the write-after-write conflict can effectively be removed by the +The translation from RTLBlock to RTLPar is performed using a scheduling algorithm, which will takes +hyperblocks and rearranges their instructions to maximise parallelism based on the data +dependencies. It then arranges the instructions in RTLPar by putting independent instructions into +the same parallel block. Scheduling is an optimisation that is well suited to translation +validation, as the algorithm itself can be quite complex, whereas the output is normally relatively +easy to check for equivalence with the input. +Figure~\goto{{[}fig:compcert_interm{]}}[fig:compcert_interm] shows that the scheduling step comes +right after the if-conversion step which originally creates the hyperblocks. + +Figure~\goto{{[}fig:op_chain{]}}[fig:op_chain] shows how the scheduling step transforms RTLBlock +into RTLPar. The RTLBlock hyperblock being scheduled is shown in +Figure~\goto{{[}fig:op_chain_a{]}}[fig:op_chain_a], which contains five predicated operations, +comprising two additions and three multiplications. The data dependencies of the instructions in the +hyperblock are shown in Figure~\goto{{[}fig:op_chain_b{]}}[fig:op_chain_b]. Curiously, even though +operations and are normally dependent on each other due to a write-after-write conflict, but because +the predicates are independent the write-after-write conflict can effectively be removed by the scheduler. -The scheduler transforms the RTLBlock hyperblock into the RTLPar -hyperblock shown in Figure~\goto{{[}fig:op_chain_c{]}}[fig:op_chain_c]. -Even though the addition in is dependent on , they can still be put into -the same cycle as the additions do not take as long to complete as the -multiplications. This optimisation is called operation chaining. Then, -the multiplication in can also be placed in the same clock cycle as it -does not have any data dependencies. Then, in the next clock cycle, -either the multiplication in can take place, or the multiplication in -will take place, meaning these two multiplications can also be placed in -the same clock cycle. This gives the final schedule that is shown in -Figure~\goto{{[}fig:op_chain_d{]}}[fig:op_chain_d]. +The scheduler transforms the RTLBlock hyperblock into the RTLPar hyperblock shown in +Figure~\goto{{[}fig:op_chain_c{]}}[fig:op_chain_c]. Even though the addition in is dependent on , +they can still be put into the same cycle as the additions do not take as long to complete as the +multiplications. This optimisation is called operation chaining. Then, the multiplication in can +also be placed in the same clock cycle as it does not have any data dependencies. Then, in the next +clock cycle, either the multiplication in can take place, or the multiplication in will take place, +meaning these two multiplications can also be placed in the same clock cycle. This gives the final +schedule that is shown in Figure~\goto{{[}fig:op_chain_d{]}}[fig:op_chain_d]. \section[abstr_interp]{Abstract Interpretation of Hyperblocks} -The correctness of the scheduling algorithm is checked by comparing the -symbolic expressions of each register before and after scheduling. This -section describes how these symbolic expressions are constructed for -each hyperblock. +The correctness of the scheduling algorithm is checked by comparing the symbolic expressions of each +register before and after scheduling. This section describes how these symbolic expressions are +constructed for each hyperblock. \subsection[abstr_language]{Construction of Abstract Language} -The main difficulty when constructing symbolic expressions for -hyperblocks is the presence of predicates, which can be used by the -scheduling algorithm to reorder instructions further, even if these -contain data dependencies. The scheduling algorithm will manipulate -predicates in the following ways, which the comparison function needs to -take into account. +The main difficulty when constructing symbolic expressions for hyperblocks is the presence of +predicates, which can be used by the scheduling algorithm to reorder instructions further, even if +these contain data dependencies. The scheduling algorithm will manipulate predicates in the +following ways, which the comparison function needs to take into account. \startitemize \item - Instructions with predicates that are equivalent to false can be - removed. + Instructions with predicates that are equivalent to false can be removed. \item - Instructions with predicates that are mutually exclusive can be - executed in parallel. + Instructions with predicates that are mutually exclusive can be executed in parallel. \item Any predicate can be replaced with an equivalent predicate. \stopitemize The symbolic expressions have the following form: -There are three main resources that are present in symbolic expressions: -registers $r$, which represent the value of the registers in the -program, predicates $p$, which represent the value of the predicates -throughout the program, and finally, memory \type{Mem}, representing the -state of the memory. For each basic instruction in the hyperblock, there -is an equivalent symbolic expression which either modifies the state of -the registers, predicates or memory. Finally, as each instruction in the -hyperblock can be predicates, each symbolic expression is also paired -with a predicate as well. - -Assigning the same resource multiple times with instructions that -contain different predicates, means that the symbolic expression must -also express this conditional result based on the predicates. This is -expressed by assigning lists of predicated symbolic expressions to each -resource, where each predicate needs to be mutually exclusive from the -other predicates in the list. - -The expressions are then constructed using a function which updates the -symbolic expressions assigned for each resource. This is done using -multiple primitives which act on predicated types, which are made up of -a list of pairs of predicates and the element of that type. The first -important primitive is multiplication of two predicated types, which is -implemented as performing a cartesian multiplication between the -predicated types in the two lists, anding the two predicates and joining +There are three main resources that are present in symbolic expressions: registers $r$, which +represent the value of the registers in the program, predicates $p$, which represent the value of +the predicates throughout the program, and finally, memory \type{Mem}, representing the state of the +memory. For each basic instruction in the hyperblock, there is an equivalent symbolic expression +which either modifies the state of the registers, predicates or memory. Finally, as each instruction +in the hyperblock can be predicates, each symbolic expression is also paired with a predicate as +well. + +Assigning the same resource multiple times with instructions that contain different predicates, +means that the symbolic expression must also express this conditional result based on the +predicates. This is expressed by assigning lists of predicated symbolic expressions to each +resource, where each predicate needs to be mutually exclusive from the other predicates in the list. + +The expressions are then constructed using a function which updates the symbolic expressions +assigned for each resource. This is done using multiple primitives which act on predicated types, +which are made up of a list of pairs of predicates and the element of that type. The first important +primitive is multiplication of two predicated types, which is implemented as performing a cartesian +multiplication between the predicated types in the two lists, anding the two predicates and joining the two types of each list using a function. %\startformula \label{eq:1} % P_{1} \otimes_{f} P_{2} \equiv \mono{map } (\lambda ((p_{1}, e_{1}), (p_{2}, e_{2})) . (p_{1} \land p_{2}, f\ e_{1}\ e_{2})) P_{1} \times P_{2} \stopformula -In addition to that, another primitive that is needed is the following -append operation, which will negate the combination of all predicates in -the second predicated type, and it to the first predicated type and -append the first predicated type to the second: +In addition to that, another primitive that is needed is the following append operation, which will +negate the combination of all predicates in the second predicated type, and it to the first +predicated type and append the first predicated type to the second: %\startformula \label{eq:2} % \mu(p, P) \equiv \mono{map } (\lambda (p', e') . (p \land p', e'))\ P \stopformula @@ -322,8 +237,7 @@ append the first predicated type to the second: %\startformula \label{eq:3} % P_{1} \oplus_{p} P_{2} \equiv \mu(\neg p, P_{1}) \mathbin{++} \mu(p, P_{2}) \stopformula -\subsection[title={Example of -translation},reference={example-of-translation}] +\subsection[example-of-translation]{Example of translation} %\startformula \begin{aligned} % \label{eq:4} @@ -342,113 +256,87 @@ translation},reference={example-of-translation}] % \end{cases} % \end{aligned} \stopformula -Using the example shown in -Figure~\goto{{[}fig:op_chain{]}}[fig:op_chain], the RTLBlock hyperblock -shown in Figure~\goto{{[}fig:op_chain_a{]}}[fig:op_chain_a] is scheduled -into the hyperblock RTLPar shown in -Figure~\goto{{[}fig:op_chain_c{]}}[fig:op_chain_c]. The first step is to -translate the input and output of the scheduler into the Abstr -intermediate language by calculating the symbolic values that will be -stored in each of the registers. Symbolically evaluating the RTLBlock -hyperblock results in the expressions shown in -Figure~\goto{{[}fig:symbolic_expressions{]}}[fig:symbolic_expressions] -for registers \type{r1}, \type{r2} and \type{r3}, where the state of the -register \type{r1} at the start of the hyperblock is denoted as -$\mono{r1}^{0}$. - -This symbolic expression can be generated by sequentially going through -the list of predicated expressions and applying the update function -defined in Section~\goto{3.1}[abstr_language]. The resulting predicated -expressions is guaranteed to have predicates which are mutually -exclusive, meaning if the predicate evaluates to true, all other -predicates must evaluate to false. - -\subsection[title={Linear (Flat) Predicated -Expressions},reference={linear-flat-predicated-expressions}] - -The update function for predicated expressions is quite a bit more -complex compared to standard update functions for symbolic execution, -such as the one used by . The main reason for this is the non-recursive -nature of the predicated expressions. As the predicates are only -available at the top-level of the expressions, combining multiple -expressions means that the conditions and multiple possibilities also -need to trickle up to the top. - -Instead, if the predicated instructions were recursive, one would not -have to change any predicates when evaluating the argument lists of -instructions, for example, as these can just stay predicated. The -evaluation semantics would also be much simpler, because a rule can be -added for evaluating predicates. - -\section[title={Correctness of the Abstract -Interpretation},reference={sec:correctness}] - -The correctness of the scheduling step is dependent on the correctness -of the equivalence check between the two abstract languages. As the -abstract languages contain predicates, this comparison check cannot just -be a syntactic comparison between the two languages, because the -predicates might be different but logically equivalent. Instead, a mixed -equivalence check is performed using a verified SAT solver and a -syntactic equivalence check of the actual symbolic expressions. - -\subsection[title={Comparing Symbolic Abstract -Expressions},reference={comparing-symbolic-abstract-expressions}] - -Abstract expressions comprise a predicate and a symbolic expression. The -translation from RTLBlock and RTLPar to Abstract ensures that each -register will contain a predicated symbolic expression that describes -the value that will be assigned to it after evaluating the basic block. -This means that if an expression always produces the same result as -another expression, that the register will always contain the same -value. If the RTLBlock program and RTLPar program translate to two -Abstract languages where all the register's symbolic expressions match, -then this means that both programs will always produce the same value. - -The main comparison between two symbolic expressions can be done using a -simple syntactic comparison, because this ensures that each of the -expressions always evaluate to the same value, and is also general -enough for the expressions that need to be compared. However, as all -expressions might be predicated, this simple syntactic comparison cannot -be performed for the predicates. This is because predicated instructions -can be executed in many different orders, and the predicates are used to -calculate dependence information and therefore also determine the order -of operations. - -Instead, comparing predicates can be done by checking logical -equivalence of the predicates using a formally verified SAT solver. This -ensures that the predicates will always evaluate to the same value, -thereby making it a fair comparison. - -\subsection[title={Verified Decidable DPLL SAT -Solver},reference={verified_sat}] - -A SAT solver is needed to correctly and reliably compare predicated -expressions. This makes it possible to then compare many different -possible schedules, even if they remove expressions completely if these -are unnecessary. The SAT solver is needed in the verifier, which needs -to be proven correct in Coq to prove the scheduling translation correct. -The SAT solver therefore has to be verified, because it could otherwise -not be used in the proof of correctness. - -The DPLL algorithm is used to implement the SAT solver. This means that -the SAT solver takes in a conjunctive normal form (CNF) formula. The -conversion from a recursive predicate type into a CNF formula is done by -recursively expanding the predicate in equivalent CNF formulas. - -The SAT solver is implemented in Coq and proven to be sound and -complete. The soundness and completeness of an algorithm to check -whether a predicate is satisfiable or not also directly implies that the -check is decidable, simplifying many proofs in Coq by using a -\type{Setoid} instance for equivalence between predicates. This makes -rewriting predicates if they are equivalent much easier in proofs by -reusing many of the existing Coq tactics. - -\subsection[title={Encoding Predicated Expressions as SAT -Formulas},reference={encoding-predicated-expressions-as-sat-formulas}] - -Given the following two predicated expressions which should be tested -for equivalence, one can construct a SAT expression which will ensure -that both expressions will always result in the same value. +Using the example shown in Figure~\goto{{[}fig:op_chain{]}}[fig:op_chain], the RTLBlock hyperblock +shown in Figure~\goto{{[}fig:op_chain_a{]}}[fig:op_chain_a] is scheduled into the hyperblock RTLPar +shown in Figure~\goto{{[}fig:op_chain_c{]}}[fig:op_chain_c]. The first step is to translate the +input and output of the scheduler into the Abstr intermediate language by calculating the symbolic +values that will be stored in each of the registers. Symbolically evaluating the RTLBlock hyperblock +results in the expressions shown in +Figure~\goto{{[}fig:symbolic_expressions{]}}[fig:symbolic_expressions] for registers \type{r1}, +\type{r2} and \type{r3}, where the state of the register \type{r1} at the start of the hyperblock is +denoted as $\mono{r1}^{0}$. + +This symbolic expression can be generated by sequentially going through the list of predicated +expressions and applying the update function defined in Section~\goto{3.1}[abstr_language]. The +resulting predicated expressions is guaranteed to have predicates which are mutually exclusive, +meaning if the predicate evaluates to true, all other predicates must evaluate to false. + +\subsection[linear-flat-predicated-expressions]{Linear (Flat) Predicated Expressions} + +The update function for predicated expressions is quite a bit more complex compared to standard +update functions for symbolic execution, such as the one used by . The main reason for this is the +non-recursive nature of the predicated expressions. As the predicates are only available at the +top-level of the expressions, combining multiple expressions means that the conditions and multiple +possibilities also need to trickle up to the top. + +Instead, if the predicated instructions were recursive, one would not have to change any predicates +when evaluating the argument lists of instructions, for example, as these can just stay +predicated. The evaluation semantics would also be much simpler, because a rule can be added for +evaluating predicates. + +\section[sec:correctness]{Correctness of the Abstract Interpretation} + +The correctness of the scheduling step is dependent on the correctness of the equivalence check +between the two abstract languages. As the abstract languages contain predicates, this comparison +check cannot just be a syntactic comparison between the two languages, because the predicates might +be different but logically equivalent. Instead, a mixed equivalence check is performed using a +verified SAT solver and a syntactic equivalence check of the actual symbolic expressions. + +\subsection[comparing-symbolic-abstract-expressions]{Comparing Symbolic Abstract Expressions} + +Abstract expressions comprise a predicate and a symbolic expression. The translation from RTLBlock +and RTLPar to Abstract ensures that each register will contain a predicated symbolic expression that +describes the value that will be assigned to it after evaluating the basic block. This means that +if an expression always produces the same result as another expression, that the register will +always contain the same value. If the RTLBlock program and RTLPar program translate to two Abstract +languages where all the register's symbolic expressions match, then this means that both programs +will always produce the same value. + +The main comparison between two symbolic expressions can be done using a simple syntactic +comparison, because this ensures that each of the expressions always evaluate to the same value, and +is also general enough for the expressions that need to be compared. However, as all expressions +might be predicated, this simple syntactic comparison cannot be performed for the predicates. This +is because predicated instructions can be executed in many different orders, and the predicates are +used to calculate dependence information and therefore also determine the order of operations. + +Instead, comparing predicates can be done by checking logical equivalence of the predicates using a +formally verified SAT solver. This ensures that the predicates will always evaluate to the same +value, thereby making it a fair comparison. + +\subsection[verified_sat]{Verified Decidable DPLL SAT Solver} + +A SAT solver is needed to correctly and reliably compare predicated expressions. This makes it +possible to then compare many different possible schedules, even if they remove expressions +completely if these are unnecessary. The SAT solver is needed in the verifier, which needs to be +proven correct in Coq to prove the scheduling translation correct. The SAT solver therefore has to +be verified, because it could otherwise not be used in the proof of correctness. + +The DPLL algorithm is used to implement the SAT solver. This means that the SAT solver takes in a +conjunctive normal form (CNF) formula. The conversion from a recursive predicate type into a CNF +formula is done by recursively expanding the predicate in equivalent CNF formulas. + +The SAT solver is implemented in Coq and proven to be sound and complete. The soundness and +completeness of an algorithm to check whether a predicate is satisfiable or not also directly +implies that the check is decidable, simplifying many proofs in Coq by using a \type{Setoid} +instance for equivalence between predicates. This makes rewriting predicates if they are equivalent +much easier in proofs by reusing many of the existing Coq tactics. + +\subsection[encoding-predicated-expressions-as-sat-formulas]{Encoding Predicated Expressions as SAT + Formulas} + +Given the following two predicated expressions which should be tested for equivalence, one can +construct a SAT expression which will ensure that both expressions will always result in the same +value. %\startformula \begin{aligned} % \bgroup \left.\begin{aligned} @@ -465,33 +353,28 @@ that both expressions will always result in the same value. % e_{m}',\quad &p_{m}'\\ % \end{cases}\end{aligned} \stopformula -The SAT query that should determine if two predicated expressions are -equivalent is the following, where $\mathit{he}_{i}$ is the hashed -expression of $e_{i}$: +The SAT query that should determine if two predicated expressions are equivalent is the following, +where $\mathit{he}_{i}$ is the hashed expression of $e_{i}$: %\startformula \begin{gathered} % (p_{1} \implies \mathit{he}_{1}) \land (p_{2} \implies \mathit{he}_{2}) \land \cdots \land (p_{n} \implies \mathit{he}_{n}) \\ % \Longleftrightarrow\\ % (p_{1}' \implies \mathit{he}_{1}') \land (p_{2}' \implies \mathit{he}_{2}') \land \cdots \land (p_{m}' \implies \mathit{he}_{m}') \\\end{gathered} \stopformula -However, reasoning about the equivalence of individual expressions -within the greater list of predicated expressions is not quite -straightforward. The evaluation semantics of predicated expression list -will evaluate each predicate expression sequentially, and pick the -expression whose predicate evaluates to true. However, proving that both -predicated expression lists will return the same value assuming that the -SAT query above is true, means that for each predicate that is true, One -needs to be able to show that there is a predicate in the other list -that will also be true, which will return the same value. However, there -is no guarantee that there is a unique predicated expression that will -match the current predicate, as there may be duplicate expressions in -the list with different predicates. - -This means that it is quite difficult to prove that the SAT query -implies the semantic preservation of the scheduling transformation. The -SAT query can therefore be broken down into a slightly more direct -comparison that is more useful for the proof itself. The steps to -perform the more fine grained comparison are the following: +However, reasoning about the equivalence of individual expressions within the greater list of +predicated expressions is not quite straightforward. The evaluation semantics of predicated +expression list will evaluate each predicate expression sequentially, and pick the expression whose +predicate evaluates to true. However, proving that both predicated expression lists will return the +same value assuming that the SAT query above is true, means that for each predicate that is true, +One needs to be able to show that there is a predicate in the other list that will also be true, +which will return the same value. However, there is no guarantee that there is a unique predicated +expression that will match the current predicate, as there may be duplicate expressions in the list +with different predicates. + +This means that it is quite difficult to prove that the SAT query implies the semantic preservation +of the scheduling transformation. The SAT query can therefore be broken down into a slightly more +direct comparison that is more useful for the proof itself. The steps to perform the more fine +grained comparison are the following: %\startenumerate[n] %\item @@ -507,81 +390,65 @@ perform the more fine grained comparison are the following: % one of the maps, then it's predicate should be equivalent to $\perp$. %\stopenumerate -This comparison is logically equivalent to performing the large SAT -query, however, it maps better to the proof because it is comparing the -predicates of the same expressions with each other. +This comparison is logically equivalent to performing the large SAT query, however, it maps better +to the proof because it is comparing the predicates of the same expressions with each other. \subsection[proof-of-equivalence-between-passes]{Proof of equivalence between passes} -Figure~\goto{{[}fig:compcert_interm{]}}[fig:compcert_interm] shows the -four different passes that were added to CompCert to form hyperblocks. -This section will cover the proofs for each of the translations, -focusing mainly on the proof of translation of the scheduling pass. +Figure~\goto{{[}fig:compcert_interm{]}}[fig:compcert_interm] shows the four different passes that +were added to CompCert to form hyperblocks. This section will cover the proofs for each of the +translations, focusing mainly on the proof of translation of the scheduling pass. \subsubsection[proof-of-correctness-of-the-if-conversion-pass]{Proof of Correctness of the If-Conversion Pass} -The if-conversion pass is quite simple to verify, as it is an -optimisation that transforms the code in the same language. +The if-conversion pass is quite simple to verify, as it is an optimisation that transforms the code +in the same language. \subsubsection[proof-of-correctness-of-scheduling]{Proof of Correctness of Scheduling} -The scheduling pass is formally verified using translation validation, -using the Abstr language as a point of comparison between RTLBlock and -RTLPar. Using the algorithm defined in Section~\goto{3}[abstr_interp], -the formally verified SAT solver described in -Section~\goto{4.2}[verified_sat] can be used prove that if the two -predicated expression lists are equivalent, that they will always -evaluate to the same value. +The scheduling pass is formally verified using translation validation, using the Abstr language as a +point of comparison between RTLBlock and RTLPar. Using the algorithm defined in +Section~\goto{3}[abstr_interp], the formally verified SAT solver described in +Section~\goto{4.2}[verified_sat] can be used prove that if the two predicated expression lists are +equivalent, that they will always evaluate to the same value. \section[implementation-details-of-static-sdc-scheduling]{Implementation Details of Static SDC Scheduling} -This section describes the implementation details of the static SDC -scheduling algorithm used, and also describes some of the details of the -basic block generation as well as the if-conversion pass which generates -the predicated instructions. +This section describes the implementation details of the static SDC scheduling algorithm used, and +also describes some of the details of the basic block generation as well as the if-conversion pass +which generates the predicated instructions. \subsection[implementation-of-if-conversion]{Implementation of If-Conversion} -If-conversion is the conversion that introduces predicated instructions -which can make use of the hyperblocks. It converts conditional -statements into predicated instructions. This transformation has a few -limitations on the kind of conditional statements that it can translate. -First, only conditional statements without loops can be translated, -therefore, one must identify cycles in the control-flow before -performing the if-conversion. Secondly, if-conversion will not always -result in more efficient code, so it should not be applied to any -conditional statements. Instead, it is best applied to conditional -statements where each branch will take a similar amount of time to -execute. - -For back ends that do not support predicates, it is also important to -eliminate the predicates again using a process called reverse -if-conversion, which creates branches out of the predicates and groups -these together again. This allows for hyperblock scheduling to also be -used with back ends that do not support predicates, which are most the -default back ends included in CompCert. - -\subsection[title={Static SDC -Scheduling},reference={static-sdc-scheduling}] - -System of difference constraints (SDC) scheduling~ is a flexible -algorithm to perform scheduling of instructions, especially if the -target is hardware directly. It allows for flexible definitions of -optimisation functions in terms of latencies and dependencies of the -operations in the basic block. - -\section[title={Performance -Comparison},reference={performance-comparison}] - -\section[title={Related Work},reference={sec:related-work}] - -\section[title={Conclusion},reference={conclusion}] - -This material is based upon work supported by the under Grant No.~ and -Grant No.~. Any opinions, findings, and conclusions or recommendations -expressed in this material are those of the author and do not -necessarily reflect the views of the National Science Foundation. +If-conversion is the conversion that introduces predicated instructions which can make use of the +hyperblocks. It converts conditional statements into predicated instructions. This transformation +has a few limitations on the kind of conditional statements that it can translate. First, only +conditional statements without loops can be translated, therefore, one must identify cycles in the +control-flow before performing the if-conversion. Secondly, if-conversion will not always result in +more efficient code, so it should not be applied to any conditional statements. Instead, it is best +applied to conditional statements where each branch will take a similar amount of time to execute. + +For back ends that do not support predicates, it is also important to eliminate the predicates again +using a process called reverse if-conversion, which creates branches out of the predicates and +groups these together again. This allows for hyperblock scheduling to also be used with back ends +that do not support predicates, which are most the default back ends included in CompCert. + +\subsection[static-sdc-scheduling]{Static SDC Scheduling} + +System of difference constraints (SDC) scheduling~ is a flexible algorithm to perform scheduling of +instructions, especially if the target is hardware directly. It allows for flexible definitions of +optimisation functions in terms of latencies and dependencies of the operations in the basic block. + +\section[performance-comparison]{Performance Comparison} + +\section[sec:related-work]{Related Work} + +\section[conclusion]{Conclusion} + +This material is based upon work supported by the under Grant No.~ and Grant No.~. Any opinions, +findings, and conclusions or recommendations expressed in this material are those of the author and +do not necessarily reflect the views of the National Science Foundation. \stopcomponent -- cgit