From 2a716db2fb59f448590b189f7b953a80d02bc5ee Mon Sep 17 00:00:00 2001
From: Yann Herklotz <git@yannherklotz.com>
Date: Fri, 15 Apr 2022 11:46:10 +0100
Subject: Add changes

---
 chapters/scheduling.tex | 779 ++++++++++++++++++++----------------------------
 1 file changed, 323 insertions(+), 456 deletions(-)

(limited to 'chapters/scheduling.tex')

diff --git a/chapters/scheduling.tex b/chapters/scheduling.tex
index 5edad2e..2201f72 100644
--- a/chapters/scheduling.tex
+++ b/chapters/scheduling.tex
@@ -1,320 +1,235 @@
+\environment fonts_env
+\environment lsr_env
+
 \startcomponent scheduling
-\project main
 
-\chapter{Static Scheduling}
+\chapter[sec:scheduling]{Static Scheduling}
 
 \section[introduction]{Introduction}
 
-The use of multicore processors has been increasing drastically, thereby
-making parallelising compilers ever more important. In addition to that,
-the need for custom hardware accelerators is also increasing, as
-traditional processors are not always the best choice for all
-applications. Compilers that support optimisations which can
-automatically parallelise it's input programs are therefore also
-becoming more important, as these back ends all benefit from it. For
-processors, one wants to optimise the pipeline usage as much as
-possible, whereas in hardware one can create custom pipelines, as well
-as making more use out of the spacial properties that hardware offers,
-making instruction parallelism important as well as speculative
-execution. However, with more powerful optimisations there is a greater
-chance to make mistakes in the compiler, making it possible to generate
-programs that behave differently to the input program. An important step
-towards making compilers more reliable is to formally verify these
-optimisations so that it can be guaranteed that bugs are not introduced.
-
-A popular optimisation for processors increasing the instruction
-parallelism of the resulting program is to use scheduling. This is an
-optimisation that is especially important for processors that have deep
-instruction pipelines or multiple functional units that can be used to
-execute instructions in parallel. However, this optimisation is also
-critical for high-level synthesis tools, which generate application
-specific hardware designs based on a software input. Simple scheduling
-algorithms often work on basic blocks, which are sequential lists of
-instructions without any branches. However, this limits what kind of
-reordering the scheduling algorithm can perform. It can never move
-instructions past a branching instruction, even when this might
-drastically improve the performance. Trace scheduling~ is an alternative
-that addresses this issue, by creating paths that might cross branching
-instructions, and scheduling the instructions in that path. However, the
-main problem with trace scheduling is that it is often infeasible on
-large programs, and because of the large design space, it's sometimes
-hard to find an optimal schedule. Superblock~ and hyperblock~ scheduling
-are two subsets of trace scheduling, which compromise on the flexibility
-of trace scheduling to instead build more tractable algorithms.
-Superblock scheduling produces more efficient code for processors with a
-low issue rate, whereas hyperblock scheduling produces more performant
-code for processors with a high issue rate~.
-
-CompCert~ is a compiler that is formally verified in Coq. In addition to
-that, have also added a formally verified hardware back end to CompCert,
-allowing it to also generate provably correct hardware designs. A
-hardware back end is an ideal target for hyperblock scheduling, as the
-issue rate is practically infinite. This paper describes the
-implementation of a hyperblock instruction scheduler in CompCert so that
-it can benefit existing CompCert back ends, as well as support more
+The use of multicore processors has been increasing drastically, thereby making parallelising
+compilers ever more important. In addition to that, the need for custom hardware accelerators is
+also increasing, as traditional processors are not always the best choice for all
+applications. Compilers that support optimisations which can automatically parallelise it's input
+programs are therefore also becoming more important, as these back ends all benefit from it. For
+processors, one wants to optimise the pipeline usage as much as possible, whereas in hardware one
+can create custom pipelines, as well as making more use out of the spacial properties that hardware
+offers, making instruction parallelism important as well as speculative execution. However, with
+more powerful optimisations there is a greater chance to make mistakes in the compiler, making it
+possible to generate programs that behave differently to the input program. An important step
+towards making compilers more reliable is to formally verify these optimisations so that it can be
+guaranteed that bugs are not introduced.
+
+A popular optimisation for processors increasing the instruction parallelism of the resulting
+program is to use scheduling. This is an optimisation that is especially important for processors
+that have deep instruction pipelines or multiple functional units that can be used to execute
+instructions in parallel. However, this optimisation is also critical for high-level synthesis
+tools, which generate application specific hardware designs based on a software input. Simple
+scheduling algorithms often work on basic blocks, which are sequential lists of instructions without
+any branches. However, this limits what kind of reordering the scheduling algorithm can perform. It
+can never move instructions past a branching instruction, even when this might drastically improve
+the performance. Trace scheduling~ is an alternative that addresses this issue, by creating paths
+that might cross branching instructions, and scheduling the instructions in that path. However, the
+main problem with trace scheduling is that it is often infeasible on large programs, and because of
+the large design space, it's sometimes hard to find an optimal schedule. Superblock~ and hyperblock~
+scheduling are two subsets of trace scheduling, which compromise on the flexibility of trace
+scheduling to instead build more tractable algorithms.  Superblock scheduling produces more
+efficient code for processors with a low issue rate, whereas hyperblock scheduling produces more
+performant code for processors with a high issue rate~.
+
+CompCert~ is a compiler that is formally verified in Coq. In addition to that, have also added a
+formally verified hardware back end to CompCert, allowing it to also generate provably correct
+hardware designs. A hardware back end is an ideal target for hyperblock scheduling, as the issue
+rate is practically infinite. This paper describes the implementation of a hyperblock instruction
+scheduler in CompCert so that it can benefit existing CompCert back ends, as well as support more
 general targets such as scheduling instructions for custom hardware.
 
 \subsection[key-points]{Key Points}
 
-The goal of this paper is to introduce hyperblock scheduling into
-CompCert to allow for advanced and flexible scheduling operations that
-could be beneficial to various different back ends. To this end, the key
-points of this paper are the following:
+The goal of this paper is to introduce hyperblock scheduling into CompCert to allow for advanced and
+flexible scheduling operations that could be beneficial to various different back ends. To this end,
+the key points of this paper are the following:
 
 \startitemize
 \item
-  Description of how hyperblocks are introduced for improved scheduling
-  of instructions, and how these can be built and translated back to
-  instructions without predicates.
+  Description of how hyperblocks are introduced for improved scheduling of instructions, and how
+  these can be built and translated back to instructions without predicates.
 \item
-  Description of the implementation and correctness proof of a system of
-  difference constraints (SDC) scheduling algorithm, which can be
-  adapted to various different scheduling strategies. {\bf {[}{[}JW:}
-  What's the link between hyperblocks and SDC-based scheduling? Are
-  these orthogonal?{\bf {]}{]}}{\bf {[}{[}YH:} Well the original SDC
-  papers only talk about basic blocks, and not SDC hyperblock
-  scheduling. But it's easy enough to adapt. SDC just requires some kind
-  of blocks without control-flow.{\bf {]}{]}} {\bf {[}{[}JW:} Ok cool,
-  so you can have SDC scheduling without hyperblocks. Can you have
-  hyperblocks without SDC scheduling? (I guess so, but just wanted to be
-  completely sure.){\bf {]}{]}}{\bf {[}{[}YH:} Yes exactly, you can
-  perform any type of basic block scheduling on hyperblocks as well I
-  think, as long as you mark the dependencies correctly. But in the
-  worst case you could even just see hyperblocks as basic blocks and
-  ignore predicates. It would still be correct but just less
-  efficient.{\bf {]}{]}}
-
-  {\bf {[}{[}JW:} It would be interesting to make the case for SDC-based
-  scheduling. Is it superior to other scheduling mechanisms? Is it so
-  much superior that it's worth the additional complexity, compared to,
-  say, list scheduling? {\bf {]}{]}}{\bf {[}{[}YH:} SDC scheduling is
-  more flexible with the type of constraint to optimise for (latency /
-  throughput). I don't think it's too meaningful for compilers, where
-  there isn't as much flexibility.{\bf {]}{]}} {\bf {[}{[}JW:} Thanks.
-  Does Vericert have any idea that the scheduling is being performed
-  using SDC? Or is the SDC stuff entirely internal to your unverified
-  OCaml code? That is, could you swap your scheduler with a different
-  one that isn't based on SDC, and not have to change anything in
-  Vericert?{\bf {]}{]}}{\bf {[}{[}YH:} The verified part doesn't know
-  about SDC at all, so it could be changed to any other scheduler that
-  acts on hyperblocks.{\bf {]}{]}}
+  Description of the implementation and correctness proof of a system of difference constraints
+  (SDC) scheduling algorithm, which can be adapted to various different scheduling strategies.
 \stopitemize
 
 \section[sec:scheduling]{Implementation of Hyperblocks in CompCert}
 
 This section describes the structure of hyperblocks in
-Section~\goto{2.1}[scheduling:hyperblocks]. Then, the structure of two
-extra intermediate languages that were added to CompCert to implement
-hyperblocks are also shown in Section~\goto{2.2}[sec:rtlblockdef].
+Section~\goto{2.1}[scheduling:hyperblocks]. Then, the structure of two extra intermediate languages
+that were added to CompCert to implement hyperblocks are also shown in
+Section~\goto{2.2}[sec:rtlblockdef].
 
 \subsection[scheduling:hyperblocks]{Hyperblocks as a Generalisation of Basic Blocks}
 
-Basic blocks are popular in intermediate languages because they describe
-a list of sequential instructions that are not separated by
-control-flow. This information can be used by various analysis passes to
-perform optimisations, as these sequences of instructions can be
-reordered in any way as long as their data dependencies are met. One
-such optimisation is scheduling, which reorders instructions to
-eliminate of any delay slots caused by the CPU pipeline.
-
-However, due to control-flow not being allowed in basic blocks, the
-number of instructions that can be scheduled at any one time is limited.
-The boundaries to other basic blocks act as hard boundary to the
-scheduling algorithm. hyperblocks~ allow instructions to be predicated,
-thereby introducing some control-flow information into the blocks, which
-can also be taken into account by the scheduler. For example, any data
-dependency between instructions that have mutually exclusive predicates
-can be removed, as these instructions will never be activate at the same
-time.
-
-\subsection[sec:rtlblockdef]{RTLBlock and RTLPar Intermediate Language Definition}
+Basic blocks are popular in intermediate languages because they describe a list of sequential
+instructions that are not separated by control-flow. This information can be used by various
+analysis passes to perform optimisations, as these sequences of instructions can be reordered in any
+way as long as their data dependencies are met. One such optimisation is scheduling, which reorders
+instructions to eliminate of any delay slots caused by the CPU pipeline.
 
-Figure~\goto{{[}fig:compcert_interm{]}}[fig:compcert_interm] shows the
-intermediate languages introduced to implement hyperblocks in CompCert.
-This consists of new RTLBlock and RTLPar intermediate languages, which
-implement the sequential and parallel semantics of basic blocks
-respectively. The semantics of RTLBlock and RTLPar are based on the
-CompCert RTL intermediate language. However, instead of mapping from
-states to instructions, RTLBlock maps from states to hyperblocks, and
-RTLPar maps from states to parallel hyperblocks, which will be described
-in the next sections.
+However, due to control-flow not being allowed in basic blocks, the number of instructions that can
+be scheduled at any one time is limited.  The boundaries to other basic blocks act as hard boundary
+to the scheduling algorithm. hyperblocks~ allow instructions to be predicated, thereby introducing
+some control-flow information into the blocks, which can also be taken into account by the
+scheduler. For example, any data dependency between instructions that have mutually exclusive
+predicates can be removed, as these instructions will never be activate at the same time.
 
-%\startformula \begin{aligned}
-%    i\ ::=&\ \ \mono{RBnop} \\
-%          &|\  \mono{RBop}\ p?\ \mathit{op}\ \vec{r}\ d \\
-%          &|\  \mono{RBload}\ p?\ \mathit{chunk}\ \mathit{addr}\ \vec{r}\ d \\
-%          &|\  \mono{RBstore}\ p?\ \mathit{chunk}\ \mathit{addr}\ \vec{r}\ s \\
-%          &|\  \mono{RBsetpred}\ p?\ c\ \vec{r}\ d \\
-%  \end{aligned} \stopformula
-%
-%\startformula \begin{aligned}
-%    i_{\mathit{cf}}\ ::=&\ \ \mono{RBcall}\ \mathit{sig}\ f\ \vec{r}\ d\ n \\
-%                        &|\  \mono{RBtailcall}\ \mathit{sig}\ f\ \vec{r} \\
-%                        &|\  \mono{RBbuiltin}\ f_{\mathit{ext}}\ \vec{r}\ r\ n \\
-%                        &|\  \mono{RBcond}\ c\ \vec{r}\ n_{1}\ n_{2} \\
-%                        &|\  \mono{RBjumptable}\ r\ \vec{n} \\
-%                        &|\  \mono{RBreturn}\ r? \\
-%                        &|\  \mono{RBgoto}\ n \\
-%                        &|\  \mono{RBpred\_{cf}}\ P\ i_{\mathit{cf}_{1}}\ i_{\mathit{cf}_{2}} \\
-%  \end{aligned} \stopformula
+\subsection[sec:rtlblockdef]{RTLBlock and RTLPar Intermediate Language Definition}
 
-RTLBlock instructions are split into two types of instructions, standard
-instructions and control-flow instructions. The standard instructions
-are the instructions that can be placed into hyperblocks, whereas
-control-flow instructions are the instructions that can end hyperblocks.
-The standard instructions are shown in
-Figure~\goto{{[}fig:standard_instruction{]}}[fig:standard_instruction],
-whereas the control-flow instructions are shown in
-Figure~\goto{{[}fig:control_flow_instruction{]}}[fig:control_flow_instruction].
-Most instructions are quite similar to their RTL counterparts, however,
-there are some instructions that have been added. \type{RBsetpred} is a
-standard instruction which sets a predicate equal to an evaluated
-condition. This instruction is used during if-conversion to set the
-predicate to the value of the condition in the conditional statement
-instruction. To support proper predicated instructions, each instruction
-therefore also contains an optional predicate which has been set
-beforehand, which produces the hyperblock. In additon to that, there is
-also an extra control-flow instruction called \type{RBpred_cf}, which
-can branch on a predicate an nests more control-flow instructions inside
-of it. This is also necessary for if-conversion, when converting an
-already converted conditional statement, as depending on the predicate,
-a different control-flow instruction may be necessary. {\bf {[}{[}JW:} I
-think there'd be a lot of value in having a picture showing op-chaining
-in action. See tex source for the kind of thing I mean. Probably best
-done as a running example throughout the paper. {\bf {]}{]}}
-
-These instructions are use in RTLBlock as well as in RTLPar. The main
-difference between these two languages is how these instructions are
-arranged within the hyperblock and the execution semantics of the
-hyperblock in the languages. Let $\mathit{hb}_{b}$ be the definition of
-a hyperblock for RTLBlock, and let $\mathit{hb}_{p}$ be the definition
-of a hyperblock for RTLPar, then $\mathit{hb}_{b}$ and $\mathit{hb}_{p}$
-be defined as in Figure~\goto{{[}fig:hb_def{]}}[fig:hb_def]. The Figure
-shows the different nesting levels of the basic blocks in RTLBlock as
-well as RTLPar, and where the parallel semantics of RTLPar come into
-play. RTLBlock is made of a list of instructions and a control-flow
-instruction that ends the hyperblock. Each instruction in RTLBlock is
-executed sequentially. RTLPar is made of three separate lists. The outer
-list behaves like the list in RTLBlock, and executes each list inside of
-it sequentially. RTLPar then has a list which executes it's contents in
-parallel, meaning each instruction uses the same state of the registers
-and memory while executing. Finally, instead of directly containing
-instructions, there is an additional layer of sequentially executing
-lists of instructions, to be able to execute small groups of
-instructions in sequence, but still in parallel to the other
-instructions in the current parallel group.
-
-RTLPar is structured in this way so that the scheduler has more
-flexibility, so that it can not only parallelise operations in the basic
-blocks, but that it can also define sequential operations in these
-parallel operations. This is especially useful for the hardware back
-end, as it means that sequential operations that do not quite fill one
-clock cycle can be executed sequentially, but can still be executed in
-parallel with other independent operations. This optimisation is called
-operation chaining, and is critical to get the most performance out of
-the hardware.
+Figure~\goto{{[}fig:compcert_interm{]}}[fig:compcert_interm] shows the intermediate languages
+introduced to implement hyperblocks in CompCert.  This consists of new RTLBlock and RTLPar
+intermediate languages, which implement the sequential and parallel semantics of basic blocks
+respectively. The semantics of RTLBlock and RTLPar are based on the CompCert RTL intermediate
+language. However, instead of mapping from states to instructions, RTLBlock maps from states to
+hyperblocks, and RTLPar maps from states to parallel hyperblocks, which will be described in the
+next sections.
+
+\placeformula \startformula \startmathalignment
+\NC i\ \ \colon\colon= \NC \ \ \mono{RBnop}                                                   \NR
+\NC                  \NC |\ \ \mono{RBop}\ p?\ \mathit{op}\ \vec{r}\ d                      \NR
+\NC                  \NC |\ \ \mono{RBload}\ p?\ \mathit{chunk}\ \mathit{addr}\ \vec{r}\ d  \NR
+\NC                  \NC |\ \ \mono{RBstore}\ p?\ \mathit{chunk}\ \mathit{addr}\ \vec{r}\ s \NR
+\NC                  \NC |\ \ \mono{RBsetpred}\ p?\ c\ \vec{r}\ d                           \NR
+\stopmathalignment \stopformula
+
+\placeformula \startformula \startmathalignment
+\NC i_{\mathit{cf}}\ \ \colon\colon= \NC \ \ \mono{RBcall}\ \mathit{sig}\ f\ \vec{r}\ d\ n                    \NR
+\NC                      \NC |\ \ \mono{RBtailcall}\ \mathit{sig}\ f\ \vec{r}                      \NR
+\NC                      \NC |\ \ \mono{RBbuiltin}\ f_{\mathit{ext}}\ \vec{r}\ r\ n                \NR
+\NC                      \NC |\ \ \mono{RBcond}\ c\ \vec{r}\ n_{1}\ n_{2}                          \NR
+\NC                      \NC |\ \ \mono{RBjumptable}\ r\ \vec{n}                                   \NR
+\NC                      \NC |\ \ \mono{RBreturn}\ r?                                              \NR
+\NC                      \NC |\ \ \mono{RBgoto}\ n                                                 \NR
+\NC                      \NC |\ \ \mono{RBpred\_{cf}}\ P\ i_{\mathit{cf}_{1}}\ i_{\mathit{cf}_{2}} \NR
+\stopmathalignment \stopformula
+
+RTLBlock instructions are split into two types of instructions, standard instructions and
+control-flow instructions. The standard instructions are the instructions that can be placed into
+hyperblocks, whereas control-flow instructions are the instructions that can end hyperblocks.  The
+standard instructions are shown in
+Figure~\goto{{[}fig:standard_instruction{]}}[fig:standard_instruction], whereas the control-flow
+instructions are shown in
+Figure~\goto{{[}fig:control_flow_instruction{]}}[fig:control_flow_instruction].  Most instructions
+are quite similar to their RTL counterparts, however, there are some instructions that have been
+added. \type{RBsetpred} is a standard instruction which sets a predicate equal to an evaluated
+condition. This instruction is used during if-conversion to set the predicate to the value of the
+condition in the conditional statement instruction. To support proper predicated instructions, each
+instruction therefore also contains an optional predicate which has been set beforehand, which
+produces the hyperblock. In additon to that, there is also an extra control-flow instruction called
+\type{RBpred_cf}, which can branch on a predicate an nests more control-flow instructions inside of
+it. This is also necessary for if-conversion, when converting an already converted conditional
+statement, as depending on the predicate, a different control-flow instruction may be
+necessary.
+
+These instructions are use in RTLBlock as well as in RTLPar. The main difference between these two
+languages is how these instructions are arranged within the hyperblock and the execution semantics
+of the hyperblock in the languages. Let $\mathit{hb}_{b}$ be the definition of a hyperblock for
+RTLBlock, and let $\mathit{hb}_{p}$ be the definition of a hyperblock for RTLPar, then
+$\mathit{hb}_{b}$ and $\mathit{hb}_{p}$ be defined as in
+Figure~\goto{{[}fig:hb_def{]}}[fig:hb_def]. The Figure shows the different nesting levels of the
+basic blocks in RTLBlock as well as RTLPar, and where the parallel semantics of RTLPar come into
+play. RTLBlock is made of a list of instructions and a control-flow instruction that ends the
+hyperblock. Each instruction in RTLBlock is executed sequentially. RTLPar is made of three separate
+lists. The outer list behaves like the list in RTLBlock, and executes each list inside of it
+sequentially. RTLPar then has a list which executes it's contents in parallel, meaning each
+instruction uses the same state of the registers and memory while executing. Finally, instead of
+directly containing instructions, there is an additional layer of sequentially executing lists of
+instructions, to be able to execute small groups of instructions in sequence, but still in parallel
+to the other instructions in the current parallel group.
+
+RTLPar is structured in this way so that the scheduler has more flexibility, so that it can not only
+parallelise operations in the basic blocks, but that it can also define sequential operations in
+these parallel operations. This is especially useful for the hardware back end, as it means that
+sequential operations that do not quite fill one clock cycle can be executed sequentially, but can
+still be executed in parallel with other independent operations. This optimisation is called
+operation chaining, and is critical to get the most performance out of the hardware.
 
 \subsection[translating-rtlblock-to-rtlpar]{Translating RTLBlock to RTLPar}
 
-The translation from RTLBlock to RTLPar is performed using a scheduling
-algorithm, which will takes hyperblocks and rearranges their
-instructions to maximise parallelism based on the data dependencies. It
-then arranges the instructions in RTLPar by putting independent
-instructions into the same parallel block. Scheduling is an optimisation
-that is well suited to translation validation, as the algorithm itself
-can be quite complex, whereas the output is normally relatively easy to
-check for equivalence with the input.
-Figure~\goto{{[}fig:compcert_interm{]}}[fig:compcert_interm] shows that
-the scheduling step comes right after the if-conversion step which
-originally creates the hyperblocks.
-
-Figure~\goto{{[}fig:op_chain{]}}[fig:op_chain] shows how the scheduling
-step transforms RTLBlock into RTLPar. The RTLBlock hyperblock being
-scheduled is shown in
-Figure~\goto{{[}fig:op_chain_a{]}}[fig:op_chain_a], which contains five
-predicated operations, comprising two additions and three
-multiplications. The data dependencies of the instructions in the
-hyperblock are shown in
-Figure~\goto{{[}fig:op_chain_b{]}}[fig:op_chain_b]. Curiously, even
-though operations and are normally dependent on each other due to a
-write-after-write conflict, but because the predicates are independent
-the write-after-write conflict can effectively be removed by the
+The translation from RTLBlock to RTLPar is performed using a scheduling algorithm, which will takes
+hyperblocks and rearranges their instructions to maximise parallelism based on the data
+dependencies. It then arranges the instructions in RTLPar by putting independent instructions into
+the same parallel block. Scheduling is an optimisation that is well suited to translation
+validation, as the algorithm itself can be quite complex, whereas the output is normally relatively
+easy to check for equivalence with the input.
+Figure~\goto{{[}fig:compcert_interm{]}}[fig:compcert_interm] shows that the scheduling step comes
+right after the if-conversion step which originally creates the hyperblocks.
+
+Figure~\goto{{[}fig:op_chain{]}}[fig:op_chain] shows how the scheduling step transforms RTLBlock
+into RTLPar. The RTLBlock hyperblock being scheduled is shown in
+Figure~\goto{{[}fig:op_chain_a{]}}[fig:op_chain_a], which contains five predicated operations,
+comprising two additions and three multiplications. The data dependencies of the instructions in the
+hyperblock are shown in Figure~\goto{{[}fig:op_chain_b{]}}[fig:op_chain_b]. Curiously, even though
+operations and are normally dependent on each other due to a write-after-write conflict, but because
+the predicates are independent the write-after-write conflict can effectively be removed by the
 scheduler.
 
-The scheduler transforms the RTLBlock hyperblock into the RTLPar
-hyperblock shown in Figure~\goto{{[}fig:op_chain_c{]}}[fig:op_chain_c].
-Even though the addition in is dependent on , they can still be put into
-the same cycle as the additions do not take as long to complete as the
-multiplications. This optimisation is called operation chaining. Then,
-the multiplication in can also be placed in the same clock cycle as it
-does not have any data dependencies. Then, in the next clock cycle,
-either the multiplication in can take place, or the multiplication in
-will take place, meaning these two multiplications can also be placed in
-the same clock cycle. This gives the final schedule that is shown in
-Figure~\goto{{[}fig:op_chain_d{]}}[fig:op_chain_d].
+The scheduler transforms the RTLBlock hyperblock into the RTLPar hyperblock shown in
+Figure~\goto{{[}fig:op_chain_c{]}}[fig:op_chain_c].  Even though the addition in is dependent on ,
+they can still be put into the same cycle as the additions do not take as long to complete as the
+multiplications. This optimisation is called operation chaining. Then, the multiplication in can
+also be placed in the same clock cycle as it does not have any data dependencies. Then, in the next
+clock cycle, either the multiplication in can take place, or the multiplication in will take place,
+meaning these two multiplications can also be placed in the same clock cycle. This gives the final
+schedule that is shown in Figure~\goto{{[}fig:op_chain_d{]}}[fig:op_chain_d].
 
 \section[abstr_interp]{Abstract Interpretation of Hyperblocks}
 
-The correctness of the scheduling algorithm is checked by comparing the
-symbolic expressions of each register before and after scheduling. This
-section describes how these symbolic expressions are constructed for
-each hyperblock.
+The correctness of the scheduling algorithm is checked by comparing the symbolic expressions of each
+register before and after scheduling. This section describes how these symbolic expressions are
+constructed for each hyperblock.
 
 \subsection[abstr_language]{Construction of Abstract Language}
 
-The main difficulty when constructing symbolic expressions for
-hyperblocks is the presence of predicates, which can be used by the
-scheduling algorithm to reorder instructions further, even if these
-contain data dependencies. The scheduling algorithm will manipulate
-predicates in the following ways, which the comparison function needs to
-take into account.
+The main difficulty when constructing symbolic expressions for hyperblocks is the presence of
+predicates, which can be used by the scheduling algorithm to reorder instructions further, even if
+these contain data dependencies. The scheduling algorithm will manipulate predicates in the
+following ways, which the comparison function needs to take into account.
 
 \startitemize
 \item
-  Instructions with predicates that are equivalent to false can be
-  removed.
+  Instructions with predicates that are equivalent to false can be removed.
 \item
-  Instructions with predicates that are mutually exclusive can be
-  executed in parallel.
+  Instructions with predicates that are mutually exclusive can be executed in parallel.
 \item
   Any predicate can be replaced with an equivalent predicate.
 \stopitemize
 
 The symbolic expressions have the following form:
 
-There are three main resources that are present in symbolic expressions:
-registers $r$, which represent the value of the registers in the
-program, predicates $p$, which represent the value of the predicates
-throughout the program, and finally, memory \type{Mem}, representing the
-state of the memory. For each basic instruction in the hyperblock, there
-is an equivalent symbolic expression which either modifies the state of
-the registers, predicates or memory. Finally, as each instruction in the
-hyperblock can be predicates, each symbolic expression is also paired
-with a predicate as well.
-
-Assigning the same resource multiple times with instructions that
-contain different predicates, means that the symbolic expression must
-also express this conditional result based on the predicates. This is
-expressed by assigning lists of predicated symbolic expressions to each
-resource, where each predicate needs to be mutually exclusive from the
-other predicates in the list.
-
-The expressions are then constructed using a function which updates the
-symbolic expressions assigned for each resource. This is done using
-multiple primitives which act on predicated types, which are made up of
-a list of pairs of predicates and the element of that type. The first
-important primitive is multiplication of two predicated types, which is
-implemented as performing a cartesian multiplication between the
-predicated types in the two lists, anding the two predicates and joining
+There are three main resources that are present in symbolic expressions: registers $r$, which
+represent the value of the registers in the program, predicates $p$, which represent the value of
+the predicates throughout the program, and finally, memory \type{Mem}, representing the state of the
+memory. For each basic instruction in the hyperblock, there is an equivalent symbolic expression
+which either modifies the state of the registers, predicates or memory. Finally, as each instruction
+in the hyperblock can be predicates, each symbolic expression is also paired with a predicate as
+well.
+
+Assigning the same resource multiple times with instructions that contain different predicates,
+means that the symbolic expression must also express this conditional result based on the
+predicates. This is expressed by assigning lists of predicated symbolic expressions to each
+resource, where each predicate needs to be mutually exclusive from the other predicates in the list.
+
+The expressions are then constructed using a function which updates the symbolic expressions
+assigned for each resource. This is done using multiple primitives which act on predicated types,
+which are made up of a list of pairs of predicates and the element of that type. The first important
+primitive is multiplication of two predicated types, which is implemented as performing a cartesian
+multiplication between the predicated types in the two lists, anding the two predicates and joining
 the two types of each list using a function.
 
 %\startformula \label{eq:1}
 %  P_{1} \otimes_{f} P_{2} \equiv \mono{map } (\lambda ((p_{1}, e_{1}), (p_{2}, e_{2})) . (p_{1} \land p_{2}, f\ e_{1}\ e_{2})) P_{1} \times P_{2} \stopformula
 
-In addition to that, another primitive that is needed is the following
-append operation, which will negate the combination of all predicates in
-the second predicated type, and it to the first predicated type and
-append the first predicated type to the second:
+In addition to that, another primitive that is needed is the following append operation, which will
+negate the combination of all predicates in the second predicated type, and it to the first
+predicated type and append the first predicated type to the second:
 
 %\startformula \label{eq:2}
 %  \mu(p, P) \equiv \mono{map } (\lambda (p', e') . (p \land p', e'))\ P \stopformula
@@ -322,8 +237,7 @@ append the first predicated type to the second:
 %\startformula \label{eq:3}
 %  P_{1} \oplus_{p} P_{2} \equiv \mu(\neg p, P_{1}) \mathbin{++} \mu(p, P_{2}) \stopformula
 
-\subsection[title={Example of
-translation},reference={example-of-translation}]
+\subsection[example-of-translation]{Example of translation}
 
 %\startformula \begin{aligned}
 %    \label{eq:4}
@@ -342,113 +256,87 @@ translation},reference={example-of-translation}]
 %      \end{cases}
 %  \end{aligned} \stopformula
 
-Using the example shown in
-Figure~\goto{{[}fig:op_chain{]}}[fig:op_chain], the RTLBlock hyperblock
-shown in Figure~\goto{{[}fig:op_chain_a{]}}[fig:op_chain_a] is scheduled
-into the hyperblock RTLPar shown in
-Figure~\goto{{[}fig:op_chain_c{]}}[fig:op_chain_c]. The first step is to
-translate the input and output of the scheduler into the Abstr
-intermediate language by calculating the symbolic values that will be
-stored in each of the registers. Symbolically evaluating the RTLBlock
-hyperblock results in the expressions shown in
-Figure~\goto{{[}fig:symbolic_expressions{]}}[fig:symbolic_expressions]
-for registers \type{r1}, \type{r2} and \type{r3}, where the state of the
-register \type{r1} at the start of the hyperblock is denoted as
-$\mono{r1}^{0}$.
-
-This symbolic expression can be generated by sequentially going through
-the list of predicated expressions and applying the update function
-defined in Section~\goto{3.1}[abstr_language]. The resulting predicated
-expressions is guaranteed to have predicates which are mutually
-exclusive, meaning if the predicate evaluates to true, all other
-predicates must evaluate to false.
-
-\subsection[title={Linear (Flat) Predicated
-Expressions},reference={linear-flat-predicated-expressions}]
-
-The update function for predicated expressions is quite a bit more
-complex compared to standard update functions for symbolic execution,
-such as the one used by . The main reason for this is the non-recursive
-nature of the predicated expressions. As the predicates are only
-available at the top-level of the expressions, combining multiple
-expressions means that the conditions and multiple possibilities also
-need to trickle up to the top.
-
-Instead, if the predicated instructions were recursive, one would not
-have to change any predicates when evaluating the argument lists of
-instructions, for example, as these can just stay predicated. The
-evaluation semantics would also be much simpler, because a rule can be
-added for evaluating predicates.
-
-\section[title={Correctness of the Abstract
-Interpretation},reference={sec:correctness}]
-
-The correctness of the scheduling step is dependent on the correctness
-of the equivalence check between the two abstract languages. As the
-abstract languages contain predicates, this comparison check cannot just
-be a syntactic comparison between the two languages, because the
-predicates might be different but logically equivalent. Instead, a mixed
-equivalence check is performed using a verified SAT solver and a
-syntactic equivalence check of the actual symbolic expressions.
-
-\subsection[title={Comparing Symbolic Abstract
-Expressions},reference={comparing-symbolic-abstract-expressions}]
-
-Abstract expressions comprise a predicate and a symbolic expression. The
-translation from RTLBlock and RTLPar to Abstract ensures that each
-register will contain a predicated symbolic expression that describes
-the value that will be assigned to it after evaluating the basic block.
-This means that if an expression always produces the same result as
-another expression, that the register will always contain the same
-value. If the RTLBlock program and RTLPar program translate to two
-Abstract languages where all the register's symbolic expressions match,
-then this means that both programs will always produce the same value.
-
-The main comparison between two symbolic expressions can be done using a
-simple syntactic comparison, because this ensures that each of the
-expressions always evaluate to the same value, and is also general
-enough for the expressions that need to be compared. However, as all
-expressions might be predicated, this simple syntactic comparison cannot
-be performed for the predicates. This is because predicated instructions
-can be executed in many different orders, and the predicates are used to
-calculate dependence information and therefore also determine the order
-of operations.
-
-Instead, comparing predicates can be done by checking logical
-equivalence of the predicates using a formally verified SAT solver. This
-ensures that the predicates will always evaluate to the same value,
-thereby making it a fair comparison.
-
-\subsection[title={Verified Decidable DPLL SAT
-Solver},reference={verified_sat}]
-
-A SAT solver is needed to correctly and reliably compare predicated
-expressions. This makes it possible to then compare many different
-possible schedules, even if they remove expressions completely if these
-are unnecessary. The SAT solver is needed in the verifier, which needs
-to be proven correct in Coq to prove the scheduling translation correct.
-The SAT solver therefore has to be verified, because it could otherwise
-not be used in the proof of correctness.
-
-The DPLL algorithm is used to implement the SAT solver. This means that
-the SAT solver takes in a conjunctive normal form (CNF) formula. The
-conversion from a recursive predicate type into a CNF formula is done by
-recursively expanding the predicate in equivalent CNF formulas.
-
-The SAT solver is implemented in Coq and proven to be sound and
-complete. The soundness and completeness of an algorithm to check
-whether a predicate is satisfiable or not also directly implies that the
-check is decidable, simplifying many proofs in Coq by using a
-\type{Setoid} instance for equivalence between predicates. This makes
-rewriting predicates if they are equivalent much easier in proofs by
-reusing many of the existing Coq tactics.
-
-\subsection[title={Encoding Predicated Expressions as SAT
-Formulas},reference={encoding-predicated-expressions-as-sat-formulas}]
-
-Given the following two predicated expressions which should be tested
-for equivalence, one can construct a SAT expression which will ensure
-that both expressions will always result in the same value.
+Using the example shown in Figure~\goto{{[}fig:op_chain{]}}[fig:op_chain], the RTLBlock hyperblock
+shown in Figure~\goto{{[}fig:op_chain_a{]}}[fig:op_chain_a] is scheduled into the hyperblock RTLPar
+shown in Figure~\goto{{[}fig:op_chain_c{]}}[fig:op_chain_c]. The first step is to translate the
+input and output of the scheduler into the Abstr intermediate language by calculating the symbolic
+values that will be stored in each of the registers. Symbolically evaluating the RTLBlock hyperblock
+results in the expressions shown in
+Figure~\goto{{[}fig:symbolic_expressions{]}}[fig:symbolic_expressions] for registers \type{r1},
+\type{r2} and \type{r3}, where the state of the register \type{r1} at the start of the hyperblock is
+denoted as $\mono{r1}^{0}$.
+
+This symbolic expression can be generated by sequentially going through the list of predicated
+expressions and applying the update function defined in Section~\goto{3.1}[abstr_language]. The
+resulting predicated expressions is guaranteed to have predicates which are mutually exclusive,
+meaning if the predicate evaluates to true, all other predicates must evaluate to false.
+
+\subsection[linear-flat-predicated-expressions]{Linear (Flat) Predicated Expressions}
+
+The update function for predicated expressions is quite a bit more complex compared to standard
+update functions for symbolic execution, such as the one used by . The main reason for this is the
+non-recursive nature of the predicated expressions. As the predicates are only available at the
+top-level of the expressions, combining multiple expressions means that the conditions and multiple
+possibilities also need to trickle up to the top.
+
+Instead, if the predicated instructions were recursive, one would not have to change any predicates
+when evaluating the argument lists of instructions, for example, as these can just stay
+predicated. The evaluation semantics would also be much simpler, because a rule can be added for
+evaluating predicates.
+
+\section[sec:correctness]{Correctness of the Abstract Interpretation}
+
+The correctness of the scheduling step is dependent on the correctness of the equivalence check
+between the two abstract languages. As the abstract languages contain predicates, this comparison
+check cannot just be a syntactic comparison between the two languages, because the predicates might
+be different but logically equivalent. Instead, a mixed equivalence check is performed using a
+verified SAT solver and a syntactic equivalence check of the actual symbolic expressions.
+
+\subsection[comparing-symbolic-abstract-expressions]{Comparing Symbolic Abstract Expressions}
+
+Abstract expressions comprise a predicate and a symbolic expression. The translation from RTLBlock
+and RTLPar to Abstract ensures that each register will contain a predicated symbolic expression that
+describes the value that will be assigned to it after evaluating the basic block.  This means that
+if an expression always produces the same result as another expression, that the register will
+always contain the same value. If the RTLBlock program and RTLPar program translate to two Abstract
+languages where all the register's symbolic expressions match, then this means that both programs
+will always produce the same value.
+
+The main comparison between two symbolic expressions can be done using a simple syntactic
+comparison, because this ensures that each of the expressions always evaluate to the same value, and
+is also general enough for the expressions that need to be compared. However, as all expressions
+might be predicated, this simple syntactic comparison cannot be performed for the predicates. This
+is because predicated instructions can be executed in many different orders, and the predicates are
+used to calculate dependence information and therefore also determine the order of operations.
+
+Instead, comparing predicates can be done by checking logical equivalence of the predicates using a
+formally verified SAT solver. This ensures that the predicates will always evaluate to the same
+value, thereby making it a fair comparison.
+
+\subsection[verified_sat]{Verified Decidable DPLL SAT Solver}
+
+A SAT solver is needed to correctly and reliably compare predicated expressions. This makes it
+possible to then compare many different possible schedules, even if they remove expressions
+completely if these are unnecessary. The SAT solver is needed in the verifier, which needs to be
+proven correct in Coq to prove the scheduling translation correct.  The SAT solver therefore has to
+be verified, because it could otherwise not be used in the proof of correctness.
+
+The DPLL algorithm is used to implement the SAT solver. This means that the SAT solver takes in a
+conjunctive normal form (CNF) formula. The conversion from a recursive predicate type into a CNF
+formula is done by recursively expanding the predicate in equivalent CNF formulas.
+
+The SAT solver is implemented in Coq and proven to be sound and complete. The soundness and
+completeness of an algorithm to check whether a predicate is satisfiable or not also directly
+implies that the check is decidable, simplifying many proofs in Coq by using a \type{Setoid}
+instance for equivalence between predicates. This makes rewriting predicates if they are equivalent
+much easier in proofs by reusing many of the existing Coq tactics.
+
+\subsection[encoding-predicated-expressions-as-sat-formulas]{Encoding Predicated Expressions as SAT
+  Formulas}
+
+Given the following two predicated expressions which should be tested for equivalence, one can
+construct a SAT expression which will ensure that both expressions will always result in the same
+value.
 
 %\startformula \begin{aligned}
 %  \bgroup \left.\begin{aligned}
@@ -465,33 +353,28 @@ that both expressions will always result in the same value.
 %    e_{m}',\quad &p_{m}'\\
 %  \end{cases}\end{aligned} \stopformula
 
-The SAT query that should determine if two predicated expressions are
-equivalent is the following, where $\mathit{he}_{i}$ is the hashed
-expression of $e_{i}$:
+The SAT query that should determine if two predicated expressions are equivalent is the following,
+where $\mathit{he}_{i}$ is the hashed expression of $e_{i}$:
 
 %\startformula \begin{gathered}
 %  (p_{1} \implies \mathit{he}_{1}) \land (p_{2} \implies \mathit{he}_{2}) \land \cdots \land (p_{n} \implies \mathit{he}_{n}) \\
 %  \Longleftrightarrow\\
 %  (p_{1}' \implies \mathit{he}_{1}') \land (p_{2}' \implies \mathit{he}_{2}') \land \cdots \land (p_{m}' \implies \mathit{he}_{m}') \\\end{gathered} \stopformula
 
-However, reasoning about the equivalence of individual expressions
-within the greater list of predicated expressions is not quite
-straightforward. The evaluation semantics of predicated expression list
-will evaluate each predicate expression sequentially, and pick the
-expression whose predicate evaluates to true. However, proving that both
-predicated expression lists will return the same value assuming that the
-SAT query above is true, means that for each predicate that is true, One
-needs to be able to show that there is a predicate in the other list
-that will also be true, which will return the same value. However, there
-is no guarantee that there is a unique predicated expression that will
-match the current predicate, as there may be duplicate expressions in
-the list with different predicates.
-
-This means that it is quite difficult to prove that the SAT query
-implies the semantic preservation of the scheduling transformation. The
-SAT query can therefore be broken down into a slightly more direct
-comparison that is more useful for the proof itself. The steps to
-perform the more fine grained comparison are the following:
+However, reasoning about the equivalence of individual expressions within the greater list of
+predicated expressions is not quite straightforward. The evaluation semantics of predicated
+expression list will evaluate each predicate expression sequentially, and pick the expression whose
+predicate evaluates to true. However, proving that both predicated expression lists will return the
+same value assuming that the SAT query above is true, means that for each predicate that is true,
+One needs to be able to show that there is a predicate in the other list that will also be true,
+which will return the same value. However, there is no guarantee that there is a unique predicated
+expression that will match the current predicate, as there may be duplicate expressions in the list
+with different predicates.
+
+This means that it is quite difficult to prove that the SAT query implies the semantic preservation
+of the scheduling transformation. The SAT query can therefore be broken down into a slightly more
+direct comparison that is more useful for the proof itself. The steps to perform the more fine
+grained comparison are the following:
 
 %\startenumerate[n]
 %\item
@@ -507,81 +390,65 @@ perform the more fine grained comparison are the following:
 %  one of the maps, then it's predicate should be equivalent to $\perp$.
 %\stopenumerate
 
-This comparison is logically equivalent to performing the large SAT
-query, however, it maps better to the proof because it is comparing the
-predicates of the same expressions with each other.
+This comparison is logically equivalent to performing the large SAT query, however, it maps better
+to the proof because it is comparing the predicates of the same expressions with each other.
 
 \subsection[proof-of-equivalence-between-passes]{Proof of equivalence between passes}
 
-Figure~\goto{{[}fig:compcert_interm{]}}[fig:compcert_interm] shows the
-four different passes that were added to CompCert to form hyperblocks.
-This section will cover the proofs for each of the translations,
-focusing mainly on the proof of translation of the scheduling pass.
+Figure~\goto{{[}fig:compcert_interm{]}}[fig:compcert_interm] shows the four different passes that
+were added to CompCert to form hyperblocks.  This section will cover the proofs for each of the
+translations, focusing mainly on the proof of translation of the scheduling pass.
 
 \subsubsection[proof-of-correctness-of-the-if-conversion-pass]{Proof of Correctness of the
   If-Conversion Pass}
 
-The if-conversion pass is quite simple to verify, as it is an
-optimisation that transforms the code in the same language.
+The if-conversion pass is quite simple to verify, as it is an optimisation that transforms the code
+in the same language.
 
 \subsubsection[proof-of-correctness-of-scheduling]{Proof of Correctness of Scheduling}
 
-The scheduling pass is formally verified using translation validation,
-using the Abstr language as a point of comparison between RTLBlock and
-RTLPar. Using the algorithm defined in Section~\goto{3}[abstr_interp],
-the formally verified SAT solver described in
-Section~\goto{4.2}[verified_sat] can be used prove that if the two
-predicated expression lists are equivalent, that they will always
-evaluate to the same value.
+The scheduling pass is formally verified using translation validation, using the Abstr language as a
+point of comparison between RTLBlock and RTLPar. Using the algorithm defined in
+Section~\goto{3}[abstr_interp], the formally verified SAT solver described in
+Section~\goto{4.2}[verified_sat] can be used prove that if the two predicated expression lists are
+equivalent, that they will always evaluate to the same value.
 
 \section[implementation-details-of-static-sdc-scheduling]{Implementation Details of Static SDC
   Scheduling}
 
-This section describes the implementation details of the static SDC
-scheduling algorithm used, and also describes some of the details of the
-basic block generation as well as the if-conversion pass which generates
-the predicated instructions.
+This section describes the implementation details of the static SDC scheduling algorithm used, and
+also describes some of the details of the basic block generation as well as the if-conversion pass
+which generates the predicated instructions.
 
 \subsection[implementation-of-if-conversion]{Implementation of If-Conversion}
 
-If-conversion is the conversion that introduces predicated instructions
-which can make use of the hyperblocks. It converts conditional
-statements into predicated instructions. This transformation has a few
-limitations on the kind of conditional statements that it can translate.
-First, only conditional statements without loops can be translated,
-therefore, one must identify cycles in the control-flow before
-performing the if-conversion. Secondly, if-conversion will not always
-result in more efficient code, so it should not be applied to any
-conditional statements. Instead, it is best applied to conditional
-statements where each branch will take a similar amount of time to
-execute.
-
-For back ends that do not support predicates, it is also important to
-eliminate the predicates again using a process called reverse
-if-conversion, which creates branches out of the predicates and groups
-these together again. This allows for hyperblock scheduling to also be
-used with back ends that do not support predicates, which are most the
-default back ends included in CompCert.
-
-\subsection[title={Static SDC
-Scheduling},reference={static-sdc-scheduling}]
-
-System of difference constraints (SDC) scheduling~ is a flexible
-algorithm to perform scheduling of instructions, especially if the
-target is hardware directly. It allows for flexible definitions of
-optimisation functions in terms of latencies and dependencies of the
-operations in the basic block.
-
-\section[title={Performance
-Comparison},reference={performance-comparison}]
-
-\section[title={Related Work},reference={sec:related-work}]
-
-\section[title={Conclusion},reference={conclusion}]
-
-This material is based upon work supported by the under Grant No.~ and
-Grant No.~. Any opinions, findings, and conclusions or recommendations
-expressed in this material are those of the author and do not
-necessarily reflect the views of the National Science Foundation.
+If-conversion is the conversion that introduces predicated instructions which can make use of the
+hyperblocks. It converts conditional statements into predicated instructions. This transformation
+has a few limitations on the kind of conditional statements that it can translate.  First, only
+conditional statements without loops can be translated, therefore, one must identify cycles in the
+control-flow before performing the if-conversion. Secondly, if-conversion will not always result in
+more efficient code, so it should not be applied to any conditional statements. Instead, it is best
+applied to conditional statements where each branch will take a similar amount of time to execute.
+
+For back ends that do not support predicates, it is also important to eliminate the predicates again
+using a process called reverse if-conversion, which creates branches out of the predicates and
+groups these together again. This allows for hyperblock scheduling to also be used with back ends
+that do not support predicates, which are most the default back ends included in CompCert.
+
+\subsection[static-sdc-scheduling]{Static SDC Scheduling}
+
+System of difference constraints (SDC) scheduling~ is a flexible algorithm to perform scheduling of
+instructions, especially if the target is hardware directly. It allows for flexible definitions of
+optimisation functions in terms of latencies and dependencies of the operations in the basic block.
+
+\section[performance-comparison]{Performance Comparison}
+
+\section[sec:related-work]{Related Work}
+
+\section[conclusion]{Conclusion}
+
+This material is based upon work supported by the under Grant No.~ and Grant No.~. Any opinions,
+findings, and conclusions or recommendations expressed in this material are those of the author and
+do not necessarily reflect the views of the National Science Foundation.
 
 \stopcomponent
-- 
cgit