summaryrefslogtreecommitdiffstats
path: root/chapters/scheduling.tex
diff options
context:
space:
mode:
Diffstat (limited to 'chapters/scheduling.tex')
-rw-r--r--chapters/scheduling.tex243
1 files changed, 137 insertions, 106 deletions
diff --git a/chapters/scheduling.tex b/chapters/scheduling.tex
index 2463099..75ae1fd 100644
--- a/chapters/scheduling.tex
+++ b/chapters/scheduling.tex
@@ -32,18 +32,18 @@ the performance. Trace scheduling~ is an alternative that addresses this issue,
that might cross branching instructions, and scheduling the instructions in that path. However, the
main problem with trace scheduling is that it is often infeasible on large programs, and because of
the large design space, it's sometimes hard to find an optimal
-schedule. \index{superblock}Superblock~ and \index{hyperblock}hyperblock~ scheduling are two subsets
-of trace scheduling, which compromise on the flexibility of trace scheduling to instead build more
-tractable algorithms. Superblock scheduling produces more efficient code for processors with a low
-issue rate, whereas hyperblock scheduling produces more performant code for processors with a high
-issue rate~.
-
-CompCert~ is a compiler that is formally verified in Coq. In addition to that, have also added a
-formally verified hardware back end to CompCert, allowing it to also generate provably correct
-hardware designs. A hardware back end is an ideal target for hyperblock scheduling, as the issue
-rate is practically infinite. This paper describes the implementation of a hyperblock instruction
-scheduler in CompCert so that it can benefit existing CompCert back ends, as well as support more
-general targets such as scheduling instructions for custom hardware.
+schedule. \index{superblock}Superblock~\cite[hwu93_super] and
+\index{hyperblock}hyperblock~\cite[mahlke92_effec_compil_suppor_predic_execut_using_hyper]
+scheduling are two subsets of trace scheduling, which compromise on the flexibility of trace
+scheduling to instead build more tractable algorithms. Superblock scheduling produces more
+efficient code for processors with a low issue rate, whereas hyperblock scheduling produces more
+performant code for processors with a high issue
+rate~\cite[mahlke92_effec_compil_suppor_predic_execut_using_hyper,aiken16_trace_sched].
+
+A hardware back end is an ideal target for hyperblock scheduling, as the issue rate is practically
+infinite. This paper describes the implementation of a hyperblock instruction scheduler in CompCert
+so that it can benefit existing CompCert back ends, as well as support more general targets such as
+scheduling instructions for custom hardware.
\subsection[key-points]{Key Points}
@@ -61,9 +61,36 @@ the key points of this paper are the following:
\section[sec:scheduling]{Implementation of Hyperblocks in CompCert}
\index{hyperblock}This section describes the structure of hyperblocks in
-Section~\goto{2.1}[scheduling:hyperblocks]. Then, the structure of two extra intermediate languages
-that were added to CompCert to implement hyperblocks are also shown in
-Section~\goto{2.2}[sec:rtlblockdef].
+\in{Section}[scheduling:hyperblocks]. Then, the structure of two extra intermediate languages that
+were added to CompCert to implement hyperblocks are also shown in \in{Section}[sec:rtlblockdef].
+
+\startplacemarginfigure[reference={fig:compcert_interm},title={New intermediate languages introduced into
+ CompCert.}]
+ \hbox{\starttikzpicture[shorten >=1pt,>=stealth]
+ \node at (-3, 0) (dotsin) {$\cdots$};
+ \node at (10, 0) (dotsout) {$\cdots$};
+ \node[draw] at (-1, 0) (rtl) {\tt RTL};
+ \node[draw] at (2, 0) (rtlblock) {\tt RTLBlock};
+ \node[draw] at (2, -2) (abstrblock) {\tt Abstr};
+ \node[draw] at (6, 0) (rtlpar) {\tt RTLPar};
+ \node[draw] at (6, -2) (abstrpar) {\tt Abstr};
+ \node[draw] at (8, 0) (htl) {\tt HTL};
+
+ \draw[->] (rtl) -- node[above,midway,align=center,font=\small]
+ {basic-block \\[-0.3em] creation} (rtlblock);
+ \draw[->] (rtlblock) -- node[above,midway,font=\small] {scheduling} (rtlpar);
+ \draw[->] (rtlpar) -- (htl);
+ \draw[->,dashed] (rtlblock) -- (abstrblock);
+ \draw[->,dashed] (rtlpar) -- (abstrpar);
+ \draw[dashed,shorten >=0pt] (abstrblock) -- node[above,midway] {$\sim$} (abstrpar);
+ \draw[->] (rtlblock) to [out=130,in=50,loop,looseness=10]
+ node[above,midway,font=\small] {if-conversion} (rtlblock);
+ \draw[->] (rtlpar) to [out=130,in=50,loop,looseness=10]
+ node[above,midway,align=center,font=\small] { operation \\[-0.3em] pipelining} (rtlpar);
+ \draw[->] (dotsin) -- (rtl);
+ \draw[->] (htl) -- (dotsout);
+ \stoptikzpicture}
+\stopplacemarginfigure
\subsection[scheduling:hyperblocks]{Hyperblocks as a Generalisation of Basic Blocks}
@@ -75,48 +102,48 @@ instructions to eliminate of any delay slots caused by the CPU pipeline.
However, due to control-flow not being allowed in basic blocks, the number of instructions that can
be scheduled at any one time is limited. The boundaries to other basic blocks act as hard boundary
-to the scheduling algorithm. hyperblocks~ allow instructions to be predicated, thereby introducing
-some control-flow information into the blocks, which can also be taken into account by the
-scheduler. For example, any data dependency between instructions that have mutually exclusive
-predicates can be removed, as these instructions will never be activate at the same time.
+to the scheduling
+algorithm. Hyperblocks~\cite[mahlke92_effec_compil_suppor_predic_execut_using_hyper] allow
+instructions to be predicated, thereby introducing some control-flow information into the blocks,
+which can also be taken into account by the scheduler. For example, any data dependency between
+instructions that have mutually exclusive predicates can be removed, as these instructions will
+never be activate at the same time.
\subsection[sec:rtlblockdef]{RTLBlock and RTLPar Intermediate Language Definition}
-\lindex{RTLBlock}\lindex{RTLPar}Figure~\goto{{[}fig:compcert_interm{]}}[fig:compcert_interm] shows
-the intermediate languages introduced to implement hyperblocks in CompCert. This consists of new
-RTLBlock and RTLPar intermediate languages, which implement the sequential and parallel semantics of
-basic blocks respectively. The semantics of RTLBlock and RTLPar are based on the CompCert RTL
-intermediate language. However, instead of mapping from states to instructions, RTLBlock maps from
-states to hyperblocks, and RTLPar maps from states to parallel hyperblocks, which will be described
-in the next sections.
+\lindex{RTLBlock}\lindex{RTLPar}\in{Figure}[fig:compcert_interm] shows the intermediate languages
+introduced to implement hyperblocks in CompCert. This consists of new RTLBlock and RTLPar
+intermediate languages, which implement the sequential and parallel semantics of basic blocks
+respectively. The semantics of RTLBlock and RTLPar are based on the CompCert RTL intermediate
+language. However, instead of mapping from states to instructions, RTLBlock maps from states to
+hyperblocks, and RTLPar maps from states to parallel hyperblocks, which will be described in the
+next sections.
\placeformula \startformula \startmathalignment
-\NC i\ \ \colon\colon= \NC \ \ \mono{RBnop} \NR
-\NC \NC |\ \ \mono{RBop}\ p?\ \mathit{op}\ \vec{r}\ d \NR
-\NC \NC |\ \ \mono{RBload}\ p?\ \mathit{chunk}\ \mathit{addr}\ \vec{r}\ d \NR
-\NC \NC |\ \ \mono{RBstore}\ p?\ \mathit{chunk}\ \mathit{addr}\ \vec{r}\ s \NR
-\NC \NC |\ \ \mono{RBsetpred}\ p?\ c\ \vec{r}\ d \NR
+\NC i\ \ \colon\colon= \NC \ \ \text{\tt RBnop} \NR
+\NC \NC |\ \ \text{\tt RBop}\ p?\ \mathit{op}\ \vec{r}\ d \NR
+\NC \NC |\ \ \text{\tt RBload}\ p?\ \mathit{chunk}\ \mathit{addr}\ \vec{r}\ d \NR
+\NC \NC |\ \ \text{\tt RBstore}\ p?\ \mathit{chunk}\ \mathit{addr}\ \vec{r}\ s \NR
+\NC \NC |\ \ \text{\tt RBsetpred}\ p?\ c\ \vec{r}\ d \NR
\stopmathalignment \stopformula
\placeformula \startformula \startmathalignment
-\NC i_{\mathit{cf}}\ \ \colon\colon= \NC \ \ \mono{RBcall}\ \mathit{sig}\ f\ \vec{r}\ d\ n \NR
-\NC \NC |\ \ \mono{RBtailcall}\ \mathit{sig}\ f\ \vec{r} \NR
-\NC \NC |\ \ \mono{RBbuiltin}\ f_{\mathit{ext}}\ \vec{r}\ r\ n \NR
-\NC \NC |\ \ \mono{RBcond}\ c\ \vec{r}\ n_{1}\ n_{2} \NR
-\NC \NC |\ \ \mono{RBjumptable}\ r\ \vec{n} \NR
-\NC \NC |\ \ \mono{RBreturn}\ r? \NR
-\NC \NC |\ \ \mono{RBgoto}\ n \NR
-\NC \NC |\ \ \mono{RBpred\_{cf}}\ P\ i_{\mathit{cf}_{1}}\ i_{\mathit{cf}_{2}} \NR
+\NC i_{\mathit{cf}}\ \ \colon\colon= \NC \ \ \text{\tt RBcall}\ \mathit{sig}\ f\ \vec{r}\ d\ n \NR
+\NC \NC |\ \ \text{\tt RBtailcall}\ \mathit{sig}\ f\ \vec{r} \NR
+\NC \NC |\ \ \text{\tt RBbuiltin}\ f_{\mathit{ext}}\ \vec{r}\ r\ n \NR
+\NC \NC |\ \ \text{\tt RBcond}\ c\ \vec{r}\ n_{1}\ n_{2} \NR
+\NC \NC |\ \ \text{\tt RBjumptable}\ r\ \vec{n} \NR
+\NC \NC |\ \ \text{\tt RBreturn}\ r? \NR
+\NC \NC |\ \ \text{\tt RBgoto}\ n \NR
+\NC \NC |\ \ \text{\tt RBpred\_{cf}}\ P\ i_{\mathit{cf}_{1}}\ i_{\mathit{cf}_{2}} \NR
\stopmathalignment \stopformula
RTLBlock instructions are split into two types of instructions, standard instructions and
control-flow instructions. The standard instructions are the instructions that can be placed into
hyperblocks, whereas control-flow instructions are the instructions that can end hyperblocks. The
-standard instructions are shown in
-Figure~\goto{{[}fig:standard_instruction{]}}[fig:standard_instruction], whereas the control-flow
-instructions are shown in
-Figure~\goto{{[}fig:control_flow_instruction{]}}[fig:control_flow_instruction]. Most instructions
-are quite similar to their RTL counterparts, however, there are some instructions that have been
+standard instructions are shown in \in{Figure}[fig:standard_instruction], whereas the control-flow
+instructions are shown in \in{Figure}[fig:control_flow_instruction]. Most instructions are quite
+similar to their RTL counterparts, however, there are some instructions that have been
added. \type{RBsetpred} is a standard instruction which sets a predicate equal to an evaluated
condition. This instruction is used during if-conversion to set the predicate to the value of the
condition in the conditional statement instruction. To support proper predicated instructions, each
@@ -124,16 +151,19 @@ instruction therefore also contains an optional predicate which has been set bef
produces the hyperblock. In additon to that, there is also an extra control-flow instruction called
\type{RBpred_cf}, which can branch on a predicate an nests more control-flow instructions inside of
it. This is also necessary for if-conversion, when converting an already converted conditional
-statement, as depending on the predicate, a different control-flow instruction may be
-necessary.
+statement, as depending on the predicate, a different control-flow instruction may be necessary.
+
+\placeformula[fig:hb_def]\startformula\startalign[align={1:left}]
+\NC \blockbb \qquad \eqdef \qquad (\text{\tt list}\ i) \times i_{\mathit{cf}} \NR
+\NC \parbb \qquad \eqdef \qquad (\text{\tt list}\ \text{\tt list}\ \text{\tt list}\ i) \times i_{\mathit{cf}}\NR
+\stopalign\stopformula
These instructions are use in RTLBlock as well as in RTLPar. The main difference between these two
languages is how these instructions are arranged within the hyperblock and the execution semantics
-of the hyperblock in the languages. Let $\mathit{hb}_{b}$ be the definition of a hyperblock for
-RTLBlock, and let $\mathit{hb}_{p}$ be the definition of a hyperblock for RTLPar, then
-$\mathit{hb}_{b}$ and $\mathit{hb}_{p}$ be defined as in
-Figure~\goto{{[}fig:hb_def{]}}[fig:hb_def]. The Figure shows the different nesting levels of the
-basic blocks in RTLBlock as well as RTLPar, and where the parallel semantics of RTLPar come into
+of the hyperblock in the languages. Let $\blockbb$ be the definition of a hyperblock for RTLBlock,
+and let $\parbb$ be the definition of a hyperblock for RTLPar, then $\blockbb$ and $\parbb$ be
+defined as in \in{Equation}[fig:hb_def]. The Figure shows the different nesting levels of the basic
+blocks in RTLBlock as well as RTLPar, and where the parallel semantics of RTLPar come into
play. RTLBlock is made of a list of instructions and a control-flow instruction that ends the
hyperblock. Each instruction in RTLBlock is executed sequentially. RTLPar is made of three separate
lists. The outer list behaves like the list in RTLBlock, and executes each list inside of it
@@ -157,27 +187,25 @@ hyperblocks and rearranges their instructions to maximise parallelism based on t
dependencies. It then arranges the instructions in RTLPar by putting independent instructions into
the same parallel block. Scheduling is an optimisation that is well suited to translation
validation, as the algorithm itself can be quite complex, whereas the output is normally relatively
-easy to check for equivalence with the input.
-Figure~\goto{{[}fig:compcert_interm{]}}[fig:compcert_interm] shows that the scheduling step comes
-right after the if-conversion step which originally creates the hyperblocks.
-
-Figure~\goto{{[}fig:op_chain{]}}[fig:op_chain] shows how the scheduling step transforms RTLBlock
-into RTLPar. The RTLBlock hyperblock being scheduled is shown in
-Figure~\goto{{[}fig:op_chain_a{]}}[fig:op_chain_a], which contains five predicated operations,
-comprising two additions and three multiplications. The data dependencies of the instructions in the
-hyperblock are shown in Figure~\goto{{[}fig:op_chain_b{]}}[fig:op_chain_b]. Curiously, even though
+easy to check for equivalence with the input. \in{Figure}[fig:compcert_interm] shows that the
+scheduling step comes right after the if-conversion step which originally creates the hyperblocks.
+
+\in{Figure}[fig:op_chain] shows how the scheduling step transforms RTLBlock into RTLPar. The
+RTLBlock hyperblock being scheduled is shown in \in{Figure}[fig:op_chain_a], which contains five
+predicated operations, comprising two additions and three multiplications. The data dependencies of
+the instructions in the hyperblock are shown in \in{Figure}[fig:op_chain_b]. Curiously, even though
operations and are normally dependent on each other due to a write-after-write conflict, but because
the predicates are independent the write-after-write conflict can effectively be removed by the
scheduler.
The scheduler transforms the RTLBlock hyperblock into the RTLPar hyperblock shown in
-Figure~\goto{{[}fig:op_chain_c{]}}[fig:op_chain_c]. Even though the addition in is dependent on ,
-they can still be put into the same cycle as the additions do not take as long to complete as the
-multiplications. This optimisation is called operation chaining. Then, the multiplication in can
-also be placed in the same clock cycle as it does not have any data dependencies. Then, in the next
-clock cycle, either the multiplication in can take place, or the multiplication in will take place,
-meaning these two multiplications can also be placed in the same clock cycle. This gives the final
-schedule that is shown in Figure~\goto{{[}fig:op_chain_d{]}}[fig:op_chain_d].
+\in{Figure}[fig:op_chain_c]. Even though the addition in is dependent on , they can still be put
+into the same cycle as the additions do not take as long to complete as the multiplications. This
+optimisation is called operation chaining. Then, the multiplication in can also be placed in the
+same clock cycle as it does not have any data dependencies. Then, in the next clock cycle, either
+the multiplication in can take place, or the multiplication in will take place, meaning these two
+multiplications can also be placed in the same clock cycle. This gives the final schedule that is
+shown in \in{Figure}[fig:op_chain_d].
\section[abstr_interp]{Abstract Interpretation of Hyperblocks}
@@ -225,24 +253,25 @@ primitive is multiplication of two predicated types, which is implemented as per
multiplication between the predicated types in the two lists, anding the two predicates and joining
the two types of each list using a function.
-%\startformula \label{eq:1}
-% P_{1} \otimes_{f} P_{2} \equiv \mono{map } (\lambda ((p_{1}, e_{1}), (p_{2}, e_{2})) . (p_{1} \land p_{2}, f\ e_{1}\ e_{2})) P_{1} \times P_{2} \stopformula
+\placeformula[eq:1]\startformula
+ P_{1} \otimes_{f} P_{2} \equiv \text{\tt map } (\lambda ((p_{1}, e_{1}), (p_{2}, e_{2})) . (p_{1}
+ \land p_{2}, f\ e_{1}\ e_{2})) P_{1} \times P_{2} \stopformula
In addition to that, another primitive that is needed is the following append operation, which will
negate the combination of all predicates in the second predicated type, and it to the first
predicated type and append the first predicated type to the second:
-%\startformula \label{eq:2}
-% \mu(p, P) \equiv \mono{map } (\lambda (p', e') . (p \land p', e'))\ P \stopformula
+\placeformula[eq:2]\startformula
+ \mu(p, P) \equiv \text{\tt map } (\lambda (p', e') . (p \land p', e'))\ P \stopformula
-%\startformula \label{eq:3}
-% P_{1} \oplus_{p} P_{2} \equiv \mu(\neg p, P_{1}) \mathbin{++} \mu(p, P_{2}) \stopformula
+\placeformula[eq:3]\startformula
+ P_{1} \oplus_{p} P_{2} \equiv \mu(\neg p, P_{1}) + \mu(p, P_{2}) \stopformula
\subsection[example-of-translation]{Example of translation}
%\startformula \begin{aligned}
% \label{eq:4}
-% &\mono{r1} =
+% &\text{\tt r1} =
% \begin{cases}
% \mono{r1}^{0} + \mono{r4}^{0} + \mono{r4}^{0}, \quad &\mono{p1} \\
% \mono{r1}^{0}, \quad &\mono{!p1} \\
@@ -257,20 +286,18 @@ predicated type and append the first predicated type to the second:
% \end{cases}
% \end{aligned} \stopformula
-Using the example shown in Figure~\goto{{[}fig:op_chain{]}}[fig:op_chain], the RTLBlock hyperblock
-shown in Figure~\goto{{[}fig:op_chain_a{]}}[fig:op_chain_a] is scheduled into the hyperblock RTLPar
-shown in Figure~\goto{{[}fig:op_chain_c{]}}[fig:op_chain_c]. The first step is to translate the
-input and output of the scheduler into the Abstr intermediate language by calculating the symbolic
-values that will be stored in each of the registers. Symbolically evaluating the RTLBlock hyperblock
-results in the expressions shown in
-Figure~\goto{{[}fig:symbolic_expressions{]}}[fig:symbolic_expressions] for registers \type{r1},
-\type{r2} and \type{r3}, where the state of the register \type{r1} at the start of the hyperblock is
-denoted as $\mono{r1}^{0}$.
+Using the example shown in \in{Figure}[fig:op_chain], the RTLBlock hyperblock shown in
+\in{Figure}[fig:op_chain_a] is scheduled into the hyperblock RTLPar shown in
+\in{Figure}[fig:op_chain_c]. The first step is to translate the input and output of the scheduler
+into the Abstr intermediate language by calculating the symbolic values that will be stored in each
+of the registers. Symbolically evaluating the RTLBlock hyperblock results in the expressions shown
+in \in{Figure}[fig:symbolic_expressions] for registers \type{r1}, \type{r2} and \type{r3}, where the
+state of the register \type{r1} at the start of the hyperblock is denoted as $\text{\tt r1}^{0}$.
This symbolic expression can be generated by sequentially going through the list of predicated
-expressions and applying the update function defined in Section~\goto{3.1}[abstr_language]. The
-resulting predicated expressions is guaranteed to have predicates which are mutually exclusive,
-meaning if the predicate evaluates to true, all other predicates must evaluate to false.
+expressions and applying the update function defined in \in{Section}[abstr_language]. The resulting
+predicated expressions is guaranteed to have predicates which are mutually exclusive, meaning if the
+predicate evaluates to true, all other predicates must evaluate to false.
\subsection[linear-flat-predicated-expressions]{Linear (Flat) Predicated Expressions}
@@ -311,7 +338,7 @@ is because predicated instructions can be executed in many different orders, and
used to calculate dependence information and therefore also determine the order of operations.
Instead, comparing predicates can be done by checking logical equivalence of the predicates using a
-formally verified \SAT solver. This ensures that the predicates will always evaluate to the same
+formally verified \SAT\ solver. This ensures that the predicates will always evaluate to the same
value, thereby making it a fair comparison.
\subsection[verified_sat]{Verified Decidable DPLL SAT Solver}
@@ -377,28 +404,27 @@ of the scheduling transformation. The SAT query can therefore be broken down int
direct comparison that is more useful for the proof itself. The steps to perform the more fine
grained comparison are the following:
-%\startenumerate[n]
-%\item
-% Hash all of the expressions into a unique literal.
-%\item
-% Create a map from expression literals to their corresponding
-% predicate, thereby having a unique predicate for each expressions. If
-% there are duplicates in the list that forms the map, the predicates
-% are combined using the or operation.
-%\item
-% Iterate through the expressions in both maps and compare the
-% predicates using the SAT solver. If an expression is not present in
-% one of the maps, then it's predicate should be equivalent to $\perp$.
-%\stopenumerate
+\startitemize[n]
+\item
+ Hash all of the expressions into a unique literal.
+\item
+ Create a map from expression literals to their corresponding predicate, thereby having a unique
+ predicate for each expressions. If there are duplicates in the list that forms the map, the
+ predicates are combined using the or operation.
+\item
+ Iterate through the expressions in both maps and compare the predicates using the SAT solver. If
+ an expression is not present in one of the maps, then it's predicate should be equivalent to
+ $\perp$.
+\stopitemize
This comparison is logically equivalent to performing the large SAT query, however, it maps better
to the proof because it is comparing the predicates of the same expressions with each other.
\subsection[proof-of-equivalence-between-passes]{Proof of equivalence between passes}
-Figure~\goto{{[}fig:compcert_interm{]}}[fig:compcert_interm] shows the four different passes that
-were added to CompCert to form hyperblocks. This section will cover the proofs for each of the
-translations, focusing mainly on the proof of translation of the scheduling pass.
+\in{Figure}[fig:compcert_interm] shows the four different passes that were added to CompCert to form
+hyperblocks. This section will cover the proofs for each of the translations, focusing mainly on
+the proof of translation of the scheduling pass.
\subsubsection[proof-of-correctness-of-the-if-conversion-pass]{Proof of Correctness of the
If-Conversion Pass}
@@ -410,9 +436,9 @@ in the same language.
The scheduling pass is formally verified using translation validation, using the Abstr language as a
point of comparison between RTLBlock and RTLPar. Using the algorithm defined in
-Section~\goto{3}[abstr_interp], the formally verified SAT solver described in
-Section~\goto{4.2}[verified_sat] can be used prove that if the two predicated expression lists are
-equivalent, that they will always evaluate to the same value.
+\in{Section}[abstr_interp], the formally verified SAT solver described in \in{Section}[verified_sat]
+can be used prove that if the two predicated expression lists are equivalent, that they will always
+evaluate to the same value.
\section[implementation-details-of-static-sdc-scheduling]{\SDC\ Scheduling Implementation Details}
@@ -452,4 +478,9 @@ This material is based upon work supported by the under Grant No.~ and Grant No.
findings, and conclusions or recommendations expressed in this material are those of the author and
do not necessarily reflect the views of the National Science Foundation.
+\startmode[section]
+ \section{Bibliography}
+ \placelistofpublications
+\stopmode
+
\stopcomponent