chapters/scheduling.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587

\startcomponent scheduling
\project main

\chapter{Static Scheduling}

\section[introduction]{Introduction}

The use of multicore processors has been increasing drastically, thereby
making parallelising compilers ever more important. In addition to that,
the need for custom hardware accelerators is also increasing, as
traditional processors are not always the best choice for all
applications. Compilers that support optimisations which can
automatically parallelise it's input programs are therefore also
becoming more important, as these back ends all benefit from it. For
processors, one wants to optimise the pipeline usage as much as
possible, whereas in hardware one can create custom pipelines, as well
as making more use out of the spacial properties that hardware offers,
making instruction parallelism important as well as speculative
execution. However, with more powerful optimisations there is a greater
chance to make mistakes in the compiler, making it possible to generate
programs that behave differently to the input program. An important step
towards making compilers more reliable is to formally verify these
optimisations so that it can be guaranteed that bugs are not introduced.

A popular optimisation for processors increasing the instruction
parallelism of the resulting program is to use scheduling. This is an
optimisation that is especially important for processors that have deep
instruction pipelines or multiple functional units that can be used to
execute instructions in parallel. However, this optimisation is also
critical for high-level synthesis tools, which generate application
specific hardware designs based on a software input. Simple scheduling
algorithms often work on basic blocks, which are sequential lists of
instructions without any branches. However, this limits what kind of
reordering the scheduling algorithm can perform. It can never move
instructions past a branching instruction, even when this might
drastically improve the performance. Trace scheduling~ is an alternative
that addresses this issue, by creating paths that might cross branching
instructions, and scheduling the instructions in that path. However, the
main problem with trace scheduling is that it is often infeasible on
large programs, and because of the large design space, it's sometimes
hard to find an optimal schedule. Superblock~ and hyperblock~ scheduling
are two subsets of trace scheduling, which compromise on the flexibility
of trace scheduling to instead build more tractable algorithms.
Superblock scheduling produces more efficient code for processors with a
low issue rate, whereas hyperblock scheduling produces more performant
code for processors with a high issue rate~.

CompCert~ is a compiler that is formally verified in Coq. In addition to
that, have also added a formally verified hardware back end to CompCert,
allowing it to also generate provably correct hardware designs. A
hardware back end is an ideal target for hyperblock scheduling, as the
issue rate is practically infinite. This paper describes the
implementation of a hyperblock instruction scheduler in CompCert so that
it can benefit existing CompCert back ends, as well as support more
general targets such as scheduling instructions for custom hardware.

\subsection[key-points]{Key Points}

The goal of this paper is to introduce hyperblock scheduling into
CompCert to allow for advanced and flexible scheduling operations that
could be beneficial to various different back ends. To this end, the key
points of this paper are the following:

\startitemize
\item
  Description of how hyperblocks are introduced for improved scheduling
  of instructions, and how these can be built and translated back to
  instructions without predicates.
\item
  Description of the implementation and correctness proof of a system of
  difference constraints (SDC) scheduling algorithm, which can be
  adapted to various different scheduling strategies. {\bf {[}{[}JW:}
  What's the link between hyperblocks and SDC-based scheduling? Are
  these orthogonal?{\bf {]}{]}}{\bf {[}{[}YH:} Well the original SDC
  papers only talk about basic blocks, and not SDC hyperblock
  scheduling. But it's easy enough to adapt. SDC just requires some kind
  of blocks without control-flow.{\bf {]}{]}} {\bf {[}{[}JW:} Ok cool,
  so you can have SDC scheduling without hyperblocks. Can you have
  hyperblocks without SDC scheduling? (I guess so, but just wanted to be
  completely sure.){\bf {]}{]}}{\bf {[}{[}YH:} Yes exactly, you can
  perform any type of basic block scheduling on hyperblocks as well I
  think, as long as you mark the dependencies correctly. But in the
  worst case you could even just see hyperblocks as basic blocks and
  ignore predicates. It would still be correct but just less
  efficient.{\bf {]}{]}}

  {\bf {[}{[}JW:} It would be interesting to make the case for SDC-based
  scheduling. Is it superior to other scheduling mechanisms? Is it so
  much superior that it's worth the additional complexity, compared to,
  say, list scheduling? {\bf {]}{]}}{\bf {[}{[}YH:} SDC scheduling is
  more flexible with the type of constraint to optimise for (latency /
  throughput). I don't think it's too meaningful for compilers, where
  there isn't as much flexibility.{\bf {]}{]}} {\bf {[}{[}JW:} Thanks.
  Does Vericert have any idea that the scheduling is being performed
  using SDC? Or is the SDC stuff entirely internal to your unverified
  OCaml code? That is, could you swap your scheduler with a different
  one that isn't based on SDC, and not have to change anything in
  Vericert?{\bf {]}{]}}{\bf {[}{[}YH:} The verified part doesn't know
  about SDC at all, so it could be changed to any other scheduler that
  acts on hyperblocks.{\bf {]}{]}}
\stopitemize

\section[sec:scheduling]{Implementation of Hyperblocks in CompCert}

This section describes the structure of hyperblocks in
Section~\goto{2.1}[scheduling:hyperblocks]. Then, the structure of two
extra intermediate languages that were added to CompCert to implement
hyperblocks are also shown in Section~\goto{2.2}[sec:rtlblockdef].

\subsection[scheduling:hyperblocks]{Hyperblocks as a Generalisation of Basic Blocks}

Basic blocks are popular in intermediate languages because they describe
a list of sequential instructions that are not separated by
control-flow. This information can be used by various analysis passes to
perform optimisations, as these sequences of instructions can be
reordered in any way as long as their data dependencies are met. One
such optimisation is scheduling, which reorders instructions to
eliminate of any delay slots caused by the CPU pipeline.

However, due to control-flow not being allowed in basic blocks, the
number of instructions that can be scheduled at any one time is limited.
The boundaries to other basic blocks act as hard boundary to the
scheduling algorithm. hyperblocks~ allow instructions to be predicated,
thereby introducing some control-flow information into the blocks, which
can also be taken into account by the scheduler. For example, any data
dependency between instructions that have mutually exclusive predicates
can be removed, as these instructions will never be activate at the same
time.

\subsection[sec:rtlblockdef]{RTLBlock and RTLPar Intermediate Language Definition}

Figure~\goto{{[}fig:compcert_interm{]}}[fig:compcert_interm] shows the
intermediate languages introduced to implement hyperblocks in CompCert.
This consists of new RTLBlock and RTLPar intermediate languages, which
implement the sequential and parallel semantics of basic blocks
respectively. The semantics of RTLBlock and RTLPar are based on the
CompCert RTL intermediate language. However, instead of mapping from
states to instructions, RTLBlock maps from states to hyperblocks, and
RTLPar maps from states to parallel hyperblocks, which will be described
in the next sections.

%\startformula \begin{aligned}
%    i\ ::=&\ \ \mono{RBnop} \\
%          &|\  \mono{RBop}\ p?\ \mathit{op}\ \vec{r}\ d \\
%          &|\  \mono{RBload}\ p?\ \mathit{chunk}\ \mathit{addr}\ \vec{r}\ d \\
%          &|\  \mono{RBstore}\ p?\ \mathit{chunk}\ \mathit{addr}\ \vec{r}\ s \\
%          &|\  \mono{RBsetpred}\ p?\ c\ \vec{r}\ d \\
%  \end{aligned} \stopformula
%
%\startformula \begin{aligned}
%    i_{\mathit{cf}}\ ::=&\ \ \mono{RBcall}\ \mathit{sig}\ f\ \vec{r}\ d\ n \\
%                        &|\  \mono{RBtailcall}\ \mathit{sig}\ f\ \vec{r} \\
%                        &|\  \mono{RBbuiltin}\ f_{\mathit{ext}}\ \vec{r}\ r\ n \\
%                        &|\  \mono{RBcond}\ c\ \vec{r}\ n_{1}\ n_{2} \\
%                        &|\  \mono{RBjumptable}\ r\ \vec{n} \\
%                        &|\  \mono{RBreturn}\ r? \\
%                        &|\  \mono{RBgoto}\ n \\
%                        &|\  \mono{RBpred\_{cf}}\ P\ i_{\mathit{cf}_{1}}\ i_{\mathit{cf}_{2}} \\
%  \end{aligned} \stopformula

RTLBlock instructions are split into two types of instructions, standard
instructions and control-flow instructions. The standard instructions
are the instructions that can be placed into hyperblocks, whereas
control-flow instructions are the instructions that can end hyperblocks.
The standard instructions are shown in
Figure~\goto{{[}fig:standard_instruction{]}}[fig:standard_instruction],
whereas the control-flow instructions are shown in
Figure~\goto{{[}fig:control_flow_instruction{]}}[fig:control_flow_instruction].
Most instructions are quite similar to their RTL counterparts, however,
there are some instructions that have been added. \type{RBsetpred} is a
standard instruction which sets a predicate equal to an evaluated
condition. This instruction is used during if-conversion to set the
predicate to the value of the condition in the conditional statement
instruction. To support proper predicated instructions, each instruction
therefore also contains an optional predicate which has been set
beforehand, which produces the hyperblock. In additon to that, there is
also an extra control-flow instruction called \type{RBpred_cf}, which
can branch on a predicate an nests more control-flow instructions inside
of it. This is also necessary for if-conversion, when converting an
already converted conditional statement, as depending on the predicate,
a different control-flow instruction may be necessary. {\bf {[}{[}JW:} I
think there'd be a lot of value in having a picture showing op-chaining
in action. See tex source for the kind of thing I mean. Probably best
done as a running example throughout the paper. {\bf {]}{]}}

These instructions are use in RTLBlock as well as in RTLPar. The main
difference between these two languages is how these instructions are
arranged within the hyperblock and the execution semantics of the
hyperblock in the languages. Let $\mathit{hb}_{b}$ be the definition of
a hyperblock for RTLBlock, and let $\mathit{hb}_{p}$ be the definition
of a hyperblock for RTLPar, then $\mathit{hb}_{b}$ and $\mathit{hb}_{p}$
be defined as in Figure~\goto{{[}fig:hb_def{]}}[fig:hb_def]. The Figure
shows the different nesting levels of the basic blocks in RTLBlock as
well as RTLPar, and where the parallel semantics of RTLPar come into
play. RTLBlock is made of a list of instructions and a control-flow
instruction that ends the hyperblock. Each instruction in RTLBlock is
executed sequentially. RTLPar is made of three separate lists. The outer
list behaves like the list in RTLBlock, and executes each list inside of
it sequentially. RTLPar then has a list which executes it's contents in
parallel, meaning each instruction uses the same state of the registers
and memory while executing. Finally, instead of directly containing
instructions, there is an additional layer of sequentially executing
lists of instructions, to be able to execute small groups of
instructions in sequence, but still in parallel to the other
instructions in the current parallel group.

RTLPar is structured in this way so that the scheduler has more
flexibility, so that it can not only parallelise operations in the basic
blocks, but that it can also define sequential operations in these
parallel operations. This is especially useful for the hardware back
end, as it means that sequential operations that do not quite fill one
clock cycle can be executed sequentially, but can still be executed in
parallel with other independent operations. This optimisation is called
operation chaining, and is critical to get the most performance out of
the hardware.

\subsection[translating-rtlblock-to-rtlpar]{Translating RTLBlock to RTLPar}

The translation from RTLBlock to RTLPar is performed using a scheduling
algorithm, which will takes hyperblocks and rearranges their
instructions to maximise parallelism based on the data dependencies. It
then arranges the instructions in RTLPar by putting independent
instructions into the same parallel block. Scheduling is an optimisation
that is well suited to translation validation, as the algorithm itself
can be quite complex, whereas the output is normally relatively easy to
check for equivalence with the input.
Figure~\goto{{[}fig:compcert_interm{]}}[fig:compcert_interm] shows that
the scheduling step comes right after the if-conversion step which
originally creates the hyperblocks.

Figure~\goto{{[}fig:op_chain{]}}[fig:op_chain] shows how the scheduling
step transforms RTLBlock into RTLPar. The RTLBlock hyperblock being
scheduled is shown in
Figure~\goto{{[}fig:op_chain_a{]}}[fig:op_chain_a], which contains five
predicated operations, comprising two additions and three
multiplications. The data dependencies of the instructions in the
hyperblock are shown in
Figure~\goto{{[}fig:op_chain_b{]}}[fig:op_chain_b]. Curiously, even
though operations and are normally dependent on each other due to a
write-after-write conflict, but because the predicates are independent
the write-after-write conflict can effectively be removed by the
scheduler.

The scheduler transforms the RTLBlock hyperblock into the RTLPar
hyperblock shown in Figure~\goto{{[}fig:op_chain_c{]}}[fig:op_chain_c].
Even though the addition in is dependent on , they can still be put into
the same cycle as the additions do not take as long to complete as the
multiplications. This optimisation is called operation chaining. Then,
the multiplication in can also be placed in the same clock cycle as it
does not have any data dependencies. Then, in the next clock cycle,
either the multiplication in can take place, or the multiplication in
will take place, meaning these two multiplications can also be placed in
the same clock cycle. This gives the final schedule that is shown in
Figure~\goto{{[}fig:op_chain_d{]}}[fig:op_chain_d].

\section[abstr_interp]{Abstract Interpretation of Hyperblocks}

The correctness of the scheduling algorithm is checked by comparing the
symbolic expressions of each register before and after scheduling. This
section describes how these symbolic expressions are constructed for
each hyperblock.

\subsection[abstr_language]{Construction of Abstract Language}

The main difficulty when constructing symbolic expressions for
hyperblocks is the presence of predicates, which can be used by the
scheduling algorithm to reorder instructions further, even if these
contain data dependencies. The scheduling algorithm will manipulate
predicates in the following ways, which the comparison function needs to
take into account.

\startitemize
\item
  Instructions with predicates that are equivalent to false can be
  removed.
\item
  Instructions with predicates that are mutually exclusive can be
  executed in parallel.
\item
  Any predicate can be replaced with an equivalent predicate.
\stopitemize

The symbolic expressions have the following form:

There are three main resources that are present in symbolic expressions:
registers $r$, which represent the value of the registers in the
program, predicates $p$, which represent the value of the predicates
throughout the program, and finally, memory \type{Mem}, representing the
state of the memory. For each basic instruction in the hyperblock, there
is an equivalent symbolic expression which either modifies the state of
the registers, predicates or memory. Finally, as each instruction in the
hyperblock can be predicates, each symbolic expression is also paired
with a predicate as well.

Assigning the same resource multiple times with instructions that
contain different predicates, means that the symbolic expression must
also express this conditional result based on the predicates. This is
expressed by assigning lists of predicated symbolic expressions to each
resource, where each predicate needs to be mutually exclusive from the
other predicates in the list.

The expressions are then constructed using a function which updates the
symbolic expressions assigned for each resource. This is done using
multiple primitives which act on predicated types, which are made up of
a list of pairs of predicates and the element of that type. The first
important primitive is multiplication of two predicated types, which is
implemented as performing a cartesian multiplication between the
predicated types in the two lists, anding the two predicates and joining
the two types of each list using a function.

%\startformula \label{eq:1}
%  P_{1} \otimes_{f} P_{2} \equiv \mono{map } (\lambda ((p_{1}, e_{1}), (p_{2}, e_{2})) . (p_{1} \land p_{2}, f\ e_{1}\ e_{2})) P_{1} \times P_{2} \stopformula

In addition to that, another primitive that is needed is the following
append operation, which will negate the combination of all predicates in
the second predicated type, and it to the first predicated type and
append the first predicated type to the second:

%\startformula \label{eq:2}
%  \mu(p, P) \equiv \mono{map } (\lambda (p', e') . (p \land p', e'))\ P \stopformula

%\startformula \label{eq:3}
%  P_{1} \oplus_{p} P_{2} \equiv \mu(\neg p, P_{1}) \mathbin{++} \mu(p, P_{2}) \stopformula

\subsection[title={Example of
translation},reference={example-of-translation}]

%\startformula \begin{aligned}
%    \label{eq:4}
%    &\mono{r1} =
%      \begin{cases}
%        \mono{r1}^{0} + \mono{r4}^{0} + \mono{r4}^{0}, \quad &\mono{p1} \\
%        \mono{r1}^{0}, \quad &\mono{!p1} \\
%      \end{cases}\\
%    &\mono{r2} = \mono{r1}^{0} + \mono{r4}^{0}\\
%    &\mono{r3} =
%      \begin{cases}
%        \left( \mono{r1}^{0} \times \mono{r1}^{0} \right) \times \left( \mono{r1}^{0} \times \mono{r1}^{0} \right),\quad &\mono{!p2 \&\& !p1}\\
%        \mono{r1}^{0} \times \mono{r4}^{0},\quad &\mono{p2 \&\& !p1}\\
%        \left( \mono{r1}^{0} + \mono{r4}^{0} + \mono{r4}^{0} \right) \times \mono{r4}^{0},\quad &\mono{!p2 \&\& p1}\\
%        \mono{r3}^{0} \times \mono{r3}^{0},\quad &\mono{p2 \&\& p1}
%      \end{cases}
%  \end{aligned} \stopformula

Using the example shown in
Figure~\goto{{[}fig:op_chain{]}}[fig:op_chain], the RTLBlock hyperblock
shown in Figure~\goto{{[}fig:op_chain_a{]}}[fig:op_chain_a] is scheduled
into the hyperblock RTLPar shown in
Figure~\goto{{[}fig:op_chain_c{]}}[fig:op_chain_c]. The first step is to
translate the input and output of the scheduler into the Abstr
intermediate language by calculating the symbolic values that will be
stored in each of the registers. Symbolically evaluating the RTLBlock
hyperblock results in the expressions shown in
Figure~\goto{{[}fig:symbolic_expressions{]}}[fig:symbolic_expressions]
for registers \type{r1}, \type{r2} and \type{r3}, where the state of the
register \type{r1} at the start of the hyperblock is denoted as
$\mono{r1}^{0}$.

This symbolic expression can be generated by sequentially going through
the list of predicated expressions and applying the update function
defined in Section~\goto{3.1}[abstr_language]. The resulting predicated
expressions is guaranteed to have predicates which are mutually
exclusive, meaning if the predicate evaluates to true, all other
predicates must evaluate to false.

\subsection[title={Linear (Flat) Predicated
Expressions},reference={linear-flat-predicated-expressions}]

The update function for predicated expressions is quite a bit more
complex compared to standard update functions for symbolic execution,
such as the one used by . The main reason for this is the non-recursive
nature of the predicated expressions. As the predicates are only
available at the top-level of the expressions, combining multiple
expressions means that the conditions and multiple possibilities also
need to trickle up to the top.

Instead, if the predicated instructions were recursive, one would not
have to change any predicates when evaluating the argument lists of
instructions, for example, as these can just stay predicated. The
evaluation semantics would also be much simpler, because a rule can be
added for evaluating predicates.

\section[title={Correctness of the Abstract
Interpretation},reference={sec:correctness}]

The correctness of the scheduling step is dependent on the correctness
of the equivalence check between the two abstract languages. As the
abstract languages contain predicates, this comparison check cannot just
be a syntactic comparison between the two languages, because the
predicates might be different but logically equivalent. Instead, a mixed
equivalence check is performed using a verified SAT solver and a
syntactic equivalence check of the actual symbolic expressions.

\subsection[title={Comparing Symbolic Abstract
Expressions},reference={comparing-symbolic-abstract-expressions}]

Abstract expressions comprise a predicate and a symbolic expression. The
translation from RTLBlock and RTLPar to Abstract ensures that each
register will contain a predicated symbolic expression that describes
the value that will be assigned to it after evaluating the basic block.
This means that if an expression always produces the same result as
another expression, that the register will always contain the same
value. If the RTLBlock program and RTLPar program translate to two
Abstract languages where all the register's symbolic expressions match,
then this means that both programs will always produce the same value.

The main comparison between two symbolic expressions can be done using a
simple syntactic comparison, because this ensures that each of the
expressions always evaluate to the same value, and is also general
enough for the expressions that need to be compared. However, as all
expressions might be predicated, this simple syntactic comparison cannot
be performed for the predicates. This is because predicated instructions
can be executed in many different orders, and the predicates are used to
calculate dependence information and therefore also determine the order
of operations.

Instead, comparing predicates can be done by checking logical
equivalence of the predicates using a formally verified SAT solver. This
ensures that the predicates will always evaluate to the same value,
thereby making it a fair comparison.

\subsection[title={Verified Decidable DPLL SAT
Solver},reference={verified_sat}]

A SAT solver is needed to correctly and reliably compare predicated
expressions. This makes it possible to then compare many different
possible schedules, even if they remove expressions completely if these
are unnecessary. The SAT solver is needed in the verifier, which needs
to be proven correct in Coq to prove the scheduling translation correct.
The SAT solver therefore has to be verified, because it could otherwise
not be used in the proof of correctness.

The DPLL algorithm is used to implement the SAT solver. This means that
the SAT solver takes in a conjunctive normal form (CNF) formula. The
conversion from a recursive predicate type into a CNF formula is done by
recursively expanding the predicate in equivalent CNF formulas.

The SAT solver is implemented in Coq and proven to be sound and
complete. The soundness and completeness of an algorithm to check
whether a predicate is satisfiable or not also directly implies that the
check is decidable, simplifying many proofs in Coq by using a
\type{Setoid} instance for equivalence between predicates. This makes
rewriting predicates if they are equivalent much easier in proofs by
reusing many of the existing Coq tactics.

\subsection[title={Encoding Predicated Expressions as SAT
Formulas},reference={encoding-predicated-expressions-as-sat-formulas}]

Given the following two predicated expressions which should be tested
for equivalence, one can construct a SAT expression which will ensure
that both expressions will always result in the same value.

%\startformula \begin{aligned}
%  \bgroup \left.\begin{aligned}
%    e_{1},\qquad &p_{1}\\
%    e_{2},\qquad &p_{2}\\
%    \vdots\quad& \\
%    e_{n},\qquad &p_{n}\\
%  \end{aligned}\right\rbrace\egroup
%  \Longleftrightarrow
%  \begin{cases}
%    e_{1}',\quad &p_{1}'\\
%    e_{2}',\quad &p_{2}'\\
%    \;\;\;\quad\vdots \notag \\
%    e_{m}',\quad &p_{m}'\\
%  \end{cases}\end{aligned} \stopformula

The SAT query that should determine if two predicated expressions are
equivalent is the following, where $\mathit{he}_{i}$ is the hashed
expression of $e_{i}$:

%\startformula \begin{gathered}
%  (p_{1} \implies \mathit{he}_{1}) \land (p_{2} \implies \mathit{he}_{2}) \land \cdots \land (p_{n} \implies \mathit{he}_{n}) \\
%  \Longleftrightarrow\\
%  (p_{1}' \implies \mathit{he}_{1}') \land (p_{2}' \implies \mathit{he}_{2}') \land \cdots \land (p_{m}' \implies \mathit{he}_{m}') \\\end{gathered} \stopformula

However, reasoning about the equivalence of individual expressions
within the greater list of predicated expressions is not quite
straightforward. The evaluation semantics of predicated expression list
will evaluate each predicate expression sequentially, and pick the
expression whose predicate evaluates to true. However, proving that both
predicated expression lists will return the same value assuming that the
SAT query above is true, means that for each predicate that is true, One
needs to be able to show that there is a predicate in the other list
that will also be true, which will return the same value. However, there
is no guarantee that there is a unique predicated expression that will
match the current predicate, as there may be duplicate expressions in
the list with different predicates.

This means that it is quite difficult to prove that the SAT query
implies the semantic preservation of the scheduling transformation. The
SAT query can therefore be broken down into a slightly more direct
comparison that is more useful for the proof itself. The steps to
perform the more fine grained comparison are the following:

%\startenumerate[n]
%\item
%  Hash all of the expressions into a unique literal.
%\item
%  Create a map from expression literals to their corresponding
%  predicate, thereby having a unique predicate for each expressions. If
%  there are duplicates in the list that forms the map, the predicates
%  are combined using the or operation.
%\item
%  Iterate through the expressions in both maps and compare the
%  predicates using the SAT solver. If an expression is not present in
%  one of the maps, then it's predicate should be equivalent to $\perp$.
%\stopenumerate

This comparison is logically equivalent to performing the large SAT
query, however, it maps better to the proof because it is comparing the
predicates of the same expressions with each other.

\subsection[proof-of-equivalence-between-passes]{Proof of equivalence between passes}

Figure~\goto{{[}fig:compcert_interm{]}}[fig:compcert_interm] shows the
four different passes that were added to CompCert to form hyperblocks.
This section will cover the proofs for each of the translations,
focusing mainly on the proof of translation of the scheduling pass.

\subsubsection[proof-of-correctness-of-the-if-conversion-pass]{Proof of Correctness of the
  If-Conversion Pass}

The if-conversion pass is quite simple to verify, as it is an
optimisation that transforms the code in the same language.

\subsubsection[proof-of-correctness-of-scheduling]{Proof of Correctness of Scheduling}

The scheduling pass is formally verified using translation validation,
using the Abstr language as a point of comparison between RTLBlock and
RTLPar. Using the algorithm defined in Section~\goto{3}[abstr_interp],
the formally verified SAT solver described in
Section~\goto{4.2}[verified_sat] can be used prove that if the two
predicated expression lists are equivalent, that they will always
evaluate to the same value.

\section[implementation-details-of-static-sdc-scheduling]{Implementation Details of Static SDC
  Scheduling}

This section describes the implementation details of the static SDC
scheduling algorithm used, and also describes some of the details of the
basic block generation as well as the if-conversion pass which generates
the predicated instructions.

\subsection[implementation-of-if-conversion]{Implementation of If-Conversion}

If-conversion is the conversion that introduces predicated instructions
which can make use of the hyperblocks. It converts conditional
statements into predicated instructions. This transformation has a few
limitations on the kind of conditional statements that it can translate.
First, only conditional statements without loops can be translated,
therefore, one must identify cycles in the control-flow before
performing the if-conversion. Secondly, if-conversion will not always
result in more efficient code, so it should not be applied to any
conditional statements. Instead, it is best applied to conditional
statements where each branch will take a similar amount of time to
execute.

For back ends that do not support predicates, it is also important to
eliminate the predicates again using a process called reverse
if-conversion, which creates branches out of the predicates and groups
these together again. This allows for hyperblock scheduling to also be
used with back ends that do not support predicates, which are most the
default back ends included in CompCert.

\subsection[title={Static SDC
Scheduling},reference={static-sdc-scheduling}]

System of difference constraints (SDC) scheduling~ is a flexible
algorithm to perform scheduling of instructions, especially if the
target is hardware directly. It allows for flexible definitions of
optimisation functions in terms of latencies and dependencies of the
operations in the basic block.

\section[title={Performance
Comparison},reference={performance-comparison}]

\section[title={Related Work},reference={sec:related-work}]

\section[title={Conclusion},reference={conclusion}]

This material is based upon work supported by the under Grant No.~ and
Grant No.~. Any opinions, findings, and conclusions or recommendations
expressed in this material are those of the author and do not
necessarily reflect the views of the National Science Foundation.

\stopcomponent