summaryrefslogtreecommitdiffstats
path: root/chapters/scheduling.tex
blob: dcbb40911f1e3f73f4e665c88ecacb7730e1d17c (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
\environment fonts_env
\environment lsr_env

\startcomponent scheduling

\chapter[sec:scheduling]{Static Scheduling}

\section[introduction]{Introduction}

\oindex{static scheduling}The use of multicore processors has been increasing drastically, thereby
making parallelising compilers ever more important. In addition to that, the need for custom
hardware accelerators is also increasing, as traditional processors are not always the best choice
for all applications. Compilers that support optimisations which can automatically parallelise it's
input programs are therefore also becoming more important, as these back ends all benefit from
it. For processors, one wants to optimise the pipeline usage as much as possible, whereas in
hardware one can create custom pipelines, as well as making more use out of the spacial properties
that hardware offers, making instruction parallelism important as well as speculative
execution. However, with more powerful optimisations there is a greater chance to make mistakes in
the compiler, making it possible to generate programs that behave differently to the input
program. An important step towards making compilers more reliable is to formally verify these
optimisations so that it can be guaranteed that bugs are not introduced.

A popular optimisation for processors increasing the instruction parallelism of the resulting
program is to use scheduling. This is an optimisation that is especially important for processors
that have deep instruction pipelines or multiple functional units that can be used to execute
instructions in parallel. However, this optimisation is also critical for high-level synthesis
tools, which generate application specific hardware designs based on a software input. Simple
scheduling algorithms often work on basic blocks, which are sequential lists of instructions without
any branches. However, this limits what kind of reordering the scheduling algorithm can perform. It
can never move instructions past a branching instruction, even when this might drastically improve
the performance. Trace scheduling~\cite[fisher81_trace_sched] is an alternative that addresses this
issue, by creating paths that might cross branching instructions, and scheduling the instructions in
that path. However, the main problem with trace scheduling is that it is often infeasible on large
programs, and because of the large design space, it's sometimes hard to find an optimal
schedule. \index{superblock}Superblock~\cite[hwu93_super] and
\index{hyperblock}hyperblock~\cite[mahlke92_effec_compil_suppor_predic_execut_using_hyper]
scheduling are two subsets of trace scheduling, which compromise on the flexibility of trace
scheduling to instead build more tractable algorithms.  Superblock scheduling produces more
efficient code for processors with a low issue rate, whereas hyperblock scheduling produces more
performant code for processors with a high issue
rate~\cite[mahlke92_effec_compil_suppor_predic_execut_using_hyper,aiken16_trace_sched].

A hardware back end is an ideal target for hyperblock scheduling, as the issue rate is practically
infinite. This paper describes the implementation of a hyperblock instruction scheduler in CompCert
so that it can benefit existing CompCert back ends, as well as support more general targets such as
scheduling instructions for custom hardware.

\subsection[key-points]{Key Points}

The goal of this chapter is to introduce hyperblock scheduling into CompCert to allow for advanced
and flexible scheduling operations that could be beneficial to various different back ends. To this
end, the key points of this paper are the following:

\startitemize
\item Description of how hyperblocks are introduced for improved scheduling of instructions, and how
  these can be built and translated to hardware.
\item Description of the implementation and correctness proof of a \SDC\ scheduling algorithm.  The
  correctness proof does not rely on the algorithm itself and so can safely be replaced by a
  different scheduling algorithm with similar properties.
\stopitemize

\section{Progress Summary}

\startplacemarginfigure[reference={fig:compcert_interm},title={New intermediate languages introduced into
    CompCert.  Progress bars are shown for each transformation pass, where green represents the
    estimated part of the work that is completed, and red stands for the remaining work to be done.}]
  \hbox{\starttikzpicture[shorten >=1pt,>=stealth]
    \node       at (-3, 0) (dotsin) {$\cdots$};
    \node       at (10, 0) (dotsout) {$\cdots$};
    \node[draw] at (-1, 0) (rtl) {\rtl};
    \node[draw] at (2,  0) (rtlblock) {\rtlblock};
    \node[draw] at (2, -2) (abstrblock) {\abstr};
    \node[draw] at (6,  0) (rtlpar) {\rtlpar};
    \node[draw] at (6, -2) (abstrpar) {\abstr};
    \node[draw] at (8.5,  0) (htl) {\htl};

    \draw[->] (rtl) --
         node[above,midway,align=center,font=\small] {basic block \\[-0.3em] creation}
         node[below,yshift=-3mm,xshift=-0.75cm,anchor=west,fill=VenetianRed,minimum width=1.5cm] {}
         node[below,yshift=-3mm,xshift=-0.75cm,anchor=west,fill=ForestGreen,minimum width=0.6cm] {}
         (rtlblock);
    \draw[->] (rtlblock) --
         node[above,midway,font=\small] {scheduling}
         node[below,yshift=-3mm,xshift=-0.75cm,anchor=west,fill=ForestGreen,minimum width=1.5cm] {}
         (rtlpar);
    \draw[->] (rtlpar) --
         node[below,yshift=-3mm,xshift=-5mm,anchor=west,fill=VenetianRed,minimum width=1cm] {}
         node[below,yshift=-3mm,xshift=-5mm,anchor=west,fill=ForestGreen,minimum width=2.5mm] {}
         (htl);
    \draw[->,dashed] (rtlblock) --
         node[left,yshift=-5mm,xshift=-3mm,anchor=south,fill=VenetianRed,minimum height=1cm] {}
         node[left,yshift=-5mm,xshift=-3mm,anchor=south,fill=ForestGreen,minimum height=8mm] {}
         (abstrblock);
    \draw[->,dashed] (rtlpar) --
         node[right,yshift=-5mm,xshift=3mm,anchor=south,fill=VenetianRed,minimum height=1cm] {}
         node[right,yshift=-5mm,xshift=3mm,anchor=south,fill=ForestGreen,minimum height=6mm] {}
         (abstrpar);
    \draw[dashed,shorten >=0pt] (abstrblock) --
         node[above,midway] {$\sim$}
         node[below,yshift=-3mm,xshift=-0.75cm,anchor=west,fill=VenetianRed,minimum width=1.5cm] {}
         node[below,yshift=-3mm,xshift=-0.75cm,anchor=west,fill=ForestGreen,minimum width=1.2cm] {}
         (abstrpar);
    \draw[->] (rtlblock) to [out=130,in=50,loop,looseness=10]
         node[above,midway,font=\small] {if-conversion}
         node[below,yshift=6mm,xshift=-0.75cm,anchor=west,fill=VenetianRed,minimum width=1.5cm] {}
         node[below,yshift=6mm,xshift=-0.75cm,anchor=west,fill=ForestGreen,minimum width=0.2cm] {}
         (rtlblock);
    \draw[->] (rtlpar) to [out=130,in=50,loop,looseness=10]
         node[above,midway,align=center,font=\small] { operation \\[-0.3em] pipelining}
         node[below,yshift=1cm,xshift=-0.75cm,anchor=west,fill=VenetianRed,minimum width=1.5cm] {}
         node[below,yshift=1cm,xshift=-0.75cm,anchor=west,fill=ForestGreen,minimum width=0.2cm] {}
         (rtlpar);
    \draw[->] (dotsin) -- (rtl);
    \draw[->] (htl) -- (dotsout);
  \stoptikzpicture}
\stopplacemarginfigure

This chapter is describing work that has not been fully completed yet.
\in{Figure}[fig:compcert_interm] describes the current progress on implementing the scheduling pass,
which relies heavily on the basic block generation, the if-conversion pass, as well as the \htl\
generation.  The operation pipelining pass is not that critical to scheduling, but allows for the
support of pipelined operations such as dividers.

In general, the implementation of all the passes is now complete, and so current work is being done
to complete the correctness proofs of all the passes.  Starting at the basic block creating pass,
which is further described in \in{Section}[sec:basic-block-generation], the proof is quite complex
for generating basic blocks.  However, the main idea of how to complete the proof has now been
understood, so it should just be a matter of time until it is implemented.  Secondly, the
if-conversion pass has only been implemented and has not been verified yet.  However, it should be
straightforward to verify it, as the transformation is quite direct.  Next, for the scheduling pass
the verification translates the input and the output of the scheduling algorithm into the \abstr\
language and then compares the two languages has mainly been completed.  However, due to changes in
the \rtlblock\ and \rtlpar\ language semantics, some of these proofs will have to be redone, which
should not take too long though.  Finally, for the generation of \htl\ from \rtlpar\ the proof has
not been started yet.  However, the difficult parts of the proof are the translation of loads and
stores into proper reads and writes from RAM, which has already been completely proven for the \rtl\
to \htl\ translation.

The operation pipelining pass has only been implemented until now, however, it will likely be
removed from the current workflow if it proves to take too long to complete.

\section[sec:basic-block-generation]{Basic Block Generation}

\index{basic block}The basic block generation proof is surprisingly difficult, as the semantics of
\rtl\ and \rtlblock\ do not match nicely.  This is because \rtl\ executes each of its instructions
sequentially in a \CFG.  \rtlblock, however, executes one basic block in one go, so it is not
possible to refer to the execution of only one instruction.  The traditional simulation diagram for
proofs in CompCert looks like the following:

\placefigure[here,none][]{}{\hbox{\starttikzpicture[shorten >=1pt,>=stealth]
  \node (s2) at (0,0) {$S_{2}$};
  \node (s1) at (0,2) {$S_{1}$};
  \node (s2p) at (5,0) {$S_{2}'$};
  \node (s1p) at (5,2) {$S_{1}'$};
  \draw (s1) -- node[above] {$\sim$} (s1p);
  \draw[dashed] (s2) -- node[above] {$\sim$} (s2p);
  \draw[->] (s1) -- (s2);
  \draw[->,dashed] (s1p) -- node[right] {$+$} (s2p);
\stoptikzpicture}}

Which essentially says that given a step in the input language which goes from state $S_{1}$ to
$S_{2}$ in one step, there should be one or more steps that go from the matching state $S_{1}'$ in
the translated language to a state $S_{2}'$ which matches with $S_{2}$.  Most the proofs in CompCert
follow this pattern, as the languages normally get refined over time and produce more instructions
than before.

However, when generating basic blocks one wants a slightly different simulation diagram:

\placefigure[here,none][]{}{\hbox{\starttikzpicture[shorten >=1pt,>=stealth]
  \node (s2) at (0,0) {$S_{2}$};
  \node (s1) at (0,2) {$S_{1}$};
  \node (s2p) at (5,0) {$S_{2}'$};
  \node (s1p) at (5,2) {$S_{1}'$};
  \draw (s1) -- node[above] {$\sim$} (s1p);
  \draw[dashed] (s2) -- node[above] {$\sim$} (s2p);
  \draw[->] (s1) -- node[left] {$+$} (s2);
  \draw[->,dashed] (s1p) -- (s2p);
\stoptikzpicture}}

This says that assuming one has performed multiple steps in the new language, from matching states
$S_{1}$ and $S_{1}'$, then there should be one step in the translated language that goes from
$S_{1}'$ to $S_{2}'$ which should then match $S_{2}$.  This then matches the semantics of how basic
blocks are executing, as multiple instructions will be grouped into one.  However, in that form such
a lemma is impossible to even express in CompCert's framework, as the proof is inductive over the
semantics of the input language (i.e. executing one instruction at a time).

The main trick that needed to be found, and which is what
\cite[authoryears][six22_formal_verif_super_sched] did to generate superblocks, is to separate the
above simulation diagram into the following two diagrams, where each diagram now only has one
transition in the left hand side.

\startplacefigure[location={here,none}]
  \startfloatcombination[nx=2,distance=2cm]
    \startplacefigure[location=none]
      \hbox{\starttikzpicture[shorten >=1pt,>=stealth]
        \node (s2) at (0,0) {$S_{2}$};
        \node (s1) at (0,2) {$S_{1}$};
        \node (s2p) at (3,0) {$S_{2}'$};
        \node (s1p) at (3,2) {$S_{1}'$};
        \draw (s1) -- node[above] {$\sim$} (s1p);
        \draw[dashed] (s2) -- node[above] {$\sim$} (s2p);
        \draw[->] (s1) -- (s2);
        \draw[->,dashed] (s1p) -- (s2p);
      \stoptikzpicture}
    \stopplacefigure
    \startplacefigure[location=none]
      \hbox{\starttikzpicture[shorten >=1pt,>=stealth]
        \node (s2) at (0,0) {$S_{2}$};
        \node (s1) at (0,2) {$S_{1}$};
        \node (s1p) at (3,2) {$S_{1}'$};
        \draw (s1) -- node[above] {$\sim$} (s1p);
        \draw[->] (s1) -- (s2);
        \draw[dashed] (s2) -- node[below,xshift=1mm] {$\sim$} (s1p);
      \stoptikzpicture}
    \stopplacefigure
  \stopfloatcombination
\stopplacefigure

The diagram on the left represents a control-flow transition, which is performed in lock-step.  This
means one has encountered a control-flow instruction, and therefore also implies that the basic
block has finished executing and that the next block needs to be retrieved.  The diagram on the
right is the interesting one which allows for the multiple transitions in the input language.  It
requires writing a matching predicate $\sim$ which can match the start of the basic block in
\rtlblock\ with any state in the original \rtl\ language as long as one has not finished executing
the basic block and has not encountered a control-flow instruction.  By also showing that some
metric stored in the $\sim$ relation is decreasing, one can then show that one will not get stuck
iterating forever, and effectively get the behaviour of executing multiple instructions in the input
code to execute one big step in the output language.  The other trick is that at each one of these
steps, one also incrementally builds a proof of the execution from the start of the basic block to
the current state, so that when a control-flow instruction is encountered, this proof can be used
show that one can jump over all the input instructions with that block.

\subsection[scheduling:hyperblocks]{Hyperblocks as a Generalisation of Basic Blocks}

Basic blocks are popular in intermediate languages because they describe a list of sequential
instructions that are not separated by control-flow. This information can be used by various
analysis passes to perform optimisations, as these sequences of instructions can be reordered in any
way as long as their data dependencies are met. One such optimisation is scheduling, which reorders
instructions to eliminate of any delay slots caused by the CPU pipeline.

However, due to control-flow not being allowed in basic blocks, the number of instructions that can
be scheduled at any one time is limited.  The boundaries to other basic blocks act as hard boundary
to the scheduling
algorithm. Hyperblocks~\cite[mahlke92_effec_compil_suppor_predic_execut_using_hyper] allow
instructions to be predicated, thereby introducing some control-flow information into the blocks,
which can also be taken into account by the scheduler. For example, any data dependency between
instructions that have mutually exclusive predicates can be removed, as these instructions will
never be activate at the same time.

\section[implementation-of-if-conversion]{Implementation of If-Conversion}

If-conversion is the conversion that introduces predicated instructions which can make use of the
hyperblocks. It converts conditional statements into predicated instructions. This transformation
has a few limitations on the kind of conditional statements that it can translate.  First, only
conditional statements without loops can be translated, therefore, one must identify cycles in the
control-flow before performing the if-conversion. Secondly, if-conversion will not always result in
more efficient code, so it should not be applied to any conditional statements. Instead, it is best
applied to conditional statements where each branch will take a similar amount of time to execute.

For back ends that do not support predicates, it is also important to eliminate the predicates again
using a process called reverse if-conversion, which creates branches out of the predicates and
groups these together again. This allows for hyperblock scheduling to also be used with back ends
that do not support predicates, which are most the default back ends included in CompCert.

From an implementation point of view, the two steps are relatively simple to implement naïvely,
however, especially the if-conversion transformation could be improved by adding better heuristics
for which conditional statements to transform into predicated instructions.  For hyperblocks,
conditionals where each branch approximately execute in the same number of cycles are an ideal
candidate for the transformation, as it means that no time is lost due to having to execute extra
predicated instructions that are false.  However, in addition to that if one has an estimate on the
probability of picking a branch, then one can introduce more interesting heuristics such as also
transforming conditionals who have disproportionate branches, but where one is more likely to
execute the longest branch every time.

However, from a proof perspective they are not the same.  In this case, the basic block generation
is especially tricky due to the

\section[sec:scheduling]{Scheduling Implementation}

\subsection[sec:rtlblockdef]{\rtlblock\ and \rtlpar\ Intermediate Language Definition}

\lindex{\rtlblock}\lindex{\rtlpar}\in{Figure}[fig:compcert_interm] shows the intermediate languages
introduced to implement hyperblocks in CompCert.  This consists of new \rtlblock\ and \rtlpar\
intermediate languages, which implement the sequential and parallel semantics of basic blocks
respectively. The semantics of \rtlblock\ and \rtlpar\ are based on the CompCert \rtl\ intermediate
language. However, instead of mapping from states to instructions, \rtlblock\ maps from states to
hyperblocks, and \rtlpar\ maps from states to parallel hyperblocks, which will be described in the
next sections.

\placeformula[eq:standard] \startformula \startmathalignment
\NC i\ \ \colon\colon= \NC \ \ \text{\tt RBnop}                                                   \NR
\NC                  \NC |\ \ \text{\tt RBop}\ p?\ \mathit{op}\ \vec{r}\ d                      \NR
\NC                  \NC |\ \ \text{\tt RBload}\ p?\ \mathit{chunk}\ \mathit{addr}\ \vec{r}\ d  \NR
\NC                  \NC |\ \ \text{\tt RBstore}\ p?\ \mathit{chunk}\ \mathit{addr}\ \vec{r}\ s \NR
\NC                  \NC |\ \ \text{\tt RBsetpred}\ p?\ c\ \vec{r}\ d                           \NR
\stopmathalignment \stopformula

\placeformula[eq:cf-instr] \startformula \startmathalignment
\NC i_{\mathit{cf}}\ \ \colon\colon= \NC \ \ \text{\tt RBcall}\ \mathit{sig}\ f\ \vec{r}\ d\ n \NR
\NC                      \NC |\ \ \text{\tt RBtailcall}\ \mathit{sig}\ f\ \vec{r}                      \NR
\NC                      \NC |\ \ \text{\tt RBbuiltin}\ f_{\mathit{ext}}\ \vec{r}\ r\ n                \NR
\NC                      \NC |\ \ \text{\tt RBcond}\ c\ \vec{r}\ n_{1}\ n_{2}                          \NR
\NC                      \NC |\ \ \text{\tt RBjumptable}\ r\ \vec{n}                                   \NR
\NC                      \NC |\ \ \text{\tt RBreturn}\ r?                                              \NR
\NC                      \NC |\ \ \text{\tt RBgoto}\ n                                                 \NR
\NC                      \NC |\ \ \text{\tt RBpred\_{cf}}\ P\ i_{\mathit{cf}_{1}}\ i_{\mathit{cf}_{2}} \NR
\stopmathalignment \stopformula

\rtlblock\ instructions are split into two types of instructions, standard instructions and
control-flow instructions. The standard instructions are the instructions that can be placed into
hyperblocks, whereas control-flow instructions are the instructions that can end hyperblocks.  The
standard instructions are shown in \in{Equation}[eq:standard], whereas the control-flow instructions
are shown in \in{Equation}[eq:cf-instr].  Most instructions are quite similar to their \rtl\
counterparts, however, there are some instructions that have been added. \type{RBsetpred} is a
standard instruction which sets a predicate equal to an evaluated condition. This instruction is
used during if-conversion to set the predicate to the value of the condition in the conditional
statement instruction. To support proper predicated instructions, each instruction therefore also
contains an optional predicate which has been set beforehand, which produces the hyperblock. In
additon to that, there is also an extra control-flow instruction called \type{RBpred_cf}, which can
branch on a predicate an nests more control-flow instructions inside of it. This is also necessary
for if-conversion, when converting an already converted conditional statement, as depending on the
predicate, a different control-flow instruction may be necessary.

\placeformula[fig:hb_def]\startformula\startalign[align={1:left}]
\NC \blockbb \qquad \eqdef \qquad (\text{\tt slist}\ i) \times i_{\mathit{cf}} \NR \NC \parbb \qquad
\eqdef \qquad (\text{\tt slist}\ (\text{\tt plist}\ (\text{\tt slist}\ i))) \times
i_{\mathit{cf}}\NR
\stopalign\stopformula

These instructions are use in \rtlblock\ as well as in \rtlpar. The main difference between these
two languages is how these instructions are arranged within the hyperblock and the execution
semantics of the hyperblock in the languages. Let $\blockbb$ be the definition of a hyperblock for
\rtlblock, and let $\parbb$ be the definition of a hyperblock for \rtlpar, then $\blockbb$ and
$\parbb$ be defined as in \in{Equation}[fig:hb_def]. The Figure shows the different nesting levels
of the basic blocks in \rtlblock\ as well as \rtlpar, and where the parallel semantics of \rtlpar\
come into play. \rtlblock\ is made of a list of instructions and a control-flow instruction that
ends the hyperblock. Each instruction in \rtlblock\ is executed sequentially. \rtlpar\ is made of
three separate lists. The outer list behaves like the list in \rtlblock, and executes each list
inside of it sequentially. \rtlpar\ then has a list which executes it's contents in parallel,
meaning each instruction uses the same state of the registers and memory while executing. Finally,
instead of directly containing instructions, there is an additional layer of sequentially executing
lists of instructions, to be able to execute small groups of instructions in sequence, but still in
parallel to the other instructions in the current parallel group.

\rtlpar\ is structured in this way so that the scheduler has more flexibility, so that it can not
only parallelise operations in the basic blocks, but that it can also define sequential operations
in these parallel operations. This is especially useful for the hardware back end, as it means that
sequential operations that do not quite fill one clock cycle can be executed sequentially, but can
still be executed in parallel with other independent operations. This optimisation is called
operation chaining, and is critical to get the most performance out of the hardware.

\subsection[translating-rtlblock-to-rtlpar]{Translating \rtlblock\ to \rtlpar}

\definecolor[s1col][x=7FC97F]
\definecolor[s2col][x=BEAED4]
\definecolor[s3col][x=FDC086]
\definecolor[s4col][x=FFFF99]
\definecolor[s5col][x=386CB0]

\define[2]\makedfgstate{\raisebox{-2pt}\hbox{\tikz{\node[circle,draw=black,minimum
      size=4mm,fill=#1,scale=0.5]{#2};}}}
\define\sI{\makedfgstate{s1col}{1}}
\define\sII{\makedfgstate{s2col}{2}}
\define\sIII{\makedfgstate{s3col}{3}}
\define\sIV{\makedfgstate{s4col}{4}}
\define\sV{\makedfgstate{s5col}{5}}

\startplacefigure[reference={fig:op_chain},title={Example of scheduling a hyperblock into multiple
    clock cycles.}]
  \startfloatcombination[ny=2,nx=2]
    \startplacesubfigure[title={RTLBlock hyperblock to be scheduled.}]
      \startframedtext[frame=off,offset=none,width={0.43\textwidth},style=\rmx]
        \starthlC
          /BTEX \sI/ETEX                r2 = r1 + r4;
          /BTEX \sII/ETEX if(p1)         r1 = r2 + r4;
          /BTEX \sIII/ETEX if(!p2 && !p1) r3 = r1 * r1;
          /BTEX \sIV/ETEX if(p2)         r3 = r1 * r4;
          /BTEX \sV/ETEX if(!p2)        r3 = r3 * r3;
        \stophlC
      \stopframedtext
    \stopplacesubfigure
    \startplacesubfigure[title={Scheduled RTLPar hyperblock.}]
      \startframedtext[frame=off,offset=none,width={0.53\textwidth},style=\rmx]
        \starthlC
          1: [r2 = r1 + r4; if(p1) r1 = r2 + r4];
          [if(!p2 && !p1) r3 = r1 * r1]
          2: [if(p2) r1 * r4]; [if(!p2) r3 * r3]
        \stophlC
      \stopframedtext
    \stopplacesubfigure
    \startplacesubfigure[title={Data dependencies between each operation of the RTLBlock
        hyperblock.}]
      \startframedtext[frame=off,offset=none,width={0.43\textwidth}]
        \hbox{\starttikzpicture[>=stealth,shorten >=1pt,auto,node distance=1.3cm]
          \tikzstyle{every state}=[draw=black,thick,minimum size=4mm]
          \node[state,fill=s1col] (s1) {1};
          \node[state,fill=s2col] (s2) [below left of=s1] {2};
          \node[state,fill=s3col] (s3) [right of=s1] {3};
          \node[state,fill=s4col] (s4) [below right of=s2] {4};
          \node[state,fill=s5col] (s5) [below of=s3] {5};

          \draw[->] (s1) -- (s2);
          \draw[->] (s2) -- (s4);
          %\draw[->] (s3) -- (s4);
          \draw[->] (s3) -- (s5);
        \stoptikzpicture}
      \stopframedtext
    \stopplacesubfigure
    \startplacesubfigure[title={Timing diagram showing how the operations in RTLPar will be executed.}]
      \externalfigure[figures/timing-3.pdf][width=0.5\textwidth]
    \stopplacesubfigure
  \stopfloatcombination
\stopplacefigure

The translation from \rtlblock\ to \rtlpar\ is performed using a scheduling algorithm, which will
takes hyperblocks and rearranges their instructions to maximise parallelism based on the data
dependencies. It then arranges the instructions in \rtlpar\ by putting independent instructions into
the same parallel block. Scheduling is an optimisation that is well suited to translation
validation, as the algorithm itself can be quite complex, whereas the output is normally relatively
easy to check for equivalence with the input.  \in{Figure}[fig:compcert_interm] shows that the
scheduling step comes right after the if-conversion step which originally creates the hyperblocks.

\in{Figure}[fig:op_chain] shows how the scheduling step transforms \rtlblock\ into \rtlpar. The
\rtlblock\ hyperblock being scheduled is shown in \in{Figure}{a}[fig:op_chain], which contains five
predicated operations, comprising two additions and three multiplications. The data dependencies of
the instructions in the hyperblock are shown in \in{Figure}{c}[fig:op_chain]. Curiously, even though
operations and are normally dependent on each other due to a write-after-write conflict, but because
the predicates are independent the write-after-write conflict can effectively be removed by the
scheduler.

The scheduler transforms the \rtlblock\ hyperblock into the \rtlpar\ hyperblock shown in
\in{Figure}{b}[fig:op_chain].  Even though the addition in \sII\ is dependent on \sI, they can still
be put into the same cycle as the additions do not take as long to complete as the
multiplications. This optimisation is called operation chaining. Then, the multiplication in \sIII\
can also be placed in the same clock cycle as it does not have any data dependencies. Then, in the
next clock cycle, either the multiplication in \sIV\ can take place, or the multiplication in \sV\
will take place, meaning these two multiplications can also be placed in the same clock cycle. This
gives the final schedule that is shown in \in{Figure}{d}[fig:op_chain].

\section[abstr_interp]{Abstract Interpretation of Hyperblocks}

\index{symbolic execution}\seeindex{abstract interpretation}{symbolic execution}The correctness of
the scheduling algorithm is checked by comparing the symbolic expressions of each register before
and after scheduling. This section describes how these symbolic expressions are constructed for each
hyperblock.

\subsection[abstr_language]{Construction of Abstract Language}

The main difficulty when constructing \index{symbolic expression}symbolic expressions for
hyperblocks is the presence of predicates, which can be used by the scheduling algorithm to reorder
instructions further, even if these contain data dependencies. The scheduling algorithm will
manipulate predicates in the following ways, which the comparison function needs to take into
account.

\startitemize
\item
  Instructions with predicates that are equivalent to false can be removed.
\item
  Instructions with predicates that are mutually exclusive can be executed in parallel.
\item
  Any predicate can be replaced with an equivalent predicate.
\stopitemize

The symbolic expressions have the following form:

\define\sep{\ |\ }

Resources:

\startformula \rho\ \eqdef\ r \sep p \sep \text{\tt Mem} \stopformula

Expressions:

\startformula e\ \eqdef\ r^{0} \sep P \Rightarrow \text{\tt Op}(\mathit{op}, \vec{e}) \sep P \Rightarrow
  \text{\tt Load}(\mathit{chunk}, \mathit{mode}, \vec{e}, e_{m}) \stopformula

Memory expressions:

\startformula e_{m}\ \eqdef\ \text{\tt Mem}^{0} \sep P \Rightarrow \text{\tt Store}(\mathit{chunk},
  \mathit{mode}, \vec{e}, e_{m}, e) \stopformula

Predicate expressions:

\startformula e_{p}\ \eqdef\ p^{0} \sep P \Rightarrow \text{\tt SetPred}(\mathit{cond}, \vec{e}, e_{p})\stopformula

Code:

\startformula m\ \eqdef\ \rho \mapsto (\vec{e} \sep \vec{e_{m}} \sep \vec{e_{p}}) \stopformula

Predicates:

\startformula P\ \eqdef\ (\text{\tt Bool} * p) \sep P \land P \sep P \lor P \sep \top\ \sep \perp \stopformula

There are three main resources that are present in symbolic expressions: registers $r$, which
represent the value of the registers in the program, predicates $p$, which represent the value of
the predicates throughout the program, and finally, memory \type{Mem}, representing the state of the
memory. For each basic instruction in the hyperblock, there is an equivalent symbolic expression
which either modifies the state of the registers, predicates or memory. Finally, as each instruction
in the hyperblock can be predicates, each symbolic expression is also paired with a predicate as
well.

Assigning the same resource multiple times with instructions that contain different predicates,
means that the symbolic expression must also express this conditional result based on the
predicates. This is expressed by assigning lists of predicated symbolic expressions to each
resource, where each predicate needs to be mutually exclusive from the other predicates in the list.

The expressions are then constructed using a function which updates the symbolic expressions
assigned for each resource. This is done using multiple primitives which act on predicated types,
which are made up of a list of pairs of predicates and the element of that type. The first important
primitive is multiplication of two predicated types, which is implemented as performing a cartesian
multiplication between the predicated types in the two lists, anding the two predicates and joining
the two types of each list using a function.

\placeformula[eq:1]\startformula
  P_{1} \otimes_{f} P_{2} \equiv \text{\tt map } (\lambda ((p_{1}, e_{1}), (p_{2}, e_{2})) . (p_{1}
  \land p_{2}, f\ e_{1}\ e_{2})) P_{1} \times P_{2} \stopformula

In addition to that, another primitive that is needed is the following append operation, which will
negate the combination of all predicates in the second predicated type, and it to the first
predicated type and append the first predicated type to the second:

\placeformula[eq:2]\startformula
  \mu(p, P) \equiv \text{\tt map } (\lambda (p', e') . (p \land p', e'))\ P \stopformula

\placeformula[eq:3]\startformula
  P_{1} \oplus_{p} P_{2} \equiv \mu(\neg p, P_{1}) + \mu(p, P_{2}) \stopformula

\subsection[example-of-translation]{Example of translation}

\placeformula[eq:4]\startformula\startalign[n=1,align={1:left}]
\NC \text{\tt r1} =
\startcases
  \NC \mono{r1}^{0} + \mono{r4}^{0} + \mono{r4}^{0}, \quad \NC \mono{p1} \NR
  \NC \mono{r1}^{0}, \quad \NC \mono{!p1} \NR
\stopcases \NR
\NC \mono{r2} = \mono{r1}^{0} + \mono{r4}^{0}\NR
\NC\mono{r3} =
\startcases
  \NC \left( \mono{r1}^{0} \times \mono{r1}^{0} \right) \times \left( \mono{r1}^{0} \times
  \mono{r1}^{0} \right),\quad \NC \mono{!p2 \&\& !p1} \NR
  \NC \mono{r1}^{0} \times \mono{r4}^{0},\quad \NC \mono{p2 \&\& !p1} \NR
  \NC \left( \mono{r1}^{0} + \mono{r4}^{0} + \mono{r4}^{0} \right) \times \mono{r4}^{0},\quad
  \NC \mono{!p2 \&\& p1}\NR
  \NC \mono{r3}^{0} \times \mono{r3}^{0},\quad \NC\mono{p2 \&\& p1}\NR
\stopcases\NR
\stopalign \stopformula

Using the example shown in \in{Figure}[fig:op_chain], the \rtlblock\ hyperblock shown in
\in{Figure}{a}[fig:op_chain] is scheduled into the hyperblock \rtlpar\ shown in
\in{Figure}{b}[fig:op_chain]. The first step is to translate the input and output of the scheduler
into the \abstr\ intermediate language by calculating the symbolic values that will be stored in
each of the registers. Symbolically evaluating the \rtlblock\ hyperblock results in the expressions
shown in \in{Equation}[eq:4] for registers \type{r1}, \type{r2} and \type{r3}, where the state of
the register \type{r1} at the start of the hyperblock is denoted as $\text{\tt r1}^{0}$.

This symbolic expression can be generated by sequentially going through the list of predicated
expressions and applying the update function defined in \in{Section}[abstr_language]. The resulting
predicated expressions is guaranteed to have predicates which are mutually exclusive, meaning if the
predicate evaluates to true, all other predicates must evaluate to false.

\subsection[linear-flat-predicated-expressions]{Linear (Flat) Predicated Expressions}

\index{symbolic expression}The update function for predicated expressions is quite a bit more
complex compared to standard update functions for symbolic execution, such as the one used by . The
main reason for this is the non-recursive nature of the predicated expressions. As the predicates
are only available at the top-level of the expressions, combining multiple expressions means that
the conditions and multiple possibilities also need to trickle up to the top.

Instead, if the predicated instructions were recursive, one would not have to change any predicates
when evaluating the argument lists of instructions, for example, as these can just stay
predicated. The evaluation semantics would also be much simpler, because a rule can be added for
evaluating predicates.

\section[sec:correctness]{Correctness of the Abstract Interpretation}

The correctness of the scheduling step is dependent on the correctness of the equivalence check
between the two abstract languages. As the abstract languages contain predicates, this comparison
check cannot just be a syntactic comparison between the two languages, because the predicates might
be different but logically equivalent. Instead, a mixed equivalence check is performed using a
verified SAT solver and a syntactic equivalence check of the actual symbolic expressions.

\subsection[comparing-symbolic-abstract-expressions]{Comparing Symbolic Abstract Expressions}

Abstract expressions comprise a predicate and a symbolic expression. The translation from \rtlblock\
and \rtlpar\ to \abstr\ ensures that each register will contain a predicated symbolic expression
that describes the value that will be assigned to it after evaluating the basic block.  This means
that if an expression always produces the same result as another expression, that the register will
always contain the same value. If the \rtlblock\ program and \rtlpar\ program translate to two
\abstr\ representations where all the register's symbolic expressions match, then this means that
both programs will always produce the same value.

The main comparison between two symbolic expressions can be done using a simple syntactic
comparison, because this ensures that each of the expressions always evaluate to the same value, and
is also general enough for the expressions that need to be compared. However, as all expressions
might be predicated, this simple syntactic comparison cannot be performed for the predicates. This
is because predicated instructions can be executed in many different orders, and the predicates are
used to calculate dependence information and therefore also determine the order of operations.

Instead, comparing predicates can be done by checking logical equivalence of the predicates using a
formally verified \SAT\ solver. This ensures that the predicates will always evaluate to the same
value, thereby making it a fair comparison.

\subsection[verified_sat]{Verified Decidable DPLL SAT Solver}

A SAT solver is needed to correctly and reliably compare predicated expressions. This makes it
possible to then compare many different possible schedules, even if they remove expressions
completely if these are unnecessary. The SAT solver is needed in the verifier, which needs to be
proven correct in Coq to prove the scheduling translation correct.  The SAT solver therefore has to
be verified, because it could otherwise not be used in the proof of correctness.

The DPLL algorithm is used to implement the SAT solver. This means that the SAT solver takes in a
\CNF\ formula. The conversion from a recursive predicate type into a \CNF\ formula is done by
recursively expanding the predicate in equivalent \CNF\ formulas.

The SAT solver is implemented in Coq and proven to be sound and complete. The soundness and
completeness of an algorithm to check whether a predicate is satisfiable or not also directly
implies that the check is decidable, simplifying many proofs in Coq by using a \type{Setoid}
instance for equivalence between predicates. This makes rewriting predicates if they are equivalent
much easier in proofs by reusing many of the existing Coq tactics.

\subsection[encoding-predicated-expressions-as-sat-formulas]{Encoding Predicated Expressions as SAT
  Formulas}

Given the following two predicated expressions which should be tested for equivalence, one can
construct a SAT expression which will ensure that both expressions will always result in the same
value.

%\startformula \begin{aligned}
%  \bgroup \left.\begin{aligned}
%    e_{1},\qquad &p_{1}\\
%    e_{2},\qquad &p_{2}\\
%    \vdots\quad& \\
%    e_{n},\qquad &p_{n}\\
%  \end{aligned}\right\rbrace\egroup
%  \Longleftrightarrow
%  \begin{cases}
%    e_{1}',\quad &p_{1}'\\
%    e_{2}',\quad &p_{2}'\\
%    \;\;\;\quad\vdots \notag \\
%    e_{m}',\quad &p_{m}'\\
%  \end{cases}\end{aligned} \stopformula

The SAT query that should determine if two predicated expressions are equivalent is the following,
where $\mathit{he}_{i}$ is the hashed expression of $e_{i}$:

%\startformula \begin{gathered}
%  (p_{1} \implies \mathit{he}_{1}) \land (p_{2} \implies \mathit{he}_{2}) \land \cdots \land (p_{n} \implies \mathit{he}_{n}) \\
%  \Longleftrightarrow\\
%  (p_{1}' \implies \mathit{he}_{1}') \land (p_{2}' \implies \mathit{he}_{2}') \land \cdots \land (p_{m}' \implies \mathit{he}_{m}') \\\end{gathered} \stopformula

However, reasoning about the equivalence of individual expressions within the greater list of
predicated expressions is not quite straightforward. The evaluation semantics of predicated
expression list will evaluate each predicate expression sequentially, and pick the expression whose
predicate evaluates to true. However, proving that both predicated expression lists will return the
same value assuming that the SAT query above is true, means that for each predicate that is true,
One needs to be able to show that there is a predicate in the other list that will also be true,
which will return the same value. However, there is no guarantee that there is a unique predicated
expression that will match the current predicate, as there may be duplicate expressions in the list
with different predicates.

This means that it is quite difficult to prove that the SAT query implies the semantic preservation
of the scheduling transformation. The SAT query can therefore be broken down into a slightly more
direct comparison that is more useful for the proof itself. The steps to perform the more fine
grained comparison are the following:

\startitemize[n]
\item
  Hash all of the expressions into a unique literal.
\item
  Create a map from expression literals to their corresponding predicate, thereby having a unique
  predicate for each expressions. If there are duplicates in the list that forms the map, the
  predicates are combined using the or operation.
\item
  Iterate through the expressions in both maps and compare the predicates using the SAT solver. If
  an expression is not present in one of the maps, then it's predicate should be equivalent to
  $\perp$.
\stopitemize

This comparison is logically equivalent to performing the large SAT query, however, it maps better
to the proof because it is comparing the predicates of the same expressions with each other.

\subsection[proof-of-equivalence-between-passes]{Proof of equivalence between passes}

\in{Figure}[fig:compcert_interm] shows the four different passes that were added to CompCert to form
hyperblocks.  This section will cover the proofs for each of the translations, focusing mainly on
the proof of translation of the scheduling pass.

\subsubsection[proof-of-correctness-of-the-if-conversion-pass]{Proof of Correctness of the
  If-Conversion Pass}

The if-conversion pass is quite simple to verify, as it is an optimisation that transforms the code
in the same language.

\subsubsection[proof-of-correctness-of-scheduling]{Proof of Correctness of Scheduling}

The scheduling pass is formally verified using translation validation, using the \abstr\ language as
a point of comparison between \rtlblock\ and \rtlpar. Using the algorithm defined in
\in{Section}[abstr_interp], the formally verified SAT solver described in \in{Section}[verified_sat]
can be used prove that if the two predicated expression lists are equivalent, that they will always
evaluate to the same value.

\section[implementation-details-of-static-sdc-scheduling]{\SDC\ Scheduling Implementation Details}

This section describes the implementation details of the static \SDC\ scheduling algorithm used, and
also describes some of the details of the basic block generation as well as the if-conversion pass
which generates the predicated instructions.

\subsection[static-sdc-scheduling]{Static \SDC\ Scheduling}

\SDC\ scheduling~\cite[cong06_sdc] is a flexible algorithm to perform scheduling of instructions,
especially if the target is hardware directly. It allows for flexible definitions of optimisation
functions in terms of latencies and dependencies of the operations in the basic block.

\startmode[chapter]
  \section{Bibliography}
  \placelistofpublications
\stopmode

\stopcomponent