summaryrefslogtreecommitdiffstats
path: root/chapters/scheduling.tex
blob: de83ed2ffa2e2c508ffba86898bc4cc9c988251d (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
\environment fonts_env
\environment lsr_env

\startcomponent scheduling

\chapter[sec:scheduling]{Static Scheduling}

\section[introduction]{Introduction}

\oindex{static scheduling}The use of multicore processors has been increasing drastically, thereby
making parallelising compilers ever more important. In addition to that, the need for custom
hardware accelerators is also increasing, as traditional processors are not always the best choice
for all applications. Compilers that support optimisations which can automatically parallelise its
input programs are therefore also becoming more important, as these back ends all benefit from
it. For processors, one wants to optimise the pipeline usage as much as possible, whereas in
hardware one can create custom pipelines, as well as making more use out of the spacial properties
that hardware offers, making instruction parallelism important as well as speculative
execution. However, with more powerful optimisations there is a greater chance to make mistakes in
the compiler, making it possible to generate programs that behave differently to the input
program. An important step towards making compilers more reliable is to formally verify these
optimisations so that it can be guaranteed that bugs are not introduced.

A popular optimisation for processors increasing the instruction parallelism of the resulting
program is to use scheduling. This is an optimisation that is especially important for processors
that have deep instruction pipelines or multiple functional units that can be used to execute
instructions in parallel. However, this optimisation is also critical for high-level synthesis
tools, which generate application specific hardware designs based on a software input. Simple
scheduling algorithms often work on basic blocks, which are sequential lists of instructions without
any branches. However, this limits what kind of reordering the scheduling algorithm can perform. It
can never move instructions past a branching instruction, even when this might drastically improve
the performance. Trace scheduling~\cite[fisher81_trace_sched] is an alternative that addresses this
issue, by creating paths that might cross branching instructions, and scheduling the instructions in
that path. However, the main problem with trace scheduling is that it is often infeasible on large
programs, and because of the large design space, it is sometimes hard to find an optimal
schedule. \index{superblock}Superblock~\cite[hwu93_super] and
\index{hyperblock}hyperblock~\cite[mahlke92_effec_compil_suppor_predic_execut_using_hyper]
scheduling are two subsets of trace scheduling, which compromise on the flexibility of trace
scheduling to instead build more tractable algorithms.  Superblock scheduling produces more
efficient code for processors with a low issue rate, whereas hyperblock scheduling produces more
performant code for processors with a high issue
rate~\cite[mahlke92_effec_compil_suppor_predic_execut_using_hyper,aiken16_trace_sched].

A hardware back end is an ideal target for hyperblock scheduling, as the issue rate is practically
infinite. This paper describes the implementation of a hyperblock instruction scheduler in CompCert
so that it can benefit existing CompCert back ends, as well as support more general targets such as
scheduling instructions for custom hardware.

\subsection[key-points]{Key Points}

The goal of this chapter is to introduce hyperblock scheduling into CompCert to allow for advanced
and flexible scheduling operations that could be beneficial to various different back ends. To this
end, the key points of this paper are the following:

\startitemize
\item Description of how hyperblocks are introduced for improved scheduling of instructions, and how
  these can be built and translated to hardware.
\item Description of the implementation and correctness proof of a \SDC\ scheduling algorithm.  The
  correctness proof does not rely on the algorithm itself and so can safely be replaced by a
  different scheduling algorithm with similar properties.
\stopitemize

\section{Progress Summary}

\startplacemarginfigure[reference={fig:compcert_interm},title={New intermediate languages introduced into
    CompCert.  Progress bars are shown for each transformation pass, where green represents the
    estimated part of the work that is completed, and red stands for the remaining work to be done.
Orange stands for a pass that will currently not be used for the current translation but could be
added back in the future.}]
  \hbox{\starttikzpicture[shorten >=1pt,>=stealth] \node at (-3, 0) (dotsin) {$\cdots$}; \node at
    (10, 0) (dotsout) {$\cdots$}; \node[draw] at (-1, 0) (rtl) {\rtl}; \node[draw] at (2, 0)
    (rtlblock) {\rtlblock}; \node[draw] at (2, -2) (abstrblock) {\abstr}; \node[draw] at (6, 0)
    (rtlpar) {\rtlpar}; \node[draw] at (6, -2) (abstrpar) {\abstr}; \node[draw] at (8.5, 0) (htl)
    {\htl};

    \draw[->] (rtl) --
         node[above,midway,align=center,font=\small] {basic block \\[-0.3em] creation}
         node[below,yshift=-3mm,xshift=-0.75cm,anchor=west,fill=ForestGreen,minimum width=1.5cm] {}
         (rtlblock);
    \draw[->] (rtlblock) --
         node[above,midway,font=\small] {scheduling}
         node[below,yshift=-3mm,xshift=-0.75cm,anchor=west,fill=ForestGreen,minimum width=1.5cm] {}
         (rtlpar);
    \draw[->] (rtlpar) --
         node[below,yshift=-3mm,xshift=-5mm,anchor=west,fill=VenetianRed,minimum width=1cm] {}
         node[below,yshift=-3mm,xshift=-5mm,anchor=west,fill=ForestGreen,minimum width=2.5mm] {}
         (htl);
    \draw[->,dashed] (rtlblock) --
         node[left,yshift=-5mm,xshift=-3mm,anchor=south,fill=VenetianRed,minimum height=1cm] {}
         node[left,yshift=-5mm,xshift=-3mm,anchor=south,fill=ForestGreen,minimum height=8mm] {}
         (abstrblock);
    \draw[->,dashed] (rtlpar) --
         node[right,yshift=-5mm,xshift=3mm,anchor=south,fill=VenetianRed,minimum height=1cm] {}
         node[right,yshift=-5mm,xshift=3mm,anchor=south,fill=ForestGreen,minimum height=6mm] {}
         (abstrpar);
    \draw[dashed,shorten >=0pt] (abstrblock) --
         node[above,midway] {$\sim$}
         node[below,yshift=-3mm,xshift=-0.75cm,anchor=west,fill=VenetianRed,minimum width=1.5cm] {}
         node[below,yshift=-3mm,xshift=-0.75cm,anchor=west,fill=ForestGreen,minimum width=1.2cm] {}
         (abstrpar);
    \draw[->] (rtlblock) to [out=130,in=50,loop,looseness=10]
         node[above,midway,font=\small] {if-conversion}
         node[below,yshift=6mm,xshift=-0.75cm,anchor=west,fill=VenetianRed,minimum width=1.5cm] {}
         node[below,yshift=6mm,xshift=-0.75cm,anchor=west,fill=ForestGreen,minimum width=1cm] {}
         (rtlblock);
    \draw[->] (rtlpar) to [out=130,in=50,loop,looseness=10]
         node[above,midway,align=center,font=\small] { operation \\[-0.3em] pipelining}
         node[below,yshift=1cm,xshift=-0.75cm,anchor=west,fill=TigersEye,minimum width=1.5cm] {}
         (rtlpar);
    \draw[->] (dotsin) -- (rtl);
    \draw[->] (htl) -- (dotsout);
  \stoptikzpicture}
\stopplacemarginfigure

This chapter is describing work that has not been fully completed yet.
\in{Figure}[fig:compcert_interm] describes the current progress on implementing the scheduling pass,
which relies heavily on the basic block generation, the if-conversion pass, as well as the \htl\
generation.  The operation pipelining pass is not that critical to scheduling, but allows for the
support of pipelined operations such as dividers.

In general, the implementation of all the passes is now complete, and so current work is being done
to complete the correctness proofs of all the passes.  Starting at the basic block creating pass,
which is further described in \in{Section}[sec:basic-block-generation], the proof is quite complex
for generating basic blocks, but has now been completed. Secondly, the if-conversion pass has only
been implemented and has not been verified yet.  Even though the transformation is quite direct,
it's still difficult to formulate the correct matching condition for the proof.  However, this has
nearly been completed.  Next, for the scheduling pass the verification of the pass that translates
the input and the output of the scheduling algorithm into the \abstr\ language and then compares the
two languages has mainly been completed.  However, due to changes in the \rtlblock\ and \rtlpar\
language semantics, some of these proofs will have to be redone.  Finally, for the generation of
\htl\ from \rtlpar\ the proof has not been started yet.  However, the difficult parts of the proof
are the translation of loads and stores into proper reads and writes from RAM, which has already
been completely proven for the \rtl\ to \htl\ translation, and can be directly copied over.  The
operation pipelining pass has only been implemented until now, however, it will likely be removed
from the current workflow if it proves to take too long to complete.

\section[sec:basic-block-generation]{Basic Block Generation}

\index{basic block}The basic block generation proof is surprisingly difficult, as the semantics of
\rtl\ and \rtlblock\ do not match nicely.  This is because \rtl\ executes each of its instructions
sequentially in a \CFG.  \rtlblock, however, executes one basic block in one go, so it is not
possible to refer to the execution of only one instruction.  The traditional simulation diagram for
proofs in CompCert looks like the following:

\placefigure[here,none][]{}{\hbox{\starttikzpicture[shorten >=1pt,>=stealth]
  \node (s2) at (0,0) {$S_{2}$};
  \node (s1) at (0,2) {$S_{1}$};
  \node (s2p) at (5,0) {$S_{2}'$};
  \node (s1p) at (5,2) {$S_{1}'$};
  \draw (s1) -- node[above] {$\sim$} (s1p);
  \draw[dashed] (s2) -- node[above] {$\sim$} (s2p);
  \draw[->] (s1) -- (s2);
  \draw[->,dashed] (s1p) -- node[right] {$+$} (s2p);
\stoptikzpicture}}

Which essentially says that given a step in the input language which goes from state $S_{1}$ to
$S_{2}$ in one step, there should be one or more steps that go from the matching state $S_{1}'$ in
the translated language to a state $S_{2}'$ which matches with $S_{2}$.  Most the proofs in CompCert
follow this pattern, as the languages normally get refined over time and produce more instructions
than before.

However, when generating basic blocks one wants a slightly different simulation diagram:

\placefigure[here,none][]{}{\hbox{\starttikzpicture[shorten >=1pt,>=stealth]
  \node (s2) at (0,0) {$S_{2}$};
  \node (s1) at (0,2) {$S_{1}$};
  \node (s2p) at (5,0) {$S_{2}'$};
  \node (s1p) at (5,2) {$S_{1}'$};
  \draw (s1) -- node[above] {$\sim$} (s1p);
  \draw[dashed] (s2) -- node[above] {$\sim$} (s2p);
  \draw[->] (s1) -- node[left] {$+$} (s2);
  \draw[->,dashed] (s1p) -- (s2p);
\stoptikzpicture}}

This says that assuming one has performed multiple steps in the new language, from matching states
$S_{1}$ and $S_{1}'$, then there should be one step in the translated language that goes from
$S_{1}'$ to $S_{2}'$ which should then match $S_{2}$.  This then matches the semantics of how basic
blocks are executing, as multiple instructions will be grouped into one.  However, in that form such
a lemma is impossible to even express in CompCert's framework, as the proof is inductive over the
semantics of the input language (i.e. executing one instruction at a time).

The main trick that needed to be found, and which is what
\cite[authoryears][six22_formal_verif_super_sched] did to generate superblocks, is to separate the
above simulation diagram into the following two diagrams, where each diagram now only has one
transition in the left hand side.

\startplacefigure[location={here,none}]
  \startfloatcombination[nx=2,distance=2cm]
    \startplacefigure[location=none]
      \hbox{\starttikzpicture[shorten >=1pt,>=stealth]
        \node (s2) at (0,0) {$S_{2}$};
        \node (s1) at (0,2) {$S_{1}$};
        \node (s2p) at (3,0) {$S_{2}'$};
        \node (s1p) at (3,2) {$S_{1}'$};
        \draw (s1) -- node[above] {$\sim_{i::\varnothing}$} (s1p);
        \draw[dashed] (s2) -- node[above] {$\exists b, \sim_{b}$} (s2p);
        \draw[->] (s1) -- (s2);
        \draw[->,dashed] (s1p) -- (s2p);
      \stoptikzpicture}
    \stopplacefigure
    \startplacefigure[location=none]
      \hbox{\starttikzpicture[shorten >=1pt,>=stealth]
        \node (s2) at (0,0) {$S_{2}$};
        \node (s1) at (0,2) {$S_{1}$};
        \node (s1p) at (3,2) {$S_{1}'$};
        \draw (s1) -- node[above] {$\sim_{i::b}$} (s1p);
        \draw[->] (s1) -- (s2);
        \draw[dashed] (s2) -- node[below,xshift=3mm] {$\sim_{b}$} (s1p);
      \stoptikzpicture}
    \stopplacefigure
  \stopfloatcombination
\stopplacefigure

The diagram on the left represents a control-flow transition, which is performed in lock-step.  This
means one has encountered a control-flow instruction, and therefore also implies that the basic
block has finished executing and that the next block needs to be retrieved.  In addition to that,
the matching relation is parameterised by the current basic block that is executing, so in the case
of the lock-step translation the translation will always be a control-flow instruction at the end of
the basic block and will therefore populate the matching relation with the new basic block that is
jumped to.  The diagram on the right is the interesting one though, which allows for the multiple
transitions in the input language.  It requires matching on a predicate $\sim_{i::b}$ which matches
the start of the basic block in \rtlblock\ with any state in the original \rtl\ language as long as
one has not finished executing the basic block and has not encountered a control-flow instruction.
Then, at every step, one has to show that the block in the matching relation reduces at every step
to show termination, and one also has to keep track of the proof of simulation from the beginning of
the basic block to the current point, to then show that the final control-flow step also matches.

This is done by the following two properties that are inside of the simulation relation
$S_{2} \sim_{b} S_{1}'$.  The \in{Equation}[eq:sem_extrap] property describes that executing the
basic block $B$ that is in the code $c$ of the translated function in \rtlblock\ at the program
counter location from an initial state $S_{1}'$ will result in the same final state as executing the
current list of instructions that the simulation relation is parametrised at (i.e. $b$).

\placeformula[eq:sem_extrap]\startformula
  \forall S,\ \ c[S_{1}'.\mathit{pc}] = \Some{B} \land (b, S_2) \Downarrow S \implies (B, S_{1}') \Downarrow S
\stopformula

The following \in{Equation}[eq:star] then also states that there exists zero or more small steps
in \rtl\ that can go from $S_{1}'$ to the current state in \rtl\ $S_{2}$.

\placeformula[eq:star]\startformula
  S_{1}' \rightarrow_* S_{2}
\stopformula

By also demonstrating that some metric stored in the $\sim$ relation is decreasing, one can then
show that one will not get stuck iterating forever, and effectively get the behaviour of executing
multiple instructions in the input code to execute one big step in the output language.  The
\in{Equation}[eq:star] also incrementally builds a proof of the execution from the start of the
basic block to the current state, so that when a control-flow instruction is encountered, this proof
can be used show that one can jump over all the input instructions with that block.

\subsection[scheduling:hyperblocks]{Hyperblocks as a Generalisation of Basic Blocks}

Basic blocks are popular in intermediate languages because they describe a list of sequential
instructions that are not separated by control-flow. This information can be used by various
analysis passes to perform optimisations, as these sequences of instructions can be reordered in any
way as long as their data dependencies are met. One such optimisation is scheduling, which reorders
instructions to eliminate any delay slots caused by the CPU pipeline.

However, due to control-flow not being allowed in basic blocks, the number of instructions that can
be scheduled at any one time is limited.  The boundaries to other basic blocks act as hard boundary
to the scheduling
algorithm. Hyperblocks~\cite[mahlke92_effec_compil_suppor_predic_execut_using_hyper] allow
instructions to be predicated, thereby introducing some control-flow information into the blocks,
which can also be taken into account by the scheduler. For example, any data dependency between
instructions that have mutually exclusive predicates can be removed, as these instructions will
never be active at the same time.

\section[implementation-of-if-conversion]{Implementation of If-Conversion}

If-conversion is the transformation that introduces predicated instructions which form the
hyperblocks. It has a few limitations on the kind of conditional statements that it can translate.
First, only conditional statements without loops can be translated, therefore, one must identify
cycles in the control-flow before performing the if-conversion. Secondly, if-conversion will not
always result in more efficient code, so it should not be applied to any conditional
statements.

From an implementation point of view, if-conversion can be implemented naïvely, however, it could be
improved by adding better heuristics to determine which conditional statements to transform into
predicated instructions.  For hyperblocks, conditionals where each branch approximately execute in
the same number of cycles are an ideal candidate for the transformation, as it means that no time is
lost due to having to execute extra predicated instructions that are false.  However, as hyperblocks
are an extension of superblocks, single paths through the control-flow graph can also be
represented.  In our implementation, we use simple static heuristics to pick these
paths~\cite[ball93_branc_predic_free].

\subsection[proof-if-conversion]{Correctness of If-Conversion}

The correctness of the if-conversion pass does not rely on which paths are chosen to be combined.
The proof therefore does not have to cover the static heuristics that are generated, because these
only affect the performance.  Even though the translation is quite simple, the reasoning about its
correctness does have its difficulties.  We have a similar problem as before, where we sometimes
need to execute multiple steps in the input program to reach one step in the output program.  The
solution is to perform a similar style proof as for the basic block generation.

The main difference is the matching condition, as the hyperblock is not reduced instruction by
instruction anymore, but section by section, depending on if a section was if-converted or not.
This means that at each point, one is either at the head of a section or in the middle of an
if-converted section.

\section[sec:scheduling]{Scheduling Implementation}

This section will discuss the implementation of hyperblock scheduling in Vericert, which is still a
work in progress.  The scheduling step consists of many different smaller transformations, covering
the initial basic block generation, hyperblock construction using if-conversion, then the scheduling
and finally the hardware generation.

Vericert uses \SDC\ scheduling~\cite[cong06_sdc], which is a flexible algorithm to perform
scheduling of instructions, especially if the target is hardware directly. It allows for flexible
definitions of optimisation functions in terms of latencies and dependencies of the operations in
the basic block. \HLS\ tools often use \SDC\ scheduling performed on hyperblocks to find the optimal
schedule.  The main benefit of using \SDC\ scheduling is that it generates constraints and an
equation that should be minimised, which can then be passed to a linear program solver.  By changing
the minimisation, different attributes of the hardware can be optimised.  In addition to that
though, there have been extensions to the \SDC\ scheduling algorithm to include more constraints
that can then also schedule loops using modulo scheduling~\cite[zhang13_sdc].

\subsection[sec:rtlblockdef]{\rtlblock\ and \rtlpar\ Intermediate Language Definition}

\lindex{\rtlblock}\lindex{\rtlpar}\in{Figure}[fig:compcert_interm] shows the intermediate languages
introduced to implement hyperblocks in CompCert.  This consists of new \rtlblock\ and \rtlpar\
intermediate languages, which implement the sequential and parallel semantics of basic blocks
respectively. The semantics of \rtlblock\ and \rtlpar\ are based on the CompCert \rtl\ intermediate
language. However, instead of mapping from states to instructions, \rtlblock\ maps from states to
hyperblocks, and \rtlpar\ maps from states to parallel hyperblocks, which will be described in the
next sections.

\startplacefigure[location={here,none}]
  \startfloatcombination[nx=2]
    \startplacefigure[reference={eq:standard},title={Syntax for instructions within a hyperblock.
        There is a predicated exit instruction which serves as an early exit instruction from the
        hyperblock, thereby making it single entry and multiple exit.}]
      \startframedtext[frame=off,offset=none,width={0.45\textwidth}]
        \startformula \startmathalignment
          \NC i\ \ \eqdef \NC   \ \ \text{\tt RBnop}                                                  \NR
          \NC             \NC \|\ \ \text{\tt RBop}\ p?\ \mathit{op}\ \vec{r}\ d                      \NR
          \NC             \NC \|\ \ \text{\tt RBload}\ p?\ \mathit{chunk}\ \mathit{addr}\ \vec{r}\ d  \NR
          \NC             \NC \|\ \ \text{\tt RBstore}\ p?\ \mathit{chunk}\ \mathit{addr}\ \vec{r}\ s \NR
          \NC             \NC \|\ \ \text{\tt RBsetpred}\ p?\ c\ \vec{r}\ d                           \NR
          \NC             \NC \|\ \ \text{\tt RBexit}\ p?\ i_{\mathit{cf}}                            \NR
        \stopmathalignment \stopformula
      \stopframedtext

    \stopplacefigure
    \startplacefigure[reference={eq:cf-instr},title={Control-flow instructions ending a hyperblock.}]
      \startframedtext[frame=off,offset=none,width={0.45\textwidth}]
        \startformula \startmathalignment
          \NC i_{\mathit{cf}}\ \ \eqdef \NC   \ \ \text{\tt RBcall}\ \mathit{sig}\ f\ \vec{r}\ d\ n     \NR
          \NC                           \NC \|\ \ \text{\tt RBtailcall}\ \mathit{sig}\ f\ \vec{r}       \NR
          \NC                           \NC \|\ \ \text{\tt RBbuiltin}\ f_{\mathit{ext}}\ \vec{r}\ r\ n \NR
          \NC                           \NC \|\ \ \text{\tt RBcond}\ c\ \vec{r}\ n_{1}\ n_{2}           \NR
          \NC                           \NC \|\ \ \text{\tt RBjumptable}\ r\ \vec{n}                    \NR
          \NC                           \NC \|\ \ \text{\tt RBreturn}\ r?                               \NR
          \NC                           \NC \|\ \ \text{\tt RBgoto}\ n                                  \NR
        \stopmathalignment \stopformula
      \stopframedtext
    \stopplacefigure
  \stopfloatcombination
\stopplacefigure

\rtlblock\ instructions are split into two types of instructions, standard instructions and
control-flow instructions. The standard instructions are the instructions that can be placed into
hyperblocks, whereas control-flow instructions are the instructions that can end hyperblocks.  The
standard instructions are shown in \in{Figure}[eq:standard], whereas the control-flow instructions
are shown in \in{Figure}[eq:cf-instr].  Most instructions are quite similar to their \rtl\
counterparts, however, there are some instructions that have been added. \type{RBsetpred} is a
standard instruction which sets a predicate equal to an evaluated condition. This instruction is
used during if-conversion to set the predicate to the value of the condition in the conditional
statement instruction. To support proper predicated instructions, each instruction therefore also
contains an optional predicate which has been set beforehand, which produces the hyperblock. In
addition to that, there is also a predicated exit instruction called \type{RBexit}, which serves as
an early exit from the hyperblock.

\placeformula[fig:hb_def]\startformula\startmathalignment[align={1:left}]
\NC \blockbb \qquad \eqdef \qquad (\text{\tt slist}\ i) \times i_{\mathit{cf}} \NR \NC \parbb \qquad
\eqdef \qquad (\text{\tt slist}\ (\text{\tt plist}\ (\text{\tt slist}\ i))) \times
i_{\mathit{cf}}\NR
\stopmathalignment\stopformula

These instructions are use in \rtlblock\ as well as in \rtlpar. The main difference between these
two languages is how these instructions are arranged within the hyperblock and the execution
semantics of the hyperblock in the languages. Let $\blockbb$ be the definition of a hyperblock for
\rtlblock, and let $\parbb$ be the definition of a hyperblock for \rtlpar, then $\blockbb$ and
$\parbb$ be defined as in \in{Equation}[fig:hb_def]. The Figure shows the different nesting levels
of the basic blocks in \rtlblock\ as well as \rtlpar, and where the parallel semantics of \rtlpar\
come into play. \rtlblock\ is made of a list of instructions and a control-flow instruction that
ends the hyperblock. Each instruction in \rtlblock\ is executed sequentially. \rtlpar\ is made of
three separate lists. The outer list behaves like the list in \rtlblock, and executes each list
inside of it sequentially. \rtlpar\ then has a list which executes its contents in parallel, meaning
each instruction uses the same state of the registers and memory while executing. Finally, instead
of directly containing instructions, there is an additional layer of sequentially executing lists of
instructions, to be able to execute small groups of instructions in sequence, but still in parallel
to the other instructions in the current parallel group.  The latter construct enables operation
chaining to be expressed, without having to create new operations for each type of chaining that
could occur.  Operation chaining is critical to get the most performance out of the hardware, as it
fills the extra latency one may have available within one clock cycle when one is also scheduling
instructions that have a larger latency.

\subsection[translating-rtlblock-to-rtlpar]{Translating \rtlblock\ to \rtlpar}

\definecolor[s1col][x=7FC97F]
\definecolor[s2col][x=BEAED4]
\definecolor[s3col][x=FDC086]
\definecolor[s4col][x=FFFF99]
\definecolor[s5col][x=386CB0]

\define[2]\makedfgstate{\raisebox{-2pt}\hbox{\tikz{\node[circle,draw=black,minimum
      size=4mm,fill=#1,scale=0.5]{#2};}}}
\define\sI{\makedfgstate{s1col}{1}}
\define\sII{\makedfgstate{s2col}{2}}
\define\sIII{\makedfgstate{s3col}{3}}
\define\sIV{\makedfgstate{s4col}{4}}
\define\sV{\makedfgstate{s5col}{5}}

\startplacefigure[reference={fig:op_chain},title={Example of scheduling a hyperblock into multiple
    clock cycles.}]
  \startfloatcombination[ny=2,nx=2]
    \startplacesubfigure[title={RTLBlock hyperblock to be scheduled.}]
      \startframedtext[frame=off,offset=none,width={0.43\textwidth},style=\rmxx]
        \starthlC
          /BTEX \sI/ETEX                r2 = r1 + r4;
          /BTEX \sII/ETEX if(p1)         r1 = r2 + r4;
          /BTEX \sIII/ETEX if(!p2 && !p1) r3 = r1 * r1;
          /BTEX \sIV/ETEX if(p2)         r3 = r1 * r4;
          /BTEX \sV/ETEX if(!p2)        r3 = r3 * r3;
        \stophlC
      \stopframedtext
    \stopplacesubfigure
    \startplacesubfigure[title={Scheduled RTLPar hyperblock.}]
      \startframedtext[frame=off,offset=none,width={0.53\textwidth},style=\rmxx]
        \starthlC
          1: [r2 = r1 + r4; if(p1) r1 = r2 + r4];
          [if(!p2 && !p1) r3 = r1 * r1]
          2: [if(p2) r1 * r4]; [if(!p2) r3 * r3]
        \stophlC
      \stopframedtext
    \stopplacesubfigure
    \startplacesubfigure[title={Data dependencies between each operation of the RTLBlock
        hyperblock.}]
      \startframedtext[frame=off,offset=none,width={0.43\textwidth}]
        \hbox{\starttikzpicture[>=stealth,shorten >=1pt,auto,node distance=1.3cm]
          \tikzstyle{every state}=[draw=black,thick,minimum size=4mm]
          \node[state,fill=s1col] (s1) {1};
          \node[state,fill=s2col] (s2) [below left of=s1] {2};
          \node[state,fill=s3col] (s3) [right of=s1] {3};
          \node[state,fill=s4col] (s4) [below right of=s2] {4};
          \node[state,fill=s5col] (s5) [below of=s3] {5};

          \draw[->] (s1) -- (s2);
          \draw[->] (s2) -- (s4);
          %\draw[->] (s3) -- (s4);
          \draw[->] (s3) -- (s5);
        \stoptikzpicture}
      \stopframedtext
    \stopplacesubfigure
    \startplacesubfigure[title={Timing diagram showing how the operations in RTLPar will be executed.}]
      \externalfigure[figures/timing-3.pdf][width=0.5\textwidth]
    \stopplacesubfigure
  \stopfloatcombination
\stopplacefigure

The translation from \rtlblock\ to \rtlpar\ is performed using a scheduling algorithm, which will
takes hyperblocks and rearranges their instructions to maximise parallelism based on the data
dependencies. It then arranges the instructions in \rtlpar\ by putting independent instructions into
the same parallel block. Scheduling is an optimisation that is well suited to translation
validation, as the algorithm itself can be quite complex, whereas the output is normally relatively
easy to check for equivalence with the input.  \in{Figure}[fig:compcert_interm] shows that the
scheduling step comes right after the if-conversion step which originally creates the hyperblocks.

\in{Figure}[fig:op_chain] shows how the scheduling step transforms \rtlblock\ into \rtlpar. The
\rtlblock\ hyperblock being scheduled is shown in \in{Figure}{a}[fig:op_chain], which contains five
predicated operations, comprising two additions and three multiplications. The data dependencies of
the instructions in the hyperblock are shown in \in{Figure}{c}[fig:op_chain]. Curiously, even though
operations may normally be dependent on each other due to a data conflict, the dependency can be
removed when the predicates are independent.

The scheduler transforms the \rtlblock\ hyperblock into the \rtlpar\ hyperblock shown in
\in{Figure}{b}[fig:op_chain].  Even though the addition in \sII\ is dependent on \sI, they can still
be put into the same cycle as the additions do not take as long to complete as the
multiplications. This optimisation is called operation chaining. Then, the multiplication in \sIII\
can also be placed in the same clock cycle as it does not have any data dependencies. Then, in the
next clock cycle, either the multiplication in \sIV\ can take place, or the multiplication in \sV\
will take place, meaning these two multiplications can also be placed in the same clock cycle. This
gives the final schedule that is shown in \in{Figure}{d}[fig:op_chain].

\subsection[abstr_interp]{Abstract Interpretation of Hyperblocks}

\index{symbolic execution}\seeindex{abstract interpretation}{symbolic execution}The correctness of
the scheduling algorithm is checked by comparing the symbolic expressions of each register before
and after scheduling. This section describes how these symbolic expressions are constructed for each
hyperblock.

\subsubsection[abstr_language]{Construction of the Abstract Language}

The main difficulty when constructing \index{symbolic expression}symbolic expressions for
hyperblocks is the presence of predicates, which can be used by the scheduling algorithm to reorder
instructions further, even if these contain data dependencies. The scheduling algorithm will
manipulate predicates in the following ways, which the comparison function needs to take into
account.

\startitemize
\item
  Instructions with predicates that are equivalent to false can be removed.
\item
  Instructions with predicates that are mutually exclusive can be executed in parallel.
\item
  Any predicate can be replaced with an equivalent predicate.
\stopitemize

\define\sep{\ |\ }

\noindent The symbolic expressions have the following form:

\noindent Resources:

\startformula \rho\ \eqdef\ r \sep p \sep \text{\tt Mem} \stopformula

Expressions:

\startformula e\ \eqdef\ r^{0} \sep P \Rightarrow \text{\tt Op}(\mathit{op}, \vec{e}) \sep P \Rightarrow
  \text{\tt Load}(\mathit{chunk}, \mathit{mode}, \vec{e}, e_{m}) \stopformula

Memory expressions:

\startformula e_{m}\ \eqdef\ \text{\tt Mem}^{0} \sep P \Rightarrow \text{\tt Store}(\mathit{chunk},
  \mathit{mode}, \vec{e}, e_{m}, e) \stopformula

Predicate expressions:

\startformula e_{p}\ \eqdef\ p^{0} \sep P \Rightarrow \text{\tt SetPred}(\mathit{cond}, \vec{e}, e_{p})\stopformula

Code:

\startformula m\ \eqdef\ \rho \mapsto (\vec{e} \sep \vec{e_{m}} \sep \vec{e_{p}}) \stopformula

Predicates:

\startformula P\ \eqdef\ (\text{\tt Bool} * p) \sep P \land P \sep P \lor P \sep \top\ \sep \perp \stopformula

There are three main resources that are present in symbolic expressions: registers $r$, which
represent the value of the registers in the program, predicates $p$, which represent the value of
the predicates throughout the program, and finally, memory \type{Mem}, representing the state of the
memory. For each basic instruction in the hyperblock, there is an equivalent symbolic expression
which either modifies the state of the registers, predicates or memory. Finally, as each instruction
in the hyperblock can be predicates, each symbolic expression is also paired with a predicate as
well.

Assigning the same resource multiple times with instructions that contain different predicates,
means that the symbolic expression must also express this conditional result based on the
predicates. This is expressed by assigning lists of predicated symbolic expressions to each
resource, where each predicate needs to be mutually exclusive from the other predicates in the list.

The expressions are then constructed using a function which updates the symbolic expressions
assigned for each resource. This is done using multiple primitives which act on predicated types,
which are made up of a list of pairs of predicates and the element of that type. The first important
primitive is multiplication of two predicated types, which is implemented as performing a cartesian
multiplication between the predicated types in the two lists, anding the two predicates and joining
the two types of each list using a function.

\subsection[example-of-translation]{Example of translation}

\placeformula[eq:4]\startformula \startmathalignment[n=1,align={1:left}]
\NC \text{\tt r1} =
\startcases
  \NC \text{\tt r1}^{0} + \text{\tt r4}^{0} + \text{\tt r4}^{0}, \quad \NC \text{\tt p1} \NR
  \NC \text{\tt r1}^{0}, \quad \NC \text{\tt !p1} \NR
\stopcases \NR
\NC \text{\tt r2} = \text{\tt r1}^{0} + \text{\tt r4}^{0}\NR
\NC\text{\tt r3} =
\startcases
  \NC \left( \text{\tt r1}^{0} \times \text{\tt r1}^{0} \right) \times \left( \text{\tt r1}^{0} \times
  \text{\tt r1}^{0} \right),\quad \NC \text{\tt !p2 \&\& !p1} \NR
  \NC \text{\tt r1}^{0} \times \text{\tt r4}^{0},\quad \NC \text{\tt p2 \&\& !p1} \NR
  \NC \left( \text{\tt r1}^{0} + \text{\tt r4}^{0} + \text{\tt r4}^{0} \right) \times \text{\tt r4}^{0},\quad
  \NC \text{\tt !p2 \&\& p1}\NR
  \NC \text{\tt r3}^{0} \times \text{\tt r3}^{0},\quad \NC\text{\tt p2 \&\& p1}\NR
\stopcases\NR
\stopmathalignment \stopformula

Using the example shown in \in{Figure}[fig:op_chain], the \rtlblock\ hyperblock shown in
\in{Figure}{a}[fig:op_chain] is scheduled into the hyperblock \rtlpar\ shown in
\in{Figure}{b}[fig:op_chain]. The first step is to translate the input and output of the scheduler
into the \abstr\ intermediate language by calculating the symbolic values that will be stored in
each of the registers. Symbolically evaluating the \rtlblock\ hyperblock results in the expressions
shown in \in{Equation}[eq:4] for registers \type{r1}, \type{r2} and \type{r3}, where the state of
the register \type{r1} at the start of the hyperblock is denoted as $\text{\tt r1}^{0}$.

This symbolic expression can be generated by sequentially going through the list of predicated
expressions and applying the update function defined in \in{Section}[abstr_language]. The resulting
predicated expressions is guaranteed to have predicates which are mutually exclusive, meaning if the
predicate evaluates to true, all other predicates must evaluate to false.

\subsubsection[linear-flat-predicated-expressions]{Linear (Flat) Predicated Expressions}

\index{symbolic expression}The update function for predicated expressions is substantially more
complex compared to the standard update functions for symbolic execution, such as the one used by
\cite[entry][tristan08_formal_verif_trans_valid]. The main reason for this is the non-recursive
nature of the predicated expressions. As the predicates are only available at the top-level of the
expressions, combining multiple expressions means that the conditions and multiple possibilities
also need to trickle up to the top.

Instead, if the predicated instructions were recursive, one would not have to change any predicates
when evaluating the argument lists of instructions, for example, as these can just stay
predicated. The evaluation semantics would also be much simpler, because a rule can be added for
evaluating predicates.

\subsection[sec:correctness]{Correctness of the Abstract Interpretation}

The correctness of the scheduling step is dependent on the correctness of the equivalence check
between the two abstract languages. As the abstract languages contain predicates, this comparison
check cannot just be a syntactic comparison between the two languages, because the predicates might
be different but logically equivalent. Instead, a mixed equivalence check is performed using a
verified SAT solver and a syntactic equivalence check of the actual symbolic expressions.

\subsubsection[comparing-symbolic-abstract-expressions]{Comparing Symbolic Abstract Expressions}

Abstract expressions comprise a predicate and a symbolic expression. The translation from \rtlblock\
and \rtlpar\ to \abstr\ ensures that each register will contain a predicated symbolic expression
that describes the value that will be assigned to it after evaluating the basic block.  This means
that if an expression always produces the same result as another expression, that the register will
always contain the same value. If the \rtlblock\ program and \rtlpar\ program translate to two
\abstr\ representations where all the register's symbolic expressions match, then this means that
both programs will always produce the same value.

The main comparison between two symbolic expressions can be done using a simple syntactic
comparison, because this ensures that each of the expressions always evaluate to the same value, and
is also general enough for the expressions that need to be compared. However, as all expressions
might be predicated, this simple syntactic comparison cannot be performed for the predicates. This
is because predicated instructions can be executed in many different orders, and the predicates are
used to calculate dependence information and therefore also determine the order of operations.

Instead, comparing predicates can be done by checking logical equivalence of the predicates using a
formally verified \SAT\ solver. This ensures that the predicates will always evaluate to the same
value, thereby making it a fair comparison.

\subsubsection[verified_sat]{Verified Decidable DPLL SAT Solver}

A SAT solver is needed to correctly and reliably compare predicated expressions. This makes it
possible to then compare many different possible schedules, even if they remove expressions
completely if these are unnecessary. The SAT solver is required by the verifier, which needs to be
proven correct in Coq to prove the scheduling translation correct.  The SAT solver therefore has to
be verified, because it could otherwise not be used in the proof of correctness.

The DPLL algorithm is used to implement the SAT solver. This means that the SAT solver takes in a
\CNF\ formula. The conversion from a recursive predicate type into a \CNF\ formula is done by
recursively expanding the predicate in equivalent \CNF\ formulas.

The SAT solver is implemented in Coq and proven to be sound and complete. The soundness and
completeness of an algorithm to check whether a predicate is satisfiable or not also directly
implies that the check is decidable, simplifying many proofs in Coq by using a \type{Setoid}
instance for equivalence between predicates. This makes rewriting predicates if they are equivalent
much easier in proofs by reusing many of the existing Coq tactics.

\subsubsection[encoding-predicated-expressions-as-sat-formulas]{Encoding Predicated Expressions as SAT
  Formulas}

Given the following two predicated expressions which should be tested for equivalence, one can
construct a SAT expression which will ensure that both expressions will always result in the same
value.

\startformula \bgroup \left.\startmatrix
    \NC e_{1},\quad \NC p_{1} \NR
    \NC e_{2},\quad \NC p_{2}\NR
    \NC \quad \vdots \NC \NR
    \NC e_{n},\quad \NC p_{n}\NR
  \stopmatrix\right\rbrace\egroup
  \Longleftrightarrow
  \startmathcases
    \NC e_{1}',\NC p_{1}' \NR
    \NC e_{2}',\NC p_{2}' \NR
    \NC \quad \vdots \NC \NR
    \NC e_{m}',\NC p_{m}'\NR
  \stopmathcases \stopformula

The SAT query that should determine if two predicated expressions are equivalent is the following,
where $\mathit{he}_{i}$ is the hashed expression of $e_{i}$:

\startformula \startmathalignment[align={1:middle}]
  \NC (p_{1} \implies \mathit{he}_{1}) \land (p_{2} \implies \mathit{he}_{2}) \land \cdots \land (p_{n}
  \implies \mathit{he}_{n}) \NR
  \NC \Longleftrightarrow \NR
  \NC (p_{1}' \implies \mathit{he}_{1}') \land (p_{2}' \implies \mathit{he}_{2}') \land \cdots \land
  (p_{m}' \implies \mathit{he}_{m}') \NR
\stopmathalignment \stopformula

However, reasoning about the equivalence of individual expressions within the greater list of
predicated expressions is not quite straightforward. The evaluation semantics of a predicated
expression list is the sequentially evaluation of each predicate expression, picking the expression
whose predicate evaluates to true. However, proving that two predicated expression lists will return
the same value assuming that the SAT query above is true, means that for each predicate that is
true, one needs to be able to show that there is a predicate in the other list that will also be
true and will return the same value. However, there is no guarantee that there is a unique
predicated expression that will match the current predicate, as there may be duplicate expressions
in the list with different predicates.

This means that it is quite difficult to prove that the SAT query implies the semantic preservation
of the scheduling transformation. The SAT query can therefore be broken down into a slightly more
direct comparison that is more useful for the proof itself. The steps to perform the more fine
grained comparison are the following:

\startitemize[n]
\item
  Hash all of the expressions into a unique literal.
\item
  Create a map from expression literals to their corresponding predicate, thereby having a unique
  predicate for each expressions. If there are duplicates in the list that forms the map, the
  predicates are combined using the or operation.
\item
  Iterate through the expressions in both maps and compare the predicates using the SAT solver. If
  an expression is not present in one of the maps, then its predicate should be equivalent to
  $\perp$.
\stopitemize

This comparison is logically equivalent to performing the large SAT query, however, it maps better
to the proof because it is comparing the predicates of the same expressions with each other.

\subsection[proof-of-correctness-of-scheduling]{Proof of Correctness of Scheduling}

The scheduling pass is formally verified using translation validation using the \abstr\ language as
a point of comparison between \rtlblock\ and \rtlpar. Using the algorithm defined in
\in{Section}[abstr_interp], the formally verified SAT solver described in \in{Section}[verified_sat]
can be used to prove that if the two predicated expression lists are equivalent, that they will
always evaluate to the same value.  This follows directly from the proofs of correctness for the
translation to \abstr.  Then, the proof that if the comparison function using the SAT solver between
two \abstr\ languages succeeds, then the two abstract languages will behave identically, this proof
can then be composed with the proofs of correctness of the generation of \abstr.

\section[scheduling:conclusion]{Conclusion}

In general, this whole proof of correctness for scheduling was the most challenging because as the
scheduling algorithm evolved, the language semantics of \rtlblock\ and \rtlpar\ also evolved, which
then required large changes in the proofs of correctness for each step in the scheduling proof.  In
addition to that, the whole scheduling proof, from \rtlblock\ generation to hardware generation
required greatly different approaches to the correctness proofs in each step.  Even simple
translations like the basic block generation and if-conversion pass required quite a few different
insights into how the algorithms actually worked and how the semantics behaved to come up with good
matching relation which are powerful enough to prove the final correctness result, but are also
provable in the first place.  Simply formulating these relations in the first place requires a lot
of experimentation.

\startmode[chapter]
  \section{Bibliography}
  \placelistofpublications
\stopmode

\stopcomponent