\environment fonts_env
\environment lsr_env

\startcomponent scheduling

\chapter[sec:scheduling]{Static Scheduling}

\section[introduction]{Introduction}

\oindex{static scheduling}The use of multicore processors has been increasing drastically, thereby
making parallelising compilers ever more important. In addition to that, the need for custom
hardware accelerators is also increasing, as traditional processors are not always the best choice
for all applications. Compilers that support optimisations which can automatically parallelise it's
input programs are therefore also becoming more important, as these back ends all benefit from
it. For processors, one wants to optimise the pipeline usage as much as possible, whereas in
hardware one can create custom pipelines, as well as making more use out of the spacial properties
that hardware offers, making instruction parallelism important as well as speculative
execution. However, with more powerful optimisations there is a greater chance to make mistakes in
the compiler, making it possible to generate programs that behave differently to the input
program. An important step towards making compilers more reliable is to formally verify these
optimisations so that it can be guaranteed that bugs are not introduced.

A popular optimisation for processors increasing the instruction parallelism of the resulting
program is to use scheduling. This is an optimisation that is especially important for processors
that have deep instruction pipelines or multiple functional units that can be used to execute
instructions in parallel. However, this optimisation is also critical for high-level synthesis
tools, which generate application specific hardware designs based on a software input. Simple
scheduling algorithms often work on basic blocks, which are sequential lists of instructions without
any branches. However, this limits what kind of reordering the scheduling algorithm can perform. It
can never move instructions past a branching instruction, even when this might drastically improve
the performance. Trace scheduling~ is an alternative that addresses this issue, by creating paths
that might cross branching instructions, and scheduling the instructions in that path. However, the
main problem with trace scheduling is that it is often infeasible on large programs, and because of
the large design space, it's sometimes hard to find an optimal
schedule. \index{superblock}Superblock~ and \index{hyperblock}hyperblock~ scheduling are two subsets
of trace scheduling, which compromise on the flexibility of trace scheduling to instead build more
tractable algorithms.  Superblock scheduling produces more efficient code for processors with a low
issue rate, whereas hyperblock scheduling produces more performant code for processors with a high
issue rate~.

CompCert~ is a compiler that is formally verified in Coq. In addition to that, have also added a
formally verified hardware back end to CompCert, allowing it to also generate provably correct
hardware designs. A hardware back end is an ideal target for hyperblock scheduling, as the issue
rate is practically infinite. This paper describes the implementation of a hyperblock instruction
scheduler in CompCert so that it can benefit existing CompCert back ends, as well as support more
general targets such as scheduling instructions for custom hardware.

\subsection[key-points]{Key Points}

The goal of this paper is to introduce hyperblock scheduling into CompCert to allow for advanced and
flexible scheduling operations that could be beneficial to various different back ends. To this end,
the key points of this paper are the following:

\startitemize
\item Description of how hyperblocks are introduced for improved scheduling of instructions, and how
  these can be built and translated back to instructions without predicates.
\item Description of the implementation and correctness proof of a \SDC\ scheduling algorithm, which
  can be adapted to various different scheduling strategies.
\stopitemize

\section[sec:scheduling]{Implementation of Hyperblocks in CompCert}

\index{hyperblock}This section describes the structure of hyperblocks in
Section~\goto{2.1}[scheduling:hyperblocks]. Then, the structure of two extra intermediate languages
that were added to CompCert to implement hyperblocks are also shown in
Section~\goto{2.2}[sec:rtlblockdef].

\subsection[scheduling:hyperblocks]{Hyperblocks as a Generalisation of Basic Blocks}

Basic blocks are popular in intermediate languages because they describe a list of sequential
instructions that are not separated by control-flow. This information can be used by various
analysis passes to perform optimisations, as these sequences of instructions can be reordered in any
way as long as their data dependencies are met. One such optimisation is scheduling, which reorders
instructions to eliminate of any delay slots caused by the CPU pipeline.

However, due to control-flow not being allowed in basic blocks, the number of instructions that can
be scheduled at any one time is limited.  The boundaries to other basic blocks act as hard boundary
to the scheduling algorithm. hyperblocks~ allow instructions to be predicated, thereby introducing
some control-flow information into the blocks, which can also be taken into account by the
scheduler. For example, any data dependency between instructions that have mutually exclusive
predicates can be removed, as these instructions will never be activate at the same time.

\subsection[sec:rtlblockdef]{RTLBlock and RTLPar Intermediate Language Definition}

\lindex{RTLBlock}\lindex{RTLPar}Figure~\goto{{[}fig:compcert_interm{]}}[fig:compcert_interm] shows
the intermediate languages introduced to implement hyperblocks in CompCert.  This consists of new
RTLBlock and RTLPar intermediate languages, which implement the sequential and parallel semantics of
basic blocks respectively. The semantics of RTLBlock and RTLPar are based on the CompCert RTL
intermediate language. However, instead of mapping from states to instructions, RTLBlock maps from
states to hyperblocks, and RTLPar maps from states to parallel hyperblocks, which will be described
in the next sections.

\placeformula \startformula \startmathalignment
\NC i\ \ \colon\colon= \NC \ \ \mono{RBnop}                                                   \NR
\NC                  \NC |\ \ \mono{RBop}\ p?\ \mathit{op}\ \vec{r}\ d                      \NR
\NC                  \NC |\ \ \mono{RBload}\ p?\ \mathit{chunk}\ \mathit{addr}\ \vec{r}\ d  \NR
\NC                  \NC |\ \ \mono{RBstore}\ p?\ \mathit{chunk}\ \mathit{addr}\ \vec{r}\ s \NR
\NC                  \NC |\ \ \mono{RBsetpred}\ p?\ c\ \vec{r}\ d                           \NR
\stopmathalignment \stopformula

\placeformula \startformula \startmathalignment
\NC i_{\mathit{cf}}\ \ \colon\colon= \NC \ \ \mono{RBcall}\ \mathit{sig}\ f\ \vec{r}\ d\ n                    \NR
\NC                      \NC |\ \ \mono{RBtailcall}\ \mathit{sig}\ f\ \vec{r}                      \NR
\NC                      \NC |\ \ \mono{RBbuiltin}\ f_{\mathit{ext}}\ \vec{r}\ r\ n                \NR
\NC                      \NC |\ \ \mono{RBcond}\ c\ \vec{r}\ n_{1}\ n_{2}                          \NR
\NC                      \NC |\ \ \mono{RBjumptable}\ r\ \vec{n}                                   \NR
\NC                      \NC |\ \ \mono{RBreturn}\ r?                                              \NR
\NC                      \NC |\ \ \mono{RBgoto}\ n                                                 \NR
\NC                      \NC |\ \ \mono{RBpred\_{cf}}\ P\ i_{\mathit{cf}_{1}}\ i_{\mathit{cf}_{2}} \NR
\stopmathalignment \stopformula

RTLBlock instructions are split into two types of instructions, standard instructions and
control-flow instructions. The standard instructions are the instructions that can be placed into
hyperblocks, whereas control-flow instructions are the instructions that can end hyperblocks.  The
standard instructions are shown in
Figure~\goto{{[}fig:standard_instruction{]}}[fig:standard_instruction], whereas the control-flow
instructions are shown in
Figure~\goto{{[}fig:control_flow_instruction{]}}[fig:control_flow_instruction].  Most instructions
are quite similar to their RTL counterparts, however, there are some instructions that have been
added. \type{RBsetpred} is a standard instruction which sets a predicate equal to an evaluated
condition. This instruction is used during if-conversion to set the predicate to the value of the
condition in the conditional statement instruction. To support proper predicated instructions, each
instruction therefore also contains an optional predicate which has been set beforehand, which
produces the hyperblock. In additon to that, there is also an extra control-flow instruction called
\type{RBpred_cf}, which can branch on a predicate an nests more control-flow instructions inside of
it. This is also necessary for if-conversion, when converting an already converted conditional
statement, as depending on the predicate, a different control-flow instruction may be
necessary.

These instructions are use in RTLBlock as well as in RTLPar. The main difference between these two
languages is how these instructions are arranged within the hyperblock and the execution semantics
of the hyperblock in the languages. Let $\mathit{hb}_{b}$ be the definition of a hyperblock for
RTLBlock, and let $\mathit{hb}_{p}$ be the definition of a hyperblock for RTLPar, then
$\mathit{hb}_{b}$ and $\mathit{hb}_{p}$ be defined as in
Figure~\goto{{[}fig:hb_def{]}}[fig:hb_def]. The Figure shows the different nesting levels of the
basic blocks in RTLBlock as well as RTLPar, and where the parallel semantics of RTLPar come into
play. RTLBlock is made of a list of instructions and a control-flow instruction that ends the
hyperblock. Each instruction in RTLBlock is executed sequentially. RTLPar is made of three separate
lists. The outer list behaves like the list in RTLBlock, and executes each list inside of it
sequentially. RTLPar then has a list which executes it's contents in parallel, meaning each
instruction uses the same state of the registers and memory while executing. Finally, instead of
directly containing instructions, there is an additional layer of sequentially executing lists of
instructions, to be able to execute small groups of instructions in sequence, but still in parallel
to the other instructions in the current parallel group.

RTLPar is structured in this way so that the scheduler has more flexibility, so that it can not only
parallelise operations in the basic blocks, but that it can also define sequential operations in
these parallel operations. This is especially useful for the hardware back end, as it means that
sequential operations that do not quite fill one clock cycle can be executed sequentially, but can
still be executed in parallel with other independent operations. This optimisation is called
operation chaining, and is critical to get the most performance out of the hardware.

\subsection[translating-rtlblock-to-rtlpar]{Translating RTLBlock to RTLPar}

The translation from RTLBlock to RTLPar is performed using a scheduling algorithm, which will takes
hyperblocks and rearranges their instructions to maximise parallelism based on the data
dependencies. It then arranges the instructions in RTLPar by putting independent instructions into
the same parallel block. Scheduling is an optimisation that is well suited to translation
validation, as the algorithm itself can be quite complex, whereas the output is normally relatively
easy to check for equivalence with the input.
Figure~\goto{{[}fig:compcert_interm{]}}[fig:compcert_interm] shows that the scheduling step comes
right after the if-conversion step which originally creates the hyperblocks.

Figure~\goto{{[}fig:op_chain{]}}[fig:op_chain] shows how the scheduling step transforms RTLBlock
into RTLPar. The RTLBlock hyperblock being scheduled is shown in
Figure~\goto{{[}fig:op_chain_a{]}}[fig:op_chain_a], which contains five predicated operations,
comprising two additions and three multiplications. The data dependencies of the instructions in the
hyperblock are shown in Figure~\goto{{[}fig:op_chain_b{]}}[fig:op_chain_b]. Curiously, even though
operations and are normally dependent on each other due to a write-after-write conflict, but because
the predicates are independent the write-after-write conflict can effectively be removed by the
scheduler.

The scheduler transforms the RTLBlock hyperblock into the RTLPar hyperblock shown in
Figure~\goto{{[}fig:op_chain_c{]}}[fig:op_chain_c].  Even though the addition in is dependent on ,
they can still be put into the same cycle as the additions do not take as long to complete as the
multiplications. This optimisation is called operation chaining. Then, the multiplication in can
also be placed in the same clock cycle as it does not have any data dependencies. Then, in the next
clock cycle, either the multiplication in can take place, or the multiplication in will take place,
meaning these two multiplications can also be placed in the same clock cycle. This gives the final
schedule that is shown in Figure~\goto{{[}fig:op_chain_d{]}}[fig:op_chain_d].

\section[abstr_interp]{Abstract Interpretation of Hyperblocks}

\index{abstract interpretation}The correctness of the scheduling algorithm is checked by comparing
the symbolic expressions of each register before and after scheduling. This section describes how
these symbolic expressions are constructed for each hyperblock.

\subsection[abstr_language]{Construction of Abstract Language}

The main difficulty when constructing \index{symbolic expression}symbolic expressions for
hyperblocks is the presence of predicates, which can be used by the scheduling algorithm to reorder
instructions further, even if these contain data dependencies. The scheduling algorithm will
manipulate predicates in the following ways, which the comparison function needs to take into
account.

\startitemize
\item
  Instructions with predicates that are equivalent to false can be removed.
\item
  Instructions with predicates that are mutually exclusive can be executed in parallel.
\item
  Any predicate can be replaced with an equivalent predicate.
\stopitemize

The symbolic expressions have the following form:

There are three main resources that are present in symbolic expressions: registers $r$, which
represent the value of the registers in the program, predicates $p$, which represent the value of
the predicates throughout the program, and finally, memory \type{Mem}, representing the state of the
memory. For each basic instruction in the hyperblock, there is an equivalent symbolic expression
which either modifies the state of the registers, predicates or memory. Finally, as each instruction
in the hyperblock can be predicates, each symbolic expression is also paired with a predicate as
well.

Assigning the same resource multiple times with instructions that contain different predicates,
means that the symbolic expression must also express this conditional result based on the
predicates. This is expressed by assigning lists of predicated symbolic expressions to each
resource, where each predicate needs to be mutually exclusive from the other predicates in the list.

The expressions are then constructed using a function which updates the symbolic expressions
assigned for each resource. This is done using multiple primitives which act on predicated types,
which are made up of a list of pairs of predicates and the element of that type. The first important
primitive is multiplication of two predicated types, which is implemented as performing a cartesian
multiplication between the predicated types in the two lists, anding the two predicates and joining
the two types of each list using a function.

%\startformula \label{eq:1}
%  P_{1} \otimes_{f} P_{2} \equiv \mono{map } (\lambda ((p_{1}, e_{1}), (p_{2}, e_{2})) . (p_{1} \land p_{2}, f\ e_{1}\ e_{2})) P_{1} \times P_{2} \stopformula

In addition to that, another primitive that is needed is the following append operation, which will
negate the combination of all predicates in the second predicated type, and it to the first
predicated type and append the first predicated type to the second:

%\startformula \label{eq:2}
%  \mu(p, P) \equiv \mono{map } (\lambda (p', e') . (p \land p', e'))\ P \stopformula

%\startformula \label{eq:3}
%  P_{1} \oplus_{p} P_{2} \equiv \mu(\neg p, P_{1}) \mathbin{++} \mu(p, P_{2}) \stopformula

\subsection[example-of-translation]{Example of translation}

%\startformula \begin{aligned}
%    \label{eq:4}
%    &\mono{r1} =
%      \begin{cases}
%        \mono{r1}^{0} + \mono{r4}^{0} + \mono{r4}^{0}, \quad &\mono{p1} \\
%        \mono{r1}^{0}, \quad &\mono{!p1} \\
%      \end{cases}\\
%    &\mono{r2} = \mono{r1}^{0} + \mono{r4}^{0}\\
%    &\mono{r3} =
%      \begin{cases}
%        \left( \mono{r1}^{0} \times \mono{r1}^{0} \right) \times \left( \mono{r1}^{0} \times \mono{r1}^{0} \right),\quad &\mono{!p2 \&\& !p1}\\
%        \mono{r1}^{0} \times \mono{r4}^{0},\quad &\mono{p2 \&\& !p1}\\
%        \left( \mono{r1}^{0} + \mono{r4}^{0} + \mono{r4}^{0} \right) \times \mono{r4}^{0},\quad &\mono{!p2 \&\& p1}\\
%        \mono{r3}^{0} \times \mono{r3}^{0},\quad &\mono{p2 \&\& p1}
%      \end{cases}
%  \end{aligned} \stopformula

Using the example shown in Figure~\goto{{[}fig:op_chain{]}}[fig:op_chain], the RTLBlock hyperblock
shown in Figure~\goto{{[}fig:op_chain_a{]}}[fig:op_chain_a] is scheduled into the hyperblock RTLPar
shown in Figure~\goto{{[}fig:op_chain_c{]}}[fig:op_chain_c]. The first step is to translate the
input and output of the scheduler into the Abstr intermediate language by calculating the symbolic
values that will be stored in each of the registers. Symbolically evaluating the RTLBlock hyperblock
results in the expressions shown in
Figure~\goto{{[}fig:symbolic_expressions{]}}[fig:symbolic_expressions] for registers \type{r1},
\type{r2} and \type{r3}, where the state of the register \type{r1} at the start of the hyperblock is
denoted as $\mono{r1}^{0}$.

This symbolic expression can be generated by sequentially going through the list of predicated
expressions and applying the update function defined in Section~\goto{3.1}[abstr_language]. The
resulting predicated expressions is guaranteed to have predicates which are mutually exclusive,
meaning if the predicate evaluates to true, all other predicates must evaluate to false.

\subsection[linear-flat-predicated-expressions]{Linear (Flat) Predicated Expressions}

\index{symbolic expression}The update function for predicated expressions is quite a bit more
complex compared to standard update functions for symbolic execution, such as the one used by . The
main reason for this is the non-recursive nature of the predicated expressions. As the predicates
are only available at the top-level of the expressions, combining multiple expressions means that
the conditions and multiple possibilities also need to trickle up to the top.

Instead, if the predicated instructions were recursive, one would not have to change any predicates
when evaluating the argument lists of instructions, for example, as these can just stay
predicated. The evaluation semantics would also be much simpler, because a rule can be added for
evaluating predicates.

\section[sec:correctness]{Correctness of the Abstract Interpretation}

The correctness of the scheduling step is dependent on the correctness of the equivalence check
between the two abstract languages. As the abstract languages contain predicates, this comparison
check cannot just be a syntactic comparison between the two languages, because the predicates might
be different but logically equivalent. Instead, a mixed equivalence check is performed using a
verified SAT solver and a syntactic equivalence check of the actual symbolic expressions.

\subsection[comparing-symbolic-abstract-expressions]{Comparing Symbolic Abstract Expressions}

Abstract expressions comprise a predicate and a symbolic expression. The translation from RTLBlock
and RTLPar to Abstract ensures that each register will contain a predicated symbolic expression that
describes the value that will be assigned to it after evaluating the basic block.  This means that
if an expression always produces the same result as another expression, that the register will
always contain the same value. If the RTLBlock program and RTLPar program translate to two Abstract
languages where all the register's symbolic expressions match, then this means that both programs
will always produce the same value.

The main comparison between two symbolic expressions can be done using a simple syntactic
comparison, because this ensures that each of the expressions always evaluate to the same value, and
is also general enough for the expressions that need to be compared. However, as all expressions
might be predicated, this simple syntactic comparison cannot be performed for the predicates. This
is because predicated instructions can be executed in many different orders, and the predicates are
used to calculate dependence information and therefore also determine the order of operations.

Instead, comparing predicates can be done by checking logical equivalence of the predicates using a
formally verified \SAT solver. This ensures that the predicates will always evaluate to the same
value, thereby making it a fair comparison.

\subsection[verified_sat]{Verified Decidable DPLL SAT Solver}

A SAT solver is needed to correctly and reliably compare predicated expressions. This makes it
possible to then compare many different possible schedules, even if they remove expressions
completely if these are unnecessary. The SAT solver is needed in the verifier, which needs to be
proven correct in Coq to prove the scheduling translation correct.  The SAT solver therefore has to
be verified, because it could otherwise not be used in the proof of correctness.

The DPLL algorithm is used to implement the SAT solver. This means that the SAT solver takes in a
\CNF\ formula. The conversion from a recursive predicate type into a \CNF\ formula is done by
recursively expanding the predicate in equivalent \CNF\ formulas.

The SAT solver is implemented in Coq and proven to be sound and complete. The soundness and
completeness of an algorithm to check whether a predicate is satisfiable or not also directly
implies that the check is decidable, simplifying many proofs in Coq by using a \type{Setoid}
instance for equivalence between predicates. This makes rewriting predicates if they are equivalent
much easier in proofs by reusing many of the existing Coq tactics.

\subsection[encoding-predicated-expressions-as-sat-formulas]{Encoding Predicated Expressions as SAT
  Formulas}

Given the following two predicated expressions which should be tested for equivalence, one can
construct a SAT expression which will ensure that both expressions will always result in the same
value.

%\startformula \begin{aligned}
%  \bgroup \left.\begin{aligned}
%    e_{1},\qquad &p_{1}\\
%    e_{2},\qquad &p_{2}\\
%    \vdots\quad& \\
%    e_{n},\qquad &p_{n}\\
%  \end{aligned}\right\rbrace\egroup
%  \Longleftrightarrow
%  \begin{cases}
%    e_{1}',\quad &p_{1}'\\
%    e_{2}',\quad &p_{2}'\\
%    \;\;\;\quad\vdots \notag \\
%    e_{m}',\quad &p_{m}'\\
%  \end{cases}\end{aligned} \stopformula

The SAT query that should determine if two predicated expressions are equivalent is the following,
where $\mathit{he}_{i}$ is the hashed expression of $e_{i}$:

%\startformula \begin{gathered}
%  (p_{1} \implies \mathit{he}_{1}) \land (p_{2} \implies \mathit{he}_{2}) \land \cdots \land (p_{n} \implies \mathit{he}_{n}) \\
%  \Longleftrightarrow\\
%  (p_{1}' \implies \mathit{he}_{1}') \land (p_{2}' \implies \mathit{he}_{2}') \land \cdots \land (p_{m}' \implies \mathit{he}_{m}') \\\end{gathered} \stopformula

However, reasoning about the equivalence of individual expressions within the greater list of
predicated expressions is not quite straightforward. The evaluation semantics of predicated
expression list will evaluate each predicate expression sequentially, and pick the expression whose
predicate evaluates to true. However, proving that both predicated expression lists will return the
same value assuming that the SAT query above is true, means that for each predicate that is true,
One needs to be able to show that there is a predicate in the other list that will also be true,
which will return the same value. However, there is no guarantee that there is a unique predicated
expression that will match the current predicate, as there may be duplicate expressions in the list
with different predicates.

This means that it is quite difficult to prove that the SAT query implies the semantic preservation
of the scheduling transformation. The SAT query can therefore be broken down into a slightly more
direct comparison that is more useful for the proof itself. The steps to perform the more fine
grained comparison are the following:

%\startenumerate[n]
%\item
%  Hash all of the expressions into a unique literal.
%\item
%  Create a map from expression literals to their corresponding
%  predicate, thereby having a unique predicate for each expressions. If
%  there are duplicates in the list that forms the map, the predicates
%  are combined using the or operation.
%\item
%  Iterate through the expressions in both maps and compare the
%  predicates using the SAT solver. If an expression is not present in
%  one of the maps, then it's predicate should be equivalent to $\perp$.
%\stopenumerate

This comparison is logically equivalent to performing the large SAT query, however, it maps better
to the proof because it is comparing the predicates of the same expressions with each other.

\subsection[proof-of-equivalence-between-passes]{Proof of equivalence between passes}

Figure~\goto{{[}fig:compcert_interm{]}}[fig:compcert_interm] shows the four different passes that
were added to CompCert to form hyperblocks.  This section will cover the proofs for each of the
translations, focusing mainly on the proof of translation of the scheduling pass.

\subsubsection[proof-of-correctness-of-the-if-conversion-pass]{Proof of Correctness of the
  If-Conversion Pass}

The if-conversion pass is quite simple to verify, as it is an optimisation that transforms the code
in the same language.

\subsubsection[proof-of-correctness-of-scheduling]{Proof of Correctness of Scheduling}

The scheduling pass is formally verified using translation validation, using the Abstr language as a
point of comparison between RTLBlock and RTLPar. Using the algorithm defined in
Section~\goto{3}[abstr_interp], the formally verified SAT solver described in
Section~\goto{4.2}[verified_sat] can be used prove that if the two predicated expression lists are
equivalent, that they will always evaluate to the same value.

\section[implementation-details-of-static-sdc-scheduling]{\SDC\ Scheduling Implementation Details}

This section describes the implementation details of the static \SDC\ scheduling algorithm used, and
also describes some of the details of the basic block generation as well as the if-conversion pass
which generates the predicated instructions.

\subsection[implementation-of-if-conversion]{Implementation of If-Conversion}

If-conversion is the conversion that introduces predicated instructions which can make use of the
hyperblocks. It converts conditional statements into predicated instructions. This transformation
has a few limitations on the kind of conditional statements that it can translate.  First, only
conditional statements without loops can be translated, therefore, one must identify cycles in the
control-flow before performing the if-conversion. Secondly, if-conversion will not always result in
more efficient code, so it should not be applied to any conditional statements. Instead, it is best
applied to conditional statements where each branch will take a similar amount of time to execute.

For back ends that do not support predicates, it is also important to eliminate the predicates again
using a process called reverse if-conversion, which creates branches out of the predicates and
groups these together again. This allows for hyperblock scheduling to also be used with back ends
that do not support predicates, which are most the default back ends included in CompCert.

\subsection[static-sdc-scheduling]{Static \SDC\ Scheduling}

\Word{\infull{SDC}} (\SDC) scheduling~\cite[cong06_sdc] is a flexible algorithm to perform
scheduling of instructions, especially if the target is hardware directly. It allows for flexible
definitions of optimisation functions in terms of latencies and dependencies of the operations in
the basic block.

\section[performance-comparison]{Performance Comparison}

\section[sec:related-work]{Related Work}

\section[conclusion]{Conclusion}

This material is based upon work supported by the under Grant No.~ and Grant No.~. Any opinions,
findings, and conclusions or recommendations expressed in this material are those of the author and
do not necessarily reflect the views of the National Science Foundation.

\stopcomponent