summaryrefslogtreecommitdiffstats
path: root/chapters
diff options
context:
space:
mode:
authorYann Herklotz <git@yannherklotz.com>2022-04-15 11:46:10 +0100
committerYann Herklotz <git@yannherklotz.com>2022-04-15 11:46:10 +0100
commit2a716db2fb59f448590b189f7b953a80d02bc5ee (patch)
tree46ee92387aa0180f18d19ced871dd395fb0b49de /chapters
parent3d6c27a18bdcf77edfd746982229f609d5448432 (diff)
downloadlsr22_fvhls-2a716db2fb59f448590b189f7b953a80d02bc5ee.tar.gz
lsr22_fvhls-2a716db2fb59f448590b189f7b953a80d02bc5ee.zip
Add changes
Diffstat (limited to 'chapters')
-rw-r--r--chapters/background.tex306
-rw-r--r--chapters/conclusion.tex7
-rw-r--r--chapters/dynamic.tex9
-rw-r--r--chapters/hls.tex5
-rw-r--r--chapters/pipelining.tex9
-rw-r--r--chapters/schedule.tex118
-rw-r--r--chapters/scheduling.tex779
7 files changed, 768 insertions, 465 deletions
diff --git a/chapters/background.tex b/chapters/background.tex
index 4636918..81ca653 100644
--- a/chapters/background.tex
+++ b/chapters/background.tex
@@ -1,7 +1,309 @@
+\environment fonts_env
+\environment lsr_env
+
\startcomponent background
-\chapter{Background}
+\chapter[sec:background]{Background}
+
+%\begin{table}
+% \centering
+% \begin{tabular}{lccccc}\toprule
+% \textbf{Tool} & \textbf{Implementation} & \textbf{Mainstream} & \textbf{Proof} & \textbf{Mechanised} & \textbf{HLS}\\
+% \midrule
+% Catapult C~\cite{mentor20_catap_high_level_synth} & \cmark{} & \cmark{} & \cmark{} & \xmark{} & \cmark{}\\
+% Chapman \textit{et al.}~\cite{chapman92_verif_bedroc} & \cmark{} & \xmark{} & \cmark{} & \cmark{} & \cmark{}\\
+% Clarke \textit{et al.}~\cite{clarke03_behav_c_veril} & \cmark{} & \cmark{} & \cmark{} & \xmark{} & \cmark{}\\
+% Ellis~\cite{ellis08} & \xmark{} & \xmark{} & \cmark{} & \cmark{} & \cmark{}\\
+% Karfa \textit{et al.}~\cite{karfa06_formal_verif_method_sched_high_synth} & \cmark{} & \xmark{} & \cmark{} & \xmark{} & \cmark{}\\
+% Kundu \textit{et al.}~\cite{kundu08_valid_high_level_synth} & \cmark{} & \xmark{} & \cmark{} & \xmark{} & \cmark\\
+% LegUp~\cite{canis13_legup} & \cmark{} & \cmark{} & \xmark{} & \xmark{} & \cmark{}\\
+% Lööw \textit{et al.}~\cite{loow19_verif_compil_verif_proces} & \cmark{} & \xmark{} & \cmark{} & \cmark{} & \xmark{}\\
+% OpenCL SDK~\cite{intel20_sdk_openc_applic} & \cmark{} & \xmark{} & \xmark{} & \xmark{} & \cmark{}\\
+% Perna \textit{et al.}~\cite{perna12_mechan_wire_wise_verif_handel_c_synth} & \cmark{} & \xmark{} & \cmark{} & \cmark{} & \xmark{}\\
+% Vivado HLS~\cite{xilinx20_vivad_high_synth} & \cmark{} & \cmark{} & \xmark{} & \xmark{} & \cmark{}\\
+% \bottomrule
+% \end{tabular}
+% \caption{Table showing the various related works and comparing them.}\label{tab:existing_tools}
+%\end{table}
+
+This section describes the relevant literature around high-level synthesis, verified high-level
+synthesis, and finally mechanised proofs of HLS. Table~\ref{tab:existing_tools} classifies
+existing tools in the are of formal verification and high-level synthesis. The next section covers
+the main different categories that the tools in Table~\ref{tab:existing_tools} show, which include
+tools that cover \emph{implementation}, \emph{proof} and finally tools that are \emph{mechanically
+ verified}.
+
+\section{Implementation: High-level Synthesis}
+
+High-level synthesis (HLS) is the transformation of software directly into hardware. There are many
+different types of HLS, which can vary greatly with what languages they accept, however, they often
+share similar steps in how the translation is performed, as they all go from a higher-level,
+behavioural description of the algorithm to a timed hardware description. The main steps performed
+in the translation are the following~\cite{coussy09_introd_to_high_level_synth,canis13_legup}:
+
+\desc{Compilation of specification} First, the input specification has to be compiled into a
+representation that can be used to apply all the necessary optimisations. This can be an
+intermediate language such as the lower-level virtual machine intermediate representation (LLVM IR),
+or a custom representation of the code.
+
+\desc{Hardware resource allocation} It is possible to select a device to target, which gives
+information about how many LUTs can be used to implement the logic. However, tools often assume
+that an unlimited amount of resources are available instead. This is resource sharing can be too
+expensive for adders, as the logic for sharing the adder would have approximately the same cost as
+adding a separate adder. However, multiplies and divide blocks should be shared, especially as
+divide blocks are often implemented in LUTs and are therefore quite expensive.
+
+\desc{Operation scheduling} The operations then need to be scheduled into a specific clock cycle, as
+the input specification language is normally an untimed behavioural representation. This is where
+the spatial parallelism of the hardware can be taken advantage of, meaning operations that are not
+dependent on each other can run in parallel. There are various possible scheduling methods that
+could be used, such as static or dynamic scheduling, which are described further in
+sections~\ref{sec:static_scheduling} and~\ref{sec:dynamic_scheduling}.
+
+\desc{Operation and variable binding} After the operations have been scheduled, the operations and
+the variables need to be bound to hardware units and registers respectively. It is often not that
+practical to share many hardware units though, as this requires adding multiplexers, which are often
+the same size as the hardware units themselves. It is therefore more practical to only share
+operations that take up a lot of space, such as modulo or divide circuits, as well as multiplies if
+there are not available any more on the FPGA.
+
+\desc{Hardware description generation} Finally, the hardware description is generated from the code
+that was described in the intermediate language. This is often a direct translation of the
+instructions into equivalent hardware constructs.
+
+There are many examples of existing high-level synthesis tools, the most popular ones being
+LegUp~\cite{canis13_legup}, Vivado HLS~\cite{xilinx20_vivad_high_synth}, Catapult
+C~\cite{mentor20_catap_high_level_synth} and Intel's OpenCL SDK~\cite{intel20_sdk_openc_applic}.
+These HLS tools all accept general programming languages such as C or OpenCL, however, some HLS
+tools take in languages that were designed for hardware such as
+Handel-C~\cite{aubury96_handel_c_languag_refer_guide}, where time is encoded as part of the
+semantics.
+
+\subsection[sec:static_scheduling]{Static Scheduling}
+
+Static scheduling is used by the majority of synthesis
+tools~\cite{canis13_legup,xilinx20_vivad_high_synth,mentor20_catap_high_level_synth,intel20_sdk_openc_applic}
+and means that the time at which each operation will execute is known at compile time. Static
+analysis is used to gather all the data dependencies between all the operations to determine in
+which clock cycle the operations should be placed, or if these operations can be parallelised. Once
+the data-flow analysis has been performed, various scheduling schemes can be taken into account to
+place operations in various clock cycles. Some common static-scheduling schemes are the following:
+
+\desc{As soon as possible (ASAP)} scheduling will place operations into the first possible clock
+cycle that they can be executed.
+
+\desc{As late as possible (ALAP)} scheduling places operations into the last possible clock cycle,
+right before they are used by later operations.
+
+\desc{List scheduling} uses priorities associated with each operation and chooses operations from
+each priority level to be placed into the current clock cycle if all the data-dependencies are
+met. This is done until all the resources for the current clock cycle are used up.
+
+Static scheduling can normally produce extremely small circuits, however, it is often not possible
+to guarantee that the circuits will have the best
+throughput~\cite{cheng20_combin_dynam_static_sched_high_level_synth}, as this requires extensive
+control-flow analysis and complex optimisations. Especially for loops, finding the optimal
+initiation interval (II) can be tricky if there are loads and stores in the loop or if the loops are
+nested.
+
+\subsection[sec:dynamic_scheduling]{Dynamic Scheduling}
+
+On the other hand, Dynamic scheduling~\cite{josipovic18_dynam_sched_high_level_synth} does not
+require the schedule to be known at compile time and it instead generates circuits using tokens to
+schedule the operations in parallel at run time. Whenever the data for an operation is available,
+it sends a token to the next operation, signalling that the data is ready to be read. The next
+operation does not start until all the required inputs to the operation are available, and once that
+is the case, it computes the result and then sends another token declaring that the result of that
+operation is also ready. The benefit of this approach is that only basic data-flow analysis is
+needed to connect the tokens correctly, however, the scheduling is then done dynamically at run
+time, depending on how long each primitive takes to finish and when the tokens activate the next
+operations.
+
+The benefits of this approach over static scheduling is that the latency of these circuits is
+normally significantly lower than the latency of static scheduled circuits, because they can take
+advantage of runtime information of the circuit. However, because of the signalling required to
+perform the runtime scheduling, the area of these circuits is usually much larger than the area of
+static scheduled circuits. In addition to that, much more analysis is needed to properly
+parallelise loads and stores to prevent bugs, which requires the addition of buffers in certain
+locations.
+
+An example of a dynamically scheduled synthesis tool is
+Dynamatic~\cite{josipovic18_dynam_sched_high_level_synth}, which uses a load-store queue
+(LSQ)~\cite{josipovic17_out_of_order_load_store} to order memory operations correctly even when
+loops are pipelined and there are dependencies between iterations. In addition to that, performance
+of the dynamically scheduled code is improved by careful buffer
+placement~\cite{josipovic20_buffer_placem_sizin_high_perfor_dataf_circuit}, which allows for better
+parallisation and pipelining of loops.
+
+\section{Proof: Verified HLS}
+
+Work is being done to prove the equivalence between the generated hardware and the original
+behavioural description in C. An example of a tool that implements this is Mentor's
+Catapult~\cite{mentor20_catap_high_level_synth}, which tries to match the states in the register
+transfer level (RTL) description to states in the original C code after an unverified translation.
+This technique is called translation validation~\cite{pnueli98_trans}, whereby the translation that
+the HLS tool performed is proven to have been correct for that input, by showing that they behave in
+the same way for all possible inputs. Using translation validation is quite effective for proving
+complex optimisations such as
+scheduling~\cite{kim04_autom_fsmd,karfa06_formal_verif_method_sched_high_synth,chouksey20_verif_sched_condit_behav_high_level_synth}
+or code
+motion~\cite{banerjee14_verif_code_motion_techn_using_value_propag,chouksey19_trans_valid_code_motion_trans_invol_loops},
+however, the validation has to be run every time the high-level synthesis is performed. In addition
+to that, the proofs are often not mechanised or directly related to the actual implementation,
+meaning the verifying algorithm might be wrong and could give false positives or false negatives.
+
+More examples of translation validation for proofs about HLS
+algorithms~\cite{karfa06_formal_verif_method_sched_high_synth,karfa07_hand_verif_high_synth,kundu07_autom,karfa08_equiv_check_method_sched_verif,kundu08_valid_high_level_synth,karfa10_verif_datap_contr_gener_phase,karfa12_formal_verif_code_motion_techn,chouksey19_trans_valid_code_motion_trans_invol_loops,chouksey20_verif_sched_condit_behav_high_level_synth}
+are performed using a HLS tool called SPARK~\cite{gupta03_spark}. These translation validation
+algorithms can check the correctness of complicated optimisations such as code motion or loop
+inversions. However, even though the correctness of the verifier is proven in the papers, the proof
+does translate directly to the algorithm that was implemented for the verifier. It is therefore
+possible that output is accepted even though it is not equivalent to the input. In addition to
+that, these papers reason about the correctness of the algorithms in terms of the intermediate
+language of SPARK, and does not extend to the high-level input language that SPARK takes in, or the
+hardware description language that SPARK outputs.
+
+Finally there has also been work proving HLS correct without using translation validation, but by
+directly showing that the translation is correct. The first instance of this is proving the
+BEDROC~\cite{chapman92_verif_bedroc} HLS tool is correct. This HLS tool converts a high-level
+description of an algorithm, supporting loops and conditional statements, to a netlist and proves
+that the output is correct. It works in two stages, first generating a DFG from the input language,
+HardwarePal. It then optimises the DFG to improve the routing of the design and also improving the
+scheduling of operations. Finally, the netlist is generated from the DFG by placing all the
+operations that do not depend on each other into the same clock cycle. Datapath and register
+allocation is performed by an unverified clique partitioning algorithm. The equivalence proof
+between the DFG and HardwarePal is done by proof by simulation, where it is proven that, given a
+valid input configuration, that applying a translation or optimisation rule will result in a valid
+DFG with the same behaviour as the input.
+
+There has also been work on proving the translation from Occam to gates~\cite{page91_compil_occam}
+correct using algebraic proofs~\cite{jifeng93_towar}. This translation resembles dynamic scheduling
+as tokens are used to start the next operations. However, Occam is timed by design, and will
+execute assignments sequentially. To take advantage of the parallel nature of hardware, Occam uses
+explicit parallel constructs with channels to share state. Handel-C, a version of Occam with many
+more features such as memory and pointers, has also been used to prove HLS correct, and is described
+further in Section~\ref{sec:wire-to-wire}.
+
+\section{Mechanised: Compiler Proofs}
+
+Even though a proof for the correctness of an algorithm might exist, this does not guarantee that
+the algorithm itself behaves in the same way as the assumed algorithm in the proof. The
+implementation of the algorithm is separate from the actual implementation, meaning there could be
+various implementation bugs in the algorithm that mean that it does not behave correctly. C
+compilers are a good example of this, where many optimisations performed by the compilers have been
+proven correct, however these proofs are not linked directly to the actual implementations of these
+algorithms in GCC or Clang. Yang et al.~\cite{yang11_findin_under_bugs_c_compil} found
+more than 300 bugs in GCC and Clang, many of them being in the optimisation phases of the compiler.
+One way to link the proofs to the actual implementations in these compilers is to write the compiler
+in a language which allows for a theorem prover to check properties about the algorithms. This
+allows for the proofs to be directly linked to the algorithms, ensuring that the actual
+implementations are proven correct. Yang et al.~\cite{yang11_findin_under_bugs_c_compil}
+found that CompCert, a formally verified C Compiler, only had five bugs in all the unverified parts
+of the compiler, meaning this method of proving algorithms correct ensures a correct compiler.
+
+\subsection{CompCert}
+
+CompCert~\cite{leroy09_formal_verif_compil_back_end} is a formally verified C compiler written in
+Coq~\cite{bertot04_inter_theor_provin_progr_devel}. Coq is a theorem prover, meaning algorithms
+written in Coq can be reasoned about in Coq itself by proving various properties about the
+algorithms. To then run these algorithms that have been proven correct, they can be extracted
+directly to OCaml code and then executed, as there is a straightforward correspondence between Coq
+and OCaml code. During this translation, all the proofs are erased, as they are not needed during
+the execution of the compiler, as they are only needed when the correctness of the compiler needs to
+be checked. With this process, one can have a Compiler that satisfies various correctness
+properties and can therefore be proven to preserve the behaviour of the code.
+
+CompCert contains eleven intermediate languages, which are used to gradually translate C code into
+assembly that has the same behaviour. Proving the translation directly without going through the
+intermediate languages would be infeasible, especially with the many optimisations that are
+performed during the translation, as assembly is very different to the abstract C code it receives.
+The first three intermediate languages (C\#minor, C\#minorgen, Cminor) are used to transform Clight
+into a more assembly like language called register transfer language (RTL). This language consist
+of a control-flow graph of instructions, and is therefore well suited for various compiler
+optimisations such as constant propagation, dead-code elimination or function
+inlining~\cite{tristan08_formal_verif_trans_valid}. After RTL, each intermediate language is used
+to get closer to the assembly language of the architecture, performing operations such as register
+allocation and making sure that the stack variables are correctly aligned.
+
+\subsection{Isabelle HLS}
+
+Martin Ellis' work on correct synthesis~\cite{ellis08} is the first example of a mechanically
+verified high-level synthesis tool, also called hardware/software compilers, as they produce
+software as well as hardware for special functional units that should be hardware accelerated. The
+main goal of the thesis is to provide a framework to prove hardware/software compilers correct, by
+defining semantics for an intermediate language (IR) which supports partitioning of code into
+hardware and software parts, as well as a custom netlist format which is used to describe the
+hardware parts and is the final target of the hardware/software compiler. The proof of correctness
+then says that the hardware/software design should have the same behaviour as if the design had only
+been implemented in software. The framework used to prove the correctness of the compilation from
+the IR to the netlist format is written in Isabelle, which is a theorem prover comparable to Coq.
+Any proofs in the framework are therefore automatically checked by Isabelle for correctness.
+
+This work first defines a static single assignment (SSA) IR, with the capabilities of defining
+sections that should be hardware accelerated, and sections that should be executed on the CPU. This
+language supports \emph{hyperblocks}, which are a list of basic blocks with one entry node and
+multiple exit nodes, which is well suited for various HLS optimisations and better scheduling.
+However, as this is quite a low-level intermediate language, the first thing that would be required
+is to create a tool that can actually translate a high-level language to this intermediate language.
+In addition to that, they also describe a netlist format in which the hardware accelerated code will
+be represented in.
+
+As both the input IR and output netlist format have been designed from scratch, they are not very
+useful for real world applications, as they require a different back end to be implemented from
+existing compilers. In addition to that, it would most likely mean that the translation from
+higher-level language to the IR is unverified and could therefore contain bugs. As the resulting
+netlist also uses a custom format, it cannot be fed directly to tools that can then translate it a
+bitstream to place onto an FPGA. The reason for designing a custom IR and netlist was so that
+these were compatible with each other, making proofs of equivalence between the two simpler.
+
+Finally, it is unclear whether or not a translation algorithm from the IR to the netlist format was
+actually implemented, as the only example in the thesis seems to be compiled by hand to explain the
+proof. There are also no benchmarks on real input programs showing the efficiency of the
+translation algorithm, and it is therefore unclear whether the framework would be able to prove more
+complicated optimisations that a compiler might perform on the source code. The thesis seems to
+describe the correctness proofs by assuming a compiler exists which outputs various properties that
+are needed by the equivalence proof, such a mapping between variables and netlist wires and
+registers.
+
+Even though this Isabelle framework does provide some equivalence relation between an IR and a
+netlist and describes how the translation would be proven correct by matching states to one another,
+it is actually not useable in practice. First of all, the IR is only represented in Isabelle and
+does not have a parser or a compiler which can target this IR. In addition to that, the netlist
+format cannot be passed to any existing tool, to then be placed onto an FPGA, meaning an additional,
+unverified translation would have to take place. This further reduces the effectiveness of the
+correctness proof, as there are various stages that are not addressed by the proofs and therefore
+have to be assumed to be correct.
+
+\subsection[sec:wire-to-wire]{Mechanised Handel-C to Netlist Translation}
+
+Handel-C~\cite{bowen98_handel_c_languag_refer_manual} is a C-like language for hardware development.
+It supports many C features such as assignments, if-statements, loops, pointers and functions. In
+addition to these constructs, Handel-C also supports explicit timing constructs, such as explicit
+parallelism as well as sequential or parallel assignment, similar to blocking and nonblocking
+assignments in Verilog. Perna et al.~\cite{perna12_mechan_wire_wise_verif_handel_c_synth} developed
+a mechanically verified Handel-C to netlist translation written in HOL. The translation is based
+on previous work describing translation from Occam to gates by Page et
+al.~\cite{page91_compil_occam}, which was proven correct by Jifeng et al.~\cite{jifeng93_towar}
+using algebraic proofs. As Handel-C is mostly an extension of Occam with C-like operations, the
+translation from Handel-C to gates can proceed in a similar way.
+
+Perna et al. verify the compilation of a subset of Handel-C to gates, which does not include memory,
+arrays or function calls. In addition to the constructs presented in Page et al., the prioritised
+choice construct is also added to the Handel-C subset that is supported. The verification proceeds
+by first defining the algorithm to perform the compilation, chaining operations together with start
+and end signals that determine the next construct which will be executed. The circuits themselves
+are treated as black boxes by the compilation algorithm and are chosen based on the current
+statement in Handel-C which is being translated. The compilation algorithm correctly has to link
+each black box to one another using the start and end signals that the circuits expose to correctly
+compile the Handel-C code.
-aroistenaoirstenoiaresntoien
+The verification of the compilation is done by proving that the control signal is propagated through
+the circuits correctly, and that each circuit is activated at the right time. It is also proven
+that the internal circuits of the constructs also propagate the control signal in the correct way,
+and will be active at the right clock cycle. However, it is not proven that the circuits have the
+same functionality as the Handel-C constructs, only that the control signals are propagated in the
+correct manner. In addition to that, the assumption is made that the semantics for time are the
+same in Handel-C as well as in the netlist format that is generated, which could be proven if the
+constructs are shown to behave in the exact same way as the handel-C constructs.
\stopcomponent
diff --git a/chapters/conclusion.tex b/chapters/conclusion.tex
index 6422396..91ca14f 100644
--- a/chapters/conclusion.tex
+++ b/chapters/conclusion.tex
@@ -1 +1,8 @@
+\environment fonts_env
+\environment lsr_env
+
+\startcomponent pipelining
+
\chapter{Conclusion}
+
+\stopcomponent
diff --git a/chapters/dynamic.tex b/chapters/dynamic.tex
index a1744e8..5edd812 100644
--- a/chapters/dynamic.tex
+++ b/chapters/dynamic.tex
@@ -1 +1,8 @@
-\chapter{\ymhfw{} Dynamic Scheduling}
+\environment fonts_env
+\environment lsr_env
+
+\startcomponent dynamic
+
+\chapter[sec:dynamic]{Dynamic Scheduling}
+
+\stopcomponent
diff --git a/chapters/hls.tex b/chapters/hls.tex
index 362309f..8ff7143 100644
--- a/chapters/hls.tex
+++ b/chapters/hls.tex
@@ -1,8 +1,5 @@
-\environment fonts_env
-\environment lsr_env
-
\startcomponent hls
-\product main
+\product ../main.tex
\def\slowdownOrig{27}
\def\slowdownDiv{2}
diff --git a/chapters/pipelining.tex b/chapters/pipelining.tex
index 4308188..5e07bd7 100644
--- a/chapters/pipelining.tex
+++ b/chapters/pipelining.tex
@@ -1 +1,8 @@
-\chapter{\ymhfw{} Loop Pipelining}
+\environment fonts_env
+\environment lsr_env
+
+\startcomponent pipelining
+
+\chapter[sec:pipelining]{Loop Pipelining}
+
+\stopcomponent
diff --git a/chapters/schedule.tex b/chapters/schedule.tex
index 9dc8d38..1711108 100644
--- a/chapters/schedule.tex
+++ b/chapters/schedule.tex
@@ -1 +1,117 @@
-\chapter{Schedule}
+\environment fonts_env
+\environment lsr_env
+
+\startcomponent pipelining
+
+\definecolor[hyperblockcolour][x=66C2A5]
+\definecolor[dynamiccolour][x=FC8D62]
+\definecolor[modulocolour][x=8DA0CB]
+\definecolor[thesiscolour][x=E78AC3]
+
+\chapter[sec:schedule]{Schedule}
+
+The structure of the thesis has been described in the previous chapters. Out of these chapters, the
+work for \in{Chapter}[sec:hls] has been completed, the work for \in{Chapter}[sec:scheduling] is
+still ongoing and finally the work in \in{Chapter}[sec:pipelining] and \in{Chapter}[sec:dynamic] has
+not been started yet. This chapter will describe how the rest of the work will be completed. The
+current plan is to finish all the work including the thesis by June 2023, which is when the funding
+will run out.
+
+\section{Timeline}
+
+The main timeline for the rest of the PhD is shown in \in{Figure}[fig:future-gantt]. It is split
+into four main sections:
+
+\desc{Hyperblock scheduling} The first section describes the work that is currently going on about
+hyperblock scheduling, and is described in \in{Chapter}[sec:scheduling]. Currently, the main proofs
+need to be completed, as well as submitting a paper about the work to FMCAD/POPL/PLDI. In general,
+this work was supposed to be finished in November 2021, however, due to complications with the
+actual formulation of the semantics, as well as the number of proofs that need to be completed, the
+work is still on going.
+
+\desc{Modulo scheduling} The second piece of work is described in \in{Chapter}[sec:pipelining] and
+describes the implementation of modulo scheduling to perform scheduling on loops. This work will be
+based on existing work by \cite[alternative=authoryears][tristan10_simpl_verif_valid_softw_pipel].
+However, it is worth noting that the proof by
+\cite[alternative=author][tristan10_simpl_verif_valid_softw_pipel] does include parts that are not
+completely verified in Coq, meaning there may be various improvements that could be done to the
+proof.
+
+\desc{Dynamic scheduling} Dynamic scheduling is another route that could be explored by using
+Vericert to pass through \SSA, as well as GSA, and finally generate dynamic HTL (or some other
+similar hardware language). During my visit at Inria Rennes, I worked with Sandrine Blazy and
+Delphine Demange to extend CompCertSSA~\cite[barthe14_formal_verif_ssa_based_middl_end_compc] with
+\infull{GSA} (\GSA)~\cite[ottenstein90_progr_depen_web]. Not only should this allow for global
+optimisations of \SSA\ code, but it should also allow for a more straightforward translation to
+hardware as the GSA code can just be viewed as a \DFG. Instead of working on modulo scheduling, it
+could therefore be interesting to work on a formally verified dynamic scheduling translation.
+
+\startplacefigure[
+ title={Gantt chart describing the future plan for 2022 and 2023. The main funding runs out
+ mid-2023, so the plan is to finish the thesis writing by then as well.},
+ reference={fig:future-gantt},
+ location=top,
+]
+ \hbox{\startgantt[xunitlength=17.5mm]{15}{6}
+ \startganttitle
+ \numtitle{2022}{1}{2022}{4}
+ \numtitle{2023}{1}{2023}{2}
+ \stopganttitle
+ \startganttitle
+ \numtitle{1}{1}{4}{1}
+ \numtitle{1}{1}{2}{1}
+ \stopganttitle
+ \ganttgroup{\rmx Hyperblock Scheduling}{0}{3}
+ \ganttbar[color=hyperblockcolour]{\rmx Proof}{0}{2}
+ \ganttbarcon[color=hyperblockcolour]{\rmx Paper Submission}{2}{1}
+ \ganttgroup{\rmx Modulo Scheduling}{2}{3}
+ \ganttbar[color=modulocolour]{\rmx Implementation}{2}{1}
+ \ganttbarcon[color=modulocolour]{\rmx Proof}{3}{2}
+ \ganttgroup{\rmx Dynamic Scheduling}{0}{5}
+ \ganttbar[color=dynamiccolour]{\rmx RTL $\rightarrow$ GSA}{0}{1}
+ \ganttbarcon[color=dynamiccolour]{\rmx GSA $\rightarrow$ RTL}{1}{1}
+ \ganttbar[color=dynamiccolour]{\rmx GSA $\rightarrow$ HTL}{2}{3}
+ \ganttcon{1}{9}{2}{11}
+ \ganttgroup{\rmx Thesis}{4}{2}
+ \ganttbar[color=thesiscolour]{\rmx Thesis Writing}{4}{2}
+ \stopgantt}
+\stopplacefigure
+
+\section{Detailed Work Plan}
+
+This section describes a more detailed work plan for finalising the thesis. First I will describe
+what the current plan is to finish the short term work of proving hyperblock scheduling, followed by
+a plan of either tackling dynamic scheduling or modulo scheduling.
+
+\subsection{Finishing Work on Hyperblock Scheduling}
+
+As described in \in{Chapter}{sec:scheduling}, the current milestone that needs to be achieved is
+proving hyperblock scheduling. This requires proofs of the following translations:
+
+\desc{\sc RTL $\rightarrow$ RTLBlock} This translation generates basic blocks to ease the scheduling
+proof as well as simplify the scheduling algorithm.
+
+\desc{\sc IfConv} The {\sc IfConv} transformation introduces predicate instructions by conversing
+conditional statements. This is necessary to generates hyperblocks and therefore improve the
+schedule by passing larger blocks to the scheduler.
+
+\desc{\sc RTLBlock $\rightarrow$ RTLPar} The proof of the scheduling algorithm itself, which is
+performed by validating the schedule and verifying that the validator is sound.
+
+\desc{\sc RTLPar $\rightarrow$ HTL} This translation generates the hardware from the parallel
+description of the schedule, which can then be translated to Verilog using an existing translation.
+As a translation from RTL to HTL has already been proven correct, it should be possible to reuse
+many proofs from that translation and only have to modify the top-level proofs that deal with the
+slightly different structure of the program.
+
+\subsection{Starting Work on Modulo Scheduling}
+
+After the proofs about hyperblock scheduling are finished, a natural extension is to work on
+implementing modulo scheduling, which entails rearranging instructions in loops.
+
+\subsection{Possible Implementation for Dynamic Scheduling}
+
+Another plausible extension to Vericert is adding dynamic scheduling, as an alternative, or in
+addition to, static scheduling.
+
+\stopcomponent
diff --git a/chapters/scheduling.tex b/chapters/scheduling.tex
index 5edad2e..2201f72 100644
--- a/chapters/scheduling.tex
+++ b/chapters/scheduling.tex
@@ -1,320 +1,235 @@
+\environment fonts_env
+\environment lsr_env
+
\startcomponent scheduling
-\project main
-\chapter{Static Scheduling}
+\chapter[sec:scheduling]{Static Scheduling}
\section[introduction]{Introduction}
-The use of multicore processors has been increasing drastically, thereby
-making parallelising compilers ever more important. In addition to that,
-the need for custom hardware accelerators is also increasing, as
-traditional processors are not always the best choice for all
-applications. Compilers that support optimisations which can
-automatically parallelise it's input programs are therefore also
-becoming more important, as these back ends all benefit from it. For
-processors, one wants to optimise the pipeline usage as much as
-possible, whereas in hardware one can create custom pipelines, as well
-as making more use out of the spacial properties that hardware offers,
-making instruction parallelism important as well as speculative
-execution. However, with more powerful optimisations there is a greater
-chance to make mistakes in the compiler, making it possible to generate
-programs that behave differently to the input program. An important step
-towards making compilers more reliable is to formally verify these
-optimisations so that it can be guaranteed that bugs are not introduced.
-
-A popular optimisation for processors increasing the instruction
-parallelism of the resulting program is to use scheduling. This is an
-optimisation that is especially important for processors that have deep
-instruction pipelines or multiple functional units that can be used to
-execute instructions in parallel. However, this optimisation is also
-critical for high-level synthesis tools, which generate application
-specific hardware designs based on a software input. Simple scheduling
-algorithms often work on basic blocks, which are sequential lists of
-instructions without any branches. However, this limits what kind of
-reordering the scheduling algorithm can perform. It can never move
-instructions past a branching instruction, even when this might
-drastically improve the performance. Trace scheduling~ is an alternative
-that addresses this issue, by creating paths that might cross branching
-instructions, and scheduling the instructions in that path. However, the
-main problem with trace scheduling is that it is often infeasible on
-large programs, and because of the large design space, it's sometimes
-hard to find an optimal schedule. Superblock~ and hyperblock~ scheduling
-are two subsets of trace scheduling, which compromise on the flexibility
-of trace scheduling to instead build more tractable algorithms.
-Superblock scheduling produces more efficient code for processors with a
-low issue rate, whereas hyperblock scheduling produces more performant
-code for processors with a high issue rate~.
-
-CompCert~ is a compiler that is formally verified in Coq. In addition to
-that, have also added a formally verified hardware back end to CompCert,
-allowing it to also generate provably correct hardware designs. A
-hardware back end is an ideal target for hyperblock scheduling, as the
-issue rate is practically infinite. This paper describes the
-implementation of a hyperblock instruction scheduler in CompCert so that
-it can benefit existing CompCert back ends, as well as support more
+The use of multicore processors has been increasing drastically, thereby making parallelising
+compilers ever more important. In addition to that, the need for custom hardware accelerators is
+also increasing, as traditional processors are not always the best choice for all
+applications. Compilers that support optimisations which can automatically parallelise it's input
+programs are therefore also becoming more important, as these back ends all benefit from it. For
+processors, one wants to optimise the pipeline usage as much as possible, whereas in hardware one
+can create custom pipelines, as well as making more use out of the spacial properties that hardware
+offers, making instruction parallelism important as well as speculative execution. However, with
+more powerful optimisations there is a greater chance to make mistakes in the compiler, making it
+possible to generate programs that behave differently to the input program. An important step
+towards making compilers more reliable is to formally verify these optimisations so that it can be
+guaranteed that bugs are not introduced.
+
+A popular optimisation for processors increasing the instruction parallelism of the resulting
+program is to use scheduling. This is an optimisation that is especially important for processors
+that have deep instruction pipelines or multiple functional units that can be used to execute
+instructions in parallel. However, this optimisation is also critical for high-level synthesis
+tools, which generate application specific hardware designs based on a software input. Simple
+scheduling algorithms often work on basic blocks, which are sequential lists of instructions without
+any branches. However, this limits what kind of reordering the scheduling algorithm can perform. It
+can never move instructions past a branching instruction, even when this might drastically improve
+the performance. Trace scheduling~ is an alternative that addresses this issue, by creating paths
+that might cross branching instructions, and scheduling the instructions in that path. However, the
+main problem with trace scheduling is that it is often infeasible on large programs, and because of
+the large design space, it's sometimes hard to find an optimal schedule. Superblock~ and hyperblock~
+scheduling are two subsets of trace scheduling, which compromise on the flexibility of trace
+scheduling to instead build more tractable algorithms. Superblock scheduling produces more
+efficient code for processors with a low issue rate, whereas hyperblock scheduling produces more
+performant code for processors with a high issue rate~.
+
+CompCert~ is a compiler that is formally verified in Coq. In addition to that, have also added a
+formally verified hardware back end to CompCert, allowing it to also generate provably correct
+hardware designs. A hardware back end is an ideal target for hyperblock scheduling, as the issue
+rate is practically infinite. This paper describes the implementation of a hyperblock instruction
+scheduler in CompCert so that it can benefit existing CompCert back ends, as well as support more
general targets such as scheduling instructions for custom hardware.
\subsection[key-points]{Key Points}
-The goal of this paper is to introduce hyperblock scheduling into
-CompCert to allow for advanced and flexible scheduling operations that
-could be beneficial to various different back ends. To this end, the key
-points of this paper are the following:
+The goal of this paper is to introduce hyperblock scheduling into CompCert to allow for advanced and
+flexible scheduling operations that could be beneficial to various different back ends. To this end,
+the key points of this paper are the following:
\startitemize
\item
- Description of how hyperblocks are introduced for improved scheduling
- of instructions, and how these can be built and translated back to
- instructions without predicates.
+ Description of how hyperblocks are introduced for improved scheduling of instructions, and how
+ these can be built and translated back to instructions without predicates.
\item
- Description of the implementation and correctness proof of a system of
- difference constraints (SDC) scheduling algorithm, which can be
- adapted to various different scheduling strategies. {\bf {[}{[}JW:}
- What's the link between hyperblocks and SDC-based scheduling? Are
- these orthogonal?{\bf {]}{]}}{\bf {[}{[}YH:} Well the original SDC
- papers only talk about basic blocks, and not SDC hyperblock
- scheduling. But it's easy enough to adapt. SDC just requires some kind
- of blocks without control-flow.{\bf {]}{]}} {\bf {[}{[}JW:} Ok cool,
- so you can have SDC scheduling without hyperblocks. Can you have
- hyperblocks without SDC scheduling? (I guess so, but just wanted to be
- completely sure.){\bf {]}{]}}{\bf {[}{[}YH:} Yes exactly, you can
- perform any type of basic block scheduling on hyperblocks as well I
- think, as long as you mark the dependencies correctly. But in the
- worst case you could even just see hyperblocks as basic blocks and
- ignore predicates. It would still be correct but just less
- efficient.{\bf {]}{]}}
-
- {\bf {[}{[}JW:} It would be interesting to make the case for SDC-based
- scheduling. Is it superior to other scheduling mechanisms? Is it so
- much superior that it's worth the additional complexity, compared to,
- say, list scheduling? {\bf {]}{]}}{\bf {[}{[}YH:} SDC scheduling is
- more flexible with the type of constraint to optimise for (latency /
- throughput). I don't think it's too meaningful for compilers, where
- there isn't as much flexibility.{\bf {]}{]}} {\bf {[}{[}JW:} Thanks.
- Does Vericert have any idea that the scheduling is being performed
- using SDC? Or is the SDC stuff entirely internal to your unverified
- OCaml code? That is, could you swap your scheduler with a different
- one that isn't based on SDC, and not have to change anything in
- Vericert?{\bf {]}{]}}{\bf {[}{[}YH:} The verified part doesn't know
- about SDC at all, so it could be changed to any other scheduler that
- acts on hyperblocks.{\bf {]}{]}}
+ Description of the implementation and correctness proof of a system of difference constraints
+ (SDC) scheduling algorithm, which can be adapted to various different scheduling strategies.
\stopitemize
\section[sec:scheduling]{Implementation of Hyperblocks in CompCert}
This section describes the structure of hyperblocks in
-Section~\goto{2.1}[scheduling:hyperblocks]. Then, the structure of two
-extra intermediate languages that were added to CompCert to implement
-hyperblocks are also shown in Section~\goto{2.2}[sec:rtlblockdef].
+Section~\goto{2.1}[scheduling:hyperblocks]. Then, the structure of two extra intermediate languages
+that were added to CompCert to implement hyperblocks are also shown in
+Section~\goto{2.2}[sec:rtlblockdef].
\subsection[scheduling:hyperblocks]{Hyperblocks as a Generalisation of Basic Blocks}
-Basic blocks are popular in intermediate languages because they describe
-a list of sequential instructions that are not separated by
-control-flow. This information can be used by various analysis passes to
-perform optimisations, as these sequences of instructions can be
-reordered in any way as long as their data dependencies are met. One
-such optimisation is scheduling, which reorders instructions to
-eliminate of any delay slots caused by the CPU pipeline.
-
-However, due to control-flow not being allowed in basic blocks, the
-number of instructions that can be scheduled at any one time is limited.
-The boundaries to other basic blocks act as hard boundary to the
-scheduling algorithm. hyperblocks~ allow instructions to be predicated,
-thereby introducing some control-flow information into the blocks, which
-can also be taken into account by the scheduler. For example, any data
-dependency between instructions that have mutually exclusive predicates
-can be removed, as these instructions will never be activate at the same
-time.
-
-\subsection[sec:rtlblockdef]{RTLBlock and RTLPar Intermediate Language Definition}
+Basic blocks are popular in intermediate languages because they describe a list of sequential
+instructions that are not separated by control-flow. This information can be used by various
+analysis passes to perform optimisations, as these sequences of instructions can be reordered in any
+way as long as their data dependencies are met. One such optimisation is scheduling, which reorders
+instructions to eliminate of any delay slots caused by the CPU pipeline.
-Figure~\goto{{[}fig:compcert_interm{]}}[fig:compcert_interm] shows the
-intermediate languages introduced to implement hyperblocks in CompCert.
-This consists of new RTLBlock and RTLPar intermediate languages, which
-implement the sequential and parallel semantics of basic blocks
-respectively. The semantics of RTLBlock and RTLPar are based on the
-CompCert RTL intermediate language. However, instead of mapping from
-states to instructions, RTLBlock maps from states to hyperblocks, and
-RTLPar maps from states to parallel hyperblocks, which will be described
-in the next sections.
+However, due to control-flow not being allowed in basic blocks, the number of instructions that can
+be scheduled at any one time is limited. The boundaries to other basic blocks act as hard boundary
+to the scheduling algorithm. hyperblocks~ allow instructions to be predicated, thereby introducing
+some control-flow information into the blocks, which can also be taken into account by the
+scheduler. For example, any data dependency between instructions that have mutually exclusive
+predicates can be removed, as these instructions will never be activate at the same time.
-%\startformula \begin{aligned}
-% i\ ::=&\ \ \mono{RBnop} \\
-% &|\ \mono{RBop}\ p?\ \mathit{op}\ \vec{r}\ d \\
-% &|\ \mono{RBload}\ p?\ \mathit{chunk}\ \mathit{addr}\ \vec{r}\ d \\
-% &|\ \mono{RBstore}\ p?\ \mathit{chunk}\ \mathit{addr}\ \vec{r}\ s \\
-% &|\ \mono{RBsetpred}\ p?\ c\ \vec{r}\ d \\
-% \end{aligned} \stopformula
-%
-%\startformula \begin{aligned}
-% i_{\mathit{cf}}\ ::=&\ \ \mono{RBcall}\ \mathit{sig}\ f\ \vec{r}\ d\ n \\
-% &|\ \mono{RBtailcall}\ \mathit{sig}\ f\ \vec{r} \\
-% &|\ \mono{RBbuiltin}\ f_{\mathit{ext}}\ \vec{r}\ r\ n \\
-% &|\ \mono{RBcond}\ c\ \vec{r}\ n_{1}\ n_{2} \\
-% &|\ \mono{RBjumptable}\ r\ \vec{n} \\
-% &|\ \mono{RBreturn}\ r? \\
-% &|\ \mono{RBgoto}\ n \\
-% &|\ \mono{RBpred\_{cf}}\ P\ i_{\mathit{cf}_{1}}\ i_{\mathit{cf}_{2}} \\
-% \end{aligned} \stopformula
+\subsection[sec:rtlblockdef]{RTLBlock and RTLPar Intermediate Language Definition}
-RTLBlock instructions are split into two types of instructions, standard
-instructions and control-flow instructions. The standard instructions
-are the instructions that can be placed into hyperblocks, whereas
-control-flow instructions are the instructions that can end hyperblocks.
-The standard instructions are shown in
-Figure~\goto{{[}fig:standard_instruction{]}}[fig:standard_instruction],
-whereas the control-flow instructions are shown in
-Figure~\goto{{[}fig:control_flow_instruction{]}}[fig:control_flow_instruction].
-Most instructions are quite similar to their RTL counterparts, however,
-there are some instructions that have been added. \type{RBsetpred} is a
-standard instruction which sets a predicate equal to an evaluated
-condition. This instruction is used during if-conversion to set the
-predicate to the value of the condition in the conditional statement
-instruction. To support proper predicated instructions, each instruction
-therefore also contains an optional predicate which has been set
-beforehand, which produces the hyperblock. In additon to that, there is
-also an extra control-flow instruction called \type{RBpred_cf}, which
-can branch on a predicate an nests more control-flow instructions inside
-of it. This is also necessary for if-conversion, when converting an
-already converted conditional statement, as depending on the predicate,
-a different control-flow instruction may be necessary. {\bf {[}{[}JW:} I
-think there'd be a lot of value in having a picture showing op-chaining
-in action. See tex source for the kind of thing I mean. Probably best
-done as a running example throughout the paper. {\bf {]}{]}}
-
-These instructions are use in RTLBlock as well as in RTLPar. The main
-difference between these two languages is how these instructions are
-arranged within the hyperblock and the execution semantics of the
-hyperblock in the languages. Let $\mathit{hb}_{b}$ be the definition of
-a hyperblock for RTLBlock, and let $\mathit{hb}_{p}$ be the definition
-of a hyperblock for RTLPar, then $\mathit{hb}_{b}$ and $\mathit{hb}_{p}$
-be defined as in Figure~\goto{{[}fig:hb_def{]}}[fig:hb_def]. The Figure
-shows the different nesting levels of the basic blocks in RTLBlock as
-well as RTLPar, and where the parallel semantics of RTLPar come into
-play. RTLBlock is made of a list of instructions and a control-flow
-instruction that ends the hyperblock. Each instruction in RTLBlock is
-executed sequentially. RTLPar is made of three separate lists. The outer
-list behaves like the list in RTLBlock, and executes each list inside of
-it sequentially. RTLPar then has a list which executes it's contents in
-parallel, meaning each instruction uses the same state of the registers
-and memory while executing. Finally, instead of directly containing
-instructions, there is an additional layer of sequentially executing
-lists of instructions, to be able to execute small groups of
-instructions in sequence, but still in parallel to the other
-instructions in the current parallel group.
-
-RTLPar is structured in this way so that the scheduler has more
-flexibility, so that it can not only parallelise operations in the basic
-blocks, but that it can also define sequential operations in these
-parallel operations. This is especially useful for the hardware back
-end, as it means that sequential operations that do not quite fill one
-clock cycle can be executed sequentially, but can still be executed in
-parallel with other independent operations. This optimisation is called
-operation chaining, and is critical to get the most performance out of
-the hardware.
+Figure~\goto{{[}fig:compcert_interm{]}}[fig:compcert_interm] shows the intermediate languages
+introduced to implement hyperblocks in CompCert. This consists of new RTLBlock and RTLPar
+intermediate languages, which implement the sequential and parallel semantics of basic blocks
+respectively. The semantics of RTLBlock and RTLPar are based on the CompCert RTL intermediate
+language. However, instead of mapping from states to instructions, RTLBlock maps from states to
+hyperblocks, and RTLPar maps from states to parallel hyperblocks, which will be described in the
+next sections.
+
+\placeformula \startformula \startmathalignment
+\NC i\ \ \colon\colon= \NC \ \ \mono{RBnop} \NR
+\NC \NC |\ \ \mono{RBop}\ p?\ \mathit{op}\ \vec{r}\ d \NR
+\NC \NC |\ \ \mono{RBload}\ p?\ \mathit{chunk}\ \mathit{addr}\ \vec{r}\ d \NR
+\NC \NC |\ \ \mono{RBstore}\ p?\ \mathit{chunk}\ \mathit{addr}\ \vec{r}\ s \NR
+\NC \NC |\ \ \mono{RBsetpred}\ p?\ c\ \vec{r}\ d \NR
+\stopmathalignment \stopformula
+
+\placeformula \startformula \startmathalignment
+\NC i_{\mathit{cf}}\ \ \colon\colon= \NC \ \ \mono{RBcall}\ \mathit{sig}\ f\ \vec{r}\ d\ n \NR
+\NC \NC |\ \ \mono{RBtailcall}\ \mathit{sig}\ f\ \vec{r} \NR
+\NC \NC |\ \ \mono{RBbuiltin}\ f_{\mathit{ext}}\ \vec{r}\ r\ n \NR
+\NC \NC |\ \ \mono{RBcond}\ c\ \vec{r}\ n_{1}\ n_{2} \NR
+\NC \NC |\ \ \mono{RBjumptable}\ r\ \vec{n} \NR
+\NC \NC |\ \ \mono{RBreturn}\ r? \NR
+\NC \NC |\ \ \mono{RBgoto}\ n \NR
+\NC \NC |\ \ \mono{RBpred\_{cf}}\ P\ i_{\mathit{cf}_{1}}\ i_{\mathit{cf}_{2}} \NR
+\stopmathalignment \stopformula
+
+RTLBlock instructions are split into two types of instructions, standard instructions and
+control-flow instructions. The standard instructions are the instructions that can be placed into
+hyperblocks, whereas control-flow instructions are the instructions that can end hyperblocks. The
+standard instructions are shown in
+Figure~\goto{{[}fig:standard_instruction{]}}[fig:standard_instruction], whereas the control-flow
+instructions are shown in
+Figure~\goto{{[}fig:control_flow_instruction{]}}[fig:control_flow_instruction]. Most instructions
+are quite similar to their RTL counterparts, however, there are some instructions that have been
+added. \type{RBsetpred} is a standard instruction which sets a predicate equal to an evaluated
+condition. This instruction is used during if-conversion to set the predicate to the value of the
+condition in the conditional statement instruction. To support proper predicated instructions, each
+instruction therefore also contains an optional predicate which has been set beforehand, which
+produces the hyperblock. In additon to that, there is also an extra control-flow instruction called
+\type{RBpred_cf}, which can branch on a predicate an nests more control-flow instructions inside of
+it. This is also necessary for if-conversion, when converting an already converted conditional
+statement, as depending on the predicate, a different control-flow instruction may be
+necessary.
+
+These instructions are use in RTLBlock as well as in RTLPar. The main difference between these two
+languages is how these instructions are arranged within the hyperblock and the execution semantics
+of the hyperblock in the languages. Let $\mathit{hb}_{b}$ be the definition of a hyperblock for
+RTLBlock, and let $\mathit{hb}_{p}$ be the definition of a hyperblock for RTLPar, then
+$\mathit{hb}_{b}$ and $\mathit{hb}_{p}$ be defined as in
+Figure~\goto{{[}fig:hb_def{]}}[fig:hb_def]. The Figure shows the different nesting levels of the
+basic blocks in RTLBlock as well as RTLPar, and where the parallel semantics of RTLPar come into
+play. RTLBlock is made of a list of instructions and a control-flow instruction that ends the
+hyperblock. Each instruction in RTLBlock is executed sequentially. RTLPar is made of three separate
+lists. The outer list behaves like the list in RTLBlock, and executes each list inside of it
+sequentially. RTLPar then has a list which executes it's contents in parallel, meaning each
+instruction uses the same state of the registers and memory while executing. Finally, instead of
+directly containing instructions, there is an additional layer of sequentially executing lists of
+instructions, to be able to execute small groups of instructions in sequence, but still in parallel
+to the other instructions in the current parallel group.
+
+RTLPar is structured in this way so that the scheduler has more flexibility, so that it can not only
+parallelise operations in the basic blocks, but that it can also define sequential operations in
+these parallel operations. This is especially useful for the hardware back end, as it means that
+sequential operations that do not quite fill one clock cycle can be executed sequentially, but can
+still be executed in parallel with other independent operations. This optimisation is called
+operation chaining, and is critical to get the most performance out of the hardware.
\subsection[translating-rtlblock-to-rtlpar]{Translating RTLBlock to RTLPar}
-The translation from RTLBlock to RTLPar is performed using a scheduling
-algorithm, which will takes hyperblocks and rearranges their
-instructions to maximise parallelism based on the data dependencies. It
-then arranges the instructions in RTLPar by putting independent
-instructions into the same parallel block. Scheduling is an optimisation
-that is well suited to translation validation, as the algorithm itself
-can be quite complex, whereas the output is normally relatively easy to
-check for equivalence with the input.
-Figure~\goto{{[}fig:compcert_interm{]}}[fig:compcert_interm] shows that
-the scheduling step comes right after the if-conversion step which
-originally creates the hyperblocks.
-
-Figure~\goto{{[}fig:op_chain{]}}[fig:op_chain] shows how the scheduling
-step transforms RTLBlock into RTLPar. The RTLBlock hyperblock being
-scheduled is shown in
-Figure~\goto{{[}fig:op_chain_a{]}}[fig:op_chain_a], which contains five
-predicated operations, comprising two additions and three
-multiplications. The data dependencies of the instructions in the
-hyperblock are shown in
-Figure~\goto{{[}fig:op_chain_b{]}}[fig:op_chain_b]. Curiously, even
-though operations and are normally dependent on each other due to a
-write-after-write conflict, but because the predicates are independent
-the write-after-write conflict can effectively be removed by the
+The translation from RTLBlock to RTLPar is performed using a scheduling algorithm, which will takes
+hyperblocks and rearranges their instructions to maximise parallelism based on the data
+dependencies. It then arranges the instructions in RTLPar by putting independent instructions into
+the same parallel block. Scheduling is an optimisation that is well suited to translation
+validation, as the algorithm itself can be quite complex, whereas the output is normally relatively
+easy to check for equivalence with the input.
+Figure~\goto{{[}fig:compcert_interm{]}}[fig:compcert_interm] shows that the scheduling step comes
+right after the if-conversion step which originally creates the hyperblocks.
+
+Figure~\goto{{[}fig:op_chain{]}}[fig:op_chain] shows how the scheduling step transforms RTLBlock
+into RTLPar. The RTLBlock hyperblock being scheduled is shown in
+Figure~\goto{{[}fig:op_chain_a{]}}[fig:op_chain_a], which contains five predicated operations,
+comprising two additions and three multiplications. The data dependencies of the instructions in the
+hyperblock are shown in Figure~\goto{{[}fig:op_chain_b{]}}[fig:op_chain_b]. Curiously, even though
+operations and are normally dependent on each other due to a write-after-write conflict, but because
+the predicates are independent the write-after-write conflict can effectively be removed by the
scheduler.
-The scheduler transforms the RTLBlock hyperblock into the RTLPar
-hyperblock shown in Figure~\goto{{[}fig:op_chain_c{]}}[fig:op_chain_c].
-Even though the addition in is dependent on , they can still be put into
-the same cycle as the additions do not take as long to complete as the
-multiplications. This optimisation is called operation chaining. Then,
-the multiplication in can also be placed in the same clock cycle as it
-does not have any data dependencies. Then, in the next clock cycle,
-either the multiplication in can take place, or the multiplication in
-will take place, meaning these two multiplications can also be placed in
-the same clock cycle. This gives the final schedule that is shown in
-Figure~\goto{{[}fig:op_chain_d{]}}[fig:op_chain_d].
+The scheduler transforms the RTLBlock hyperblock into the RTLPar hyperblock shown in
+Figure~\goto{{[}fig:op_chain_c{]}}[fig:op_chain_c]. Even though the addition in is dependent on ,
+they can still be put into the same cycle as the additions do not take as long to complete as the
+multiplications. This optimisation is called operation chaining. Then, the multiplication in can
+also be placed in the same clock cycle as it does not have any data dependencies. Then, in the next
+clock cycle, either the multiplication in can take place, or the multiplication in will take place,
+meaning these two multiplications can also be placed in the same clock cycle. This gives the final
+schedule that is shown in Figure~\goto{{[}fig:op_chain_d{]}}[fig:op_chain_d].
\section[abstr_interp]{Abstract Interpretation of Hyperblocks}
-The correctness of the scheduling algorithm is checked by comparing the
-symbolic expressions of each register before and after scheduling. This
-section describes how these symbolic expressions are constructed for
-each hyperblock.
+The correctness of the scheduling algorithm is checked by comparing the symbolic expressions of each
+register before and after scheduling. This section describes how these symbolic expressions are
+constructed for each hyperblock.
\subsection[abstr_language]{Construction of Abstract Language}
-The main difficulty when constructing symbolic expressions for
-hyperblocks is the presence of predicates, which can be used by the
-scheduling algorithm to reorder instructions further, even if these
-contain data dependencies. The scheduling algorithm will manipulate
-predicates in the following ways, which the comparison function needs to
-take into account.
+The main difficulty when constructing symbolic expressions for hyperblocks is the presence of
+predicates, which can be used by the scheduling algorithm to reorder instructions further, even if
+these contain data dependencies. The scheduling algorithm will manipulate predicates in the
+following ways, which the comparison function needs to take into account.
\startitemize
\item
- Instructions with predicates that are equivalent to false can be
- removed.
+ Instructions with predicates that are equivalent to false can be removed.
\item
- Instructions with predicates that are mutually exclusive can be
- executed in parallel.
+ Instructions with predicates that are mutually exclusive can be executed in parallel.
\item
Any predicate can be replaced with an equivalent predicate.
\stopitemize
The symbolic expressions have the following form:
-There are three main resources that are present in symbolic expressions:
-registers $r$, which represent the value of the registers in the
-program, predicates $p$, which represent the value of the predicates
-throughout the program, and finally, memory \type{Mem}, representing the
-state of the memory. For each basic instruction in the hyperblock, there
-is an equivalent symbolic expression which either modifies the state of
-the registers, predicates or memory. Finally, as each instruction in the
-hyperblock can be predicates, each symbolic expression is also paired
-with a predicate as well.
-
-Assigning the same resource multiple times with instructions that
-contain different predicates, means that the symbolic expression must
-also express this conditional result based on the predicates. This is
-expressed by assigning lists of predicated symbolic expressions to each
-resource, where each predicate needs to be mutually exclusive from the
-other predicates in the list.
-
-The expressions are then constructed using a function which updates the
-symbolic expressions assigned for each resource. This is done using
-multiple primitives which act on predicated types, which are made up of
-a list of pairs of predicates and the element of that type. The first
-important primitive is multiplication of two predicated types, which is
-implemented as performing a cartesian multiplication between the
-predicated types in the two lists, anding the two predicates and joining
+There are three main resources that are present in symbolic expressions: registers $r$, which
+represent the value of the registers in the program, predicates $p$, which represent the value of
+the predicates throughout the program, and finally, memory \type{Mem}, representing the state of the
+memory. For each basic instruction in the hyperblock, there is an equivalent symbolic expression
+which either modifies the state of the registers, predicates or memory. Finally, as each instruction
+in the hyperblock can be predicates, each symbolic expression is also paired with a predicate as
+well.
+
+Assigning the same resource multiple times with instructions that contain different predicates,
+means that the symbolic expression must also express this conditional result based on the
+predicates. This is expressed by assigning lists of predicated symbolic expressions to each
+resource, where each predicate needs to be mutually exclusive from the other predicates in the list.
+
+The expressions are then constructed using a function which updates the symbolic expressions
+assigned for each resource. This is done using multiple primitives which act on predicated types,
+which are made up of a list of pairs of predicates and the element of that type. The first important
+primitive is multiplication of two predicated types, which is implemented as performing a cartesian
+multiplication between the predicated types in the two lists, anding the two predicates and joining
the two types of each list using a function.
%\startformula \label{eq:1}
% P_{1} \otimes_{f} P_{2} \equiv \mono{map } (\lambda ((p_{1}, e_{1}), (p_{2}, e_{2})) . (p_{1} \land p_{2}, f\ e_{1}\ e_{2})) P_{1} \times P_{2} \stopformula
-In addition to that, another primitive that is needed is the following
-append operation, which will negate the combination of all predicates in
-the second predicated type, and it to the first predicated type and
-append the first predicated type to the second:
+In addition to that, another primitive that is needed is the following append operation, which will
+negate the combination of all predicates in the second predicated type, and it to the first
+predicated type and append the first predicated type to the second:
%\startformula \label{eq:2}
% \mu(p, P) \equiv \mono{map } (\lambda (p', e') . (p \land p', e'))\ P \stopformula
@@ -322,8 +237,7 @@ append the first predicated type to the second:
%\startformula \label{eq:3}
% P_{1} \oplus_{p} P_{2} \equiv \mu(\neg p, P_{1}) \mathbin{++} \mu(p, P_{2}) \stopformula
-\subsection[title={Example of
-translation},reference={example-of-translation}]
+\subsection[example-of-translation]{Example of translation}
%\startformula \begin{aligned}
% \label{eq:4}
@@ -342,113 +256,87 @@ translation},reference={example-of-translation}]
% \end{cases}
% \end{aligned} \stopformula
-Using the example shown in
-Figure~\goto{{[}fig:op_chain{]}}[fig:op_chain], the RTLBlock hyperblock
-shown in Figure~\goto{{[}fig:op_chain_a{]}}[fig:op_chain_a] is scheduled
-into the hyperblock RTLPar shown in
-Figure~\goto{{[}fig:op_chain_c{]}}[fig:op_chain_c]. The first step is to
-translate the input and output of the scheduler into the Abstr
-intermediate language by calculating the symbolic values that will be
-stored in each of the registers. Symbolically evaluating the RTLBlock
-hyperblock results in the expressions shown in
-Figure~\goto{{[}fig:symbolic_expressions{]}}[fig:symbolic_expressions]
-for registers \type{r1}, \type{r2} and \type{r3}, where the state of the
-register \type{r1} at the start of the hyperblock is denoted as
-$\mono{r1}^{0}$.
-
-This symbolic expression can be generated by sequentially going through
-the list of predicated expressions and applying the update function
-defined in Section~\goto{3.1}[abstr_language]. The resulting predicated
-expressions is guaranteed to have predicates which are mutually
-exclusive, meaning if the predicate evaluates to true, all other
-predicates must evaluate to false.
-
-\subsection[title={Linear (Flat) Predicated
-Expressions},reference={linear-flat-predicated-expressions}]
-
-The update function for predicated expressions is quite a bit more
-complex compared to standard update functions for symbolic execution,
-such as the one used by . The main reason for this is the non-recursive
-nature of the predicated expressions. As the predicates are only
-available at the top-level of the expressions, combining multiple
-expressions means that the conditions and multiple possibilities also
-need to trickle up to the top.
-
-Instead, if the predicated instructions were recursive, one would not
-have to change any predicates when evaluating the argument lists of
-instructions, for example, as these can just stay predicated. The
-evaluation semantics would also be much simpler, because a rule can be
-added for evaluating predicates.
-
-\section[title={Correctness of the Abstract
-Interpretation},reference={sec:correctness}]
-
-The correctness of the scheduling step is dependent on the correctness
-of the equivalence check between the two abstract languages. As the
-abstract languages contain predicates, this comparison check cannot just
-be a syntactic comparison between the two languages, because the
-predicates might be different but logically equivalent. Instead, a mixed
-equivalence check is performed using a verified SAT solver and a
-syntactic equivalence check of the actual symbolic expressions.
-
-\subsection[title={Comparing Symbolic Abstract
-Expressions},reference={comparing-symbolic-abstract-expressions}]
-
-Abstract expressions comprise a predicate and a symbolic expression. The
-translation from RTLBlock and RTLPar to Abstract ensures that each
-register will contain a predicated symbolic expression that describes
-the value that will be assigned to it after evaluating the basic block.
-This means that if an expression always produces the same result as
-another expression, that the register will always contain the same
-value. If the RTLBlock program and RTLPar program translate to two
-Abstract languages where all the register's symbolic expressions match,
-then this means that both programs will always produce the same value.
-
-The main comparison between two symbolic expressions can be done using a
-simple syntactic comparison, because this ensures that each of the
-expressions always evaluate to the same value, and is also general
-enough for the expressions that need to be compared. However, as all
-expressions might be predicated, this simple syntactic comparison cannot
-be performed for the predicates. This is because predicated instructions
-can be executed in many different orders, and the predicates are used to
-calculate dependence information and therefore also determine the order
-of operations.
-
-Instead, comparing predicates can be done by checking logical
-equivalence of the predicates using a formally verified SAT solver. This
-ensures that the predicates will always evaluate to the same value,
-thereby making it a fair comparison.
-
-\subsection[title={Verified Decidable DPLL SAT
-Solver},reference={verified_sat}]
-
-A SAT solver is needed to correctly and reliably compare predicated
-expressions. This makes it possible to then compare many different
-possible schedules, even if they remove expressions completely if these
-are unnecessary. The SAT solver is needed in the verifier, which needs
-to be proven correct in Coq to prove the scheduling translation correct.
-The SAT solver therefore has to be verified, because it could otherwise
-not be used in the proof of correctness.
-
-The DPLL algorithm is used to implement the SAT solver. This means that
-the SAT solver takes in a conjunctive normal form (CNF) formula. The
-conversion from a recursive predicate type into a CNF formula is done by
-recursively expanding the predicate in equivalent CNF formulas.
-
-The SAT solver is implemented in Coq and proven to be sound and
-complete. The soundness and completeness of an algorithm to check
-whether a predicate is satisfiable or not also directly implies that the
-check is decidable, simplifying many proofs in Coq by using a
-\type{Setoid} instance for equivalence between predicates. This makes
-rewriting predicates if they are equivalent much easier in proofs by
-reusing many of the existing Coq tactics.
-
-\subsection[title={Encoding Predicated Expressions as SAT
-Formulas},reference={encoding-predicated-expressions-as-sat-formulas}]
-
-Given the following two predicated expressions which should be tested
-for equivalence, one can construct a SAT expression which will ensure
-that both expressions will always result in the same value.
+Using the example shown in Figure~\goto{{[}fig:op_chain{]}}[fig:op_chain], the RTLBlock hyperblock
+shown in Figure~\goto{{[}fig:op_chain_a{]}}[fig:op_chain_a] is scheduled into the hyperblock RTLPar
+shown in Figure~\goto{{[}fig:op_chain_c{]}}[fig:op_chain_c]. The first step is to translate the
+input and output of the scheduler into the Abstr intermediate language by calculating the symbolic
+values that will be stored in each of the registers. Symbolically evaluating the RTLBlock hyperblock
+results in the expressions shown in
+Figure~\goto{{[}fig:symbolic_expressions{]}}[fig:symbolic_expressions] for registers \type{r1},
+\type{r2} and \type{r3}, where the state of the register \type{r1} at the start of the hyperblock is
+denoted as $\mono{r1}^{0}$.
+
+This symbolic expression can be generated by sequentially going through the list of predicated
+expressions and applying the update function defined in Section~\goto{3.1}[abstr_language]. The
+resulting predicated expressions is guaranteed to have predicates which are mutually exclusive,
+meaning if the predicate evaluates to true, all other predicates must evaluate to false.
+
+\subsection[linear-flat-predicated-expressions]{Linear (Flat) Predicated Expressions}
+
+The update function for predicated expressions is quite a bit more complex compared to standard
+update functions for symbolic execution, such as the one used by . The main reason for this is the
+non-recursive nature of the predicated expressions. As the predicates are only available at the
+top-level of the expressions, combining multiple expressions means that the conditions and multiple
+possibilities also need to trickle up to the top.
+
+Instead, if the predicated instructions were recursive, one would not have to change any predicates
+when evaluating the argument lists of instructions, for example, as these can just stay
+predicated. The evaluation semantics would also be much simpler, because a rule can be added for
+evaluating predicates.
+
+\section[sec:correctness]{Correctness of the Abstract Interpretation}
+
+The correctness of the scheduling step is dependent on the correctness of the equivalence check
+between the two abstract languages. As the abstract languages contain predicates, this comparison
+check cannot just be a syntactic comparison between the two languages, because the predicates might
+be different but logically equivalent. Instead, a mixed equivalence check is performed using a
+verified SAT solver and a syntactic equivalence check of the actual symbolic expressions.
+
+\subsection[comparing-symbolic-abstract-expressions]{Comparing Symbolic Abstract Expressions}
+
+Abstract expressions comprise a predicate and a symbolic expression. The translation from RTLBlock
+and RTLPar to Abstract ensures that each register will contain a predicated symbolic expression that
+describes the value that will be assigned to it after evaluating the basic block. This means that
+if an expression always produces the same result as another expression, that the register will
+always contain the same value. If the RTLBlock program and RTLPar program translate to two Abstract
+languages where all the register's symbolic expressions match, then this means that both programs
+will always produce the same value.
+
+The main comparison between two symbolic expressions can be done using a simple syntactic
+comparison, because this ensures that each of the expressions always evaluate to the same value, and
+is also general enough for the expressions that need to be compared. However, as all expressions
+might be predicated, this simple syntactic comparison cannot be performed for the predicates. This
+is because predicated instructions can be executed in many different orders, and the predicates are
+used to calculate dependence information and therefore also determine the order of operations.
+
+Instead, comparing predicates can be done by checking logical equivalence of the predicates using a
+formally verified SAT solver. This ensures that the predicates will always evaluate to the same
+value, thereby making it a fair comparison.
+
+\subsection[verified_sat]{Verified Decidable DPLL SAT Solver}
+
+A SAT solver is needed to correctly and reliably compare predicated expressions. This makes it
+possible to then compare many different possible schedules, even if they remove expressions
+completely if these are unnecessary. The SAT solver is needed in the verifier, which needs to be
+proven correct in Coq to prove the scheduling translation correct. The SAT solver therefore has to
+be verified, because it could otherwise not be used in the proof of correctness.
+
+The DPLL algorithm is used to implement the SAT solver. This means that the SAT solver takes in a
+conjunctive normal form (CNF) formula. The conversion from a recursive predicate type into a CNF
+formula is done by recursively expanding the predicate in equivalent CNF formulas.
+
+The SAT solver is implemented in Coq and proven to be sound and complete. The soundness and
+completeness of an algorithm to check whether a predicate is satisfiable or not also directly
+implies that the check is decidable, simplifying many proofs in Coq by using a \type{Setoid}
+instance for equivalence between predicates. This makes rewriting predicates if they are equivalent
+much easier in proofs by reusing many of the existing Coq tactics.
+
+\subsection[encoding-predicated-expressions-as-sat-formulas]{Encoding Predicated Expressions as SAT
+ Formulas}
+
+Given the following two predicated expressions which should be tested for equivalence, one can
+construct a SAT expression which will ensure that both expressions will always result in the same
+value.
%\startformula \begin{aligned}
% \bgroup \left.\begin{aligned}
@@ -465,33 +353,28 @@ that both expressions will always result in the same value.
% e_{m}',\quad &p_{m}'\\
% \end{cases}\end{aligned} \stopformula
-The SAT query that should determine if two predicated expressions are
-equivalent is the following, where $\mathit{he}_{i}$ is the hashed
-expression of $e_{i}$:
+The SAT query that should determine if two predicated expressions are equivalent is the following,
+where $\mathit{he}_{i}$ is the hashed expression of $e_{i}$:
%\startformula \begin{gathered}
% (p_{1} \implies \mathit{he}_{1}) \land (p_{2} \implies \mathit{he}_{2}) \land \cdots \land (p_{n} \implies \mathit{he}_{n}) \\
% \Longleftrightarrow\\
% (p_{1}' \implies \mathit{he}_{1}') \land (p_{2}' \implies \mathit{he}_{2}') \land \cdots \land (p_{m}' \implies \mathit{he}_{m}') \\\end{gathered} \stopformula
-However, reasoning about the equivalence of individual expressions
-within the greater list of predicated expressions is not quite
-straightforward. The evaluation semantics of predicated expression list
-will evaluate each predicate expression sequentially, and pick the
-expression whose predicate evaluates to true. However, proving that both
-predicated expression lists will return the same value assuming that the
-SAT query above is true, means that for each predicate that is true, One
-needs to be able to show that there is a predicate in the other list
-that will also be true, which will return the same value. However, there
-is no guarantee that there is a unique predicated expression that will
-match the current predicate, as there may be duplicate expressions in
-the list with different predicates.
-
-This means that it is quite difficult to prove that the SAT query
-implies the semantic preservation of the scheduling transformation. The
-SAT query can therefore be broken down into a slightly more direct
-comparison that is more useful for the proof itself. The steps to
-perform the more fine grained comparison are the following:
+However, reasoning about the equivalence of individual expressions within the greater list of
+predicated expressions is not quite straightforward. The evaluation semantics of predicated
+expression list will evaluate each predicate expression sequentially, and pick the expression whose
+predicate evaluates to true. However, proving that both predicated expression lists will return the
+same value assuming that the SAT query above is true, means that for each predicate that is true,
+One needs to be able to show that there is a predicate in the other list that will also be true,
+which will return the same value. However, there is no guarantee that there is a unique predicated
+expression that will match the current predicate, as there may be duplicate expressions in the list
+with different predicates.
+
+This means that it is quite difficult to prove that the SAT query implies the semantic preservation
+of the scheduling transformation. The SAT query can therefore be broken down into a slightly more
+direct comparison that is more useful for the proof itself. The steps to perform the more fine
+grained comparison are the following:
%\startenumerate[n]
%\item
@@ -507,81 +390,65 @@ perform the more fine grained comparison are the following:
% one of the maps, then it's predicate should be equivalent to $\perp$.
%\stopenumerate
-This comparison is logically equivalent to performing the large SAT
-query, however, it maps better to the proof because it is comparing the
-predicates of the same expressions with each other.
+This comparison is logically equivalent to performing the large SAT query, however, it maps better
+to the proof because it is comparing the predicates of the same expressions with each other.
\subsection[proof-of-equivalence-between-passes]{Proof of equivalence between passes}
-Figure~\goto{{[}fig:compcert_interm{]}}[fig:compcert_interm] shows the
-four different passes that were added to CompCert to form hyperblocks.
-This section will cover the proofs for each of the translations,
-focusing mainly on the proof of translation of the scheduling pass.
+Figure~\goto{{[}fig:compcert_interm{]}}[fig:compcert_interm] shows the four different passes that
+were added to CompCert to form hyperblocks. This section will cover the proofs for each of the
+translations, focusing mainly on the proof of translation of the scheduling pass.
\subsubsection[proof-of-correctness-of-the-if-conversion-pass]{Proof of Correctness of the
If-Conversion Pass}
-The if-conversion pass is quite simple to verify, as it is an
-optimisation that transforms the code in the same language.
+The if-conversion pass is quite simple to verify, as it is an optimisation that transforms the code
+in the same language.
\subsubsection[proof-of-correctness-of-scheduling]{Proof of Correctness of Scheduling}
-The scheduling pass is formally verified using translation validation,
-using the Abstr language as a point of comparison between RTLBlock and
-RTLPar. Using the algorithm defined in Section~\goto{3}[abstr_interp],
-the formally verified SAT solver described in
-Section~\goto{4.2}[verified_sat] can be used prove that if the two
-predicated expression lists are equivalent, that they will always
-evaluate to the same value.
+The scheduling pass is formally verified using translation validation, using the Abstr language as a
+point of comparison between RTLBlock and RTLPar. Using the algorithm defined in
+Section~\goto{3}[abstr_interp], the formally verified SAT solver described in
+Section~\goto{4.2}[verified_sat] can be used prove that if the two predicated expression lists are
+equivalent, that they will always evaluate to the same value.
\section[implementation-details-of-static-sdc-scheduling]{Implementation Details of Static SDC
Scheduling}
-This section describes the implementation details of the static SDC
-scheduling algorithm used, and also describes some of the details of the
-basic block generation as well as the if-conversion pass which generates
-the predicated instructions.
+This section describes the implementation details of the static SDC scheduling algorithm used, and
+also describes some of the details of the basic block generation as well as the if-conversion pass
+which generates the predicated instructions.
\subsection[implementation-of-if-conversion]{Implementation of If-Conversion}
-If-conversion is the conversion that introduces predicated instructions
-which can make use of the hyperblocks. It converts conditional
-statements into predicated instructions. This transformation has a few
-limitations on the kind of conditional statements that it can translate.
-First, only conditional statements without loops can be translated,
-therefore, one must identify cycles in the control-flow before
-performing the if-conversion. Secondly, if-conversion will not always
-result in more efficient code, so it should not be applied to any
-conditional statements. Instead, it is best applied to conditional
-statements where each branch will take a similar amount of time to
-execute.
-
-For back ends that do not support predicates, it is also important to
-eliminate the predicates again using a process called reverse
-if-conversion, which creates branches out of the predicates and groups
-these together again. This allows for hyperblock scheduling to also be
-used with back ends that do not support predicates, which are most the
-default back ends included in CompCert.
-
-\subsection[title={Static SDC
-Scheduling},reference={static-sdc-scheduling}]
-
-System of difference constraints (SDC) scheduling~ is a flexible
-algorithm to perform scheduling of instructions, especially if the
-target is hardware directly. It allows for flexible definitions of
-optimisation functions in terms of latencies and dependencies of the
-operations in the basic block.
-
-\section[title={Performance
-Comparison},reference={performance-comparison}]
-
-\section[title={Related Work},reference={sec:related-work}]
-
-\section[title={Conclusion},reference={conclusion}]
-
-This material is based upon work supported by the under Grant No.~ and
-Grant No.~. Any opinions, findings, and conclusions or recommendations
-expressed in this material are those of the author and do not
-necessarily reflect the views of the National Science Foundation.
+If-conversion is the conversion that introduces predicated instructions which can make use of the
+hyperblocks. It converts conditional statements into predicated instructions. This transformation
+has a few limitations on the kind of conditional statements that it can translate. First, only
+conditional statements without loops can be translated, therefore, one must identify cycles in the
+control-flow before performing the if-conversion. Secondly, if-conversion will not always result in
+more efficient code, so it should not be applied to any conditional statements. Instead, it is best
+applied to conditional statements where each branch will take a similar amount of time to execute.
+
+For back ends that do not support predicates, it is also important to eliminate the predicates again
+using a process called reverse if-conversion, which creates branches out of the predicates and
+groups these together again. This allows for hyperblock scheduling to also be used with back ends
+that do not support predicates, which are most the default back ends included in CompCert.
+
+\subsection[static-sdc-scheduling]{Static SDC Scheduling}
+
+System of difference constraints (SDC) scheduling~ is a flexible algorithm to perform scheduling of
+instructions, especially if the target is hardware directly. It allows for flexible definitions of
+optimisation functions in terms of latencies and dependencies of the operations in the basic block.
+
+\section[performance-comparison]{Performance Comparison}
+
+\section[sec:related-work]{Related Work}
+
+\section[conclusion]{Conclusion}
+
+This material is based upon work supported by the under Grant No.~ and Grant No.~. Any opinions,
+findings, and conclusions or recommendations expressed in this material are those of the author and
+do not necessarily reflect the views of the National Science Foundation.
\stopcomponent