Add changes

author: Yann Herklotz <git@yannherklotz.com> 2022-04-15 11:46:10 +0100
committer: Yann Herklotz <git@yannherklotz.com> 2022-04-15 11:46:10 +0100
commit: 2a716db2fb59f448590b189f7b953a80d02bc5ee (patch)
tree: 46ee92387aa0180f18d19ced871dd395fb0b49de
parent: 3d6c27a18bdcf77edfd746982229f609d5448432 (diff)
download: lsr22_fvhls-2a716db2fb59f448590b189f7b953a80d02bc5ee.tar.gz
lsr22_fvhls-2a716db2fb59f448590b189f7b953a80d02bc5ee.zip
11 files changed, 1699 insertions, 495 deletions
diff --git a/Makefile b/Makefile
index 9898cf9..259cfa8 100644
--- a/Makefile
+++ b/Makefile
@@ -1,8 +1,7 @@
-PROJNAME=main
+.PHONY: all clean
 
-.PHONY: $(PROJNAME).pdf all clean
+all: main.pdf
 
-all: $(PROJNAME).pdf
-
-$(PROJNAME).pdf: $(PROJNAME).tex
+%.pdf: %.tex
 	context --nonstopmode $<
+	mv $(notdir $@) $@
diff --git a/chapters/background.tex b/chapters/background.tex
index 4636918..81ca653 100644
--- a/chapters/background.tex
+++ b/chapters/background.tex
@@ -1,7 +1,309 @@
+\environment fonts_env
+\environment lsr_env
+
 \startcomponent background
 
-\chapter{Background}
+\chapter[sec:background]{Background}
+
+%\begin{table}
+%  \centering
+%  \begin{tabular}{lccccc}\toprule
+%    \textbf{Tool} & \textbf{Implementation} & \textbf{Mainstream} & \textbf{Proof} & \textbf{Mechanised} & \textbf{HLS}\\
+%    \midrule
+%    Catapult C~\cite{mentor20_catap_high_level_synth} & \cmark{} & \cmark{} & \cmark{} & \xmark{} & \cmark{}\\
+%    Chapman \textit{et al.}~\cite{chapman92_verif_bedroc} & \cmark{} & \xmark{} & \cmark{} & \cmark{} & \cmark{}\\
+%    Clarke \textit{et al.}~\cite{clarke03_behav_c_veril} & \cmark{} & \cmark{} & \cmark{} & \xmark{} & \cmark{}\\
+%    Ellis~\cite{ellis08} & \xmark{} & \xmark{} & \cmark{} & \cmark{} & \cmark{}\\
+%    Karfa \textit{et al.}~\cite{karfa06_formal_verif_method_sched_high_synth} & \cmark{} & \xmark{} & \cmark{} & \xmark{} & \cmark{}\\
+%    Kundu \textit{et al.}~\cite{kundu08_valid_high_level_synth} & \cmark{} & \xmark{} & \cmark{} & \xmark{} & \cmark\\
+%    LegUp~\cite{canis13_legup} & \cmark{} & \cmark{} & \xmark{} & \xmark{} & \cmark{}\\
+%    Lööw \textit{et al.}~\cite{loow19_verif_compil_verif_proces} & \cmark{} & \xmark{} & \cmark{} & \cmark{} & \xmark{}\\
+%    OpenCL SDK~\cite{intel20_sdk_openc_applic} & \cmark{} & \xmark{} & \xmark{} & \xmark{} & \cmark{}\\
+%    Perna \textit{et al.}~\cite{perna12_mechan_wire_wise_verif_handel_c_synth} & \cmark{} & \xmark{} & \cmark{} & \cmark{} & \xmark{}\\
+%    Vivado HLS~\cite{xilinx20_vivad_high_synth} & \cmark{} & \cmark{} & \xmark{} & \xmark{} & \cmark{}\\
+%    \bottomrule
+%  \end{tabular}
+%  \caption{Table showing the various related works and comparing them.}\label{tab:existing_tools}
+%\end{table}
+
+This section describes the relevant literature around high-level synthesis, verified high-level
+synthesis, and finally mechanised proofs of HLS. Table~\ref{tab:existing_tools} classifies
+existing tools in the are of formal verification and high-level synthesis.  The next section covers
+the main different categories that the tools in Table~\ref{tab:existing_tools} show, which include
+tools that cover \emph{implementation}, \emph{proof} and finally tools that are \emph{mechanically
+  verified}.
+
+\section{Implementation: High-level Synthesis}
+
+High-level synthesis (HLS) is the transformation of software directly into hardware.  There are many
+different types of HLS, which can vary greatly with what languages they accept, however, they often
+share similar steps in how the translation is performed, as they all go from a higher-level,
+behavioural description of the algorithm to a timed hardware description.  The main steps performed
+in the translation are the following~\cite{coussy09_introd_to_high_level_synth,canis13_legup}:
+
+\desc{Compilation of specification} First, the input specification has to be compiled into a
+representation that can be used to apply all the necessary optimisations.  This can be an
+intermediate language such as the lower-level virtual machine intermediate representation (LLVM IR),
+or a custom representation of the code.
+
+\desc{Hardware resource allocation} It is possible to select a device to target, which gives
+information about how many LUTs can be used to implement the logic.  However, tools often assume
+that an unlimited amount of resources are available instead.  This is resource sharing can be too
+expensive for adders, as the logic for sharing the adder would have approximately the same cost as
+adding a separate adder.  However, multiplies and divide blocks should be shared, especially as
+divide blocks are often implemented in LUTs and are therefore quite expensive.
+
+\desc{Operation scheduling} The operations then need to be scheduled into a specific clock cycle, as
+the input specification language is normally an untimed behavioural representation.  This is where
+the spatial parallelism of the hardware can be taken advantage of, meaning operations that are not
+dependent on each other can run in parallel.  There are various possible scheduling methods that
+could be used, such as static or dynamic scheduling, which are described further in
+sections~\ref{sec:static_scheduling} and~\ref{sec:dynamic_scheduling}.
+
+\desc{Operation and variable binding} After the operations have been scheduled, the operations and
+the variables need to be bound to hardware units and registers respectively.  It is often not that
+practical to share many hardware units though, as this requires adding multiplexers, which are often
+the same size as the hardware units themselves.  It is therefore more practical to only share
+operations that take up a lot of space, such as modulo or divide circuits, as well as multiplies if
+there are not available any more on the FPGA.
+
+\desc{Hardware description generation} Finally, the hardware description is generated from the code
+that was described in the intermediate language.  This is often a direct translation of the
+instructions into equivalent hardware constructs.
+
+There are many examples of existing high-level synthesis tools, the most popular ones being
+LegUp~\cite{canis13_legup}, Vivado HLS~\cite{xilinx20_vivad_high_synth}, Catapult
+C~\cite{mentor20_catap_high_level_synth} and Intel's OpenCL SDK~\cite{intel20_sdk_openc_applic}.
+These HLS tools all accept general programming languages such as C or OpenCL, however, some HLS
+tools take in languages that were designed for hardware such as
+Handel-C~\cite{aubury96_handel_c_languag_refer_guide}, where time is encoded as part of the
+semantics.
+
+\subsection[sec:static_scheduling]{Static Scheduling}
+
+Static scheduling is used by the majority of synthesis
+tools~\cite{canis13_legup,xilinx20_vivad_high_synth,mentor20_catap_high_level_synth,intel20_sdk_openc_applic}
+and means that the time at which each operation will execute is known at compile time.  Static
+analysis is used to gather all the data dependencies between all the operations to determine in
+which clock cycle the operations should be placed, or if these operations can be parallelised.  Once
+the data-flow analysis has been performed, various scheduling schemes can be taken into account to
+place operations in various clock cycles.  Some common static-scheduling schemes are the following:
+
+\desc{As soon as possible (ASAP)} scheduling will place operations into the first possible clock
+cycle that they can be executed.
+
+\desc{As late as possible (ALAP)} scheduling places operations into the last possible clock cycle,
+right before they are used by later operations.
+
+\desc{List scheduling} uses priorities associated with each operation and chooses operations from
+each priority level to be placed into the current clock cycle if all the data-dependencies are
+met.  This is done until all the resources for the current clock cycle are used up.
+
+Static scheduling can normally produce extremely small circuits, however, it is often not possible
+to guarantee that the circuits will have the best
+throughput~\cite{cheng20_combin_dynam_static_sched_high_level_synth}, as this requires extensive
+control-flow analysis and complex optimisations.  Especially for loops, finding the optimal
+initiation interval (II) can be tricky if there are loads and stores in the loop or if the loops are
+nested.
+
+\subsection[sec:dynamic_scheduling]{Dynamic Scheduling}
+
+On the other hand, Dynamic scheduling~\cite{josipovic18_dynam_sched_high_level_synth} does not
+require the schedule to be known at compile time and it instead generates circuits using tokens to
+schedule the operations in parallel at run time.  Whenever the data for an operation is available,
+it sends a token to the next operation, signalling that the data is ready to be read.  The next
+operation does not start until all the required inputs to the operation are available, and once that
+is the case, it computes the result and then sends another token declaring that the result of that
+operation is also ready.  The benefit of this approach is that only basic data-flow analysis is
+needed to connect the tokens correctly, however, the scheduling is then done dynamically at run
+time, depending on how long each primitive takes to finish and when the tokens activate the next
+operations.
+
+The benefits of this approach over static scheduling is that the latency of these circuits is
+normally significantly lower than the latency of static scheduled circuits, because they can take
+advantage of runtime information of the circuit.  However, because of the signalling required to
+perform the runtime scheduling, the area of these circuits is usually much larger than the area of
+static scheduled circuits.  In addition to that, much more analysis is needed to properly
+parallelise loads and stores to prevent bugs, which requires the addition of buffers in certain
+locations.
+
+An example of a dynamically scheduled synthesis tool is
+Dynamatic~\cite{josipovic18_dynam_sched_high_level_synth}, which uses a load-store queue
+(LSQ)~\cite{josipovic17_out_of_order_load_store} to order memory operations correctly even when
+loops are pipelined and there are dependencies between iterations.  In addition to that, performance
+of the dynamically scheduled code is improved by careful buffer
+placement~\cite{josipovic20_buffer_placem_sizin_high_perfor_dataf_circuit}, which allows for better
+parallisation and pipelining of loops.
+
+\section{Proof: Verified HLS}
+
+Work is being done to prove the equivalence between the generated hardware and the original
+behavioural description in C.  An example of a tool that implements this is Mentor's
+Catapult~\cite{mentor20_catap_high_level_synth}, which tries to match the states in the register
+transfer level (RTL) description to states in the original C code after an unverified translation.
+This technique is called translation validation~\cite{pnueli98_trans}, whereby the translation that
+the HLS tool performed is proven to have been correct for that input, by showing that they behave in
+the same way for all possible inputs.  Using translation validation is quite effective for proving
+complex optimisations such as
+scheduling~\cite{kim04_autom_fsmd,karfa06_formal_verif_method_sched_high_synth,chouksey20_verif_sched_condit_behav_high_level_synth}
+or code
+motion~\cite{banerjee14_verif_code_motion_techn_using_value_propag,chouksey19_trans_valid_code_motion_trans_invol_loops},
+however, the validation has to be run every time the high-level synthesis is performed.  In addition
+to that, the proofs are often not mechanised or directly related to the actual implementation,
+meaning the verifying algorithm might be wrong and could give false positives or false negatives.
+
+More examples of translation validation for proofs about HLS
+algorithms~\cite{karfa06_formal_verif_method_sched_high_synth,karfa07_hand_verif_high_synth,kundu07_autom,karfa08_equiv_check_method_sched_verif,kundu08_valid_high_level_synth,karfa10_verif_datap_contr_gener_phase,karfa12_formal_verif_code_motion_techn,chouksey19_trans_valid_code_motion_trans_invol_loops,chouksey20_verif_sched_condit_behav_high_level_synth}
+are performed using a HLS tool called SPARK~\cite{gupta03_spark}.  These translation validation
+algorithms can check the correctness of complicated optimisations such as code motion or loop
+inversions.  However, even though the correctness of the verifier is proven in the papers, the proof
+does translate directly to the algorithm that was implemented for the verifier.  It is therefore
+possible that output is accepted even though it is not equivalent to the input.  In addition to
+that, these papers reason about the correctness of the algorithms in terms of the intermediate
+language of SPARK, and does not extend to the high-level input language that SPARK takes in, or the
+hardware description language that SPARK outputs.
+
+Finally there has also been work proving HLS correct without using translation validation, but by
+directly showing that the translation is correct.  The first instance of this is proving the
+BEDROC~\cite{chapman92_verif_bedroc} HLS tool is correct.  This HLS tool converts a high-level
+description of an algorithm, supporting loops and conditional statements, to a netlist and proves
+that the output is correct.  It works in two stages, first generating a DFG from the input language,
+HardwarePal.  It then optimises the DFG to improve the routing of the design and also improving the
+scheduling of operations.  Finally, the netlist is generated from the DFG by placing all the
+operations that do not depend on each other into the same clock cycle.  Datapath and register
+allocation is performed by an unverified clique partitioning algorithm.  The equivalence proof
+between the DFG and HardwarePal is done by proof by simulation, where it is proven that, given a
+valid input configuration, that applying a translation or optimisation rule will result in a valid
+DFG with the same behaviour as the input.
+
+There has also been work on proving the translation from Occam to gates~\cite{page91_compil_occam}
+correct using algebraic proofs~\cite{jifeng93_towar}.  This translation resembles dynamic scheduling
+as tokens are used to start the next operations.  However, Occam is timed by design, and will
+execute assignments sequentially.  To take advantage of the parallel nature of hardware, Occam uses
+explicit parallel constructs with channels to share state.  Handel-C, a version of Occam with many
+more features such as memory and pointers, has also been used to prove HLS correct, and is described
+further in Section~\ref{sec:wire-to-wire}.
+
+\section{Mechanised: Compiler Proofs}
+
+Even though a proof for the correctness of an algorithm might exist, this does not guarantee that
+the algorithm itself behaves in the same way as the assumed algorithm in the proof.  The
+implementation of the algorithm is separate from the actual implementation, meaning there could be
+various implementation bugs in the algorithm that mean that it does not behave correctly.  C
+compilers are a good example of this, where many optimisations performed by the compilers have been
+proven correct, however these proofs are not linked directly to the actual implementations of these
+algorithms in GCC or Clang.  Yang et al.~\cite{yang11_findin_under_bugs_c_compil} found
+more than 300 bugs in GCC and Clang, many of them being in the optimisation phases of the compiler.
+One way to link the proofs to the actual implementations in these compilers is to write the compiler
+in a language which allows for a theorem prover to check properties about the algorithms.  This
+allows for the proofs to be directly linked to the algorithms, ensuring that the actual
+implementations are proven correct.  Yang et al.~\cite{yang11_findin_under_bugs_c_compil}
+found that CompCert, a formally verified C Compiler, only had five bugs in all the unverified parts
+of the compiler, meaning this method of proving algorithms correct ensures a correct compiler.
+
+\subsection{CompCert}
+
+CompCert~\cite{leroy09_formal_verif_compil_back_end} is a formally verified C compiler written in
+Coq~\cite{bertot04_inter_theor_provin_progr_devel}.  Coq is a theorem prover, meaning algorithms
+written in Coq can be reasoned about in Coq itself by proving various properties about the
+algorithms.  To then run these algorithms that have been proven correct, they can be extracted
+directly to OCaml code and then executed, as there is a straightforward correspondence between Coq
+and OCaml code.  During this translation, all the proofs are erased, as they are not needed during
+the execution of the compiler, as they are only needed when the correctness of the compiler needs to
+be checked.  With this process, one can have a Compiler that satisfies various correctness
+properties and can therefore be proven to preserve the behaviour of the code.
+
+CompCert contains eleven intermediate languages, which are used to gradually translate C code into
+assembly that has the same behaviour.  Proving the translation directly without going through the
+intermediate languages would be infeasible, especially with the many optimisations that are
+performed during the translation, as assembly is very different to the abstract C code it receives.
+The first three intermediate languages (C\#minor, C\#minorgen, Cminor) are used to transform Clight
+into a more assembly like language called register transfer language (RTL).  This language consist
+of a control-flow graph of instructions, and is therefore well suited for various compiler
+optimisations such as constant propagation, dead-code elimination or function
+inlining~\cite{tristan08_formal_verif_trans_valid}.  After RTL, each intermediate language is used
+to get closer to the assembly language of the architecture, performing operations such as register
+allocation and making sure that the stack variables are correctly aligned.
+
+\subsection{Isabelle HLS}
+
+Martin Ellis' work on correct synthesis~\cite{ellis08} is the first example of a mechanically
+verified high-level synthesis tool, also called hardware/software compilers, as they produce
+software as well as hardware for special functional units that should be hardware accelerated.  The
+main goal of the thesis is to provide a framework to prove hardware/software compilers correct, by
+defining semantics for an intermediate language (IR) which supports partitioning of code into
+hardware and software parts, as well as a custom netlist format which is used to describe the
+hardware parts and is the final target of the hardware/software compiler.  The proof of correctness
+then says that the hardware/software design should have the same behaviour as if the design had only
+been implemented in software.  The framework used to prove the correctness of the compilation from
+the IR to the netlist format is written in Isabelle, which is a theorem prover comparable to Coq.
+Any proofs in the framework are therefore automatically checked by Isabelle for correctness.
+
+This work first defines a static single assignment (SSA) IR, with the capabilities of defining
+sections that should be hardware accelerated, and sections that should be executed on the CPU.  This
+language supports \emph{hyperblocks}, which are a list of basic blocks with one entry node and
+multiple exit nodes, which is well suited for various HLS optimisations and better scheduling.
+However, as this is quite a low-level intermediate language, the first thing that would be required
+is to create a tool that can actually translate a high-level language to this intermediate language.
+In addition to that, they also describe a netlist format in which the hardware accelerated code will
+be represented in.
+
+As both the input IR and output netlist format have been designed from scratch, they are not very
+useful for real world applications, as they require a different back end to be implemented from
+existing compilers.  In addition to that, it would most likely mean that the translation from
+higher-level language to the IR is unverified and could therefore contain bugs.  As the resulting
+netlist also uses a custom format, it cannot be fed directly to tools that can then translate it a
+bitstream to place onto an FPGA. The reason for designing a custom IR and netlist was so that
+these were compatible with each other, making proofs of equivalence between the two simpler.
+
+Finally, it is unclear whether or not a translation algorithm from the IR to the netlist format was
+actually implemented, as the only example in the thesis seems to be compiled by hand to explain the
+proof.  There are also no benchmarks on real input programs showing the efficiency of the
+translation algorithm, and it is therefore unclear whether the framework would be able to prove more
+complicated optimisations that a compiler might perform on the source code.  The thesis seems to
+describe the correctness proofs by assuming a compiler exists which outputs various properties that
+are needed by the equivalence proof, such a mapping between variables and netlist wires and
+registers.
+
+Even though this Isabelle framework does provide some equivalence relation between an IR and a
+netlist and describes how the translation would be proven correct by matching states to one another,
+it is actually not useable in practice.  First of all, the IR is only represented in Isabelle and
+does not have a parser or a compiler which can target this IR.  In addition to that, the netlist
+format cannot be passed to any existing tool, to then be placed onto an FPGA, meaning an additional,
+unverified translation would have to take place.  This further reduces the effectiveness of the
+correctness proof, as there are various stages that are not addressed by the proofs and therefore
+have to be assumed to be correct.
+
+\subsection[sec:wire-to-wire]{Mechanised Handel-C to Netlist Translation}
+
+Handel-C~\cite{bowen98_handel_c_languag_refer_manual} is a C-like language for hardware development.
+It supports many C features such as assignments, if-statements, loops, pointers and functions.  In
+addition to these constructs, Handel-C also supports explicit timing constructs, such as explicit
+parallelism as well as sequential or parallel assignment, similar to blocking and nonblocking
+assignments in Verilog.  Perna et al.~\cite{perna12_mechan_wire_wise_verif_handel_c_synth} developed
+a mechanically verified Handel-C to netlist translation written in HOL. The translation is based
+on previous work describing translation from Occam to gates by Page et
+al.~\cite{page91_compil_occam}, which was proven correct by Jifeng et al.~\cite{jifeng93_towar}
+using algebraic proofs.  As Handel-C is mostly an extension of Occam with C-like operations, the
+translation from Handel-C to gates can proceed in a similar way.
+
+Perna et al. verify the compilation of a subset of Handel-C to gates, which does not include memory,
+arrays or function calls.  In addition to the constructs presented in Page et al., the prioritised
+choice construct is also added to the Handel-C subset that is supported.  The verification proceeds
+by first defining the algorithm to perform the compilation, chaining operations together with start
+and end signals that determine the next construct which will be executed.  The circuits themselves
+are treated as black boxes by the compilation algorithm and are chosen based on the current
+statement in Handel-C which is being translated.  The compilation algorithm correctly has to link
+each black box to one another using the start and end signals that the circuits expose to correctly
+compile the Handel-C code.
 
-aroistenaoirstenoiaresntoien
+The verification of the compilation is done by proving that the control signal is propagated through
+the circuits correctly, and that each circuit is activated at the right time.  It is also proven
+that the internal circuits of the constructs also propagate the control signal in the correct way,
+and will be active at the right clock cycle.  However, it is not proven that the circuits have the
+same functionality as the Handel-C constructs, only that the control signals are propagated in the
+correct manner.  In addition to that, the assumption is made that the semantics for time are the
+same in Handel-C as well as in the netlist format that is generated, which could be proven if the
+constructs are shown to behave in the exact same way as the handel-C constructs.
 
 \stopcomponent
diff --git a/chapters/conclusion.tex b/chapters/conclusion.tex
index 6422396..91ca14f 100644
--- a/chapters/conclusion.tex
+++ b/chapters/conclusion.tex
@@ -1 +1,8 @@
+\environment fonts_env
+\environment lsr_env
+
+\startcomponent pipelining
+
 \chapter{Conclusion}
+
+\stopcomponent
diff --git a/chapters/dynamic.tex b/chapters/dynamic.tex
index a1744e8..5edd812 100644
--- a/chapters/dynamic.tex
+++ b/chapters/dynamic.tex
@@ -1 +1,8 @@
-\chapter{\ymhfw{} Dynamic Scheduling}
+\environment fonts_env
+\environment lsr_env
+
+\startcomponent dynamic
+
+\chapter[sec:dynamic]{Dynamic Scheduling}
+
+\stopcomponent
diff --git a/chapters/hls.tex b/chapters/hls.tex
index 362309f..8ff7143 100644
--- a/chapters/hls.tex
+++ b/chapters/hls.tex
@@ -1,8 +1,5 @@
-\environment fonts_env
-\environment lsr_env
-
 \startcomponent hls
-\product main
+\product ../main.tex
 
 \def\slowdownOrig{27}
 \def\slowdownDiv{2}
diff --git a/chapters/pipelining.tex b/chapters/pipelining.tex
index 4308188..5e07bd7 100644
--- a/chapters/pipelining.tex
+++ b/chapters/pipelining.tex
@@ -1 +1,8 @@
-\chapter{\ymhfw{} Loop Pipelining}
+\environment fonts_env
+\environment lsr_env
+
+\startcomponent pipelining
+
+\chapter[sec:pipelining]{Loop Pipelining}
+
+\stopcomponent
diff --git a/chapters/schedule.tex b/chapters/schedule.tex
index 9dc8d38..1711108 100644
--- a/chapters/schedule.tex
+++ b/chapters/schedule.tex
@@ -1 +1,117 @@
-\chapter{Schedule}
+\environment fonts_env
+\environment lsr_env
+
+\startcomponent pipelining
+
+\definecolor[hyperblockcolour][x=66C2A5]
+\definecolor[dynamiccolour][x=FC8D62]
+\definecolor[modulocolour][x=8DA0CB]
+\definecolor[thesiscolour][x=E78AC3]
+
+\chapter[sec:schedule]{Schedule}
+
+The structure of the thesis has been described in the previous chapters.  Out of these chapters, the
+work for \in{Chapter}[sec:hls] has been completed, the work for \in{Chapter}[sec:scheduling] is
+still ongoing and finally the work in \in{Chapter}[sec:pipelining] and \in{Chapter}[sec:dynamic] has
+not been started yet.  This chapter will describe how the rest of the work will be completed.  The
+current plan is to finish all the work including the thesis by June 2023, which is when the funding
+will run out.
+
+\section{Timeline}
+
+The main timeline for the rest of the PhD is shown in \in{Figure}[fig:future-gantt].  It is split
+into four main sections:
+
+\desc{Hyperblock scheduling} The first section describes the work that is currently going on about
+hyperblock scheduling, and is described in \in{Chapter}[sec:scheduling].  Currently, the main proofs
+need to be completed, as well as submitting a paper about the work to FMCAD/POPL/PLDI.  In general,
+this work was supposed to be finished in November 2021, however, due to complications with the
+actual formulation of the semantics, as well as the number of proofs that need to be completed, the
+work is still on going.
+
+\desc{Modulo scheduling} The second piece of work is described in \in{Chapter}[sec:pipelining] and
+describes the implementation of modulo scheduling to perform scheduling on loops.  This work will be
+based on existing work by \cite[alternative=authoryears][tristan10_simpl_verif_valid_softw_pipel].
+However, it is worth noting that the proof by
+\cite[alternative=author][tristan10_simpl_verif_valid_softw_pipel] does include parts that are not
+completely verified in Coq, meaning there may be various improvements that could be done to the
+proof.
+
+\desc{Dynamic scheduling} Dynamic scheduling is another route that could be explored by using
+Vericert to pass through \SSA, as well as GSA, and finally generate dynamic HTL (or some other
+similar hardware language).  During my visit at Inria Rennes, I worked with Sandrine Blazy and
+Delphine Demange to extend CompCertSSA~\cite[barthe14_formal_verif_ssa_based_middl_end_compc] with
+\infull{GSA} (\GSA)~\cite[ottenstein90_progr_depen_web].  Not only should this allow for global
+optimisations of \SSA\ code, but it should also allow for a more straightforward translation to
+hardware as the GSA code can just be viewed as a \DFG.  Instead of working on modulo scheduling, it
+could therefore be interesting to work on a formally verified dynamic scheduling translation.
+
+\startplacefigure[
+  title={Gantt chart describing the future plan for 2022 and 2023.  The main funding runs out
+    mid-2023, so the plan is to finish the thesis writing by then as well.},
+  reference={fig:future-gantt},
+  location=top,
+]
+  \hbox{\startgantt[xunitlength=17.5mm]{15}{6}
+    \startganttitle
+      \numtitle{2022}{1}{2022}{4}
+      \numtitle{2023}{1}{2023}{2}
+    \stopganttitle
+    \startganttitle
+      \numtitle{1}{1}{4}{1}
+      \numtitle{1}{1}{2}{1}
+    \stopganttitle
+    \ganttgroup{\rmx Hyperblock Scheduling}{0}{3}
+    \ganttbar[color=hyperblockcolour]{\rmx Proof}{0}{2}
+    \ganttbarcon[color=hyperblockcolour]{\rmx Paper Submission}{2}{1}
+    \ganttgroup{\rmx Modulo Scheduling}{2}{3}
+    \ganttbar[color=modulocolour]{\rmx Implementation}{2}{1}
+    \ganttbarcon[color=modulocolour]{\rmx Proof}{3}{2}
+    \ganttgroup{\rmx Dynamic Scheduling}{0}{5}
+    \ganttbar[color=dynamiccolour]{\rmx RTL $\rightarrow$ GSA}{0}{1}
+    \ganttbarcon[color=dynamiccolour]{\rmx GSA $\rightarrow$ RTL}{1}{1}
+    \ganttbar[color=dynamiccolour]{\rmx GSA $\rightarrow$ HTL}{2}{3}
+    \ganttcon{1}{9}{2}{11}
+    \ganttgroup{\rmx Thesis}{4}{2}
+    \ganttbar[color=thesiscolour]{\rmx Thesis Writing}{4}{2}
+  \stopgantt}
+\stopplacefigure
+
+\section{Detailed Work Plan}
+
+This section describes a more detailed work plan for finalising the thesis.  First I will describe
+what the current plan is to finish the short term work of proving hyperblock scheduling, followed by
+a plan of either tackling dynamic scheduling or modulo scheduling.
+
+\subsection{Finishing Work on Hyperblock Scheduling}
+
+As described in \in{Chapter}{sec:scheduling}, the current milestone that needs to be achieved is
+proving hyperblock scheduling.  This requires proofs of the following translations:
+
+\desc{\sc RTL $\rightarrow$ RTLBlock} This translation generates basic blocks to ease the scheduling
+proof as well as simplify the scheduling algorithm.
+
+\desc{\sc IfConv} The {\sc IfConv} transformation introduces predicate instructions by conversing
+conditional statements.  This is necessary to generates hyperblocks and therefore improve the
+schedule by passing larger blocks to the scheduler.
+
+\desc{\sc RTLBlock $\rightarrow$ RTLPar} The proof of the scheduling algorithm itself, which is
+performed by validating the schedule and verifying that the validator is sound.
+
+\desc{\sc RTLPar $\rightarrow$ HTL} This translation generates the hardware from the parallel
+description of the schedule, which can then be translated to Verilog using an existing translation.
+As a translation from RTL to HTL has already been proven correct, it should be possible to reuse
+many proofs from that translation and only have to modify the top-level proofs that deal with the
+slightly different structure of the program.
+
+\subsection{Starting Work on Modulo Scheduling}
+
+After the proofs about hyperblock scheduling are finished, a natural extension is to work on
+implementing modulo scheduling, which entails rearranging instructions in loops.
+
+\subsection{Possible Implementation for Dynamic Scheduling}
+
+Another plausible extension to Vericert is adding dynamic scheduling, as an alternative, or in
+addition to, static scheduling.
+
+\stopcomponent
diff --git a/chapters/scheduling.tex b/chapters/scheduling.tex
index 5edad2e..2201f72 100644
--- a/chapters/scheduling.tex
+++ b/chapters/scheduling.tex
@@ -1,320 +1,235 @@
+\environment fonts_env
+\environment lsr_env
+
 \startcomponent scheduling
-\project main
 
-\chapter{Static Scheduling}
+\chapter[sec:scheduling]{Static Scheduling}
 
 \section[introduction]{Introduction}
 
-The use of multicore processors has been increasing drastically, thereby
-making parallelising compilers ever more important. In addition to that,
-the need for custom hardware accelerators is also increasing, as
-traditional processors are not always the best choice for all
-applications. Compilers that support optimisations which can
-automatically parallelise it's input programs are therefore also
-becoming more important, as these back ends all benefit from it. For
-processors, one wants to optimise the pipeline usage as much as
-possible, whereas in hardware one can create custom pipelines, as well
-as making more use out of the spacial properties that hardware offers,
-making instruction parallelism important as well as speculative
-execution. However, with more powerful optimisations there is a greater
-chance to make mistakes in the compiler, making it possible to generate
-programs that behave differently to the input program. An important step
-towards making compilers more reliable is to formally verify these
-optimisations so that it can be guaranteed that bugs are not introduced.
-
-A popular optimisation for processors increasing the instruction
-parallelism of the resulting program is to use scheduling. This is an
-optimisation that is especially important for processors that have deep
-instruction pipelines or multiple functional units that can be used to
-execute instructions in parallel. However, this optimisation is also
-critical for high-level synthesis tools, which generate application
-specific hardware designs based on a software input. Simple scheduling
-algorithms often work on basic blocks, which are sequential lists of
-instructions without any branches. However, this limits what kind of
-reordering the scheduling algorithm can perform. It can never move
-instructions past a branching instruction, even when this might
-drastically improve the performance. Trace scheduling~ is an alternative
-that addresses this issue, by creating paths that might cross branching
-instructions, and scheduling the instructions in that path. However, the
-main problem with trace scheduling is that it is often infeasible on
-large programs, and because of the large design space, it's sometimes
-hard to find an optimal schedule. Superblock~ and hyperblock~ scheduling
-are two subsets of trace scheduling, which compromise on the flexibility
-of trace scheduling to instead build more tractable algorithms.
-Superblock scheduling produces more efficient code for processors with a
-low issue rate, whereas hyperblock scheduling produces more performant
-code for processors with a high issue rate~.
-
-CompCert~ is a compiler that is formally verified in Coq. In addition to
-that, have also added a formally verified hardware back end to CompCert,
-allowing it to also generate provably correct hardware designs. A
-hardware back end is an ideal target for hyperblock scheduling, as the
-issue rate is practically infinite. This paper describes the
-implementation of a hyperblock instruction scheduler in CompCert so that
-it can benefit existing CompCert back ends, as well as support more
+The use of multicore processors has been increasing drastically, thereby making parallelising
+compilers ever more important. In addition to that, the need for custom hardware accelerators is
+also increasing, as traditional processors are not always the best choice for all
+applications. Compilers that support optimisations which can automatically parallelise it's input
+programs are therefore also becoming more important, as these back ends all benefit from it. For
+processors, one wants to optimise the pipeline usage as much as possible, whereas in hardware one
+can create custom pipelines, as well as making more use out of the spacial properties that hardware
+offers, making instruction parallelism important as well as speculative execution. However, with
+more powerful optimisations there is a greater chance to make mistakes in the compiler, making it
+possible to generate programs that behave differently to the input program. An important step
+towards making compilers more reliable is to formally verify these optimisations so that it can be
+guaranteed that bugs are not introduced.
+
+A popular optimisation for processors increasing the instruction parallelism of the resulting
+program is to use scheduling. This is an optimisation that is especially important for processors
+that have deep instruction pipelines or multiple functional units that can be used to execute
+instructions in parallel. However, this optimisation is also critical for high-level synthesis
+tools, which generate application specific hardware designs based on a software input. Simple
+scheduling algorithms often work on basic blocks, which are sequential lists of instructions without
+any branches. However, this limits what kind of reordering the scheduling algorithm can perform. It
+can never move instructions past a branching instruction, even when this might drastically improve
+the performance. Trace scheduling~ is an alternative that addresses this issue, by creating paths
+that might cross branching instructions, and scheduling the instructions in that path. However, the
+main problem with trace scheduling is that it is often infeasible on large programs, and because of
+the large design space, it's sometimes hard to find an optimal schedule. Superblock~ and hyperblock~
+scheduling are two subsets of trace scheduling, which compromise on the flexibility of trace
+scheduling to instead build more tractable algorithms.  Superblock scheduling produces more
+efficient code for processors with a low issue rate, whereas hyperblock scheduling produces more
+performant code for processors with a high issue rate~.
+
+CompCert~ is a compiler that is formally verified in Coq. In addition to that, have also added a
+formally verified hardware back end to CompCert, allowing it to also generate provably correct
+hardware designs. A hardware back end is an ideal target for hyperblock scheduling, as the issue
+rate is practically infinite. This paper describes the implementation of a hyperblock instruction
+scheduler in CompCert so that it can benefit existing CompCert back ends, as well as support more
 general targets such as scheduling instructions for custom hardware.
 
 \subsection[key-points]{Key Points}
 
-The goal of this paper is to introduce hyperblock scheduling into
-CompCert to allow for advanced and flexible scheduling operations that
-could be beneficial to various different back ends. To this end, the key
-points of this paper are the following:
+The goal of this paper is to introduce hyperblock scheduling into CompCert to allow for advanced and
+flexible scheduling operations that could be beneficial to various different back ends. To this end,
+the key points of this paper are the following:
 
 \startitemize
 \item
-  Description of how hyperblocks are introduced for improved scheduling
-  of instructions, and how these can be built and translated back to
-  instructions without predicates.
+  Description of how hyperblocks are introduced for improved scheduling of instructions, and how
+  these can be built and translated back to instructions without predicates.
 \item
-  Description of the implementation and correctness proof of a system of
-  difference constraints (SDC) scheduling algorithm, which can be
-  adapted to various different scheduling strategies. {\bf {[}{[}JW:}
-  What's the link between hyperblocks and SDC-based scheduling? Are
-  these orthogonal?{\bf {]}{]}}{\bf {[}{[}YH:} Well the original SDC
-  papers only talk about basic blocks, and not SDC hyperblock
-  scheduling. But it's easy enough to adapt. SDC just requires some kind
-  of blocks without control-flow.{\bf {]}{]}} {\bf {[}{[}JW:} Ok cool,
-  so you can have SDC scheduling without hyperblocks. Can you have
-  hyperblocks without SDC scheduling? (I guess so, but just wanted to be
-  completely sure.){\bf {]}{]}}{\bf {[}{[}YH:} Yes exactly, you can
-  perform any type of basic block scheduling on hyperblocks as well I
-  think, as long as you mark the dependencies correctly. But in the
-  worst case you could even just see hyperblocks as basic blocks and
-  ignore predicates. It would still be correct but just less
-  efficient.{\bf {]}{]}}
-
-  {\bf {[}{[}JW:} It would be interesting to make the case for SDC-based
-  scheduling. Is it superior to other scheduling mechanisms? Is it so
-  much superior that it's worth the additional complexity, compared to,
-  say, list scheduling? {\bf {]}{]}}{\bf {[}{[}YH:} SDC scheduling is
-  more flexible with the type of constraint to optimise for (latency /
-  throughput). I don't think it's too meaningful for compilers, where
-  there isn't as much flexibility.{\bf {]}{]}} {\bf {[}{[}JW:} Thanks.
-  Does Vericert have any idea that the scheduling is being performed
-  using SDC? Or is the SDC stuff entirely internal to your unverified
-  OCaml code? That is, could you swap your scheduler with a different
-  one that isn't based on SDC, and not have to change anything in
-  Vericert?{\bf {]}{]}}{\bf {[}{[}YH:} The verified part doesn't know
-  about SDC at all, so it could be changed to any other scheduler that
-  acts on hyperblocks.{\bf {]}{]}}
+  Description of the implementation and correctness proof of a system of difference constraints
+  (SDC) scheduling algorithm, which can be adapted to various different scheduling strategies.
 \stopitemize
 
 \section[sec:scheduling]{Implementation of Hyperblocks in CompCert}
 
 This section describes the structure of hyperblocks in
-Section~\goto{2.1}[scheduling:hyperblocks]. Then, the structure of two
-extra intermediate languages that were added to CompCert to implement
-hyperblocks are also shown in Section~\goto{2.2}[sec:rtlblockdef].
+Section~\goto{2.1}[scheduling:hyperblocks]. Then, the structure of two extra intermediate languages
+that were added to CompCert to implement hyperblocks are also shown in
+Section~\goto{2.2}[sec:rtlblockdef].
 
 \subsection[scheduling:hyperblocks]{Hyperblocks as a Generalisation of Basic Blocks}
 
-Basic blocks are popular in intermediate languages because they describe
-a list of sequential instructions that are not separated by
-control-flow. This information can be used by various analysis passes to
-perform optimisations, as these sequences of instructions can be
-reordered in any way as long as their data dependencies are met. One
-such optimisation is scheduling, which reorders instructions to
-eliminate of any delay slots caused by the CPU pipeline.
-
-However, due to control-flow not being allowed in basic blocks, the
-number of instructions that can be scheduled at any one time is limited.
-The boundaries to other basic blocks act as hard boundary to the
-scheduling algorithm. hyperblocks~ allow instructions to be predicated,
-thereby introducing some control-flow information into the blocks, which
-can also be taken into account by the scheduler. For example, any data
-dependency between instructions that have mutually exclusive predicates
-can be removed, as these instructions will never be activate at the same
-time.
-
-\subsection[sec:rtlblockdef]{RTLBlock and RTLPar Intermediate Language Definition}
+Basic blocks are popular in intermediate languages because they describe a list of sequential
+instructions that are not separated by control-flow. This information can be used by various
+analysis passes to perform optimisations, as these sequences of instructions can be reordered in any
+way as long as their data dependencies are met. One such optimisation is scheduling, which reorders
+instructions to eliminate of any delay slots caused by the CPU pipeline.
 
-Figure~\goto{{[}fig:compcert_interm{]}}[fig:compcert_interm] shows the
-intermediate languages introduced to implement hyperblocks in CompCert.
-This consists of new RTLBlock and RTLPar intermediate languages, which
-implement the sequential and parallel semantics of basic blocks
-respectively. The semantics of RTLBlock and RTLPar are based on the
-CompCert RTL intermediate language. However, instead of mapping from
-states to instructions, RTLBlock maps from states to hyperblocks, and
-RTLPar maps from states to parallel hyperblocks, which will be described
-in the next sections.
+However, due to control-flow not being allowed in basic blocks, the number of instructions that can
+be scheduled at any one time is limited.  The boundaries to other basic blocks act as hard boundary
+to the scheduling algorithm. hyperblocks~ allow instructions to be predicated, thereby introducing
+some control-flow information into the blocks, which can also be taken into account by the
+scheduler. For example, any data dependency between instructions that have mutually exclusive
+predicates can be removed, as these instructions will never be activate at the same time.
 
-%\startformula \begin{aligned}
-%    i\ ::=&\ \ \mono{RBnop} \\
-%          &|\  \mono{RBop}\ p?\ \mathit{op}\ \vec{r}\ d \\
-%          &|\  \mono{RBload}\ p?\ \mathit{chunk}\ \mathit{addr}\ \vec{r}\ d \\
-%          &|\  \mono{RBstore}\ p?\ \mathit{chunk}\ \mathit{addr}\ \vec{r}\ s \\
-%          &|\  \mono{RBsetpred}\ p?\ c\ \vec{r}\ d \\
-%  \end{aligned} \stopformula
-%
-%\startformula \begin{aligned}
-%    i_{\mathit{cf}}\ ::=&\ \ \mono{RBcall}\ \mathit{sig}\ f\ \vec{r}\ d\ n \\
-%                        &|\  \mono{RBtailcall}\ \mathit{sig}\ f\ \vec{r} \\
-%                        &|\  \mono{RBbuiltin}\ f_{\mathit{ext}}\ \vec{r}\ r\ n \\
-%                        &|\  \mono{RBcond}\ c\ \vec{r}\ n_{1}\ n_{2} \\
-%                        &|\  \mono{RBjumptable}\ r\ \vec{n} \\
-%                        &|\  \mono{RBreturn}\ r? \\
-%                        &|\  \mono{RBgoto}\ n \\
-%                        &|\  \mono{RBpred\_{cf}}\ P\ i_{\mathit{cf}_{1}}\ i_{\mathit{cf}_{2}} \\
-%  \end{aligned} \stopformula
+\subsection[sec:rtlblockdef]{RTLBlock and RTLPar Intermediate Language Definition}
 
-RTLBlock instructions are split into two types of instructions, standard
-instructions and control-flow instructions. The standard instructions
-are the instructions that can be placed into hyperblocks, whereas
-control-flow instructions are the instructions that can end hyperblocks.
-The standard instructions are shown in
-Figure~\goto{{[}fig:standard_instruction{]}}[fig:standard_instruction],
-whereas the control-flow instructions are shown in
-Figure~\goto{{[}fig:control_flow_instruction{]}}[fig:control_flow_instruction].
-Most instructions are quite similar to their RTL counterparts, however,
-there are some instructions that have been added. \type{RBsetpred} is a
-standard instruction which sets a predicate equal to an evaluated
-condition. This instruction is used during if-conversion to set the
-predicate to the value of the condition in the conditional statement
-instruction. To support proper predicated instructions, each instruction
-therefore also contains an optional predicate which has been set
-beforehand, which produces the hyperblock. In additon to that, there is
-also an extra control-flow instruction called \type{RBpred_cf}, which
-can branch on a predicate an nests more control-flow instructions inside
-of it. This is also necessary for if-conversion, when converting an
-already converted conditional statement, as depending on the predicate,
-a different control-flow instruction may be necessary. {\bf {[}{[}JW:} I
-think there'd be a lot of value in having a picture showing op-chaining
-in action. See tex source for the kind of thing I mean. Probably best
-done as a running example throughout the paper. {\bf {]}{]}}
-
-These instructions are use in RTLBlock as well as in RTLPar. The main
-difference between these two languages is how these instructions are
-arranged within the hyperblock and the execution semantics of the
-hyperblock in the languages. Let $\mathit{hb}_{b}$ be the definition of
-a hyperblock for RTLBlock, and let $\mathit{hb}_{p}$ be the definition
-of a hyperblock for RTLPar, then $\mathit{hb}_{b}$ and $\mathit{hb}_{p}$
-be defined as in Figure~\goto{{[}fig:hb_def{]}}[fig:hb_def]. The Figure
-shows the different nesting levels of the basic blocks in RTLBlock as
-well as RTLPar, and where the parallel semantics of RTLPar come into
-play. RTLBlock is made of a list of instructions and a control-flow
-instruction that ends the hyperblock. Each instruction in RTLBlock is
-executed sequentially. RTLPar is made of three separate lists. The outer
-list behaves like the list in RTLBlock, and executes each list inside of
-it sequentially. RTLPar then has a list which executes it's contents in
-parallel, meaning each instruction uses the same state of the registers
-and memory while executing. Finally, instead of directly containing
-instructions, there is an additional layer of sequentially executing
-lists of instructions, to be able to execute small groups of
-instructions in sequence, but still in parallel to the other
-instructions in the current parallel group.
-
-RTLPar is structured in this way so that the scheduler has more
-flexibility, so that it can not only parallelise operations in the basic
-blocks, but that it can also define sequential operations in these
-parallel operations. This is especially useful for the hardware back
-end, as it means that sequential operations that do not quite fill one
-clock cycle can be executed sequentially, but can still be executed in
-parallel with other independent operations. This optimisation is called
-operation chaining, and is critical to get the most performance out of
-the hardware.
+Figure~\goto{{[}fig:compcert_interm{]}}[fig:compcert_interm] shows the intermediate languages
+introduced to implement hyperblocks in CompCert.  This consists of new RTLBlock and RTLPar
+intermediate languages, which implement the sequential and parallel semantics of basic blocks
+respectively. The semantics of RTLBlock and RTLPar are based on the CompCert RTL intermediate
+language. However, instead of mapping from states to instructions, RTLBlock maps from states to
+hyperblocks, and RTLPar maps from states to parallel hyperblocks, which will be described in the
+next sections.
+
+\placeformula \startformula \startmathalignment
+\NC i\ \ \colon\colon= \NC \ \ \mono{RBnop}                                                   \NR
+\NC                  \NC |\ \ \mono{RBop}\ p?\ \mathit{op}\ \vec{r}\ d                      \NR
+\NC                  \NC |\ \ \mono{RBload}\ p?\ \mathit{chunk}\ \mathit{addr}\ \vec{r}\ d  \NR
+\NC                  \NC |\ \ \mono{RBstore}\ p?\ \mathit{chunk}\ \mathit{addr}\ \vec{r}\ s \NR
+\NC                  \NC |\ \ \mono{RBsetpred}\ p?\ c\ \vec{r}\ d                           \NR
+\stopmathalignment \stopformula
+
+\placeformula \startformula \startmathalignment
+\NC i_{\mathit{cf}}\ \ \colon\colon= \NC \ \ \mono{RBcall}\ \mathit{sig}\ f\ \vec{r}\ d\ n                    \NR
+\NC                      \NC |\ \ \mono{RBtailcall}\ \mathit{sig}\ f\ \vec{r}                      \NR
+\NC                      \NC |\ \ \mono{RBbuiltin}\ f_{\mathit{ext}}\ \vec{r}\ r\ n                \NR
+\NC                      \NC |\ \ \mono{RBcond}\ c\ \vec{r}\ n_{1}\ n_{2}                          \NR
+\NC                      \NC |\ \ \mono{RBjumptable}\ r\ \vec{n}                                   \NR
+\NC                      \NC |\ \ \mono{RBreturn}\ r?                                              \NR
+\NC                      \NC |\ \ \mono{RBgoto}\ n                                                 \NR
+\NC                      \NC |\ \ \mono{RBpred\_{cf}}\ P\ i_{\mathit{cf}_{1}}\ i_{\mathit{cf}_{2}} \NR
+\stopmathalignment \stopformula
+
+RTLBlock instructions are split into two types of instructions, standard instructions and
+control-flow instructions. The standard instructions are the instructions that can be placed into
+hyperblocks, whereas control-flow instructions are the instructions that can end hyperblocks.  The
+standard instructions are shown in
+Figure~\goto{{[}fig:standard_instruction{]}}[fig:standard_instruction], whereas the control-flow
+instructions are shown in
+Figure~\goto{{[}fig:control_flow_instruction{]}}[fig:control_flow_instruction].  Most instructions
+are quite similar to their RTL counterparts, however, there are some instructions that have been
+added. \type{RBsetpred} is a standard instruction which sets a predicate equal to an evaluated
+condition. This instruction is used during if-conversion to set the predicate to the value of the
+condition in the conditional statement instruction. To support proper predicated instructions, each
+instruction therefore also contains an optional predicate which has been set beforehand, which
+produces the hyperblock. In additon to that, there is also an extra control-flow instruction called
+\type{RBpred_cf}, which can branch on a predicate an nests more control-flow instructions inside of
+it. This is also necessary for if-conversion, when converting an already converted conditional
+statement, as depending on the predicate, a different control-flow instruction may be
+necessary.
+
+These instructions are use in RTLBlock as well as in RTLPar. The main difference between these two
+languages is how these instructions are arranged within the hyperblock and the execution semantics
+of the hyperblock in the languages. Let $\mathit{hb}_{b}$ be the definition of a hyperblock for
+RTLBlock, and let $\mathit{hb}_{p}$ be the definition of a hyperblock for RTLPar, then
+$\mathit{hb}_{b}$ and $\mathit{hb}_{p}$ be defined as in
+Figure~\goto{{[}fig:hb_def{]}}[fig:hb_def]. The Figure shows the different nesting levels of the
+basic blocks in RTLBlock as well as RTLPar, and where the parallel semantics of RTLPar come into
+play. RTLBlock is made of a list of instructions and a control-flow instruction that ends the
+hyperblock. Each instruction in RTLBlock is executed sequentially. RTLPar is made of three separate
+lists. The outer list behaves like the list in RTLBlock, and executes each list inside of it
+sequentially. RTLPar then has a list which executes it's contents in parallel, meaning each
+instruction uses the same state of the registers and memory while executing. Finally, instead of
+directly containing instructions, there is an additional layer of sequentially executing lists of
+instructions, to be able to execute small groups of instructions in sequence, but still in parallel
+to the other instructions in the current parallel group.
+
+RTLPar is structured in this way so that the scheduler has more flexibility, so that it can not only
+parallelise operations in the basic blocks, but that it can also define sequential operations in
+these parallel operations. This is especially useful for the hardware back end, as it means that
+sequential operations that do not quite fill one clock cycle can be executed sequentially, but can
+still be executed in parallel with other independent operations. This optimisation is called
+operation chaining, and is critical to get the most performance out of the hardware.
 
 \subsection[translating-rtlblock-to-rtlpar]{Translating RTLBlock to RTLPar}
 
-The translation from RTLBlock to RTLPar is performed using a scheduling
-algorithm, which will takes hyperblocks and rearranges their
-instructions to maximise parallelism based on the data dependencies. It
-then arranges the instructions in RTLPar by putting independent
-instructions into the same parallel block. Scheduling is an optimisation
-that is well suited to translation validation, as the algorithm itself
-can be quite complex, whereas the output is normally relatively easy to
-check for equivalence with the input.
-Figure~\goto{{[}fig:compcert_interm{]}}[fig:compcert_interm] shows that
-the scheduling step comes right after the if-conversion step which
-originally creates the hyperblocks.
-
-Figure~\goto{{[}fig:op_chain{]}}[fig:op_chain] shows how the scheduling
-step transforms RTLBlock into RTLPar. The RTLBlock hyperblock being
-scheduled is shown in
-Figure~\goto{{[}fig:op_chain_a{]}}[fig:op_chain_a], which contains five
-predicated operations, comprising two additions and three
-multiplications. The data dependencies of the instructions in the
-hyperblock are shown in
-Figure~\goto{{[}fig:op_chain_b{]}}[fig:op_chain_b]. Curiously, even
-though operations and are normally dependent on each other due to a
-write-after-write conflict, but because the predicates are independent
-the write-after-write conflict can effectively be removed by the
+The translation from RTLBlock to RTLPar is performed using a scheduling algorithm, which will takes
+hyperblocks and rearranges their instructions to maximise parallelism based on the data
+dependencies. It then arranges the instructions in RTLPar by putting independent instructions into
+the same parallel block. Scheduling is an optimisation that is well suited to translation
+validation, as the algorithm itself can be quite complex, whereas the output is normally relatively
+easy to check for equivalence with the input.
+Figure~\goto{{[}fig:compcert_interm{]}}[fig:compcert_interm] shows that the scheduling step comes
+right after the if-conversion step which originally creates the hyperblocks.
+
+Figure~\goto{{[}fig:op_chain{]}}[fig:op_chain] shows how the scheduling step transforms RTLBlock
+into RTLPar. The RTLBlock hyperblock being scheduled is shown in
+Figure~\goto{{[}fig:op_chain_a{]}}[fig:op_chain_a], which contains five predicated operations,
+comprising two additions and three multiplications. The data dependencies of the instructions in the
+hyperblock are shown in Figure~\goto{{[}fig:op_chain_b{]}}[fig:op_chain_b]. Curiously, even though
+operations and are normally dependent on each other due to a write-after-write conflict, but because
+the predicates are independent the write-after-write conflict can effectively be removed by the
 scheduler.
 
-The scheduler transforms the RTLBlock hyperblock into the RTLPar
-hyperblock shown in Figure~\goto{{[}fig:op_chain_c{]}}[fig:op_chain_c].
-Even though the addition in is dependent on , they can still be put into
-the same cycle as the additions do not take as long to complete as the
-multiplications. This optimisation is called operation chaining. Then,
-the multiplication in can also be placed in the same clock cycle as it
-does not have any data dependencies. Then, in the next clock cycle,
-either the multiplication in can take place, or the multiplication in
-will take place, meaning these two multiplications can also be placed in
-the same clock cycle. This gives the final schedule that is shown in
-Figure~\goto{{[}fig:op_chain_d{]}}[fig:op_chain_d].
+The scheduler transforms the RTLBlock hyperblock into the RTLPar hyperblock shown in
+Figure~\goto{{[}fig:op_chain_c{]}}[fig:op_chain_c].  Even though the addition in is dependent on ,
+they can still be put into the same cycle as the additions do not take as long to complete as the
+multiplications. This optimisation is called operation chaining. Then, the multiplication in can
+also be placed in the same clock cycle as it does not have any data dependencies. Then, in the next
+clock cycle, either the multiplication in can take place, or the multiplication in will take place,
+meaning these two multiplications can also be placed in the same clock cycle. This gives the final
+schedule that is shown in Figure~\goto{{[}fig:op_chain_d{]}}[fig:op_chain_d].
 
 \section[abstr_interp]{Abstract Interpretation of Hyperblocks}
 
-The correctness of the scheduling algorithm is checked by comparing the
-symbolic expressions of each register before and after scheduling. This
-section describes how these symbolic expressions are constructed for
-each hyperblock.
+The correctness of the scheduling algorithm is checked by comparing the symbolic expressions of each
+register before and after scheduling. This section describes how these symbolic expressions are
+constructed for each hyperblock.
 
 \subsection[abstr_language]{Construction of Abstract Language}
 
-The main difficulty when constructing symbolic expressions for
-hyperblocks is the presence of predicates, which can be used by the
-scheduling algorithm to reorder instructions further, even if these
-contain data dependencies. The scheduling algorithm will manipulate
-predicates in the following ways, which the comparison function needs to
-take into account.
+The main difficulty when constructing symbolic expressions for hyperblocks is the presence of
+predicates, which can be used by the scheduling algorithm to reorder instructions further, even if
+these contain data dependencies. The scheduling algorithm will manipulate predicates in the
+following ways, which the comparison function needs to take into account.
 
 \startitemize
 \item
-  Instructions with predicates that are equivalent to false can be
-  removed.
+  Instructions with predicates that are equivalent to false can be removed.
 \item
-  Instructions with predicates that are mutually exclusive can be
-  executed in parallel.
+  Instructions with predicates that are mutually exclusive can be executed in parallel.
 \item
   Any predicate can be replaced with an equivalent predicate.
 \stopitemize
 
 The symbolic expressions have the following form:
 
-There are three main resources that are present in symbolic expressions:
-registers $r$, which represent the value of the registers in the
-program, predicates $p$, which represent the value of the predicates
-throughout the program, and finally, memory \type{Mem}, representing the
-state of the memory. For each basic instruction in the hyperblock, there
-is an equivalent symbolic expression which either modifies the state of
-the registers, predicates or memory. Finally, as each instruction in the
-hyperblock can be predicates, each symbolic expression is also paired
-with a predicate as well.
-
-Assigning the same resource multiple times with instructions that
-contain different predicates, means that the symbolic expression must
-also express this conditional result based on the predicates. This is
-expressed by assigning lists of predicated symbolic expressions to each
-resource, where each predicate needs to be mutually exclusive from the
-other predicates in the list.
-
-The expressions are then constructed using a function which updates the
-symbolic expressions assigned for each resource. This is done using
-multiple primitives which act on predicated types, which are made up of
-a list of pairs of predicates and the element of that type. The first
-important primitive is multiplication of two predicated types, which is
-implemented as performing a cartesian multiplication between the
-predicated types in the two lists, anding the two predicates and joining
+There are three main resources that are present in symbolic expressions: registers $r$, which
+represent the value of the registers in the program, predicates $p$, which represent the value of
+the predicates throughout the program, and finally, memory \type{Mem}, representing the state of the
+memory. For each basic instruction in the hyperblock, there is an equivalent symbolic expression
+which either modifies the state of the registers, predicates or memory. Finally, as each instruction
+in the hyperblock can be predicates, each symbolic expression is also paired with a predicate as
+well.
+
+Assigning the same resource multiple times with instructions that contain different predicates,
+means that the symbolic expression must also express this conditional result based on the
+predicates. This is expressed by assigning lists of predicated symbolic expressions to each
+resource, where each predicate needs to be mutually exclusive from the other predicates in the list.
+
+The expressions are then constructed using a function which updates the symbolic expressions
+assigned for each resource. This is done using multiple primitives which act on predicated types,
+which are made up of a list of pairs of predicates and the element of that type. The first important
+primitive is multiplication of two predicated types, which is implemented as performing a cartesian
+multiplication between the predicated types in the two lists, anding the two predicates and joining
 the two types of each list using a function.
 
 %\startformula \label{eq:1}
 %  P_{1} \otimes_{f} P_{2} \equiv \mono{map } (\lambda ((p_{1}, e_{1}), (p_{2}, e_{2})) . (p_{1} \land p_{2}, f\ e_{1}\ e_{2})) P_{1} \times P_{2} \stopformula
 
-In addition to that, another primitive that is needed is the following
-append operation, which will negate the combination of all predicates in
-the second predicated type, and it to the first predicated type and
-append the first predicated type to the second:
+In addition to that, another primitive that is needed is the following append operation, which will
+negate the combination of all predicates in the second predicated type, and it to the first
+predicated type and append the first predicated type to the second:
 
 %\startformula \label{eq:2}
 %  \mu(p, P) \equiv \mono{map } (\lambda (p', e') . (p \land p', e'))\ P \stopformula
@@ -322,8 +237,7 @@ append the first predicated type to the second:
 %\startformula \label{eq:3}
 %  P_{1} \oplus_{p} P_{2} \equiv \mu(\neg p, P_{1}) \mathbin{++} \mu(p, P_{2}) \stopformula
 
-\subsection[title={Example of
-translation},reference={example-of-translation}]
+\subsection[example-of-translation]{Example of translation}
 
 %\startformula \begin{aligned}
 %    \label{eq:4}
@@ -342,113 +256,87 @@ translation},reference={example-of-translation}]
 %      \end{cases}
 %  \end{aligned} \stopformula
 
-Using the example shown in
-Figure~\goto{{[}fig:op_chain{]}}[fig:op_chain], the RTLBlock hyperblock
-shown in Figure~\goto{{[}fig:op_chain_a{]}}[fig:op_chain_a] is scheduled
-into the hyperblock RTLPar shown in
-Figure~\goto{{[}fig:op_chain_c{]}}[fig:op_chain_c]. The first step is to
-translate the input and output of the scheduler into the Abstr
-intermediate language by calculating the symbolic values that will be
-stored in each of the registers. Symbolically evaluating the RTLBlock
-hyperblock results in the expressions shown in
-Figure~\goto{{[}fig:symbolic_expressions{]}}[fig:symbolic_expressions]
-for registers \type{r1}, \type{r2} and \type{r3}, where the state of the
-register \type{r1} at the start of the hyperblock is denoted as
-$\mono{r1}^{0}$.
-
-This symbolic expression can be generated by sequentially going through
-the list of predicated expressions and applying the update function
-defined in Section~\goto{3.1}[abstr_language]. The resulting predicated
-expressions is guaranteed to have predicates which are mutually
-exclusive, meaning if the predicate evaluates to true, all other
-predicates must evaluate to false.
-
-\subsection[title={Linear (Flat) Predicated
-Expressions},reference={linear-flat-predicated-expressions}]
-
-The update function for predicated expressions is quite a bit more
-complex compared to standard update functions for symbolic execution,
-such as the one used by . The main reason for this is the non-recursive
-nature of the predicated expressions. As the predicates are only
-available at the top-level of the expressions, combining multiple
-expressions means that the conditions and multiple possibilities also
-need to trickle up to the top.
-
-Instead, if the predicated instructions were recursive, one would not
-have to change any predicates when evaluating the argument lists of
-instructions, for example, as these can just stay predicated. The
-evaluation semantics would also be much simpler, because a rule can be
-added for evaluating predicates.
-
-\section[title={Correctness of the Abstract
-Interpretation},reference={sec:correctness}]
-
-The correctness of the scheduling step is dependent on the correctness
-of the equivalence check between the two abstract languages. As the
-abstract languages contain predicates, this comparison check cannot just
-be a syntactic comparison between the two languages, because the
-predicates might be different but logically equivalent. Instead, a mixed
-equivalence check is performed using a verified SAT solver and a
-syntactic equivalence check of the actual symbolic expressions.
-
-\subsection[title={Comparing Symbolic Abstract
-Expressions},reference={comparing-symbolic-abstract-expressions}]
-
-Abstract expressions comprise a predicate and a symbolic expression. The
-translation from RTLBlock and RTLPar to Abstract ensures that each
-register will contain a predicated symbolic expression that describes
-the value that will be assigned to it after evaluating the basic block.
-This means that if an expression always produces the same result as
-another expression, that the register will always contain the same
-value. If the RTLBlock program and RTLPar program translate to two
-Abstract languages where all the register's symbolic expressions match,
-then this means that both programs will always produce the same value.
-
-The main comparison between two symbolic expressions can be done using a
-simple syntactic comparison, because this ensures that each of the
-expressions always evaluate to the same value, and is also general
-enough for the expressions that need to be compared. However, as all
-expressions might be predicated, this simple syntactic comparison cannot
-be performed for the predicates. This is because predicated instructions
-can be executed in many different orders, and the predicates are used to
-calculate dependence information and therefore also determine the order
-of operations.
-
-Instead, comparing predicates can be done by checking logical
-equivalence of the predicates using a formally verified SAT solver. This
-ensures that the predicates will always evaluate to the same value,
-thereby making it a fair comparison.
-
-\subsection[title={Verified Decidable DPLL SAT
-Solver},reference={verified_sat}]
-
-A SAT solver is needed to correctly and reliably compare predicated
-expressions. This makes it possible to then compare many different
-possible schedules, even if they remove expressions completely if these
-are unnecessary. The SAT solver is needed in the verifier, which needs
-to be proven correct in Coq to prove the scheduling translation correct.
-The SAT solver therefore has to be verified, because it could otherwise
-not be used in the proof of correctness.
-
-The DPLL algorithm is used to implement the SAT solver. This means that
-the SAT solver takes in a conjunctive normal form (CNF) formula. The
-conversion from a recursive predicate type into a CNF formula is done by
-recursively expanding the predicate in equivalent CNF formulas.
-
-The SAT solver is implemented in Coq and proven to be sound and
-complete. The soundness and completeness of an algorithm to check
-whether a predicate is satisfiable or not also directly implies that the
-check is decidable, simplifying many proofs in Coq by using a
-\type{Setoid} instance for equivalence between predicates. This makes
-rewriting predicates if they are equivalent much easier in proofs by
-reusing many of the existing Coq tactics.
-
-\subsection[title={Encoding Predicated Expressions as SAT
-Formulas},reference={encoding-predicated-expressions-as-sat-formulas}]
-
-Given the following two predicated expressions which should be tested
-for equivalence, one can construct a SAT expression which will ensure
-that both expressions will always result in the same value.
+Using the example shown in Figure~\goto{{[}fig:op_chain{]}}[fig:op_chain], the RTLBlock hyperblock
+shown in Figure~\goto{{[}fig:op_chain_a{]}}[fig:op_chain_a] is scheduled into the hyperblock RTLPar
+shown in Figure~\goto{{[}fig:op_chain_c{]}}[fig:op_chain_c]. The first step is to translate the
+input and output of the scheduler into the Abstr intermediate language by calculating the symbolic
+values that will be stored in each of the registers. Symbolically evaluating the RTLBlock hyperblock
+results in the expressions shown in
+Figure~\goto{{[}fig:symbolic_expressions{]}}[fig:symbolic_expressions] for registers \type{r1},
+\type{r2} and \type{r3}, where the state of the register \type{r1} at the start of the hyperblock is
+denoted as $\mono{r1}^{0}$.
+
+This symbolic expression can be generated by sequentially going through the list of predicated
+expressions and applying the update function defined in Section~\goto{3.1}[abstr_language]. The
+resulting predicated expressions is guaranteed to have predicates which are mutually exclusive,
+meaning if the predicate evaluates to true, all other predicates must evaluate to false.
+
+\subsection[linear-flat-predicated-expressions]{Linear (Flat) Predicated Expressions}
+
+The update function for predicated expressions is quite a bit more complex compared to standard
+update functions for symbolic execution, such as the one used by . The main reason for this is the
+non-recursive nature of the predicated expressions. As the predicates are only available at the
+top-level of the expressions, combining multiple expressions means that the conditions and multiple
+possibilities also need to trickle up to the top.
+
+Instead, if the predicated instructions were recursive, one would not have to change any predicates
+when evaluating the argument lists of instructions, for example, as these can just stay
+predicated. The evaluation semantics would also be much simpler, because a rule can be added for
+evaluating predicates.
+
+\section[sec:correctness]{Correctness of the Abstract Interpretation}
+
+The correctness of the scheduling step is dependent on the correctness of the equivalence check
+between the two abstract languages. As the abstract languages contain predicates, this comparison
+check cannot just be a syntactic comparison between the two languages, because the predicates might
+be different but logically equivalent. Instead, a mixed equivalence check is performed using a
+verified SAT solver and a syntactic equivalence check of the actual symbolic expressions.
+
+\subsection[comparing-symbolic-abstract-expressions]{Comparing Symbolic Abstract Expressions}
+
+Abstract expressions comprise a predicate and a symbolic expression. The translation from RTLBlock
+and RTLPar to Abstract ensures that each register will contain a predicated symbolic expression that
+describes the value that will be assigned to it after evaluating the basic block.  This means that
+if an expression always produces the same result as another expression, that the register will
+always contain the same value. If the RTLBlock program and RTLPar program translate to two Abstract
+languages where all the register's symbolic expressions match, then this means that both programs
+will always produce the same value.
+
+The main comparison between two symbolic expressions can be done using a simple syntactic
+comparison, because this ensures that each of the expressions always evaluate to the same value, and
+is also general enough for the expressions that need to be compared. However, as all expressions
+might be predicated, this simple syntactic comparison cannot be performed for the predicates. This
+is because predicated instructions can be executed in many different orders, and the predicates are
+used to calculate dependence information and therefore also determine the order of operations.
+
+Instead, comparing predicates can be done by checking logical equivalence of the predicates using a
+formally verified SAT solver. This ensures that the predicates will always evaluate to the same
+value, thereby making it a fair comparison.
+
+\subsection[verified_sat]{Verified Decidable DPLL SAT Solver}
+
+A SAT solver is needed to correctly and reliably compare predicated expressions. This makes it
+possible to then compare many different possible schedules, even if they remove expressions
+completely if these are unnecessary. The SAT solver is needed in the verifier, which needs to be
+proven correct in Coq to prove the scheduling translation correct.  The SAT solver therefore has to
+be verified, because it could otherwise not be used in the proof of correctness.
+
+The DPLL algorithm is used to implement the SAT solver. This means that the SAT solver takes in a
+conjunctive normal form (CNF) formula. The conversion from a recursive predicate type into a CNF
+formula is done by recursively expanding the predicate in equivalent CNF formulas.
+
+The SAT solver is implemented in Coq and proven to be sound and complete. The soundness and
+completeness of an algorithm to check whether a predicate is satisfiable or not also directly
+implies that the check is decidable, simplifying many proofs in Coq by using a \type{Setoid}
+instance for equivalence between predicates. This makes rewriting predicates if they are equivalent
+much easier in proofs by reusing many of the existing Coq tactics.
+
+\subsection[encoding-predicated-expressions-as-sat-formulas]{Encoding Predicated Expressions as SAT
+  Formulas}
+
+Given the following two predicated expressions which should be tested for equivalence, one can
+construct a SAT expression which will ensure that both expressions will always result in the same
+value.
 
 %\startformula \begin{aligned}
 %  \bgroup \left.\begin{aligned}
@@ -465,33 +353,28 @@ that both expressions will always result in the same value.
 %    e_{m}',\quad &p_{m}'\\
 %  \end{cases}\end{aligned} \stopformula
 
-The SAT query that should determine if two predicated expressions are
-equivalent is the following, where $\mathit{he}_{i}$ is the hashed
-expression of $e_{i}$:
+The SAT query that should determine if two predicated expressions are equivalent is the following,
+where $\mathit{he}_{i}$ is the hashed expression of $e_{i}$:
 
 %\startformula \begin{gathered}
 %  (p_{1} \implies \mathit{he}_{1}) \land (p_{2} \implies \mathit{he}_{2}) \land \cdots \land (p_{n} \implies \mathit{he}_{n}) \\
 %  \Longleftrightarrow\\
 %  (p_{1}' \implies \mathit{he}_{1}') \land (p_{2}' \implies \mathit{he}_{2}') \land \cdots \land (p_{m}' \implies \mathit{he}_{m}') \\\end{gathered} \stopformula
 
-However, reasoning about the equivalence of individual expressions
-within the greater list of predicated expressions is not quite
-straightforward. The evaluation semantics of predicated expression list
-will evaluate each predicate expression sequentially, and pick the
-expression whose predicate evaluates to true. However, proving that both
-predicated expression lists will return the same value assuming that the
-SAT query above is true, means that for each predicate that is true, One
-needs to be able to show that there is a predicate in the other list
-that will also be true, which will return the same value. However, there
-is no guarantee that there is a unique predicated expression that will
-match the current predicate, as there may be duplicate expressions in
-the list with different predicates.
-
-This means that it is quite difficult to prove that the SAT query
-implies the semantic preservation of the scheduling transformation. The
-SAT query can therefore be broken down into a slightly more direct
-comparison that is more useful for the proof itself. The steps to
-perform the more fine grained comparison are the following:
+However, reasoning about the equivalence of individual expressions within the greater list of
+predicated expressions is not quite straightforward. The evaluation semantics of predicated
+expression list will evaluate each predicate expression sequentially, and pick the expression whose
+predicate evaluates to true. However, proving that both predicated expression lists will return the
+same value assuming that the SAT query above is true, means that for each predicate that is true,
+One needs to be able to show that there is a predicate in the other list that will also be true,
+which will return the same value. However, there is no guarantee that there is a unique predicated
+expression that will match the current predicate, as there may be duplicate expressions in the list
+with different predicates.
+
+This means that it is quite difficult to prove that the SAT query implies the semantic preservation
+of the scheduling transformation. The SAT query can therefore be broken down into a slightly more
+direct comparison that is more useful for the proof itself. The steps to perform the more fine
+grained comparison are the following:
 
 %\startenumerate[n]
 %\item
@@ -507,81 +390,65 @@ perform the more fine grained comparison are the following:
 %  one of the maps, then it's predicate should be equivalent to $\perp$.
 %\stopenumerate
 
-This comparison is logically equivalent to performing the large SAT
-query, however, it maps better to the proof because it is comparing the
-predicates of the same expressions with each other.
+This comparison is logically equivalent to performing the large SAT query, however, it maps better
+to the proof because it is comparing the predicates of the same expressions with each other.
 
 \subsection[proof-of-equivalence-between-passes]{Proof of equivalence between passes}
 
-Figure~\goto{{[}fig:compcert_interm{]}}[fig:compcert_interm] shows the
-four different passes that were added to CompCert to form hyperblocks.
-This section will cover the proofs for each of the translations,
-focusing mainly on the proof of translation of the scheduling pass.
+Figure~\goto{{[}fig:compcert_interm{]}}[fig:compcert_interm] shows the four different passes that
+were added to CompCert to form hyperblocks.  This section will cover the proofs for each of the
+translations, focusing mainly on the proof of translation of the scheduling pass.
 
 \subsubsection[proof-of-correctness-of-the-if-conversion-pass]{Proof of Correctness of the
   If-Conversion Pass}
 
-The if-conversion pass is quite simple to verify, as it is an
-optimisation that transforms the code in the same language.
+The if-conversion pass is quite simple to verify, as it is an optimisation that transforms the code
+in the same language.
 
 \subsubsection[proof-of-correctness-of-scheduling]{Proof of Correctness of Scheduling}
 
-The scheduling pass is formally verified using translation validation,
-using the Abstr language as a point of comparison between RTLBlock and
-RTLPar. Using the algorithm defined in Section~\goto{3}[abstr_interp],
-the formally verified SAT solver described in
-Section~\goto{4.2}[verified_sat] can be used prove that if the two
-predicated expression lists are equivalent, that they will always
-evaluate to the same value.
+The scheduling pass is formally verified using translation validation, using the Abstr language as a
+point of comparison between RTLBlock and RTLPar. Using the algorithm defined in
+Section~\goto{3}[abstr_interp], the formally verified SAT solver described in
+Section~\goto{4.2}[verified_sat] can be used prove that if the two predicated expression lists are
+equivalent, that they will always evaluate to the same value.
 
 \section[implementation-details-of-static-sdc-scheduling]{Implementation Details of Static SDC
   Scheduling}
 
-This section describes the implementation details of the static SDC
-scheduling algorithm used, and also describes some of the details of the
-basic block generation as well as the if-conversion pass which generates
-the predicated instructions.
+This section describes the implementation details of the static SDC scheduling algorithm used, and
+also describes some of the details of the basic block generation as well as the if-conversion pass
+which generates the predicated instructions.
 
 \subsection[implementation-of-if-conversion]{Implementation of If-Conversion}
 
-If-conversion is the conversion that introduces predicated instructions
-which can make use of the hyperblocks. It converts conditional
-statements into predicated instructions. This transformation has a few
-limitations on the kind of conditional statements that it can translate.
-First, only conditional statements without loops can be translated,
-therefore, one must identify cycles in the control-flow before
-performing the if-conversion. Secondly, if-conversion will not always
-result in more efficient code, so it should not be applied to any
-conditional statements. Instead, it is best applied to conditional
-statements where each branch will take a similar amount of time to
-execute.
-
-For back ends that do not support predicates, it is also important to
-eliminate the predicates again using a process called reverse
-if-conversion, which creates branches out of the predicates and groups
-these together again. This allows for hyperblock scheduling to also be
-used with back ends that do not support predicates, which are most the
-default back ends included in CompCert.
-
-\subsection[title={Static SDC
-Scheduling},reference={static-sdc-scheduling}]
-
-System of difference constraints (SDC) scheduling~ is a flexible
-algorithm to perform scheduling of instructions, especially if the
-target is hardware directly. It allows for flexible definitions of
-optimisation functions in terms of latencies and dependencies of the
-operations in the basic block.
-
-\section[title={Performance
-Comparison},reference={performance-comparison}]
-
-\section[title={Related Work},reference={sec:related-work}]
-
-\section[title={Conclusion},reference={conclusion}]
-
-This material is based upon work supported by the under Grant No.~ and
-Grant No.~. Any opinions, findings, and conclusions or recommendations
-expressed in this material are those of the author and do not
-necessarily reflect the views of the National Science Foundation.
+If-conversion is the conversion that introduces predicated instructions which can make use of the
+hyperblocks. It converts conditional statements into predicated instructions. This transformation
+has a few limitations on the kind of conditional statements that it can translate.  First, only
+conditional statements without loops can be translated, therefore, one must identify cycles in the
+control-flow before performing the if-conversion. Secondly, if-conversion will not always result in
+more efficient code, so it should not be applied to any conditional statements. Instead, it is best
+applied to conditional statements where each branch will take a similar amount of time to execute.
+
+For back ends that do not support predicates, it is also important to eliminate the predicates again
+using a process called reverse if-conversion, which creates branches out of the predicates and
+groups these together again. This allows for hyperblock scheduling to also be used with back ends
+that do not support predicates, which are most the default back ends included in CompCert.
+
+\subsection[static-sdc-scheduling]{Static SDC Scheduling}
+
+System of difference constraints (SDC) scheduling~ is a flexible algorithm to perform scheduling of
+instructions, especially if the target is hardware directly. It allows for flexible definitions of
+optimisation functions in terms of latencies and dependencies of the operations in the basic block.
+
+\section[performance-comparison]{Performance Comparison}
+
+\section[sec:related-work]{Related Work}
+
+\section[conclusion]{Conclusion}
+
+This material is based upon work supported by the under Grant No.~ and Grant No.~. Any opinions,
+findings, and conclusions or recommendations expressed in this material are those of the author and
+do not necessarily reflect the views of the National Science Foundation.
 
 \stopcomponent
diff --git a/lsr_env.tex b/lsr_env.tex
index e27681a..4ce5761 100644
--- a/lsr_env.tex
+++ b/lsr_env.tex
@@ -30,27 +30,28 @@
 %\setupalign[hz,hanging,nothyphenated]
 \setupalign[hz,hanging]
 \setuptolerance[stretch,tolerant]
+
 \setuplayout[
-  edgedistance=1cm,
-  backspace=2cm,
-  rightmargin=5cm,
-  leftmargin=0cm,
-  topspace=0cm,
-  header=2cm,
-  headerdistance=0.5cm,
-  width=fit,
-  height=fit,
+  backspace=3cm,
+  margin=2cm,
+  topspace=1.25cm,
+  header=1.25cm,
+  footer=1.25cm,
+  height=middle,
+  width=middle,
 ]
 
-%\setuplayout[
-%  backspace=3cm,
-%  margin=2cm,
-%  topspace=1.25cm,
-%  header=1.25cm,
-%  footer=1.25cm,
-%  height=middle,
-%  width=middle,
-%]
+\startsectionblockenvironment [backpart]
+\setuplayout[
+  backspace=3cm,
+  margin=2cm,
+  topspace=1.25cm,
+  header=1.25cm,
+  footer=1.25cm,
+  height=middle,
+  width=middle,
+]
+\stopsectionblockenvironment
 
 \startsectionblockenvironment [frontpart]
 \setuppagenumbering[location={footer,middle}]
@@ -67,17 +68,71 @@
   header=empty,
 ]
 
-\define[1]\marginhead{\margintitle[location=right]{\switchtobodyfont[36pt] #1}}
+\startMPinclusions
+  numeric ChapterPageDone[];
+\stopMPinclusions
+
+\startuseMPgraphic{chapterbackground}
+  StartPage;
+  n := \somenamedheadnumber{chapter}{current};
+
+  x := PaperWidth - BackSpace - RightMarginWidth - RightMarginDistance/2;
+  x2 := PaperWidth - BackSpace - 20;
+
+  if n > 0 : % ignore pages before the first chapter
+    if unknown ChapterPageDone[n] : % This is the first page a new chapter
+          draw (x,PaperHeight) -- (x,PaperHeight-TopSpace+Header-HeaderDistance-75)
+            withpen pencircle scaled 0.5mm;;
+        ChapterPageDone[n] := 1 ;
+    else :
+      draw (x2,PaperHeight) -- (x2,PaperHeight-TopSpace+Header)
+    fi;
+  fi;
+  StopPage;
+\stopuseMPgraphic
+
+\define\PlaceFootnote
+  {\inrightmargin{\vtop{\placelocalnotes[footnote][before=,after=]}}}
+
+\setupnote
+  [footnote]
+  [location=text,
+   textstyle={\rmxx},
+   align={yes,tolerant},
+   next=\PlaceFootnote]
+
+\setupnotation
+  [footnote]
+  [alternative=serried]
+
+\defineoverlay[chapterbackground][\useMPgraphic{chapterbackground}]
+\setupbackgrounds[page][background=chapterbackground]
+
+\define[1]\marginhead{\margintitle[location=right]{\switchtobodyfont[30pt] #1}}
 
 \setuphead[section][style={\bfc}]
 \setuphead[subsection][style={\bfb}]
 \setuphead[subsubsection][style={\bfa},after={\blank[0.2cm]}]
 
 \startsectionblockenvironment [bodypart]
+\setuplayout[
+  edgedistance=1cm,
+  backspace=2cm,
+  rightmargin=5cm,
+  leftmargin=0cm,
+  topspace=1cm,
+  header=1cm,
+  footer=1cm,
+  headerdistance=0.5cm,
+  width=fit,
+  height=fit,
+]
   \setuppagenumbering[location=]
   \setuplabeltext[chapter=]
-  \setupheadertexts[{\someheadnumber[section][current] \getmarking[section]}][pagenumber]
-  \setupheader[text][style=smallcaps,after=\hrule]
+  \setupheadertexts[margin][][{{\it \getmarking[chapter]}\wordright{\pagenumber}}]
+  \setupheadertexts[text][][{\someheadnumber[chapter][current]}]
+  %\setupheader[text][style=smallcaps,after=\hrule]
+  %\setupheader[margin][style=smallcaps]
   \setuphead[chapter][
     style={\bfd},
     header=empty,
@@ -153,7 +208,7 @@
 % ==========================================================================
 % Figure
 % ==========================================================================
-\setupcaptions[figure][location=bottom]
+\setupcaptions[figure][location={high,rightmargin},width=5cm,align={yes, tolerant},style=\rmx]
 
 \definefloat[subfigure][local=yes]
 \setupcaption[subfigure][numbercommand=\groupedcommand{(}{)},numberconversion=a,prefix=no,way=bytext]
diff --git a/main.tex b/main.tex
index f1df505..d31609c 100644
--- a/main.tex
+++ b/main.tex
@@ -1,10 +1,10 @@
-\startproduct main
-
 % Common layout settings.
 \environment fonts_env
 \environment lsr_env
 \enabletrackers[fonts.missing]
 
+\startproduct main
+
 \mainlanguage[en]
 
 % Common module shared by all.
@@ -26,11 +26,14 @@
 \stopfrontmatter
 
 \startbodymatter
-  \setupinterlinespace[small]
   \component introduction
   \component background
   \component hls
   \component scheduling
+  \component pipelining
+  \component dynamic
+  \component schedule
+  \component conclusion
 \stopbodymatter
 
 \startbackmatter
diff --git a/references.bib b/references.bib
index 03e91cc..8dd4b3b 100644
--- a/references.bib
+++ b/references.bib
@@ -1001,3 +1001,847 @@ series = {CPP 2021}
 	pages = {219-223},
 	doi = {10.1109/FCCM51124.2021.00034}
 }
+
+@article{leroy09_formal_verif_realis_compil,
+  author =       {Leroy, Xavier},
+  title =        {Formal Verification of a Realistic Compiler},
+  journal =      {Commun. ACM},
+  volume =       52,
+  number =       7,
+  pages =        {107-115},
+  year =         2009,
+  doi =          {10.1145/1538788.1538814},
+  address =      {New York, NY, USA},
+  issn =         {0001-0782},
+  issue_date =   {July 2009},
+  month =        jul,
+  numpages =     9,
+  publisher =    {ACM},
+}
+
+@inproceedings{lidbury15_many_core_compil_fuzzin,
+  author          = {Lidbury, C. and Lascu, A. and Chong, N. and Donaldson, A. F.},
+  title           = {Many-Core Compiler Fuzzing},
+  booktitle       = {Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design
+                  and Implementation},
+  year            = 2015,
+  pages           = {65-76},
+  doi             = {10.1145/2737924.2737986},
+  address         = {New York, NY, USA},
+  isbn            = 9781450334686,
+  location        = {Portland, OR, USA},
+  numpages        = 12,
+  publisher       = {Association for Computing Machinery},
+  series          = {PLDI '15},
+}
+
+@article{takach16_high_level_synth,
+  author          = {A. {Takach}},
+  title           = {High-Level Synthesis: Status, Trends, and Future
+                  Directions},
+  journal         = {IEEE Design Test},
+  volume          = {33},
+  number          = {3},
+  pages           = {116-124},
+  year            = {2016},
+  doi             = {10.1109/MDAT.2016.2544850},
+  ISSN            = {2168-2364},
+  month           = {June},
+}
+
+@inproceedings{liu16_effic_high_level_synth_desig,
+  author          = { {Dong Liu} and B. C. {Schafer}},
+  title           = {Efficient and reliable High-Level Synthesis Design Space
+                  Explorer for FPGAs},
+  booktitle       = {2016 26th International Conference on Field Programmable
+                  Logic and Applications (FPL)},
+  year            = 2016,
+  pages           = {1-8},
+  doi             = {10.1109/FPL.2016.7577370},
+  ISSN            = {1946-1488},
+  month           = {Aug},
+}
+
+@article{lahti19_are_we_there_yet,
+  author          = {S. {Lahti} and P. {Sj{\"o}vall} and J. {Vanne} and
+                  T. D. {H{\"a}m{\"a}l{\"a}inen}},
+  title           = {Are We There Yet? a Study on the State of High-Level
+                  Synthesis},
+  journal         = {IEEE Transactions on Computer-Aided Design of Integrated
+                  Circuits and Systems},
+  volume          = {38},
+  number          = {5},
+  pages           = {898-911},
+  year            = {2019},
+  doi             = {10.1109/TCAD.2018.2834439},
+  ISSN            = {1937-4151},
+  month           = {May},
+}
+
+@misc{mentor20_catap_high_level_synth,
+  author          = {Mentor},
+  title           = {Catapult High-Level Synthesis},
+  url             = {https://bit.ly/32xhADw},
+  urldate         = {2020-06-06},
+  year            = 2020,
+}
+
+@misc{xilinx20_vivad_high_synth,
+  author          = {Xilinx},
+  title           = {Vivado High-level Synthesis},
+  url             = {https://bit.ly/39ereMx},
+  urldate         = {2020-07-20},
+  year            = 2020,
+}
+
+@misc{intel20_sdk_openc_applic,
+  author          = {Intel},
+  title           = {{SDK} for {OpenCL} Applications},
+  url             = {https://intel.ly/30sYHz0},
+  urldate         = {2020-07-20},
+  year            = 2020,
+}
+
+@inproceedings{gupta03_spark,
+  author          = {S. {Gupta} and N. {Dutt} and R. {Gupta} and A. {Nicolau}},
+  title           = {{SPARK}: a high-level synthesis framework for applying parallelizing compiler
+                  transformations},
+  booktitle       = {16th International Conference on VLSI Design, 2003. Proceedings.},
+  year            = 2003,
+  pages           = {461-466},
+  doi             = {10.1109/ICVD.2003.1183177},
+  ISSN            = {1063-9667},
+  month           = {Jan},
+}
+
+@article{chouksey20_verif_sched_condit_behav_high_level_synth,
+  author          = {R. {Chouksey} and C. {Karfa}},
+  title           = {Verification of Scheduling of Conditional Behaviors in
+                  High-Level Synthesis},
+  journal         = {IEEE Transactions on Very Large Scale Integration (VLSI)
+                  Systems},
+  volume          = {},
+  number          = {},
+  pages           = {1-14},
+  year            = {2020},
+  doi             = {10.1109/TVLSI.2020.2978242},
+  ISSN            = {1557-9999},
+  month           = {},
+}
+
+@inproceedings{pnueli98_trans,
+  author          = "Pnueli, A.  and Siegel, M.  and Singerman, E.",
+  title           = "Translation validation",
+  booktitle       = "Tools and Algorithms for the Construction and Analysis of Systems",
+  year            = 1998,
+  pages           = "151--166",
+  address         = "Berlin, Heidelberg",
+  editor          = "Steffen, Bernhard",
+  isbn            = "978-3-540-69753-4",
+  publisher       = "Springer",
+}
+
+@article{chouksey19_trans_valid_code_motion_trans_invol_loops,
+  author =       {R. {Chouksey} and C. {Karfa} and P. {Bhaduri}},
+  title =        {Translation Validation of Code Motion Transformations
+                  Involving Loops},
+  journal =      {IEEE Transactions on Computer-Aided Design of Integrated
+                  Circuits and Systems},
+  volume =       38,
+  number =       7,
+  pages =        {1378-1382},
+  year =         2019,
+  doi =          {10.1109/TCAD.2018.2846654},
+  ISSN =         {1937-4151},
+  month =        {July},
+}
+
+@article{banerjee14_verif_code_motion_techn_using_value_propag,
+  author = {K. {Banerjee} and C. {Karfa} and D. {Sarkar} and C. {Mandal}},
+  title = {Verification of Code Motion Techniques Using Value
+                  Propagation},
+  journal = {IEEE Transactions on Computer-Aided Design of Integrated
+                  Circuits and Systems},
+  volume = 33,
+  number = 8,
+  pages = {1180-1193},
+  year = 2014,
+  doi = {10.1109/TCAD.2014.2314392},
+  ISSN = {1937-4151},
+  month = {Aug}
+}
+
+@inproceedings{kim04_autom_fsmd,
+  author = { Kim, Y. and {Kopuri}, S. and {Mansouri}, N.},
+  title = {Automated formal verification of scheduling process using
+                  finite state machines with datapath (FSMD)},
+  booktitle = {International Symposium on Signals, Circuits and
+                  Systems. Proceedings, SCS 2003. (Cat. No.03EX720)},
+  year = 2004,
+  pages = {110-115},
+  doi = {10.1109/ISQED.2004.1283659},
+  month = {March}
+}
+
+@inproceedings{karfa06_formal_verif_method_sched_high_synth,
+  author =       {Karfa, C. and Mandal, C. and Sarkar, D. and Pentakota, S. R. and
+                  Reade, C.},
+  title =        {A Formal Verification Method of Scheduling in High-level
+                  Synthesis},
+  booktitle =    {Proceedings of the 7th International Symposium on Quality
+                  Electronic Design},
+  year =         2006,
+  pages =        {71--78},
+  doi =          {10.1109/ISQED.2006.10},
+  acmid =        1126731,
+  address =      {Washington, DC, USA},
+  isbn =         {0-7695-2523-7},
+  numpages =     8,
+  publisher =    {IEEE Computer Society},
+  series =       {ISQED '06},
+}
+
+@inproceedings{leroy06_formal_certif_compil_back_end,
+  author = {Leroy, Xavier},
+  title = {Formal Certification of a Compiler Back-End or: Programming
+                  a Compiler with a Proof Assistant},
+  booktitle = {Conference Record of the 33rd ACM SIGPLAN-SIGACT Symposium
+                  on Principles of Programming Languages},
+  year = 2006,
+  pages = {42-54},
+  doi = {10.1145/1111037.1111042},
+  address = {New York, NY, USA},
+  isbn = 1595930272,
+  location = {Charleston, South Carolina, USA},
+  numpages = 13,
+  publisher = {Association for Computing Machinery},
+  series = {POPL '06}
+}
+
+@book{bertot04_inter_theor_provin_progr_devel,
+  author          = {Yves Bertot and Pierre Cast{\'{e}}ran},
+  title           = {Interactive Theorem Proving and Program Development},
+  year            = 2004,
+  publisher       = {Springer Berlin Heidelberg},
+  doi             = {10.1007/978-3-662-07964-5},
+}
+
+@inproceedings{tristan08_formal_verif_trans_valid,
+  author = {Tristan, Jean-Baptiste and Leroy, Xavier},
+  title = {Formal Verification of Translation Validators: A Case Study on
+                  Instruction Scheduling Optimizations},
+  booktitle = {Proceedings of the 35th Annual ACM SIGPLAN-SIGACT Symposium on
+                  Principles of Programming Languages},
+  year = 2008,
+  pages = {17-27},
+  doi = {10.1145/1328438.1328444},
+  address = {New York, NY, USA},
+  isbn = 9781595936899,
+  location = {San Francisco, California, USA},
+  numpages = 11,
+  publisher = {Association for Computing Machinery},
+  series = {POPL '08}
+}
+
+@inproceedings{kundu08_valid_high_level_synth,
+  author          = "Kundu, Sudipta and Lerner, Sorin and Gupta, Rajesh",
+  title           = "Validating High-Level Synthesis",
+  booktitle       = "Computer Aided Verification",
+  year            = 2008,
+  pages           = "459--472",
+  address         = "Berlin, Heidelberg",
+  editor          = "Gupta, Aarti and Malik, Sharad",
+  isbn            = "978-3-540-70545-1",
+  publisher       = "Springer",
+}
+
+@inproceedings{chapman92_verif_bedroc,
+  author          = {R. {Chapman} and G. {Brown} and M. {Leeser}},
+  title           = {Verified high-level synthesis in BEDROC},
+  booktitle       = {[1992] Proceedings The European Conference on Design Automation},
+  year            = 1992,
+  pages           = {59--63},
+  doi             = {10.1109/EDAC.1992.205894},
+  month           = {March},
+  publisher       = {IEEE Computer Society},
+}
+
+@article{hwang91_formal_approac_to_sched_probl,
+  author = {C. -. {Hwang} and J. -. {Lee} and Y. -. {Hsu}},
+  title = {A Formal Approach To the Scheduling Problem in High Level
+                  Synthesis},
+  journal = {IEEE Transactions on Computer-Aided Design of Integrated
+                  Circuits and Systems},
+  volume = 10,
+  number = 4,
+  pages = {464-475},
+  year = 1991,
+  doi = {10.1109/43.75629},
+  month = {April}
+}
+
+@inproceedings{page91_compil_occam,
+  author = {Page, Ian and Luk, Wayne},
+  title = {Compiling Occam into field-programmable gate arrays},
+  booktitle = {FPGAs, Oxford Workshop on Field Programmable Logic and
+                  Applications},
+  year = 1991,
+  volume = 15,
+  pages = {271--283},
+}
+
+@inproceedings{grass94_high,
+  author = {W. {Grass} and M. {Mutz} and W. -. {Tiedemann}},
+  title = {High level synthesis based on formal methods},
+  booktitle = {Proceedings of Twentieth Euromicro Conference. System
+                  Architecture and Integration},
+  year = 1994,
+  pages = {83-91},
+  doi = {10.1109/EURMIC.1994.390403},
+  month = {Sep.}
+}
+
+@inproceedings{jifeng93_towar,
+  author = "Jifeng, He and Page, Ian and Bowen, Jonathan",
+  title = "Towards a provably correct hardware implementation of occam",
+  booktitle = "Correct Hardware Design and Verification Methods",
+  year = 1993,
+  pages = "214--225",
+  address = "Berlin, Heidelberg",
+  editor = "Milne, George J.  and Pierre, Laurence",
+  isbn = "978-3-540-70655-7",
+  publisher = "Springer"
+}
+
+@article{perna12_mechan_wire_wise_verif_handel_c_synth,
+  author = "Juan Perna and Jim Woodcock",
+  title = {Mechanised Wire-Wise Verification of {Handel-C} Synthesis},
+  journal = "Science of Computer Programming",
+  volume = 77,
+  number = 4,
+  pages = "424 - 443",
+  year = 2012,
+  doi = "10.1016/j.scico.2010.02.007",
+  issn = "0167-6423",
+}
+
+@article{perna11_correc_hardw_synth,
+  author = "Perna, Juan and Woodcock, Jim and Sampaio, Augusto and Iyoda,
+                  Juliano",
+  title = {Correct Hardware Synthesis},
+  journal = "Acta Informatica",
+  volume = 48,
+  number = 7,
+  pages = "363--396",
+  year = 2011,
+  doi = "10.1007/s00236-011-0142-y",
+  day = 01,
+  issn = "1432-0525",
+  month = "Dec"
+}
+
+@phdthesis{ellis08,
+  author = {Ellis, Martin},
+  title = {Correct synthesis and integration of compiler-generated function units},
+  school = {Newcastle University},
+  url = {https://theses.ncl.ac.uk/jspui/handle/10443/828},
+  year = {2008},
+}
+
+@article{06_ieee_stand_veril_hardw_descr_languag,
+  author          = {},
+  title           = {IEEE Standard for Verilog Hardware Description Language},
+  journal         = {IEEE Std 1364-2005 (Revision of IEEE Std 1364-2001)},
+  volume          = {},
+  number          = {},
+  pages           = {1-590},
+  year            = {2006},
+  doi             = {10.1109/IEEESTD.2006.99495},
+  ISSN            = {},
+  month           = {April},
+}
+
+@inproceedings{loow19_verif_compil_verif_proces,
+  author =       {L\"{o}\"{o}w, Andreas and Kumar, Ramana and Tan, Yong Kiam and
+                  Myreen, Magnus O. and Norrish, Michael and Abrahamsson, Oskar
+                  and Fox, Anthony},
+  title =        {Verified Compilation on a Verified Processor},
+  tags =         {verification},
+  booktitle =    {Proceedings of the 40th ACM SIGPLAN Conference on Programming
+                  Language Design and Implementation},
+  year =         2019,
+  pages =        {1041--1053},
+  doi =          {10.1145/3314221.3314622},
+  acmid =        3314622,
+  address =      {New York, NY, USA},
+  isbn =         {978-1-4503-6712-7},
+  keywords =     {compiler verification, hardware verification, program
+                  verification, verified stack},
+  location =     {Phoenix, AZ, USA},
+  numpages =     13,
+  publisher =    {ACM},
+  series =       {PLDI 2019},
+}
+
+@inproceedings{canis11_legup,
+  author          = {Andrew Canis and Jongsok Choi and Mark Aldham and Victor Zhang and Ahmed
+                  Kammoona and Jason Helge Anderson and Stephen Dean Brown and Tomasz S. Czajkowski},
+  title           = {{LegUp}: high-level synthesis for {FPGA}-based processor/accelerator systems},
+  booktitle       = {{FPGA}},
+  year            = 2011,
+  pages           = {33--36},
+  publisher       = {{ACM}},
+}
+
+@inproceedings{choi18_hls,
+  author          = {Young{-}kyu Choi and Jason Cong},
+  title           = {{HLS}-based optimization and design space exploration for applications with
+                  variable loop bounds},
+  booktitle       = {{ICCAD}},
+  year            = 2018,
+  publisher       = {{ACM}},
+}
+
+@article{canis13_legup,
+  author          = {Canis, Andrew and Choi, Jongsok and Aldham, Mark and Zhang, Victor and
+                  Kammoona, Ahmed and Czajkowski, Tomasz and Brown, Stephen D. and Anderson, Jason
+                  H.},
+  title           = {Legup: an Open-Source High-Level Synthesis Tool for Fpga-Based
+                  Processor/accelerator Systems},
+  journal         = {ACM Trans. Embed. Comput. Syst.},
+  volume          = {13},
+  number          = {2},
+  year            = {2013},
+  doi             = {10.1145/2514740},
+  address         = {New York, NY, USA},
+  articleno       = {24},
+  issn            = {1539-9087},
+  issue_date      = {September 2013},
+  keywords        = {High-level synthesis, FPGAs, hardware/software codesign, synthesis,
+                  performance, power, field-programmable gate arrays},
+  month           = sep,
+  numpages        = {27},
+  publisher       = {Association for Computing Machinery},
+}
+
+@article{coussy09_introd_to_high_level_synth,
+  author =       {P. {Coussy} and D. D. {Gajski} and M. {Meredith} and
+                  A. {Takach}},
+  title =        {An Introduction To High-Level Synthesis},
+  tags =         {hls},
+  journal =      {IEEE Design Test of Computers},
+  volume =       26,
+  number =       4,
+  pages =        {8-17},
+  year =         2009,
+  doi =          {10.1109/MDT.2009.69},
+  keywords =     {high level synthesis;high-level synthesis;optimized RTL
+                  hardware;abstraction level design;HLS techniques;High level
+                  synthesis;Assembly;Application software;Circuit
+                  simulation;Design methodology;Space exploration;Computer
+                  architecture;Design optimization;Hardware design
+                  languages;Circuit synthesis;high-level synthesis;RTL
+                  abstraction;custom processors;hardware synthesis and
+                  verification;architectures;design and test},
+  month =        {July},
+}
+
+@article{aubury96_handel_c_languag_refer_guide,
+  author =       {Aubury, Matthew and Page, Ian and Randall, Geoff and Saul,
+                  Jonathan and Watts, Robin},
+  title =        {Handel-C Language Reference Guide},
+  tags =         {hls},
+  journal =      {Computing Laboratory. Oxford University, UK},
+  year =         1996,
+}
+
+@inproceedings{clarke03_behav_c_veril,
+  author          = {E. {Clarke} and D. {Kroening} and K. {Yorav}},
+  title           = {Behavioral consistency of C and Verilog programs using bounded model checking},
+  booktitle       = {Proceedings 2003. Design Automation Conference (IEEE Cat. No.03CH37451)},
+  year            = 2003,
+  pages           = {368-371},
+  doi             = {10.1145/775832.775928},
+  month           = {June},
+}
+
+@inproceedings{josipovic18_dynam_sched_high_level_synth,
+  author          = {Josipovi\'{c}, Lana and Ghosal, Radhika and Ienne, Paolo},
+  title           = {Dynamically Scheduled High-Level Synthesis},
+  booktitle       = {Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable
+                  Gate Arrays},
+  year            = 2018,
+  pages           = {127-136},
+  doi             = {10.1145/3174243.3174264},
+  address         = {New York, NY, USA},
+  isbn            = 9781450356145,
+  keywords        = {dynamically scheduled circuits, compiler, pipelining, high-level synthesis},
+  location        = {Monterey, CALIFORNIA, USA},
+  numpages        = 10,
+  publisher       = {Association for Computing Machinery},
+  series          = {FPGA '18},
+}
+
+@inproceedings{cheng20_combin_dynam_static_sched_high_level_synth,
+  author          = {Cheng, Jianyi and Josipovic, Lana and Constantinides, George A. and Ienne,
+                  Paolo and Wickerson, John},
+  title           = {Combining Dynamic \& Static Scheduling in High-Level Synthesis},
+  booktitle       = {The 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays},
+  year            = 2020,
+  pages           = {288-298},
+  doi             = {10.1145/3373087.3375297},
+  address         = {New York, NY, USA},
+  isbn            = 9781450370998,
+  keywords        = {high-level synthesis, dynamic scheduling, static analysis},
+  location        = {Seaside, CA, USA},
+  numpages        = 11,
+  publisher       = {Association for Computing Machinery},
+  series          = {FPGA '20},
+}
+
+@article{josipovic17_out_of_order_load_store,
+  author          = {Josipovic, Lana and Brisk, Philip and Ienne, Paolo},
+  title           = {An Out-Of-Order Load-Store Queue for Spatial Computing},
+  journal         = {ACM Trans. Embed. Comput. Syst.},
+  volume          = {16},
+  number          = {5s},
+  year            = {2017},
+  doi             = {10.1145/3126525},
+  address         = {New York, NY, USA},
+  articleno       = {125},
+  issn            = {1539-9087},
+  issue_date      = {October 2017},
+  keywords        = {Load-store queue, dynamic scheduling, allocation, spatial computing},
+  month           = sep,
+  numpages        = {19},
+  publisher       = {Association for Computing Machinery},
+}
+
+@inproceedings{josipovic20_buffer_placem_sizin_high_perfor_dataf_circuit,
+  author          = {Josipovi\'{c}, Lana and Sheikhha, Shabnam and Guerrieri, Andrea and Ienne,
+                  Paolo and Cortadella, Jordi},
+  title           = {Buffer Placement and Sizing for High-Performance Dataflow Circuits},
+  booktitle       = {The 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays},
+  year            = 2020,
+  pages           = {186-196},
+  doi             = {10.1145/3373087.3375314},
+  address         = {New York, NY, USA},
+  isbn            = 9781450370998,
+  keywords        = {buffers, high-level synthesis, dataflow circuits, timing optimization},
+  location        = {Seaside, CA, USA},
+  numpages        = 11,
+  publisher       = {Association for Computing Machinery},
+  series          = {FPGA '20},
+}
+
+@article{leroy09_formal_verif_compil_back_end,
+  author          = {Leroy, Xavier},
+  title           = {A Formally Verified Compiler Back-End},
+  journal         = {Journal of Automated Reasoning},
+  volume          = {43},
+  number          = {4},
+  pages           = {363},
+  year            = {2009},
+  doi             = {10.1007/s10817-009-9155-4},
+  issn            = {1573-0670},
+}
+
+@inproceedings{yang11_findin_under_bugs_c_compil,
+  author          = {Yang, Xuejun and Chen, Yang and Eide, Eric and Regehr, John},
+  title           = {Finding and Understanding Bugs in C Compilers},
+  booktitle       = {Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design
+                  and Implementation},
+  year            = 2011,
+  pages           = {283-294},
+  doi             = {10.1145/1993498.1993532},
+  address         = {New York, NY, USA},
+  isbn            = 9781450306638,
+  keywords        = {random program generation, compiler defect, automated testing, compiler
+                  testing, random testing},
+  location        = {San Jose, California, USA},
+  numpages        = 12,
+  publisher       = {Association for Computing Machinery},
+  series          = {PLDI '11},
+}
+
+@book{bertot04_inter_theor_provin_progr_devel,
+  author          = {Yves Bertot and Pierre Cast{\'{e}}ran},
+  title           = {Interactive Theorem Proving and Program Development},
+  year            = 2004,
+  publisher       = {Springer Berlin Heidelberg},
+  doi             = {10.1007/978-3-662-07964-5},
+}
+
+@article{bowen98_handel_c_languag_refer_manual,
+  author          = {Bowen, Matthew},
+  title           = {Handel-C Language Reference Manual},
+  journal         = {Embedded Solutions Ltd},
+  volume          = {2},
+  year            = {1998},
+}
+
+@inproceedings{herklotz20_findin_under_bugs_fpga_synth_tools,
+  author          = {Yann Herklotz and John Wickerson},
+  title           = {Finding and Understanding Bugs in {FPGA} Synthesis Tools},
+  booktitle       = {ACM/SIGDA Int. Symp. on Field-Programmable Gate Arrays},
+  year            = 2020,
+  doi             = {10.1145/3373087.3375310},
+  isbn            = {978-1-4503-7099-8/20/02},
+  keywords        = {automated testing, compiler defect, compiler testing, random program
+                  generation, random testing},
+  location        = {Seaside, CA, USA},
+  numpages        = 11,
+}
+
+@misc{xilinx_vivad_desig_suite,
+  author =       {Xilinx},
+  title =        {{Vivado Design Suite}},
+  url =          {https://bit.ly/2wZAmld},
+  urldate =      {2019-01-14},
+  year = 2019,
+}
+
+@misc{intel_intel_quart,
+  author =       {Intel},
+  title =        {{Intel Quartus}},
+  url =
+                  {https://intel.ly/2m7wbCs},
+  urldate =      {2019-01-14},
+  year = 2019,
+}
+
+@misc{wolf_yosys_open_synth_suite,
+  author =       {Clifford Wolf},
+  title =        {{Yosys Open SYnthesis Suite}},
+  url =          {https://bit.ly/2kAXg0q},
+  urldate =      {2019-01-11},
+  year = 2019,
+}
+
+@misc{xilinx_xst_synth_overv,
+  author =       {Xilinx},
+  title =        {{XST} Synthesis Overview},
+  url = {https://bit.ly/2lGtkjL},
+  urldate =      {2019-01-11},
+  year = 2019,
+}
+
+@inproceedings{tristan08_formal_verif_trans_valid,
+  author =       {Tristan, Jean-Baptiste and Leroy, Xavier},
+  title =        {Formal Verification of Translation Validators: A Case Study on
+                  Instruction Scheduling Optimizations},
+  tags =         {verification},
+  booktitle =    {Proceedings of the 35th Annual ACM SIGPLAN-SIGACT Symposium on
+                  Principles of Programming Languages},
+  year =         2008,
+  pages =        {17-27},
+  doi =          {10.1145/1328438.1328444},
+  address =      {New York, NY, USA},
+  isbn =         9781595936899,
+  keywords =     {the coq proof assistant, translation validation, verified
+                  compilers, scheduling optimizations},
+  location =     {San Francisco, California, USA},
+  numpages =     11,
+  publisher =    {Association for Computing Machinery},
+  series =       {POPL '08},
+}
+
+@inproceedings{kildall73_unified_approac_global_progr_optim,
+  author          = {Kildall, Gary A.},
+  title           = {A Unified Approach to Global Program Optimization},
+  booktitle       = {Proceedings of the 1st Annual ACM SIGACT-SIGPLAN Symposium on Principles of
+                  Programming Languages},
+  year            = 1973,
+  pages           = {194-206},
+  doi             = {10.1145/512927.512945},
+  address         = {New York, NY, USA},
+  isbn            = 9781450373494,
+  location        = {Boston, Massachusetts},
+  numpages        = 13,
+  publisher       = {Association for Computing Machinery},
+  series          = {POPL '73},
+}
+
+@inproceedings{bertot06_struc_approac_provin_compil_optim,
+  author          = "Bertot, Yves and Gr{\'e}goire, Benjamin and Leroy, Xavier",
+  title           = "A Structured Approach to Proving Compiler Optimizations
+                  Based on Dataflow Analysis",
+  booktitle       = "Types for Proofs and Programs",
+  year            = 2006,
+  pages           = "66--81",
+  address         = "Berlin, Heidelberg",
+  editor          = "Filli{\^a}tre, Jean-Christophe and Paulin-Mohring,
+                  Christine and Werner, Benjamin",
+  isbn            = "978-3-540-31429-5",
+  publisher       = "Springer",
+}
+
+@article{chouksey19_trans_valid_code_motion_trans_invol_loops,
+  author =       {R. {Chouksey} and C. {Karfa} and P. {Bhaduri}},
+  title =        {Translation Validation of Code Motion Transformations
+                  Involving Loops},
+  tags =         {hls, verification},
+  journal =      {IEEE Transactions on Computer-Aided Design of Integrated
+                  Circuits and Systems},
+  volume =       38,
+  number =       7,
+  pages =        {1378-1382},
+  year =         2019,
+  doi =          {10.1109/TCAD.2018.2846654},
+  ISSN =         {1937-4151},
+  month =        {July},
+}
+
+@inproceedings{karfa07_hand_verif_high_synth,
+  author =       {Karfa, C. and Sarkar, D. and Mandal, C.
+                  and Reade, C.},
+  title =        {Hand-in-hand Verification of High-level Synthesis},
+  tags =         {hls},
+  booktitle =    {Proceedings of the 17th ACM Great Lakes Symposium on VLSI},
+  year =         2007,
+  pages =        {429--434},
+  doi =          {10.1145/1228784.1228885},
+  acmid =        1228885,
+  address =      {New York, NY, USA},
+  isbn =         {978-1-59593-605-9},
+  location =     {Stresa-Lago Maggiore, Italy},
+  numpages =     6,
+  publisher =    {ACM},
+  series =       {GLSVLSI '07},
+}
+
+@article{karfa08_equiv_check_method_sched_verif,
+  author =       {C. {Karfa} and D. {Sarkar} and C. {Mandal} and P. {Kumar}},
+  title =        {An Equivalence-Checking Method for Scheduling Verification in
+                  High-Level Synthesis},
+  tags =         {hls, verification},
+  journal =      {IEEE Transactions on Computer-Aided Design of Integrated
+                  Circuits and Systems},
+  volume =       27,
+  number =       3,
+  pages =        {556-569},
+  year =         2008,
+  doi =          {10.1109/TCAD.2007.913390},
+  ISSN =         {1937-4151},
+  month =        {March},
+}
+
+@article{karfa10_verif_datap_contr_gener_phase,
+  author =       {C. {Karfa} and D. {Sarkar} and C. {Mandal}},
+  title =        {Verification of Datapath and Controller Generation Phase in
+                  High-Level Synthesis of Digital Circuits},
+  tags =         {hls, verification},
+  journal =      {IEEE Transactions on Computer-Aided Design of Integrated
+                  Circuits and Systems},
+  volume =       29,
+  number =       3,
+  pages =        {479-492},
+  year =         2010,
+  doi =          {10.1109/TCAD.2009.2035542},
+  ISSN =         {1937-4151},
+  month =        {March},
+}
+
+@article{karfa12_formal_verif_code_motion_techn,
+  author =       {Karfa, C. and Mandal, C. and Sarkar, D.},
+  title =        {Formal Verification of Code Motion Techniques Using
+                  Data-Flow-Driven Equivalence Checking},
+  tags =         {hls, verification},
+  journal =      {ACM Trans. Des. Autom. Electron. Syst.},
+  volume =       17,
+  number =       3,
+  year =         2012,
+  doi =          {10.1145/2209291.2209303},
+  address =      {New York, NY, USA},
+  articleno =    {Article 30},
+  issn =         {1084-4309},
+  issue_date =   {June 2012},
+  month =        jul,
+  numpages =     37,
+  publisher =    {Association for Computing Machinery},
+}
+
+@article{chouksey20_verif_sched_condit_behav_high_level_synth,
+  author          = {R. {Chouksey} and C. {Karfa}},
+  title           = {Verification of Scheduling of Conditional Behaviors in
+                  High-Level Synthesis},
+  journal         = {IEEE Transactions on Very Large Scale Integration (VLSI)
+                  Systems},
+  volume          = {},
+  number          = {},
+  pages           = {1-14},
+  year            = {2020},
+  doi             = {10.1109/TVLSI.2020.2978242},
+  ISSN            = {1557-9999},
+  month           = {},
+}
+
+@inproceedings{kundu08_valid_high_level_synth,
+  author          = "Kundu, Sudipta and Lerner, Sorin and Gupta, Rajesh",
+  title           = "Validating High-Level Synthesis",
+  booktitle       = "Computer Aided Verification",
+  year            = 2008,
+  pages           = "459--472",
+  address         = "Berlin, Heidelberg",
+  editor          = "Gupta, Aarti and Malik, Sharad",
+  isbn            = "978-3-540-70545-1",
+  publisher       = "Springer Berlin Heidelberg",
+}
+
+@inproceedings{kundu07_autom,
+  author          = { {Sudipta Kundu} and S. {Lerner} and {Rajesh Gupta}},
+  title           = {Automated refinement checking of concurrent systems},
+  booktitle       = {2007 IEEE/ACM International Conference on Computer-Aided Design},
+  year            = 2007,
+  pages           = {318-325},
+  doi             = {10.1109/ICCAD.2007.4397284},
+  ISSN            = {1558-2434},
+  month           = {Nov},
+}
+
+@inproceedings{tristan10_simpl_verif_valid_softw_pipel,
+	author = {Tristan, Jean-Baptiste and Leroy, Xavier},
+	location = {Madrid, Spain},
+	publisher = {Association for Computing Machinery},
+	url = {https://doi.org/10.1145/1706299.1706311},
+	booktitle = {Proceedings of the 37th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages},
+	year = {2010},
+	doi = {10.1145/1706299.1706311},
+	isbn = {9781605584799},
+	keywords = {symbolic execution,coq,verification,translation validation,loop scheduling,compiler optimisation,software pipelining},
+	pages = {83--92},
+	series = {POPL '10},
+	title = {A Simple, Verified Validator for Software Pipelining}
+}
+
+@article{barthe14_formal_verif_ssa_based_middl_end_compc,
+	abstract = {CompCert is a formally verified compiler that generates compact and efficient code for a large subset of the C language. However, CompCert foregoes using SSA, an intermediate representation employed by many compilers that enables writing simpler, faster optimizers. In fact, it has remained an open problem to verify formally an SSA-based compiler. We report on a formally verified, SSA-based middle-end for CompCert. In addition to providing a formally verified SSA-based middle-end, we address two problems raised by Leroy in [2009]: giving an intuitive formal semantics to SSA, and leveraging its global properties to reason locally about program optimizations.},
+	author = {Barthe, Gilles and Demange, Delphine and Pichardie, David},
+	location = {New York, NY, USA},
+	publisher = {Association for Computing Machinery},
+	url = {https://doi.org/10.1145/2579080},
+	date = {2014-03},
+	doi = {10.1145/2579080},
+	issn = {0164-0925},
+	journaltitle = {ACM Trans. Program. Lang. Syst.},
+	keywords = {CompCertSSA,CompCert,SSA,coq,verification,compiler optimisation},
+	number = {1},
+	title = {Formal Verification of an SSA-Based Middle-End for CompCert},
+	volume = {36}
+}
+
+@inproceedings{ottenstein90_progr_depen_web,
+	abstract = {The Program Dependence Web (PDW) is a program representation that can be directly interpreted using control-, data-, or demand-driven models of execution. A PDW combines a single-assignment version of the program with explicit operators that manage the flow of data values. The PDW can be viewed as an augmented Program Dependence Graph. Translation to the PDW representation provides the basis for projects to compile Fortran onto dynamic dataflow architectures and simulators. A second application of the PDW is the construction of various compositional semantics for program dependence graphs.},
+	author = {Ottenstein, Karl J. and Ballance, Robert A. and MacCabe, Arthur B.},
+	location = {White Plains, New York, USA},
+	publisher = {Association for Computing Machinery},
+	url = {https://doi.org/10.1145/93542.93578},
+	booktitle = {Proceedings of the ACM SIGPLAN 1990 Conference on Programming Language Design and Implementation},
+	date = {1990},
+	doi = {10.1145/93542.93578},
+	isbn = {0897913647},
+	keywords = {gated-SSA,SSA,program dependence graph},
+	pages = {257--271},
+	series = {PLDI '90},
+	title = {The Program Dependence Web: A Representation Supporting Control-, Data-, and Demand-Driven Interpretation of Imperative Languages}
+}
author	Yann Herklotz <git@yannherklotz.com>	2022-04-15 11:46:10 +0100
committer	Yann Herklotz <git@yannherklotz.com>	2022-04-15 11:46:10 +0100
commit	2a716db2fb59f448590b189f7b953a80d02bc5ee (patch)
tree	46ee92387aa0180f18d19ced871dd395fb0b49de
parent	3d6c27a18bdcf77edfd746982229f609d5448432 (diff)
download	lsr22_fvhls-2a716db2fb59f448590b189f7b953a80d02bc5ee.tar.gz lsr22_fvhls-2a716db2fb59f448590b189f7b953a80d02bc5ee.zip