Add changes

author: Yann Herklotz <git@yannherklotz.com> 2022-06-27 15:01:51 +0100
committer: Yann Herklotz <git@yannherklotz.com> 2022-06-27 15:01:51 +0100
commit: 2fa27381a18e65b4af2d860dc9be17489ff26258 (patch)
tree: 558ce90874835ff0094af1abf56256b346402ffc
parent: b93723f5d4391333b6ed12cea8c3f49871976927 (diff)
download: lsr22_fvhls-2fa27381a18e65b4af2d860dc9be17489ff26258.tar.gz
lsr22_fvhls-2fa27381a18e65b4af2d860dc9be17489ff26258.zip
10 files changed, 302 insertions, 228 deletions
diff --git a/Makefile b/Makefile
index 4278ffd..216dded 100644
--- a/Makefile
+++ b/Makefile
@@ -1,6 +1,8 @@
 MODE ?= main
 REBUILD_DEPS ?= yes
 DEPS :=
+DEFAULT_DEPS := lsr_env.tex fonts_env.tex lsr_refs.bib
+ALL_TEX := $(wildcard chapters/*.tex) $(wildcard *.tex)
 CONTEXT ?= context
 
 ifeq ($(REBUILD_DEPS), yes)
@@ -11,7 +13,7 @@ endif
 
 all: main.pdf
 
-main.pdf: $(DEPS)
+main.pdf: $(DEFAULT_DEPS) $(DEPS) $(ALL_TEX)
 
 figures/%.pdf: figures/%.tex
 	latexmk -pdf -shell-escape $<
@@ -20,7 +22,7 @@ figures/%.pdf: figures/%.tex
 chapters/scheduling.pdf: figures/timing-3.pdf
 
 # silent structure,structures,pages,resolvers,open source,close source,loading,modules
-%.pdf: %.tex
+%.pdf: %.tex $(DEFAULT_DEPS)
 	$(CONTEXT) --mode=$(MODE) --nonstopmode --silent='*' $<
 	cp $(notdir $@) $@ || true
 
diff --git a/chapters/background.tex b/chapters/background.tex
index d4ec91f..3eb525a 100644
--- a/chapters/background.tex
+++ b/chapters/background.tex
@@ -45,8 +45,8 @@ SMTCoq~\cite[armand11_modul_integ_sat_smt_solver].
 
 \subsection[sec:background:interactive]{Interactive theorem provers}
 
-Interactive theorem provers, on the other hand, focus on the checking proofs that are provided to
-them.  These can either be written manually by the user, or automatically generated by some decision
+Interactive theorem provers, on the other hand, focus on checking proofs that are provided to them.
+These can either be written manually by the user, or automatically generated by some decision
 procedure.  However, these two ways of generating proofs can be combined, so the general proof
 structure can be manually written, and simpler steps inside of the proof can be automatically
 solved.
@@ -75,7 +75,7 @@ or a custom representation of the code.
 
 \desc{Hardware resource allocation} It is possible to select a device to target, which gives
 information about how many resources are available and can be used to implement the logic.  However,
-tools often assume that an unlimited amount of resources are available instead.  This is resource
+tools often assume that an unlimited amount of resources are available instead.  This resource
 sharing can be too expensive for adders, as the logic for sharing the adder would have approximately
 the same cost as adding a separate adder.  However, multipliers and divider blocks should be shared,
 especially as divider blocks are often implemented in logic and are therefore quite expensive.
@@ -91,8 +91,8 @@ could be used, such as static or dynamic scheduling, which are described further
 the variables need to be bound to hardware units and registers respectively.  It is often not that
 practical to share many hardware units though, as this requires adding multiplexers, which are often
 the same size as the hardware units themselves.  It is therefore more practical to only share
-operations that take up a lot of space, such as modulo or divide circuits, as well as multiplies if
-there are not available any more on the FPGA.
+operations that take up a lot of space, such as modulo or divide circuits, as well as multipliers if
+they are not available any more on the FPGA.
 
 \desc{Hardware description generation} Finally, the hardware description is generated from the code
 that was described in the intermediate language.  This is often a direct translation of the
@@ -109,7 +109,7 @@ semantics.
 \subsection[sec:language-blocks]{Intermediate Language}
 
 This section describes some possible characteristics of a language that could be used as an input to
-an \HLS\ tool.  These language normally require some structure to allow for easier optimisation and
+an \HLS\ tool.  These languages normally require some structure to allow for easier optimisation and
 analysis passes.  In particular, it is often useful to have contiguous blocks of instructions that
 do not contain any control-flow in one list.  This means that these instructions can safely be
 rearranged by only looking at local information of the block itself, and in particular it allows for
@@ -168,7 +168,7 @@ be tricky if there are loads and stores in the loop or if the loops are nested.
 \subsection[sec:dynamic_scheduling]{Dynamic Scheduling}
 
 On the other hand, Dynamic scheduling~\cite{josipovic18_dynam_sched_high_level_synth} does not
-require the schedule to be known at compile time and it instead generates circuits using tokens to
+require the schedule to be known at compile time and instead it generates circuits using tokens to
 schedule the operations in parallel at run time.  Whenever the data for an operation is available,
 it sends a token to the next operation, signalling that the data is ready to be read.  The next
 operation does not start until all the required inputs to the operation are available, and once that
@@ -248,17 +248,17 @@ further in \in{Section}[sec:wire-to-wire].
 Even though a proof for the correctness of an algorithm might exist, this does not guarantee that
 the algorithm itself behaves in the same way as the assumed algorithm in the proof.  The
 implementation of the algorithm is separate from the actual implementation, meaning there could be
-various implementation bugs in the algorithm that mean that it does not behave correctly.  C
-compilers are a good example of this, where many optimisations performed by the compilers have been
-proven correct, however these proofs are not linked directly to the actual implementations of these
-algorithms in GCC or Clang.  Yang et al.~\cite{yang11_findin_under_bugs_c_compil} found
-more than 300 bugs in GCC and Clang, many of them being in the optimisation phases of the compiler.
-One way to link the proofs to the actual implementations in these compilers is to write the compiler
-in a language which allows for a theorem prover to check properties about the algorithms.  This
-allows for the proofs to be directly linked to the algorithms, ensuring that the actual
-implementations are proven correct.  Yang et al.~\cite{yang11_findin_under_bugs_c_compil}
-found that CompCert, a formally verified C Compiler, only had five bugs in all the unverified parts
-of the compiler, meaning this method of proving algorithms correct ensures a correct compiler.
+various implementation bugs in the algorithm that cause it to behave incorrectly.  C compilers are a
+good example of this, where many optimisations performed by the compilers have been proven correct,
+however these proofs are not linked directly to the actual implementations of these algorithms in
+GCC or Clang.  Yang et al.~\cite{yang11_findin_under_bugs_c_compil} found more than 300 bugs in GCC
+and Clang, many of them appearing in the optimisation phases of the compiler.  One way to link the
+proofs to the actual implementations in these compilers is to write the compiler in a language which
+allows for a theorem prover to check properties about the algorithms.  This allows for the proofs to
+be directly linked to the algorithms, ensuring that the actual implementations are proven correct.
+Yang et al.~\cite{yang11_findin_under_bugs_c_compil} found that CompCert, a formally verified C
+Compiler, only had five bugs in all the unverified parts of the compiler, meaning this method of
+proving algorithms correct ensures a correct compiler.
 
 \subsection{CompCert}
 
@@ -333,8 +333,8 @@ As both the input IR and output netlist format have been designed from scratch,
 useful for real world applications, as they require a different back end to be implemented from
 existing compilers.  In addition to that, it would most likely mean that the translation from
 higher-level language to the IR is unverified and could therefore contain bugs.  As the resulting
-netlist also uses a custom format, it cannot be fed directly to tools that can then translate it a
-bitstream to place onto an FPGA. The reason for designing a custom IR and netlist was so that
+netlist also uses a custom format, it cannot be fed directly to tools that can then translate it to
+a bitstream to place onto an FPGA. The reason for designing a custom IR and netlist was so that
 these were compatible with each other, making proofs of equivalence between the two simpler.
 
 Finally, it is unclear whether or not a translation algorithm from the IR to the netlist format was
diff --git a/chapters/hls.tex b/chapters/hls.tex
index 0be3f99..8a75fee 100644
--- a/chapters/hls.tex
+++ b/chapters/hls.tex
@@ -484,7 +484,7 @@ making it easier to prove optimisations like proper RAM insertion.
 \stoptikzpicture
 \stopplacemarginfigure
 
-\paragraph{Translating memory} Typically, HLS-generated hardware consists of a sea of registers and
+\paragraph{Translating memory.} Typically, HLS-generated hardware consists of a sea of registers and
 RAMs.  This memory view is very different from the C memory model, so we perform the following
 translation from CompCert's abstract memory model to a concrete RAM.  Variables that do not have
 their address taken are kept in registers, which correspond to the registers in 3AC.  All
@@ -500,17 +500,15 @@ and would instead implement it using many registers.  This interface is shown on
 Verilog code in \in{Figure}{a}[fig:accumulator_c_rtl].  A high-level overview of the architecture
 and of the RAM interface can be seen in \in{Figure}[fig:accumulator_diagram].
 
-\paragraph{Translating instructions}
-
-Most 3AC instructions correspond to hardware constructs.  For example, line 2 in
-\in{Figure}{d}[fig:accumulator_c_rtl] shows a 32-bit register \type{x5} being initialised to 3,
-after which the control flow moves execution to line 3. This initialisation is also encoded in the
-Verilog generated from HTL at state 8 in both the control logic and data-path always-blocks, shown
-at lines 33 and 16 respectively in \in{Figure}{a}[fig:accumulator_c_rtl].  Simple operator
-instructions are translated in a similar way.  For example, the add instruction is just translated
-to the built-in add operator, similarly for the multiply operator.  All 32-bit instructions can be
-translated in this way, but some special instructions require extra care. One such instruction is
-the \type{Oshrximm} instruction, which is discussed further in
+\paragraph{Translating instructions.} Most 3AC instructions correspond to hardware constructs.  For
+example, line 2 in \in{Figure}{d}[fig:accumulator_c_rtl] shows a 32-bit register \type{x5} being
+initialised to 3, after which the control flow moves execution to line 3. This initialisation is
+also encoded in the Verilog generated from HTL at state 8 in both the control logic and data-path
+always-blocks, shown at lines 33 and 16 respectively in \in{Figure}{a}[fig:accumulator_c_rtl].
+Simple operator instructions are translated in a similar way.  For example, the add instruction is
+just translated to the built-in add operator, similarly for the multiply operator.  All 32-bit
+instructions can be translated in this way, but some special instructions require extra care. One
+such instruction is the \type{Oshrximm} instruction, which is discussed further in
 \in{Section}[sec:algorithm:optimisation:oshrximm].  Another is the \type{Oshldimm} instruction,
 which is a left rotate instruction that has no Verilog equivalent and therefore has to be
 implemented in terms of other operations and proven to be equivalent.  The only 32-bit instructions
@@ -983,12 +981,12 @@ in the hardware itself.
 \stoptikzpicture}
 \stopplacefigure
 
-This translation is represented in \in{Section}[fig:memory_model_transl].  defines a map from blocks
-to maps from memory addresses to memory contents. Each block represents an area in memory; for
-example, a block can represent a global variable or a stack for a function. As there are no global
-variables, the main stack can be assumed to be block 0, and this is the only block we
-translate. Meanwhile, our Verilog semantics defines two finite arrays of optional values, one for
-the blocking assignments map $\Gamma_{\rm a}$ and one for the nonblocking assignments map
+This translation is represented in \in{Figure}[fig:memory_model_transl].  CompCert's memory model
+defines a map from blocks to maps from memory addresses to memory contents. Each block represents an
+area in memory; for example, a block can represent a global variable or a stack for a function. As
+there are no global variables, the main stack can be assumed to be block 0, and this is the only
+block we translate. Meanwhile, our Verilog semantics defines two finite arrays of optional values,
+one for the blocking assignments map $\Gamma_{\rm a}$ and one for the nonblocking assignments map
 $\Delta_{\rm a}$. The optional values are present to ensure correct merging of the two association
 maps at the end of the clock cycle. The invariant used in the proofs is that block 0 should be
 equivalent to the merged representation of the $\Gamma_{\rm a}$ and $\Delta_{\rm a}$ maps.
diff --git a/chapters/introduction.tex b/chapters/introduction.tex
index fecfcaa..aae48a3 100644
--- a/chapters/introduction.tex
+++ b/chapters/introduction.tex
@@ -17,7 +17,7 @@ while offering the convenient abstractions and rich ecosystems of software devel
 
 There are two main use-cases of \HLS.  Firstly, it is a very convenient tool to use for prototyping
 designs, as this means that the software model can be reused for the initial design of the hardware.
-Secondly, \HLS\ is also often used design hardware that has quite regular control-flow and is
+Secondly, \HLS\ is also often used to design hardware that has quite regular control-flow and is
 compute intensive, one example being \DSP\ hardware design~\cite{coussy08_gaut}.  Ideally, \HLS\
 should also be good for projects that require detailed functional verification, as manual \HDL\
 designs cannot often be fully verified because of their size.  \HLS\ would move the functional
diff --git a/chapters/pipelining.tex b/chapters/pipelining.tex
index 7d81da6..a7a54d0 100644
--- a/chapters/pipelining.tex
+++ b/chapters/pipelining.tex
@@ -97,9 +97,9 @@ $\mu$ and representing the unroll factor of the steady state $\delta$.
 The goal is to verify the equivalence of the original loop and the pipelined loop using a validation
 algorithm, and then prove that the validation algorithm is \emph{sound}, i.e. if it finds the two
 loops to be equivalent, this implies that they will behave the same according to the language
-semantics.  Essentially, as the loop pipelining algorithm only modifies loops, it is enough to show
-that the input loop $X$ is equivalent to the pipelined version of the loop $Y$.  Assuming that the
-validation algorithm is called $\alpha$, we therefore want to show that the following statement
+semantics.  Essentially, as the loop pipelining algorithm only modifies loops, it is sufficient to
+show that the input loop $X$ is equivalent to the pipelined version of the loop $Y$.  Assuming that
+the validation algorithm is called $\alpha$, we therefore want to show that the following statement
 holds, where $X_N$ means that we are considering $N$ iterations of the loop.
 
 \placeformula\startformula
@@ -113,7 +113,7 @@ number of loop iterations $N$ are not always known at compile time, this may bec
 property that needs to be checked.
 
 This infinite property can be proven by checking smaller finite properties that loop pipelining
-should ensure.  The two key insights by \cite[author][tristan10_simpl_verif_valid_softw_pipel] where
+should ensure.  The two key insights by \cite[author][tristan10_simpl_verif_valid_softw_pipel] were
 that mainly only the two following properties needed to be checked:
 
 \placeformula[eq:loop1]\startformula
@@ -130,8 +130,8 @@ one more body $\mathcal{S}$ of the pipelined loop.  The second property shown in
 \in{Equation}[eq:loop2] describes that executing the initial loop $\mu$ times is exactly the same as
 executing the prologue $\mathcal{P}$ of the loop followed by the epilogue $\mathcal{E}$.  In
 addition to some simpler properties, this means that the infinite property can still be
-checked.\footnote{In addition to other more trivial properties, for example that the loop structure
-  is correct, and only contains one increment operation on the induction variable.}
+checked.\footnote{This is in addition to other more trivial properties, for example that the loop
+  structure is correct, and only contains one increment operation on the induction variable.}
 
 Finally, the verifier itself needs to be proven correct, meaning the correctness of the symbolic
 evaluation needs to be shown to be semantics preserving.  If the comparison between two symbolic
@@ -140,22 +140,22 @@ states succeeds, then this must imply that the initial blocks themselves execute
 \section{Novelties of verifying software pipelining in Vericert}
 
 As mentioned in the previous section, software pipelining has already been implemented in CompCert
-by \cite[author][tristan10_simpl_verif_valid_softw_pipel].  From an implementation point of view for
+by \cite[author][tristan10_simpl_verif_valid_softw_pipel].  From an implementation point of view,
 the actual pipelining algorithm that was implemented was a backtracking iterative modulo scheduling
 algorithm paired with modulo variable expansion.  However, the results were underwhelming because it
 was tested on an out-of-order PowerPC G5 processor, whereas pipelining is more effective on in-order
 processors to take full advantage of the static pipeline.  In addition to that, alias analysis on
-memory dependencies is also not performed, which is perhaps the best way to improve the performance
-of the optimisation.  Finally, the loop scheduling is also only performed on basic blocks, and not
-on extended basic blocks such as superblocks or hyperblocks.  These are all short-comings that could
-be easily improved by using our existing {\sc RTLBlock} or {\sc RTLPar} language as the source
-language of the pipelining transformation.  Paired with the scheduling step, one could extract more
-parallelism and also support conditionals in loop bodies.  Memory aliasing is not directly supported
-yet in the {\sc RTLBlock} abstract interpretation, however, using some simple arithmetic properties
-it should be possible to normalise memory accesses and therefore prove reordering of memory accesses
-that do not alias under certain conditions.
-
-From a proof perspective there is also some improvements that can be made to the work.  First of
+memory dependencies is also not performed, although this is perhaps the best way to improve the
+results of the optimisation.  Finally, the loop scheduling is also only performed on basic blocks,
+and not on extended basic blocks such as superblocks or hyperblocks.  These are all short-comings
+that could be easily improved by using our existing {\sc RTLBlock} or {\sc RTLPar} language as the
+source language of the pipelining transformation.  Paired with the scheduling step, one could
+extract more parallelism and also support conditionals in loop bodies.  Memory aliasing is not
+directly supported yet in the {\sc RTLBlock} abstract interpretation, however, using some simple
+arithmetic properties it should be possible to normalise memory accesses and therefore prove
+reordering of memory accesses that do not alias under certain conditions.
+
+From a proof perspective there are also some improvements that can be made to the work.  First of
 all, the proof from the paper was actually never fully completed:
 
 \startframedtext[
@@ -180,7 +180,7 @@ all, the proof from the paper was actually never fully completed:
 \stopframedtext
 
 \noindent The proof of semantic preservation going from unrolled loops with basic blocks back to a
-control-flow graph were tedious, showing that the language was not designed to handle loop
+control-flow graph was tedious, showing that the language was not designed to handle loop
 transformations well.  However, recent work on the formalisation of polyhedral transformations of
 loops~\cite[authoryear][courant21_verif_code_gener_polyh_model] shows that this may be a better
 representation for the intermediate language.  However, translations to and from the {\sc Loop}
@@ -208,7 +208,7 @@ which the pipeline needs to be filled.  The benefit of generating hardware direc
 pipeline can already be laid out in the optimal and natural way.
 
 It is less clear why, and especially how, software can be pipelined in the same way as hardware, and
-what the benefits are if the code is going to be executed.\footnote{It is more clear why scheduling
+what the benefits are if the code is going to be executed.\footnote{It is clearer why scheduling
   would be important on a VLIW processor, but still not clear how the pipelined loop can be
   represented in software without explicit parallelism.}  The main difference between to a hardware
 pipeline is that in software the code is executed inherently sequentially.  This means that it is
@@ -216,16 +216,45 @@ not possible to lay out a pipeline in the same way, as the next input cannot jus
 loop while it is executing.  However, a pipeline can be simulated in software, so that it behaves
 the same as the hardware pipeline, but it will manually have to unroll and rearrange the loop body
 so that all the stages can be executed by following one control-flow.  However, the order of
-instructions will be quite different hardware, where the pipeline can be written naturally in
+instructions will be quite different in hardware, where the pipeline can be written naturally in
 sequential order of the stages.  In software, the pipeline is instead represented by one horizontal
 slice of the pipeline, which will be the new loop body and represents the computation that the whole
 hardware pipeline is performing at one point in time.
 
-The main question that comes up is are the two pipelining methods actually equivalent?  In Vericert
+The main question that comes up is: are the two pipelining methods actually equivalent?  In Vericert
 we are assuming that performing software pipelining, followed by scheduling and finally followed by
 the hardware generation approximates the hardware pipelining that traditional \HLS\ tools perform as
 part of their hardware generation.
 
+\subsection{Hardware Pipelining Can be Approximated by Software Pipelining}
+
+The hope is that hardware pipelining can be approximated by software pipelining, so that the
+translation can easily be separated.  We argue that this is in fact possible, with two additional
+properties that can be added to the software intermediate language.  The main two problems that
+software pipelining runs into when directly comparing it to the equivalent hardware pipeline are
+first the duplicated code for the prologue and the epilogue, and secondly the possible loop
+unrolling that is necessary to then allocate the registers correctly.
+
+\paragraph{Removing duplicate prologue and epilogue.}  Disregarding the code duplication that
+results from generating the prologue and the epilogue, the main problem with having these pieces of
+code is that one ends up with a loop that has a minimum number of iterations which it needs to be
+executed for.  This is not the case with a hardware pipeline, and it means that in software, to keep
+the code correct, the original version of the loop needs to be kept around so that if the iterations
+are under the minimum number of iterations the loop can execute, it will branch to the original loop
+instead.  This does not really affect the performance of the code, but it does add a duplicate of
+every loop to the program, which might increase the size of the hardware dramatically.  Predicated
+execution can be used to selectively turn on instructions that should be running at this point in
+time, therefore simulating the prologue and epilogue without having to duplicate any code.
+
+\paragraph{Removing the need to unroll the loop.}  In addition to that, sometimes register
+allocation will require the loop to be unrolled many times to remain correct.  This also means that
+the original loop will have to be kept around in case the number of executions of the loop is not
+divisible by the number of times the loop was unrolled.  However, by using a rotating register
+file~\cite[rau92_regis_alloc_softw_pipel_loops] the clash between registers from different
+iterations is removed, thereby eliminating the need for register allocation.  This does increase the
+amount of registers, however, this can further be optimised.  In addition to that, rotating register
+files act a lot like proper hardware registers in pipelines.
+
 \startmode[chapter]
   \section{Bibliography}
   \placelistofpublications
diff --git a/chapters/schedule.tex b/chapters/schedule.tex
index 73cb4b2..52b93d0 100644
--- a/chapters/schedule.tex
+++ b/chapters/schedule.tex
@@ -26,7 +26,7 @@ hyperblock scheduling, and is described in \in{Chapter}[sec:scheduling].  Curren
 need to be completed, as well as submitting a paper about the work to FMCAD/POPL/PLDI.  In general,
 this work was supposed to be finished in November 2021, however, due to complications with the
 actual formulation of the semantics, as well as the number of proofs that need to be completed, the
-work is still on going.
+work is still ongoing.
 
 \desc{Modulo scheduling} The second piece of work is described in \in{Chapter}[sec:pipelining] and
 describes the implementation of modulo scheduling to perform scheduling on loops.  This work will be
@@ -92,7 +92,7 @@ proving hyperblock scheduling.  This requires proofs of the following translatio
 proof as well as simplify the scheduling algorithm.
 
 \desc{\sc IfConv} The {\sc IfConv} transformation introduces predicate instructions by conversing
-conditional statements.  This is necessary to generates hyperblocks and therefore improve the
+conditional statements.  This is necessary to generate hyperblocks and therefore improve the
 schedule by passing larger blocks to the scheduler.
 
 \desc{\sc RTLBlock $\rightarrow$ RTLPar} The proof of the scheduling algorithm itself, which is
diff --git a/chapters/scheduling.tex b/chapters/scheduling.tex
index 4b897ef..d3a25f4 100644
--- a/chapters/scheduling.tex
+++ b/chapters/scheduling.tex
@@ -63,21 +63,18 @@ end, the key points of this paper are the following:
 
 \startplacemarginfigure[reference={fig:compcert_interm},title={New intermediate languages introduced into
     CompCert.  Progress bars are shown for each transformation pass, where green represents the
-    estimated part of the work that is completed, and red stands for the remaining work to be done.}]
-  \hbox{\starttikzpicture[shorten >=1pt,>=stealth]
-    \node       at (-3, 0) (dotsin) {$\cdots$};
-    \node       at (10, 0) (dotsout) {$\cdots$};
-    \node[draw] at (-1, 0) (rtl) {\rtl};
-    \node[draw] at (2,  0) (rtlblock) {\rtlblock};
-    \node[draw] at (2, -2) (abstrblock) {\abstr};
-    \node[draw] at (6,  0) (rtlpar) {\rtlpar};
-    \node[draw] at (6, -2) (abstrpar) {\abstr};
-    \node[draw] at (8.5,  0) (htl) {\htl};
+    estimated part of the work that is completed, and red stands for the remaining work to be done.
+Orange stands for a pass that will currently not be used for the current translation but could be
+added back in the future.}]
+  \hbox{\starttikzpicture[shorten >=1pt,>=stealth] \node at (-3, 0) (dotsin) {$\cdots$}; \node at
+    (10, 0) (dotsout) {$\cdots$}; \node[draw] at (-1, 0) (rtl) {\rtl}; \node[draw] at (2, 0)
+    (rtlblock) {\rtlblock}; \node[draw] at (2, -2) (abstrblock) {\abstr}; \node[draw] at (6, 0)
+    (rtlpar) {\rtlpar}; \node[draw] at (6, -2) (abstrpar) {\abstr}; \node[draw] at (8.5, 0) (htl)
+    {\htl};
 
     \draw[->] (rtl) --
          node[above,midway,align=center,font=\small] {basic block \\[-0.3em] creation}
-         node[below,yshift=-3mm,xshift=-0.75cm,anchor=west,fill=VenetianRed,minimum width=1.5cm] {}
-         node[below,yshift=-3mm,xshift=-0.75cm,anchor=west,fill=ForestGreen,minimum width=0.6cm] {}
+         node[below,yshift=-3mm,xshift=-0.75cm,anchor=west,fill=ForestGreen,minimum width=1.5cm] {}
          (rtlblock);
     \draw[->] (rtlblock) --
          node[above,midway,font=\small] {scheduling}
@@ -103,12 +100,11 @@ end, the key points of this paper are the following:
     \draw[->] (rtlblock) to [out=130,in=50,loop,looseness=10]
          node[above,midway,font=\small] {if-conversion}
          node[below,yshift=6mm,xshift=-0.75cm,anchor=west,fill=VenetianRed,minimum width=1.5cm] {}
-         node[below,yshift=6mm,xshift=-0.75cm,anchor=west,fill=ForestGreen,minimum width=0.2cm] {}
+         node[below,yshift=6mm,xshift=-0.75cm,anchor=west,fill=ForestGreen,minimum width=1cm] {}
          (rtlblock);
     \draw[->] (rtlpar) to [out=130,in=50,loop,looseness=10]
          node[above,midway,align=center,font=\small] { operation \\[-0.3em] pipelining}
-         node[below,yshift=1cm,xshift=-0.75cm,anchor=west,fill=VenetianRed,minimum width=1.5cm] {}
-         node[below,yshift=1cm,xshift=-0.75cm,anchor=west,fill=ForestGreen,minimum width=0.2cm] {}
+         node[below,yshift=1cm,xshift=-0.75cm,anchor=west,fill=TigersEye,minimum width=1.5cm] {}
          (rtlpar);
     \draw[->] (dotsin) -- (rtl);
     \draw[->] (htl) -- (dotsout);
@@ -124,20 +120,18 @@ support of pipelined operations such as dividers.
 In general, the implementation of all the passes is now complete, and so current work is being done
 to complete the correctness proofs of all the passes.  Starting at the basic block creating pass,
 which is further described in \in{Section}[sec:basic-block-generation], the proof is quite complex
-for generating basic blocks.  However, the main idea of how to complete the proof has now been
-understood, so it should just be a matter of time until it is implemented.  Secondly, the
-if-conversion pass has only been implemented and has not been verified yet.  However, it should be
-straightforward to verify it, as the transformation is quite direct.  Next, for the scheduling pass
-the verification translates the input and the output of the scheduling algorithm into the \abstr\
-language and then compares the two languages has mainly been completed.  However, due to changes in
-the \rtlblock\ and \rtlpar\ language semantics, some of these proofs will have to be redone, which
-should not take too long though.  Finally, for the generation of \htl\ from \rtlpar\ the proof has
-not been started yet.  However, the difficult parts of the proof are the translation of loads and
-stores into proper reads and writes from RAM, which has already been completely proven for the \rtl\
-to \htl\ translation.
-
-The operation pipelining pass has only been implemented until now, however, it will likely be
-removed from the current workflow if it proves to take too long to complete.
+for generating basic blocks, but has now been completed. Secondly, the if-conversion pass has only
+been implemented and has not been verified yet.  Even though the transformation is quite direct,
+it's still difficult to formulate the correct matching condition for the proof.  However, this has
+nearly been completed.  Next, for the scheduling pass the verification of the pass that translates
+the input and the output of the scheduling algorithm into the \abstr\ language and then compares the
+two languages has mainly been completed.  However, due to changes in the \rtlblock\ and \rtlpar\
+language semantics, some of these proofs will have to be redone.  Finally, for the generation of
+\htl\ from \rtlpar\ the proof has not been started yet.  However, the difficult parts of the proof
+are the translation of loads and stores into proper reads and writes from RAM, which has already
+been completely proven for the \rtl\ to \htl\ translation, and can be directly copied over.  The
+operation pipelining pass has only been implemented until now, however, it will likely be removed
+from the current workflow if it proves to take too long to complete.
 
 \section[sec:basic-block-generation]{Basic Block Generation}
 
@@ -197,8 +191,8 @@ transition in the left hand side.
         \node (s1) at (0,2) {$S_{1}$};
         \node (s2p) at (3,0) {$S_{2}'$};
         \node (s1p) at (3,2) {$S_{1}'$};
-        \draw (s1) -- node[above] {$\sim$} (s1p);
-        \draw[dashed] (s2) -- node[above] {$\sim$} (s2p);
+        \draw (s1) -- node[above] {$\sim_{i::\varnothing}$} (s1p);
+        \draw[dashed] (s2) -- node[above] {$\exists b, \sim_{b}$} (s2p);
         \draw[->] (s1) -- (s2);
         \draw[->,dashed] (s1p) -- (s2p);
       \stoptikzpicture}
@@ -208,9 +202,9 @@ transition in the left hand side.
         \node (s2) at (0,0) {$S_{2}$};
         \node (s1) at (0,2) {$S_{1}$};
         \node (s1p) at (3,2) {$S_{1}'$};
-        \draw (s1) -- node[above] {$\sim$} (s1p);
+        \draw (s1) -- node[above] {$\sim_{i::b}$} (s1p);
         \draw[->] (s1) -- (s2);
-        \draw[dashed] (s2) -- node[below,xshift=1mm] {$\sim$} (s1p);
+        \draw[dashed] (s2) -- node[below,xshift=3mm] {$\sim_{b}$} (s1p);
       \stoptikzpicture}
     \stopplacefigure
   \stopfloatcombination
@@ -218,17 +212,41 @@ transition in the left hand side.
 
 The diagram on the left represents a control-flow transition, which is performed in lock-step.  This
 means one has encountered a control-flow instruction, and therefore also implies that the basic
-block has finished executing and that the next block needs to be retrieved.  The diagram on the
-right is the interesting one which allows for the multiple transitions in the input language.  It
-requires writing a matching predicate $\sim$ which can match the start of the basic block in
-\rtlblock\ with any state in the original \rtl\ language as long as one has not finished executing
-the basic block and has not encountered a control-flow instruction.  By also showing that some
-metric stored in the $\sim$ relation is decreasing, one can then show that one will not get stuck
-iterating forever, and effectively get the behaviour of executing multiple instructions in the input
-code to execute one big step in the output language.  The other trick is that at each one of these
-steps, one also incrementally builds a proof of the execution from the start of the basic block to
-the current state, so that when a control-flow instruction is encountered, this proof can be used
-show that one can jump over all the input instructions with that block.
+block has finished executing and that the next block needs to be retrieved.  In addition to that,
+the matching relation is parameterised by the current basic block that is executing, so in the case
+of the lock-step translation the translation will always be a control-flow instruction at the end of
+the basic block and will therefore populate the matching relation with the new basic block that is
+jumped to.  The diagram on the right is the interesting one though, which allows for the multiple
+transitions in the input language.  It requires matching on a predicate $\sim_{i::b}$ which matches
+the start of the basic block in \rtlblock\ with any state in the original \rtl\ language as long as
+one has not finished executing the basic block and has not encountered a control-flow instruction.
+Then, at every step, one has to show that the block in the matching relation reduces at every step
+to show termination, and one also has to keep track of the proof of simulation from the beginning of
+the basic block to the current point, to then show that the final control-flow step also matches.
+
+This is done by the following two properties that are inside of the simulation relation
+$S_{2} \sim_{b} S_{1}'$.  The \in{Equation}[eq:sem_extrap] property describes that executing the
+basic block $B$ that is in the code $c$ of the translated function in \rtlblock\ at the program
+counter location from an initial state $S_{1}'$ will result in the same final state as executing the
+current list of instructions that the simulation relation is parametrised at (i.e. $b$).
+
+\placeformula[eq:sem_extrap]\startformula
+  \forall S,\ \ c[S_{1}'.\mathit{pc}] = \Some{B} \land (b, S_2) \Downarrow S \implies (B, S_{1}') \Downarrow S
+\stopformula
+
+The following \in{Equation}[eq:star] then also states that there exists zero or more small steps
+in \rtl\ that can go from $S_{1}'$ to the current state in \rtl\ $S_{2}$.
+
+\placeformula[eq:star]\startformula
+  S_{1}' \rightarrow_* S_{2}
+\stopformula
+
+By also demonstrating that some metric stored in the $\sim$ relation is decreasing, one can then
+show that one will not get stuck iterating forever, and effectively get the behaviour of executing
+multiple instructions in the input code to execute one big step in the output language.  The
+\in{Equation}[eq:star] also incrementally builds a proof of the execution from the start of the
+basic block to the current state, so that when a control-flow instruction is encountered, this proof
+can be used show that one can jump over all the input instructions with that block.
 
 \subsection[scheduling:hyperblocks]{Hyperblocks as a Generalisation of Basic Blocks}
 
@@ -236,7 +254,7 @@ Basic blocks are popular in intermediate languages because they describe a list
 instructions that are not separated by control-flow. This information can be used by various
 analysis passes to perform optimisations, as these sequences of instructions can be reordered in any
 way as long as their data dependencies are met. One such optimisation is scheduling, which reorders
-instructions to eliminate of any delay slots caused by the CPU pipeline.
+instructions to eliminate any delay slots caused by the CPU pipeline.
 
 However, due to control-flow not being allowed in basic blocks, the number of instructions that can
 be scheduled at any one time is limited.  The boundaries to other basic blocks act as hard boundary
@@ -245,7 +263,7 @@ algorithm. Hyperblocks~\cite[mahlke92_effec_compil_suppor_predic_execut_using_hy
 instructions to be predicated, thereby introducing some control-flow information into the blocks,
 which can also be taken into account by the scheduler. For example, any data dependency between
 instructions that have mutually exclusive predicates can be removed, as these instructions will
-never be activate at the same time.
+never be active at the same time.
 
 \section[implementation-of-if-conversion]{Implementation of If-Conversion}
 
@@ -257,18 +275,19 @@ always result in more efficient code, so it should not be applied to any conditi
 statements.
 
 From an implementation point of view, if-conversion can be implemented naïvely, however, it could be
-improved by adding better heuristics for which conditional statements to transform into predicated
-instructions.  For hyperblocks, conditionals where each branch approximately execute in the same
-number of cycles are an ideal candidate for the transformation, as it means that no time is lost due
-to having to execute extra predicated instructions that are false.  However, as hyperblocks are an
-extension of superblocks, single paths through the control-flow graph can also be represented.  In
-our implementation, we use simple static heuristics to pick these paths~\cite[??].
+improved by adding better heuristics to determine which conditional statements to transform into
+predicated instructions.  For hyperblocks, conditionals where each branch approximately execute in
+the same number of cycles are an ideal candidate for the transformation, as it means that no time is
+lost due to having to execute extra predicated instructions that are false.  However, as hyperblocks
+are an extension of superblocks, single paths through the control-flow graph can also be
+represented.  In our implementation, we use simple static heuristics to pick these
+paths~\cite[ball93_branc_predic_free].
 
 \subsection[proof-if-conversion]{Correctness of If-Conversion}
 
 The correctness of the if-conversion pass does not rely on which paths are chosen to be combined.
 The proof therefore does not have to cover the static heuristics that are generated, because these
-only affect the performance.  Even though the translation is quite simple, the reasoning about it's
+only affect the performance.  Even though the translation is quite simple, the reasoning about its
 correctness does have its difficulties.  We have a similar problem as before, where we sometimes
 need to execute multiple steps in the input program to reach one step in the output program.  The
 solution is to perform a similar style proof as for the basic block generation.
@@ -280,6 +299,21 @@ if-converted section.
 
 \section[sec:scheduling]{Scheduling Implementation}
 
+This section will discuss the implementation of hyperblock scheduling in Vericert, which is still a
+work in progress.  The scheduling step consists of many different smaller transformations, covering
+the initial basic block generation, hyperblock construction using if-conversion, then the scheduling
+and finally the hardware generation.
+
+Vericert uses \SDC\ scheduling~\cite[cong06_sdc], which is a flexible algorithm to perform
+scheduling of instructions, especially if the target is hardware directly. It allows for flexible
+definitions of optimisation functions in terms of latencies and dependencies of the operations in
+the basic block. \HLS\ tools often use \SDC\ scheduling performed on hyperblocks to find the optimal
+schedule.  The main benefit of using \SDC\ scheduling is that it generates constraints and an
+equation that should be minimised, which can then be passed to a linear program solver.  By changing
+the minimisation, different attributes of the hardware can be optimised.  In addition to that
+though, there have been extensions to the \SDC\ scheduling algorithm to include more constraints
+that can then also schedule loops using modulo scheduling~\cite[zhang13_sdc].
+
 \subsection[sec:rtlblockdef]{\rtlblock\ and \rtlpar\ Intermediate Language Definition}
 
 \lindex{\rtlblock}\lindex{\rtlpar}\in{Figure}[fig:compcert_interm] shows the intermediate languages
@@ -292,14 +326,17 @@ next sections.
 
 \startplacefigure[location={here,none}]
   \startfloatcombination[nx=2]
-    \startplacefigure[reference={eq:standard},title={Syntax for instructions within a hyperblock.}]
+    \startplacefigure[reference={eq:standard},title={Syntax for instructions within a hyperblock.
+        There is a predicated exit instruction which serves as an early exit instruction from the
+        hyperblock, thereby making it single entry and multiple exit.}]
       \startframedtext[frame=off,offset=none,width={0.45\textwidth}]
         \startformula\startmathalignment
           \NC i\ \ \eqdef \NC \ \ \text{\tt RBnop}                                                   \NR
           \NC                  \NC |\ \ \text{\tt RBop}\ p?\ \mathit{op}\ \vec{r}\ d                      \NR
           \NC                  \NC |\ \ \text{\tt RBload}\ p?\ \mathit{chunk}\ \mathit{addr}\ \vec{r}\ d  \NR
           \NC                  \NC |\ \ \text{\tt RBstore}\ p?\ \mathit{chunk}\ \mathit{addr}\ \vec{r}\ s \NR
-          \NC                  \NC |\ \ \text{\tt RBsetpred}\ p?\ c\ \vec{r}\ d                           \NR
+          \NC                  \NC |\ \ \text{\tt RBsetpred}\ p?\ c\ \vec{r}\ d                 \NR
+          \NC                  \NC |\ \ \text{\tt RBexit}\ p?\ i_{\mathit{cf}}                 \NR
         \stopmathalignment\stopformula
       \stopframedtext
 
@@ -314,7 +351,6 @@ next sections.
           \NC                      \NC |\ \ \text{\tt RBjumptable}\ r\ \vec{n}                                   \NR
           \NC                      \NC |\ \ \text{\tt RBreturn}\ r?                                              \NR
           \NC                      \NC |\ \ \text{\tt RBgoto}\ n                                                 \NR
-          \NC                      \NC |\ \ \text{\tt RBpred\_{cf}}\ P\ i_{\mathit{cf}_{1}}\ i_{\mathit{cf}_{2}} \NR
         \stopmathalignment \stopformula
       \stopframedtext
     \stopplacefigure
@@ -331,10 +367,8 @@ standard instruction which sets a predicate equal to an evaluated condition. Thi
 used during if-conversion to set the predicate to the value of the condition in the conditional
 statement instruction. To support proper predicated instructions, each instruction therefore also
 contains an optional predicate which has been set beforehand, which produces the hyperblock. In
-additon to that, there is also an extra control-flow instruction called \type{RBpred_cf}, which can
-branch on a predicate an nests more control-flow instructions inside of it. This is also necessary
-for if-conversion, when converting an already converted conditional statement, as depending on the
-predicate, a different control-flow instruction may be necessary.
+addition to that, there is also a predicated exit instruction called \type{RBexit}, which serves as
+an early exit from the hyperblock.
 
 \placeformula[fig:hb_def]\startformula\startmathalignment[align={1:left}]
 \NC \blockbb \qquad \eqdef \qquad (\text{\tt slist}\ i) \times i_{\mathit{cf}} \NR \NC \parbb \qquad
@@ -351,18 +385,15 @@ of the basic blocks in \rtlblock\ as well as \rtlpar, and where the parallel sem
 come into play. \rtlblock\ is made of a list of instructions and a control-flow instruction that
 ends the hyperblock. Each instruction in \rtlblock\ is executed sequentially. \rtlpar\ is made of
 three separate lists. The outer list behaves like the list in \rtlblock, and executes each list
-inside of it sequentially. \rtlpar\ then has a list which executes its contents in parallel,
-meaning each instruction uses the same state of the registers and memory while executing. Finally,
-instead of directly containing instructions, there is an additional layer of sequentially executing
-lists of instructions, to be able to execute small groups of instructions in sequence, but still in
-parallel to the other instructions in the current parallel group.
-
-\rtlpar\ is structured in this way so that the scheduler has more flexibility, so that it can not
-only parallelise operations in the basic blocks, but that it can also define sequential operations
-in these parallel operations. This is especially useful for the hardware back end, as it means that
-sequential operations that do not quite fill one clock cycle can be executed sequentially, but can
-still be executed in parallel with other independent operations. This optimisation is called
-operation chaining, and is critical to get the most performance out of the hardware.
+inside of it sequentially. \rtlpar\ then has a list which executes its contents in parallel, meaning
+each instruction uses the same state of the registers and memory while executing. Finally, instead
+of directly containing instructions, there is an additional layer of sequentially executing lists of
+instructions, to be able to execute small groups of instructions in sequence, but still in parallel
+to the other instructions in the current parallel group.  The latter construct enables operation
+chaining to be expressed, without having to create new operations for each type of chaining that
+could occur.  Operation chaining is critical to get the most performance out of the hardware, as it
+fills the extra latency one may have available within one clock cycle when one is also scheduling
+instructions that have a larger latency.
 
 \subsection[translating-rtlblock-to-rtlpar]{Translating \rtlblock\ to \rtlpar}
 
@@ -384,7 +415,7 @@ operation chaining, and is critical to get the most performance out of the hardw
     clock cycles.}]
   \startfloatcombination[ny=2,nx=2]
     \startplacesubfigure[title={RTLBlock hyperblock to be scheduled.}]
-      \startframedtext[frame=off,offset=none,width={0.43\textwidth},style=\rmx]
+      \startframedtext[frame=off,offset=none,width={0.43\textwidth},style=\rmxx]
         \starthlC
           /BTEX \sI/ETEX                r2 = r1 + r4;
           /BTEX \sII/ETEX if(p1)         r1 = r2 + r4;
@@ -395,7 +426,7 @@ operation chaining, and is critical to get the most performance out of the hardw
       \stopframedtext
     \stopplacesubfigure
     \startplacesubfigure[title={Scheduled RTLPar hyperblock.}]
-      \startframedtext[frame=off,offset=none,width={0.53\textwidth},style=\rmx]
+      \startframedtext[frame=off,offset=none,width={0.53\textwidth},style=\rmxx]
         \starthlC
           1: [r2 = r1 + r4; if(p1) r1 = r2 + r4];
           [if(!p2 && !p1) r3 = r1 * r1]
@@ -439,9 +470,8 @@ scheduling step comes right after the if-conversion step which originally create
 \rtlblock\ hyperblock being scheduled is shown in \in{Figure}{a}[fig:op_chain], which contains five
 predicated operations, comprising two additions and three multiplications. The data dependencies of
 the instructions in the hyperblock are shown in \in{Figure}{c}[fig:op_chain]. Curiously, even though
-operations and are normally dependent on each other due to a write-after-write conflict, but because
-the predicates are independent the write-after-write conflict can effectively be removed by the
-scheduler.
+operations may normally be dependent on each other due to a data conflict, the dependency can be
+removed when the predicates are independent.
 
 The scheduler transforms the \rtlblock\ hyperblock into the \rtlpar\ hyperblock shown in
 \in{Figure}{b}[fig:op_chain].  Even though the addition in \sII\ is dependent on \sI, they can still
@@ -452,14 +482,14 @@ next clock cycle, either the multiplication in \sIV\ can take place, or the mult
 will take place, meaning these two multiplications can also be placed in the same clock cycle. This
 gives the final schedule that is shown in \in{Figure}{d}[fig:op_chain].
 
-\section[abstr_interp]{Abstract Interpretation of Hyperblocks}
+\subsection[abstr_interp]{Abstract Interpretation of Hyperblocks}
 
 \index{symbolic execution}\seeindex{abstract interpretation}{symbolic execution}The correctness of
 the scheduling algorithm is checked by comparing the symbolic expressions of each register before
 and after scheduling. This section describes how these symbolic expressions are constructed for each
 hyperblock.
 
-\subsection[abstr_language]{Construction of Abstract Language}
+\subsubsection[abstr_language]{Construction of the Abstract Language}
 
 The main difficulty when constructing \index{symbolic expression}symbolic expressions for
 hyperblocks is the presence of predicates, which can be used by the scheduling algorithm to reorder
@@ -526,20 +556,6 @@ primitive is multiplication of two predicated types, which is implemented as per
 multiplication between the predicated types in the two lists, anding the two predicates and joining
 the two types of each list using a function.
 
-\placeformula[eq:1]\startformula
-  P_{1} \otimes_{f} P_{2} \equiv \text{\tt map } (\lambda ((p_{1}, e_{1}), (p_{2}, e_{2})) . (p_{1}
-  \land p_{2}, f\ e_{1}\ e_{2})) P_{1} \times P_{2} \stopformula
-
-In addition to that, another primitive that is needed is the following append operation, which will
-negate the combination of all predicates in the second predicated type, and it to the first
-predicated type and append the first predicated type to the second:
-
-\placeformula[eq:2]\startformula
-  \mu(p, P) \equiv \text{\tt map } (\lambda (p', e') . (p \land p', e'))\ P \stopformula
-
-\placeformula[eq:3]\startformula
-  P_{1} \oplus_{p} P_{2} \equiv \mu(\neg p, P_{1}) + \mu(p, P_{2}) \stopformula
-
 \subsection[example-of-translation]{Example of translation}
 
 \placeformula[eq:4]\startformula \startmathalignment[n=1,align={1:left}]
@@ -573,20 +589,21 @@ expressions and applying the update function defined in \in{Section}[abstr_langu
 predicated expressions is guaranteed to have predicates which are mutually exclusive, meaning if the
 predicate evaluates to true, all other predicates must evaluate to false.
 
-\subsection[linear-flat-predicated-expressions]{Linear (Flat) Predicated Expressions}
+\subsubsection[linear-flat-predicated-expressions]{Linear (Flat) Predicated Expressions}
 
-\index{symbolic expression}The update function for predicated expressions is quite a bit more
-complex compared to standard update functions for symbolic execution, such as the one used by . The
-main reason for this is the non-recursive nature of the predicated expressions. As the predicates
-are only available at the top-level of the expressions, combining multiple expressions means that
-the conditions and multiple possibilities also need to trickle up to the top.
+\index{symbolic expression}The update function for predicated expressions is substantially more
+complex compared to the standard update functions for symbolic execution, such as the one used by
+\cite[entry][tristan08_formal_verif_trans_valid]. The main reason for this is the non-recursive
+nature of the predicated expressions. As the predicates are only available at the top-level of the
+expressions, combining multiple expressions means that the conditions and multiple possibilities
+also need to trickle up to the top.
 
 Instead, if the predicated instructions were recursive, one would not have to change any predicates
 when evaluating the argument lists of instructions, for example, as these can just stay
 predicated. The evaluation semantics would also be much simpler, because a rule can be added for
 evaluating predicates.
 
-\section[sec:correctness]{Correctness of the Abstract Interpretation}
+\subsection[sec:correctness]{Correctness of the Abstract Interpretation}
 
 The correctness of the scheduling step is dependent on the correctness of the equivalence check
 between the two abstract languages. As the abstract languages contain predicates, this comparison
@@ -594,7 +611,7 @@ check cannot just be a syntactic comparison between the two languages, because t
 be different but logically equivalent. Instead, a mixed equivalence check is performed using a
 verified SAT solver and a syntactic equivalence check of the actual symbolic expressions.
 
-\subsection[comparing-symbolic-abstract-expressions]{Comparing Symbolic Abstract Expressions}
+\subsubsection[comparing-symbolic-abstract-expressions]{Comparing Symbolic Abstract Expressions}
 
 Abstract expressions comprise a predicate and a symbolic expression. The translation from \rtlblock\
 and \rtlpar\ to \abstr\ ensures that each register will contain a predicated symbolic expression
@@ -615,11 +632,11 @@ Instead, comparing predicates can be done by checking logical equivalence of the
 formally verified \SAT\ solver. This ensures that the predicates will always evaluate to the same
 value, thereby making it a fair comparison.
 
-\subsection[verified_sat]{Verified Decidable DPLL SAT Solver}
+\subsubsection[verified_sat]{Verified Decidable DPLL SAT Solver}
 
 A SAT solver is needed to correctly and reliably compare predicated expressions. This makes it
 possible to then compare many different possible schedules, even if they remove expressions
-completely if these are unnecessary. The SAT solver is needed in the verifier, which needs to be
+completely if these are unnecessary. The SAT solver is required by the verifier, which needs to be
 proven correct in Coq to prove the scheduling translation correct.  The SAT solver therefore has to
 be verified, because it could otherwise not be used in the proof of correctness.
 
@@ -633,45 +650,47 @@ implies that the check is decidable, simplifying many proofs in Coq by using a \
 instance for equivalence between predicates. This makes rewriting predicates if they are equivalent
 much easier in proofs by reusing many of the existing Coq tactics.
 
-\subsection[encoding-predicated-expressions-as-sat-formulas]{Encoding Predicated Expressions as SAT
+\subsubsection[encoding-predicated-expressions-as-sat-formulas]{Encoding Predicated Expressions as SAT
   Formulas}
 
 Given the following two predicated expressions which should be tested for equivalence, one can
 construct a SAT expression which will ensure that both expressions will always result in the same
 value.
 
-%\startformula \begin{aligned}
-%  \bgroup \left.\begin{aligned}
-%    e_{1},\qquad &p_{1}\\
-%    e_{2},\qquad &p_{2}\\
-%    \vdots\quad& \\
-%    e_{n},\qquad &p_{n}\\
-%  \end{aligned}\right\rbrace\egroup
-%  \Longleftrightarrow
-%  \begin{cases}
-%    e_{1}',\quad &p_{1}'\\
-%    e_{2}',\quad &p_{2}'\\
-%    \;\;\;\quad\vdots \notag \\
-%    e_{m}',\quad &p_{m}'\\
-%  \end{cases}\end{aligned} \stopformula
+\startformula \bgroup \left.\startmatrix
+    \NC e_{1},\quad \NC p_{1} \NR
+    \NC e_{2},\quad \NC p_{2}\NR
+    \NC \quad \vdots \NC \NR
+    \NC e_{n},\quad \NC p_{n}\NR
+  \stopmatrix\right\rbrace\egroup
+  \Longleftrightarrow
+  \startmathcases
+    \NC e_{1}',\NC p_{1}' \NR
+    \NC e_{2}',\NC p_{2}' \NR
+    \NC \quad \vdots \NC \NR
+    \NC e_{m}',\NC p_{m}'\NR
+  \stopmathcases \stopformula
 
 The SAT query that should determine if two predicated expressions are equivalent is the following,
 where $\mathit{he}_{i}$ is the hashed expression of $e_{i}$:
 
-%\startformula \begin{gathered}
-%  (p_{1} \implies \mathit{he}_{1}) \land (p_{2} \implies \mathit{he}_{2}) \land \cdots \land (p_{n} \implies \mathit{he}_{n}) \\
-%  \Longleftrightarrow\\
-%  (p_{1}' \implies \mathit{he}_{1}') \land (p_{2}' \implies \mathit{he}_{2}') \land \cdots \land (p_{m}' \implies \mathit{he}_{m}') \\\end{gathered} \stopformula
+\startformula \startmathalignment[align={1:middle}]
+  \NC (p_{1} \implies \mathit{he}_{1}) \land (p_{2} \implies \mathit{he}_{2}) \land \cdots \land (p_{n}
+  \implies \mathit{he}_{n}) \NR
+  \NC \Longleftrightarrow \NR
+  \NC (p_{1}' \implies \mathit{he}_{1}') \land (p_{2}' \implies \mathit{he}_{2}') \land \cdots \land
+  (p_{m}' \implies \mathit{he}_{m}') \NR
+\stopmathalignment \stopformula
 
 However, reasoning about the equivalence of individual expressions within the greater list of
-predicated expressions is not quite straightforward. The evaluation semantics of predicated
-expression list will evaluate each predicate expression sequentially, and pick the expression whose
-predicate evaluates to true. However, proving that both predicated expression lists will return the
-same value assuming that the SAT query above is true, means that for each predicate that is true,
-One needs to be able to show that there is a predicate in the other list that will also be true,
-which will return the same value. However, there is no guarantee that there is a unique predicated
-expression that will match the current predicate, as there may be duplicate expressions in the list
-with different predicates.
+predicated expressions is not quite straightforward. The evaluation semantics of a predicated
+expression list is the sequentially evaluation of each predicate expression, picking the expression
+whose predicate evaluates to true. However, proving that two predicated expression lists will return
+the same value assuming that the SAT query above is true, means that for each predicate that is
+true, one needs to be able to show that there is a predicate in the other list that will also be
+true and will return the same value. However, there is no guarantee that there is a unique
+predicated expression that will match the current predicate, as there may be duplicate expressions
+in the list with different predicates.
 
 This means that it is quite difficult to prove that the SAT query implies the semantic preservation
 of the scheduling transformation. The SAT query can therefore be broken down into a slightly more
@@ -694,37 +713,29 @@ grained comparison are the following:
 This comparison is logically equivalent to performing the large SAT query, however, it maps better
 to the proof because it is comparing the predicates of the same expressions with each other.
 
-\subsection[proof-of-equivalence-between-passes]{Proof of equivalence between passes}
+\subsection[proof-of-correctness-of-scheduling]{Proof of Correctness of Scheduling}
 
-\in{Figure}[fig:compcert_interm] shows the four different passes that were added to CompCert to form
-hyperblocks.  This section will cover the proofs for each of the translations, focusing mainly on
-the proof of translation of the scheduling pass.
-
-\subsubsection[proof-of-correctness-of-the-if-conversion-pass]{Proof of Correctness of the
-  If-Conversion Pass}
-
-The if-conversion pass is quite simple to verify, as it is an optimisation that transforms the code
-in the same language.
-
-\subsubsection[proof-of-correctness-of-scheduling]{Proof of Correctness of Scheduling}
-
-The scheduling pass is formally verified using translation validation, using the \abstr\ language as
+The scheduling pass is formally verified using translation validation using the \abstr\ language as
 a point of comparison between \rtlblock\ and \rtlpar. Using the algorithm defined in
 \in{Section}[abstr_interp], the formally verified SAT solver described in \in{Section}[verified_sat]
-can be used prove that if the two predicated expression lists are equivalent, that they will always
-evaluate to the same value.
-
-\section[implementation-details-of-static-sdc-scheduling]{\SDC\ Scheduling Implementation Details}
-
-This section describes the implementation details of the static \SDC\ scheduling algorithm used, and
-also describes some of the details of the basic block generation as well as the if-conversion pass
-which generates the predicated instructions.
-
-\subsection[static-sdc-scheduling]{Static \SDC\ Scheduling}
-
-\SDC\ scheduling~\cite[cong06_sdc] is a flexible algorithm to perform scheduling of instructions,
-especially if the target is hardware directly. It allows for flexible definitions of optimisation
-functions in terms of latencies and dependencies of the operations in the basic block.
+can be used to prove that if the two predicated expression lists are equivalent, that they will
+always evaluate to the same value.  This follows directly from the proofs of correctness for the
+translation to \abstr.  Then, the proof that if the comparison function using the SAT solver between
+two \abstr\ languages succeeds, then the two abstract languages will behave identically, this proof
+can then be composed with the proofs of correctness of the generation of \abstr.
+
+\section[scheduling:conclusion]{Conclusion}
+
+In general, this whole proof of correctness for scheduling was the most challenging because as the
+scheduling algorithm evolved, the language semantics of \rtlblock\ and \rtlpar\ also evolved, which
+then required large changes in the proofs of correctness for each step in the scheduling proof.  In
+addition to that, the whole scheduling proof, from \rtlblock\ generation to hardware generation
+required greatly different approaches to the correctness proofs in each step.  Even simple
+translations like the basic block generation and if-conversion pass required quite a few different
+insights into how the algorithms actually worked and how the semantics behaved to come up with good
+matching relation which are powerful enough to prove the final correctness result, but are also
+provable in the first place.  Simply formulating these relations in the first place requires a lot
+of experimentation.
 
 \startmode[chapter]
   \section{Bibliography}
diff --git a/lsr_env.tex b/lsr_env.tex
index a97e380..3fe0321 100644
--- a/lsr_env.tex
+++ b/lsr_env.tex
@@ -139,6 +139,8 @@
   [footnote]
   [alternative=serried]
 
+\setupfootnotes[align={stretch,verytolerant,hz,hanging,hyphenated}]
+
 \defineoverlay[chapterbackground][\useMPgraphic{chapterbackground}]
 \setupbackgrounds[page][background=chapterbackground]
 
@@ -213,8 +215,8 @@
 \setupcaptions[figure][
   location={high,rightmargin},
   width=\rightmarginwidth,
-  align={yes, tolerant},
-  style=\rmx,
+  align={verytolerant,hz,hanging,hyphenated},
+  style={\rmx\setupinterlinespace[reset]},
 ]
 \setupfloat[figure][default=top,location=middle]
 
@@ -224,7 +226,7 @@
   numberconversion=a,
   prefix=no,
   way=bytext,
-  style=\rmx,
+  style={\rmx\setupinterlinespace[reset]},
 ]
 \setuplabeltext[subfigure=]
 
@@ -472,8 +474,6 @@
   [before={\blank[big]\startnarrower\startstyle[it]},
    after={\stopstyle\stopnarrower\blank[big]\indenting[next]}]
 
-\setupfootnotes[align={stretch,verytolerant,hz,hanging,hyphenated}]
-
 \define[1]\lindex{\index{intermediate language+#1}}
 \define[1]\oindex{\index{optimisation+#1}}
 \define[1]\pindex{\index{proof+#1}}
@@ -492,4 +492,6 @@
 \define\htl{{\sc Htl}}
 \define\abstr{{\sc Abstr}}
 
+\define[1]\Some{\left\lfloor #1 \right\rfloor}
+
 \stopenvironment
diff --git a/lsr_refs.bib b/lsr_refs.bib
index 992410d..aabc116 100644
--- a/lsr_refs.bib
+++ b/lsr_refs.bib
@@ -1410,3 +1410,34 @@
 	year = {2013}
 }
 
+@inproceedings{zhang13_sdc,
+	abstract = {Modulo scheduling is a popular technique to enable pipelined execution of successive loop iterations for performance improvement. While a variety of modulo scheduling algorithms exist for software pipelining, they are not amenable to many complex design constraints and optimization goals that arise in the hardware synthesis context. In this paper we describe a modulo scheduling framework based on the formulation of system of difference constraints (SDC). Our framework can systematically model a rich set of performance constraints that are specific to the hardware design. The scheduler also exploits the unique mathematical properties of SDC to carry out efficient global optimization and fast incremental update on the constraint system to minimize the resource usage of the synthesized pipeline. Experiments demonstrate that our proposed technique provides efficient solutions for a set of real-life applications and compares favorably against a widely used lifetime-sensitive modulo scheduling algorithm.},
+	author = {Zhang, Z. and Liu, B.},
+	url = {https://doi.org/10.1109/ICCAD.2013.6691121},
+	booktitle = {2013 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)},
+	doi = {10.1109/ICCAD.2013.6691121},
+	issn = {1558-2434},
+	keywords = {high level synthesis;pipeline processing;scheduling;SDC-based modulo scheduling;pipeline synthesis;hardware design;mathematical properties;global optimization;incremental update;Schedules;Pipeline processing;Registers;Optimal scheduling;Scheduling algorithms;Timing},
+	month = nov,
+	pages = {211--218},
+	title = {SDC-based modulo scheduling for pipeline synthesis},
+	year = {2013}
+}
+
+@inproceedings{ball93_branc_predic_free,
+	keywords = {if-conversion},
+	author = {Ball, Thomas and Larus, James R.},
+	title = {Branch Prediction for Free},
+	year = {1993},
+	isbn = {0897915984},
+	publisher = {Association for Computing Machinery},
+	address = {New York, NY, USA},
+	url = {https://doi.org/10.1145/155090.155119},
+	doi = {10.1145/155090.155119},
+	abstract = {Many compilers rely on branch prediction to improve program performance by identifying frequently executed regions and by aiding in scheduling instructions.Profile-based predictors require a time-consuming and inconvenient compile-profile-compile cycle in order to make predictions. We present a program-based branch predictor that performs well for a large and diverse set of programs written in C and Fortran. In addition to using natural loop analysis to predict branches that control the iteration of loops, we focus on heuristics for predicting non-loop branches, which dominate the dynamic branch count of many programs. The heuristics are simple and require little program analysis, yet they are effective in terms of coverage and miss rate. Although program-based prediction does not equal the accuracy of profile-based prediction, we believe it reaches a sufficiently high level to be useful. Additional type and semantic information available to a compiler would enhance our heuristics.},
+	booktitle = {Proceedings of the ACM SIGPLAN 1993 Conference on Programming Language Design and Implementation},
+	pages = {300–313},
+	numpages = {14},
+	location = {Albuquerque, New Mexico, USA},
+	series = {PLDI '93}
+}
diff --git a/title.tex b/title.tex
index 104d621..03e131a 100644
--- a/title.tex
+++ b/title.tex
@@ -28,6 +28,7 @@
     \starttabulate[format={|l|l|}]
       \NC \bold{Supervisor} \NC John Wickerson \NC \NR
       \NC \bold{Examiner}   \NC George A. Constantinides \NC \NR
+      \NC \bold{Examiner}   \NC Philippa Gardner \NC \NR
       \NC \bold{CID}        \NC 01062783       \NC \NR
     \stoptabulate
   \stopalignment
author	Yann Herklotz <git@yannherklotz.com>	2022-06-27 15:01:51 +0100
committer	Yann Herklotz <git@yannherklotz.com>	2022-06-27 15:01:51 +0100
commit	2fa27381a18e65b4af2d860dc9be17489ff26258 (patch)
tree	558ce90874835ff0094af1abf56256b346402ffc
parent	b93723f5d4391333b6ed12cea8c3f49871976927 (diff)
download	lsr22_fvhls-2fa27381a18e65b4af2d860dc9be17489ff26258.tar.gz lsr22_fvhls-2fa27381a18e65b4af2d860dc9be17489ff26258.zip