First realistic draft

author: Yann Herklotz <git@yannherklotz.com> 2022-05-05 17:08:12 +0100
committer: Yann Herklotz <git@yannherklotz.com> 2022-05-05 17:08:12 +0100
commit: c9648f2dc56a0c970a90e1539925ae78b0610fbc (patch)
tree: 41a7934ad8d327684e33038b0e7f63d7bfe407dc
parent: b90e207d5a9d1871d5ecde168d39ce91c7691fe1 (diff)
download: lsr22_fvhls-c9648f2dc56a0c970a90e1539925ae78b0610fbc.tar.gz
lsr22_fvhls-c9648f2dc56a0c970a90e1539925ae78b0610fbc.zip
7 files changed, 436 insertions, 157 deletions
diff --git a/Makefile b/Makefile
index 98d14cc..282a3bb 100644
--- a/Makefile
+++ b/Makefile
@@ -4,12 +4,14 @@ MODE ?= main
 
 all: main.pdf
 
-main.pdf: figures/timing-1.pdf figures/timing-2.pdf
+main.pdf: figures/timing-1.pdf figures/timing-2.pdf figures/timing-3.pdf
 
 figures/%.pdf: figures/%.tex
 	latexmk -pdf -shell-escape $<
 	cp $(notdir $@) $@ || true
 
+chapters/scheduling.pdf: figures/timing-3.pdf
+
 # silent structure,structures,pages,resolvers,open source,close source,loading,modules
 %.pdf: %.tex
 	context --mode=$(MODE) --nonstopmode --silent='*' $<
diff --git a/chapters/pipelining.tex b/chapters/pipelining.tex
index 09fae11..b0bdb20 100644
--- a/chapters/pipelining.tex
+++ b/chapters/pipelining.tex
@@ -168,9 +168,10 @@ all, the proof from the paper was actually never fully completed:
   location=middle,
   width=\dimexpr \textwidth - 1cm \relax,
 ]
-  {\noindent\it [Semantic preservation going from unrolled loops in a basic-block representation to
-   the actual loops in a CFG] is intuitively obvious but surprisingly tedious to formalize in full
-   details: indeed, this is the only result in this paper that we have not yet mechanized in Coq.}
+  {\noindent\it [Semantic preservation going from unrolled loops in a basic block representation to
+      the actual loops in a CFG] is intuitively obvious but surprisingly tedious to formalize in
+    full details: indeed, this is the only result in this paper that we have not yet mechanized in
+    Coq.}
 
   \startalignment[flushright]
     --- \cite[alternative=authoryears,right={, Section
diff --git a/chapters/scheduling.tex b/chapters/scheduling.tex
index 3f4acd6..dcbb409 100644
--- a/chapters/scheduling.tex
+++ b/chapters/scheduling.tex
@@ -28,10 +28,10 @@ tools, which generate application specific hardware designs based on a software
 scheduling algorithms often work on basic blocks, which are sequential lists of instructions without
 any branches. However, this limits what kind of reordering the scheduling algorithm can perform. It
 can never move instructions past a branching instruction, even when this might drastically improve
-the performance. Trace scheduling~ is an alternative that addresses this issue, by creating paths
-that might cross branching instructions, and scheduling the instructions in that path. However, the
-main problem with trace scheduling is that it is often infeasible on large programs, and because of
-the large design space, it's sometimes hard to find an optimal
+the performance. Trace scheduling~\cite[fisher81_trace_sched] is an alternative that addresses this
+issue, by creating paths that might cross branching instructions, and scheduling the instructions in
+that path. However, the main problem with trace scheduling is that it is often infeasible on large
+programs, and because of the large design space, it's sometimes hard to find an optimal
 schedule. \index{superblock}Superblock~\cite[hwu93_super] and
 \index{hyperblock}hyperblock~\cite[mahlke92_effec_compil_suppor_predic_execut_using_hyper]
 scheduling are two subsets of trace scheduling, which compromise on the flexibility of trace
@@ -47,51 +47,189 @@ scheduling instructions for custom hardware.
 
 \subsection[key-points]{Key Points}
 
-The goal of this paper is to introduce hyperblock scheduling into CompCert to allow for advanced and
-flexible scheduling operations that could be beneficial to various different back ends. To this end,
-the key points of this paper are the following:
+The goal of this chapter is to introduce hyperblock scheduling into CompCert to allow for advanced
+and flexible scheduling operations that could be beneficial to various different back ends. To this
+end, the key points of this paper are the following:
 
 \startitemize
 \item Description of how hyperblocks are introduced for improved scheduling of instructions, and how
-  these can be built and translated back to instructions without predicates.
-\item Description of the implementation and correctness proof of a \SDC\ scheduling algorithm, which
-  can be adapted to various different scheduling strategies.
+  these can be built and translated to hardware.
+\item Description of the implementation and correctness proof of a \SDC\ scheduling algorithm.  The
+  correctness proof does not rely on the algorithm itself and so can safely be replaced by a
+  different scheduling algorithm with similar properties.
 \stopitemize
 
-\section[sec:scheduling]{Implementation of Hyperblocks in CompCert}
-
-\index{hyperblock}This section describes the structure of hyperblocks in
-\in{Section}[scheduling:hyperblocks]. Then, the structure of two extra intermediate languages that
-were added to CompCert to implement hyperblocks are also shown in \in{Section}[sec:rtlblockdef].
+\section{Progress Summary}
 
 \startplacemarginfigure[reference={fig:compcert_interm},title={New intermediate languages introduced into
-    CompCert.}]
+    CompCert.  Progress bars are shown for each transformation pass, where green represents the
+    estimated part of the work that is completed, and red stands for the remaining work to be done.}]
   \hbox{\starttikzpicture[shorten >=1pt,>=stealth]
     \node       at (-3, 0) (dotsin) {$\cdots$};
     \node       at (10, 0) (dotsout) {$\cdots$};
-    \node[draw] at (-1, 0) (rtl) {\tt RTL};
-    \node[draw] at (2,  0) (rtlblock) {\tt RTLBlock};
-    \node[draw] at (2, -2) (abstrblock) {\tt Abstr};
-    \node[draw] at (6,  0) (rtlpar) {\tt RTLPar};
-    \node[draw] at (6, -2) (abstrpar) {\tt Abstr};
-    \node[draw] at (8,  0) (htl) {\tt HTL};
-
-    \draw[->] (rtl) -- node[above,midway,align=center,font=\small]
-         {basic-block \\[-0.3em] creation} (rtlblock);
-    \draw[->] (rtlblock) -- node[above,midway,font=\small] {scheduling} (rtlpar);
-    \draw[->] (rtlpar) -- (htl);
-    \draw[->,dashed] (rtlblock) -- (abstrblock);
-    \draw[->,dashed] (rtlpar) -- (abstrpar);
-    \draw[dashed,shorten >=0pt] (abstrblock) -- node[above,midway] {$\sim$} (abstrpar);
+    \node[draw] at (-1, 0) (rtl) {\rtl};
+    \node[draw] at (2,  0) (rtlblock) {\rtlblock};
+    \node[draw] at (2, -2) (abstrblock) {\abstr};
+    \node[draw] at (6,  0) (rtlpar) {\rtlpar};
+    \node[draw] at (6, -2) (abstrpar) {\abstr};
+    \node[draw] at (8.5,  0) (htl) {\htl};
+
+    \draw[->] (rtl) --
+         node[above,midway,align=center,font=\small] {basic block \\[-0.3em] creation}
+         node[below,yshift=-3mm,xshift=-0.75cm,anchor=west,fill=VenetianRed,minimum width=1.5cm] {}
+         node[below,yshift=-3mm,xshift=-0.75cm,anchor=west,fill=ForestGreen,minimum width=0.6cm] {}
+         (rtlblock);
+    \draw[->] (rtlblock) --
+         node[above,midway,font=\small] {scheduling}
+         node[below,yshift=-3mm,xshift=-0.75cm,anchor=west,fill=ForestGreen,minimum width=1.5cm] {}
+         (rtlpar);
+    \draw[->] (rtlpar) --
+         node[below,yshift=-3mm,xshift=-5mm,anchor=west,fill=VenetianRed,minimum width=1cm] {}
+         node[below,yshift=-3mm,xshift=-5mm,anchor=west,fill=ForestGreen,minimum width=2.5mm] {}
+         (htl);
+    \draw[->,dashed] (rtlblock) --
+         node[left,yshift=-5mm,xshift=-3mm,anchor=south,fill=VenetianRed,minimum height=1cm] {}
+         node[left,yshift=-5mm,xshift=-3mm,anchor=south,fill=ForestGreen,minimum height=8mm] {}
+         (abstrblock);
+    \draw[->,dashed] (rtlpar) --
+         node[right,yshift=-5mm,xshift=3mm,anchor=south,fill=VenetianRed,minimum height=1cm] {}
+         node[right,yshift=-5mm,xshift=3mm,anchor=south,fill=ForestGreen,minimum height=6mm] {}
+         (abstrpar);
+    \draw[dashed,shorten >=0pt] (abstrblock) --
+         node[above,midway] {$\sim$}
+         node[below,yshift=-3mm,xshift=-0.75cm,anchor=west,fill=VenetianRed,minimum width=1.5cm] {}
+         node[below,yshift=-3mm,xshift=-0.75cm,anchor=west,fill=ForestGreen,minimum width=1.2cm] {}
+         (abstrpar);
     \draw[->] (rtlblock) to [out=130,in=50,loop,looseness=10]
-    node[above,midway,font=\small] {if-conversion} (rtlblock);
+         node[above,midway,font=\small] {if-conversion}
+         node[below,yshift=6mm,xshift=-0.75cm,anchor=west,fill=VenetianRed,minimum width=1.5cm] {}
+         node[below,yshift=6mm,xshift=-0.75cm,anchor=west,fill=ForestGreen,minimum width=0.2cm] {}
+         (rtlblock);
     \draw[->] (rtlpar) to [out=130,in=50,loop,looseness=10]
-    node[above,midway,align=center,font=\small] { operation \\[-0.3em] pipelining} (rtlpar);
+         node[above,midway,align=center,font=\small] { operation \\[-0.3em] pipelining}
+         node[below,yshift=1cm,xshift=-0.75cm,anchor=west,fill=VenetianRed,minimum width=1.5cm] {}
+         node[below,yshift=1cm,xshift=-0.75cm,anchor=west,fill=ForestGreen,minimum width=0.2cm] {}
+         (rtlpar);
     \draw[->] (dotsin) -- (rtl);
     \draw[->] (htl) -- (dotsout);
   \stoptikzpicture}
 \stopplacemarginfigure
 
+This chapter is describing work that has not been fully completed yet.
+\in{Figure}[fig:compcert_interm] describes the current progress on implementing the scheduling pass,
+which relies heavily on the basic block generation, the if-conversion pass, as well as the \htl\
+generation.  The operation pipelining pass is not that critical to scheduling, but allows for the
+support of pipelined operations such as dividers.
+
+In general, the implementation of all the passes is now complete, and so current work is being done
+to complete the correctness proofs of all the passes.  Starting at the basic block creating pass,
+which is further described in \in{Section}[sec:basic-block-generation], the proof is quite complex
+for generating basic blocks.  However, the main idea of how to complete the proof has now been
+understood, so it should just be a matter of time until it is implemented.  Secondly, the
+if-conversion pass has only been implemented and has not been verified yet.  However, it should be
+straightforward to verify it, as the transformation is quite direct.  Next, for the scheduling pass
+the verification translates the input and the output of the scheduling algorithm into the \abstr\
+language and then compares the two languages has mainly been completed.  However, due to changes in
+the \rtlblock\ and \rtlpar\ language semantics, some of these proofs will have to be redone, which
+should not take too long though.  Finally, for the generation of \htl\ from \rtlpar\ the proof has
+not been started yet.  However, the difficult parts of the proof are the translation of loads and
+stores into proper reads and writes from RAM, which has already been completely proven for the \rtl\
+to \htl\ translation.
+
+The operation pipelining pass has only been implemented until now, however, it will likely be
+removed from the current workflow if it proves to take too long to complete.
+
+\section[sec:basic-block-generation]{Basic Block Generation}
+
+\index{basic block}The basic block generation proof is surprisingly difficult, as the semantics of
+\rtl\ and \rtlblock\ do not match nicely.  This is because \rtl\ executes each of its instructions
+sequentially in a \CFG.  \rtlblock, however, executes one basic block in one go, so it is not
+possible to refer to the execution of only one instruction.  The traditional simulation diagram for
+proofs in CompCert looks like the following:
+
+\placefigure[here,none][]{}{\hbox{\starttikzpicture[shorten >=1pt,>=stealth]
+  \node (s2) at (0,0) {$S_{2}$};
+  \node (s1) at (0,2) {$S_{1}$};
+  \node (s2p) at (5,0) {$S_{2}'$};
+  \node (s1p) at (5,2) {$S_{1}'$};
+  \draw (s1) -- node[above] {$\sim$} (s1p);
+  \draw[dashed] (s2) -- node[above] {$\sim$} (s2p);
+  \draw[->] (s1) -- (s2);
+  \draw[->,dashed] (s1p) -- node[right] {$+$} (s2p);
+\stoptikzpicture}}
+
+Which essentially says that given a step in the input language which goes from state $S_{1}$ to
+$S_{2}$ in one step, there should be one or more steps that go from the matching state $S_{1}'$ in
+the translated language to a state $S_{2}'$ which matches with $S_{2}$.  Most the proofs in CompCert
+follow this pattern, as the languages normally get refined over time and produce more instructions
+than before.
+
+However, when generating basic blocks one wants a slightly different simulation diagram:
+
+\placefigure[here,none][]{}{\hbox{\starttikzpicture[shorten >=1pt,>=stealth]
+  \node (s2) at (0,0) {$S_{2}$};
+  \node (s1) at (0,2) {$S_{1}$};
+  \node (s2p) at (5,0) {$S_{2}'$};
+  \node (s1p) at (5,2) {$S_{1}'$};
+  \draw (s1) -- node[above] {$\sim$} (s1p);
+  \draw[dashed] (s2) -- node[above] {$\sim$} (s2p);
+  \draw[->] (s1) -- node[left] {$+$} (s2);
+  \draw[->,dashed] (s1p) -- (s2p);
+\stoptikzpicture}}
+
+This says that assuming one has performed multiple steps in the new language, from matching states
+$S_{1}$ and $S_{1}'$, then there should be one step in the translated language that goes from
+$S_{1}'$ to $S_{2}'$ which should then match $S_{2}$.  This then matches the semantics of how basic
+blocks are executing, as multiple instructions will be grouped into one.  However, in that form such
+a lemma is impossible to even express in CompCert's framework, as the proof is inductive over the
+semantics of the input language (i.e. executing one instruction at a time).
+
+The main trick that needed to be found, and which is what
+\cite[authoryears][six22_formal_verif_super_sched] did to generate superblocks, is to separate the
+above simulation diagram into the following two diagrams, where each diagram now only has one
+transition in the left hand side.
+
+\startplacefigure[location={here,none}]
+  \startfloatcombination[nx=2,distance=2cm]
+    \startplacefigure[location=none]
+      \hbox{\starttikzpicture[shorten >=1pt,>=stealth]
+        \node (s2) at (0,0) {$S_{2}$};
+        \node (s1) at (0,2) {$S_{1}$};
+        \node (s2p) at (3,0) {$S_{2}'$};
+        \node (s1p) at (3,2) {$S_{1}'$};
+        \draw (s1) -- node[above] {$\sim$} (s1p);
+        \draw[dashed] (s2) -- node[above] {$\sim$} (s2p);
+        \draw[->] (s1) -- (s2);
+        \draw[->,dashed] (s1p) -- (s2p);
+      \stoptikzpicture}
+    \stopplacefigure
+    \startplacefigure[location=none]
+      \hbox{\starttikzpicture[shorten >=1pt,>=stealth]
+        \node (s2) at (0,0) {$S_{2}$};
+        \node (s1) at (0,2) {$S_{1}$};
+        \node (s1p) at (3,2) {$S_{1}'$};
+        \draw (s1) -- node[above] {$\sim$} (s1p);
+        \draw[->] (s1) -- (s2);
+        \draw[dashed] (s2) -- node[below,xshift=1mm] {$\sim$} (s1p);
+      \stoptikzpicture}
+    \stopplacefigure
+  \stopfloatcombination
+\stopplacefigure
+
+The diagram on the left represents a control-flow transition, which is performed in lock-step.  This
+means one has encountered a control-flow instruction, and therefore also implies that the basic
+block has finished executing and that the next block needs to be retrieved.  The diagram on the
+right is the interesting one which allows for the multiple transitions in the input language.  It
+requires writing a matching predicate $\sim$ which can match the start of the basic block in
+\rtlblock\ with any state in the original \rtl\ language as long as one has not finished executing
+the basic block and has not encountered a control-flow instruction.  By also showing that some
+metric stored in the $\sim$ relation is decreasing, one can then show that one will not get stuck
+iterating forever, and effectively get the behaviour of executing multiple instructions in the input
+code to execute one big step in the output language.  The other trick is that at each one of these
+steps, one also incrementally builds a proof of the execution from the start of the basic block to
+the current state, so that when a control-flow instruction is encountered, this proof can be used
+show that one can jump over all the input instructions with that block.
+
 \subsection[scheduling:hyperblocks]{Hyperblocks as a Generalisation of Basic Blocks}
 
 Basic blocks are popular in intermediate languages because they describe a list of sequential
@@ -109,17 +247,47 @@ which can also be taken into account by the scheduler. For example, any data dep
 instructions that have mutually exclusive predicates can be removed, as these instructions will
 never be activate at the same time.
 
-\subsection[sec:rtlblockdef]{RTLBlock and RTLPar Intermediate Language Definition}
+\section[implementation-of-if-conversion]{Implementation of If-Conversion}
+
+If-conversion is the conversion that introduces predicated instructions which can make use of the
+hyperblocks. It converts conditional statements into predicated instructions. This transformation
+has a few limitations on the kind of conditional statements that it can translate.  First, only
+conditional statements without loops can be translated, therefore, one must identify cycles in the
+control-flow before performing the if-conversion. Secondly, if-conversion will not always result in
+more efficient code, so it should not be applied to any conditional statements. Instead, it is best
+applied to conditional statements where each branch will take a similar amount of time to execute.
+
+For back ends that do not support predicates, it is also important to eliminate the predicates again
+using a process called reverse if-conversion, which creates branches out of the predicates and
+groups these together again. This allows for hyperblock scheduling to also be used with back ends
+that do not support predicates, which are most the default back ends included in CompCert.
+
+From an implementation point of view, the two steps are relatively simple to implement naïvely,
+however, especially the if-conversion transformation could be improved by adding better heuristics
+for which conditional statements to transform into predicated instructions.  For hyperblocks,
+conditionals where each branch approximately execute in the same number of cycles are an ideal
+candidate for the transformation, as it means that no time is lost due to having to execute extra
+predicated instructions that are false.  However, in addition to that if one has an estimate on the
+probability of picking a branch, then one can introduce more interesting heuristics such as also
+transforming conditionals who have disproportionate branches, but where one is more likely to
+execute the longest branch every time.
+
+However, from a proof perspective they are not the same.  In this case, the basic block generation
+is especially tricky due to the
 
-\lindex{RTLBlock}\lindex{RTLPar}\in{Figure}[fig:compcert_interm] shows the intermediate languages
-introduced to implement hyperblocks in CompCert.  This consists of new RTLBlock and RTLPar
+\section[sec:scheduling]{Scheduling Implementation}
+
+\subsection[sec:rtlblockdef]{\rtlblock\ and \rtlpar\ Intermediate Language Definition}
+
+\lindex{\rtlblock}\lindex{\rtlpar}\in{Figure}[fig:compcert_interm] shows the intermediate languages
+introduced to implement hyperblocks in CompCert.  This consists of new \rtlblock\ and \rtlpar\
 intermediate languages, which implement the sequential and parallel semantics of basic blocks
-respectively. The semantics of RTLBlock and RTLPar are based on the CompCert RTL intermediate
-language. However, instead of mapping from states to instructions, RTLBlock maps from states to
-hyperblocks, and RTLPar maps from states to parallel hyperblocks, which will be described in the
+respectively. The semantics of \rtlblock\ and \rtlpar\ are based on the CompCert \rtl\ intermediate
+language. However, instead of mapping from states to instructions, \rtlblock\ maps from states to
+hyperblocks, and \rtlpar\ maps from states to parallel hyperblocks, which will be described in the
 next sections.
 
-\placeformula \startformula \startmathalignment
+\placeformula[eq:standard] \startformula \startmathalignment
 \NC i\ \ \colon\colon= \NC \ \ \text{\tt RBnop}                                                   \NR
 \NC                  \NC |\ \ \text{\tt RBop}\ p?\ \mathit{op}\ \vec{r}\ d                      \NR
 \NC                  \NC |\ \ \text{\tt RBload}\ p?\ \mathit{chunk}\ \mathit{addr}\ \vec{r}\ d  \NR
@@ -127,7 +295,7 @@ next sections.
 \NC                  \NC |\ \ \text{\tt RBsetpred}\ p?\ c\ \vec{r}\ d                           \NR
 \stopmathalignment \stopformula
 
-\placeformula \startformula \startmathalignment
+\placeformula[eq:cf-instr] \startformula \startmathalignment
 \NC i_{\mathit{cf}}\ \ \colon\colon= \NC \ \ \text{\tt RBcall}\ \mathit{sig}\ f\ \vec{r}\ d\ n \NR
 \NC                      \NC |\ \ \text{\tt RBtailcall}\ \mathit{sig}\ f\ \vec{r}                      \NR
 \NC                      \NC |\ \ \text{\tt RBbuiltin}\ f_{\mathit{ext}}\ \vec{r}\ r\ n                \NR
@@ -138,74 +306,136 @@ next sections.
 \NC                      \NC |\ \ \text{\tt RBpred\_{cf}}\ P\ i_{\mathit{cf}_{1}}\ i_{\mathit{cf}_{2}} \NR
 \stopmathalignment \stopformula
 
-RTLBlock instructions are split into two types of instructions, standard instructions and
+\rtlblock\ instructions are split into two types of instructions, standard instructions and
 control-flow instructions. The standard instructions are the instructions that can be placed into
 hyperblocks, whereas control-flow instructions are the instructions that can end hyperblocks.  The
-standard instructions are shown in \in{Figure}[fig:standard_instruction], whereas the control-flow
-instructions are shown in \in{Figure}[fig:control_flow_instruction].  Most instructions are quite
-similar to their RTL counterparts, however, there are some instructions that have been
-added. \type{RBsetpred} is a standard instruction which sets a predicate equal to an evaluated
-condition. This instruction is used during if-conversion to set the predicate to the value of the
-condition in the conditional statement instruction. To support proper predicated instructions, each
-instruction therefore also contains an optional predicate which has been set beforehand, which
-produces the hyperblock. In additon to that, there is also an extra control-flow instruction called
-\type{RBpred_cf}, which can branch on a predicate an nests more control-flow instructions inside of
-it. This is also necessary for if-conversion, when converting an already converted conditional
-statement, as depending on the predicate, a different control-flow instruction may be necessary.
+standard instructions are shown in \in{Equation}[eq:standard], whereas the control-flow instructions
+are shown in \in{Equation}[eq:cf-instr].  Most instructions are quite similar to their \rtl\
+counterparts, however, there are some instructions that have been added. \type{RBsetpred} is a
+standard instruction which sets a predicate equal to an evaluated condition. This instruction is
+used during if-conversion to set the predicate to the value of the condition in the conditional
+statement instruction. To support proper predicated instructions, each instruction therefore also
+contains an optional predicate which has been set beforehand, which produces the hyperblock. In
+additon to that, there is also an extra control-flow instruction called \type{RBpred_cf}, which can
+branch on a predicate an nests more control-flow instructions inside of it. This is also necessary
+for if-conversion, when converting an already converted conditional statement, as depending on the
+predicate, a different control-flow instruction may be necessary.
 
 \placeformula[fig:hb_def]\startformula\startalign[align={1:left}]
-\NC \blockbb \qquad \eqdef \qquad (\text{\tt list}\ i) \times i_{\mathit{cf}} \NR
-\NC \parbb \qquad \eqdef \qquad (\text{\tt list}\ \text{\tt list}\ \text{\tt list}\ i) \times i_{\mathit{cf}}\NR
+\NC \blockbb \qquad \eqdef \qquad (\text{\tt slist}\ i) \times i_{\mathit{cf}} \NR \NC \parbb \qquad
+\eqdef \qquad (\text{\tt slist}\ (\text{\tt plist}\ (\text{\tt slist}\ i))) \times
+i_{\mathit{cf}}\NR
 \stopalign\stopformula
 
-These instructions are use in RTLBlock as well as in RTLPar. The main difference between these two
-languages is how these instructions are arranged within the hyperblock and the execution semantics
-of the hyperblock in the languages. Let $\blockbb$ be the definition of a hyperblock for RTLBlock,
-and let $\parbb$ be the definition of a hyperblock for RTLPar, then $\blockbb$ and $\parbb$ be
-defined as in \in{Equation}[fig:hb_def]. The Figure shows the different nesting levels of the basic
-blocks in RTLBlock as well as RTLPar, and where the parallel semantics of RTLPar come into
-play. RTLBlock is made of a list of instructions and a control-flow instruction that ends the
-hyperblock. Each instruction in RTLBlock is executed sequentially. RTLPar is made of three separate
-lists. The outer list behaves like the list in RTLBlock, and executes each list inside of it
-sequentially. RTLPar then has a list which executes it's contents in parallel, meaning each
-instruction uses the same state of the registers and memory while executing. Finally, instead of
-directly containing instructions, there is an additional layer of sequentially executing lists of
-instructions, to be able to execute small groups of instructions in sequence, but still in parallel
-to the other instructions in the current parallel group.
-
-RTLPar is structured in this way so that the scheduler has more flexibility, so that it can not only
-parallelise operations in the basic blocks, but that it can also define sequential operations in
-these parallel operations. This is especially useful for the hardware back end, as it means that
+These instructions are use in \rtlblock\ as well as in \rtlpar. The main difference between these
+two languages is how these instructions are arranged within the hyperblock and the execution
+semantics of the hyperblock in the languages. Let $\blockbb$ be the definition of a hyperblock for
+\rtlblock, and let $\parbb$ be the definition of a hyperblock for \rtlpar, then $\blockbb$ and
+$\parbb$ be defined as in \in{Equation}[fig:hb_def]. The Figure shows the different nesting levels
+of the basic blocks in \rtlblock\ as well as \rtlpar, and where the parallel semantics of \rtlpar\
+come into play. \rtlblock\ is made of a list of instructions and a control-flow instruction that
+ends the hyperblock. Each instruction in \rtlblock\ is executed sequentially. \rtlpar\ is made of
+three separate lists. The outer list behaves like the list in \rtlblock, and executes each list
+inside of it sequentially. \rtlpar\ then has a list which executes it's contents in parallel,
+meaning each instruction uses the same state of the registers and memory while executing. Finally,
+instead of directly containing instructions, there is an additional layer of sequentially executing
+lists of instructions, to be able to execute small groups of instructions in sequence, but still in
+parallel to the other instructions in the current parallel group.
+
+\rtlpar\ is structured in this way so that the scheduler has more flexibility, so that it can not
+only parallelise operations in the basic blocks, but that it can also define sequential operations
+in these parallel operations. This is especially useful for the hardware back end, as it means that
 sequential operations that do not quite fill one clock cycle can be executed sequentially, but can
 still be executed in parallel with other independent operations. This optimisation is called
 operation chaining, and is critical to get the most performance out of the hardware.
 
-\subsection[translating-rtlblock-to-rtlpar]{Translating RTLBlock to RTLPar}
-
-The translation from RTLBlock to RTLPar is performed using a scheduling algorithm, which will takes
-hyperblocks and rearranges their instructions to maximise parallelism based on the data
-dependencies. It then arranges the instructions in RTLPar by putting independent instructions into
+\subsection[translating-rtlblock-to-rtlpar]{Translating \rtlblock\ to \rtlpar}
+
+\definecolor[s1col][x=7FC97F]
+\definecolor[s2col][x=BEAED4]
+\definecolor[s3col][x=FDC086]
+\definecolor[s4col][x=FFFF99]
+\definecolor[s5col][x=386CB0]
+
+\define[2]\makedfgstate{\raisebox{-2pt}\hbox{\tikz{\node[circle,draw=black,minimum
+      size=4mm,fill=#1,scale=0.5]{#2};}}}
+\define\sI{\makedfgstate{s1col}{1}}
+\define\sII{\makedfgstate{s2col}{2}}
+\define\sIII{\makedfgstate{s3col}{3}}
+\define\sIV{\makedfgstate{s4col}{4}}
+\define\sV{\makedfgstate{s5col}{5}}
+
+\startplacefigure[reference={fig:op_chain},title={Example of scheduling a hyperblock into multiple
+    clock cycles.}]
+  \startfloatcombination[ny=2,nx=2]
+    \startplacesubfigure[title={RTLBlock hyperblock to be scheduled.}]
+      \startframedtext[frame=off,offset=none,width={0.43\textwidth},style=\rmx]
+        \starthlC
+          /BTEX \sI/ETEX                r2 = r1 + r4;
+          /BTEX \sII/ETEX if(p1)         r1 = r2 + r4;
+          /BTEX \sIII/ETEX if(!p2 && !p1) r3 = r1 * r1;
+          /BTEX \sIV/ETEX if(p2)         r3 = r1 * r4;
+          /BTEX \sV/ETEX if(!p2)        r3 = r3 * r3;
+        \stophlC
+      \stopframedtext
+    \stopplacesubfigure
+    \startplacesubfigure[title={Scheduled RTLPar hyperblock.}]
+      \startframedtext[frame=off,offset=none,width={0.53\textwidth},style=\rmx]
+        \starthlC
+          1: [r2 = r1 + r4; if(p1) r1 = r2 + r4];
+          [if(!p2 && !p1) r3 = r1 * r1]
+          2: [if(p2) r1 * r4]; [if(!p2) r3 * r3]
+        \stophlC
+      \stopframedtext
+    \stopplacesubfigure
+    \startplacesubfigure[title={Data dependencies between each operation of the RTLBlock
+        hyperblock.}]
+      \startframedtext[frame=off,offset=none,width={0.43\textwidth}]
+        \hbox{\starttikzpicture[>=stealth,shorten >=1pt,auto,node distance=1.3cm]
+          \tikzstyle{every state}=[draw=black,thick,minimum size=4mm]
+          \node[state,fill=s1col] (s1) {1};
+          \node[state,fill=s2col] (s2) [below left of=s1] {2};
+          \node[state,fill=s3col] (s3) [right of=s1] {3};
+          \node[state,fill=s4col] (s4) [below right of=s2] {4};
+          \node[state,fill=s5col] (s5) [below of=s3] {5};
+
+          \draw[->] (s1) -- (s2);
+          \draw[->] (s2) -- (s4);
+          %\draw[->] (s3) -- (s4);
+          \draw[->] (s3) -- (s5);
+        \stoptikzpicture}
+      \stopframedtext
+    \stopplacesubfigure
+    \startplacesubfigure[title={Timing diagram showing how the operations in RTLPar will be executed.}]
+      \externalfigure[figures/timing-3.pdf][width=0.5\textwidth]
+    \stopplacesubfigure
+  \stopfloatcombination
+\stopplacefigure
+
+The translation from \rtlblock\ to \rtlpar\ is performed using a scheduling algorithm, which will
+takes hyperblocks and rearranges their instructions to maximise parallelism based on the data
+dependencies. It then arranges the instructions in \rtlpar\ by putting independent instructions into
 the same parallel block. Scheduling is an optimisation that is well suited to translation
 validation, as the algorithm itself can be quite complex, whereas the output is normally relatively
 easy to check for equivalence with the input.  \in{Figure}[fig:compcert_interm] shows that the
 scheduling step comes right after the if-conversion step which originally creates the hyperblocks.
 
-\in{Figure}[fig:op_chain] shows how the scheduling step transforms RTLBlock into RTLPar. The
-RTLBlock hyperblock being scheduled is shown in \in{Figure}[fig:op_chain_a], which contains five
+\in{Figure}[fig:op_chain] shows how the scheduling step transforms \rtlblock\ into \rtlpar. The
+\rtlblock\ hyperblock being scheduled is shown in \in{Figure}{a}[fig:op_chain], which contains five
 predicated operations, comprising two additions and three multiplications. The data dependencies of
-the instructions in the hyperblock are shown in \in{Figure}[fig:op_chain_b]. Curiously, even though
+the instructions in the hyperblock are shown in \in{Figure}{c}[fig:op_chain]. Curiously, even though
 operations and are normally dependent on each other due to a write-after-write conflict, but because
 the predicates are independent the write-after-write conflict can effectively be removed by the
 scheduler.
 
-The scheduler transforms the RTLBlock hyperblock into the RTLPar hyperblock shown in
-\in{Figure}[fig:op_chain_c].  Even though the addition in is dependent on , they can still be put
-into the same cycle as the additions do not take as long to complete as the multiplications. This
-optimisation is called operation chaining. Then, the multiplication in can also be placed in the
-same clock cycle as it does not have any data dependencies. Then, in the next clock cycle, either
-the multiplication in can take place, or the multiplication in will take place, meaning these two
-multiplications can also be placed in the same clock cycle. This gives the final schedule that is
-shown in \in{Figure}[fig:op_chain_d].
+The scheduler transforms the \rtlblock\ hyperblock into the \rtlpar\ hyperblock shown in
+\in{Figure}{b}[fig:op_chain].  Even though the addition in \sII\ is dependent on \sI, they can still
+be put into the same cycle as the additions do not take as long to complete as the
+multiplications. This optimisation is called operation chaining. Then, the multiplication in \sIII\
+can also be placed in the same clock cycle as it does not have any data dependencies. Then, in the
+next clock cycle, either the multiplication in \sIV\ can take place, or the multiplication in \sV\
+will take place, meaning these two multiplications can also be placed in the same clock cycle. This
+gives the final schedule that is shown in \in{Figure}{d}[fig:op_chain].
 
 \section[abstr_interp]{Abstract Interpretation of Hyperblocks}
 
@@ -233,6 +463,34 @@ account.
 
 The symbolic expressions have the following form:
 
+\define\sep{\ |\ }
+
+Resources:
+
+\startformula \rho\ \eqdef\ r \sep p \sep \text{\tt Mem} \stopformula
+
+Expressions:
+
+\startformula e\ \eqdef\ r^{0} \sep P \Rightarrow \text{\tt Op}(\mathit{op}, \vec{e}) \sep P \Rightarrow
+  \text{\tt Load}(\mathit{chunk}, \mathit{mode}, \vec{e}, e_{m}) \stopformula
+
+Memory expressions:
+
+\startformula e_{m}\ \eqdef\ \text{\tt Mem}^{0} \sep P \Rightarrow \text{\tt Store}(\mathit{chunk},
+  \mathit{mode}, \vec{e}, e_{m}, e) \stopformula
+
+Predicate expressions:
+
+\startformula e_{p}\ \eqdef\ p^{0} \sep P \Rightarrow \text{\tt SetPred}(\mathit{cond}, \vec{e}, e_{p})\stopformula
+
+Code:
+
+\startformula m\ \eqdef\ \rho \mapsto (\vec{e} \sep \vec{e_{m}} \sep \vec{e_{p}}) \stopformula
+
+Predicates:
+
+\startformula P\ \eqdef\ (\text{\tt Bool} * p) \sep P \land P \sep P \lor P \sep \top\ \sep \perp \stopformula
+
 There are three main resources that are present in symbolic expressions: registers $r$, which
 represent the value of the registers in the program, predicates $p$, which represent the value of
 the predicates throughout the program, and finally, memory \type{Mem}, representing the state of the
@@ -269,30 +527,31 @@ predicated type and append the first predicated type to the second:
 
 \subsection[example-of-translation]{Example of translation}
 
-%\startformula \begin{aligned}
-%    \label{eq:4}
-%    &\text{\tt r1} =
-%      \begin{cases}
-%        \mono{r1}^{0} + \mono{r4}^{0} + \mono{r4}^{0}, \quad &\mono{p1} \\
-%        \mono{r1}^{0}, \quad &\mono{!p1} \\
-%      \end{cases}\\
-%    &\mono{r2} = \mono{r1}^{0} + \mono{r4}^{0}\\
-%    &\mono{r3} =
-%      \begin{cases}
-%        \left( \mono{r1}^{0} \times \mono{r1}^{0} \right) \times \left( \mono{r1}^{0} \times \mono{r1}^{0} \right),\quad &\mono{!p2 \&\& !p1}\\
-%        \mono{r1}^{0} \times \mono{r4}^{0},\quad &\mono{p2 \&\& !p1}\\
-%        \left( \mono{r1}^{0} + \mono{r4}^{0} + \mono{r4}^{0} \right) \times \mono{r4}^{0},\quad &\mono{!p2 \&\& p1}\\
-%        \mono{r3}^{0} \times \mono{r3}^{0},\quad &\mono{p2 \&\& p1}
-%      \end{cases}
-%  \end{aligned} \stopformula
-
-Using the example shown in \in{Figure}[fig:op_chain], the RTLBlock hyperblock shown in
-\in{Figure}[fig:op_chain_a] is scheduled into the hyperblock RTLPar shown in
-\in{Figure}[fig:op_chain_c]. The first step is to translate the input and output of the scheduler
-into the Abstr intermediate language by calculating the symbolic values that will be stored in each
-of the registers. Symbolically evaluating the RTLBlock hyperblock results in the expressions shown
-in \in{Figure}[fig:symbolic_expressions] for registers \type{r1}, \type{r2} and \type{r3}, where the
-state of the register \type{r1} at the start of the hyperblock is denoted as $\text{\tt r1}^{0}$.
+\placeformula[eq:4]\startformula\startalign[n=1,align={1:left}]
+\NC \text{\tt r1} =
+\startcases
+  \NC \mono{r1}^{0} + \mono{r4}^{0} + \mono{r4}^{0}, \quad \NC \mono{p1} \NR
+  \NC \mono{r1}^{0}, \quad \NC \mono{!p1} \NR
+\stopcases \NR
+\NC \mono{r2} = \mono{r1}^{0} + \mono{r4}^{0}\NR
+\NC\mono{r3} =
+\startcases
+  \NC \left( \mono{r1}^{0} \times \mono{r1}^{0} \right) \times \left( \mono{r1}^{0} \times
+  \mono{r1}^{0} \right),\quad \NC \mono{!p2 \&\& !p1} \NR
+  \NC \mono{r1}^{0} \times \mono{r4}^{0},\quad \NC \mono{p2 \&\& !p1} \NR
+  \NC \left( \mono{r1}^{0} + \mono{r4}^{0} + \mono{r4}^{0} \right) \times \mono{r4}^{0},\quad
+  \NC \mono{!p2 \&\& p1}\NR
+  \NC \mono{r3}^{0} \times \mono{r3}^{0},\quad \NC\mono{p2 \&\& p1}\NR
+\stopcases\NR
+\stopalign \stopformula
+
+Using the example shown in \in{Figure}[fig:op_chain], the \rtlblock\ hyperblock shown in
+\in{Figure}{a}[fig:op_chain] is scheduled into the hyperblock \rtlpar\ shown in
+\in{Figure}{b}[fig:op_chain]. The first step is to translate the input and output of the scheduler
+into the \abstr\ intermediate language by calculating the symbolic values that will be stored in
+each of the registers. Symbolically evaluating the \rtlblock\ hyperblock results in the expressions
+shown in \in{Equation}[eq:4] for registers \type{r1}, \type{r2} and \type{r3}, where the state of
+the register \type{r1} at the start of the hyperblock is denoted as $\text{\tt r1}^{0}$.
 
 This symbolic expression can be generated by sequentially going through the list of predicated
 expressions and applying the update function defined in \in{Section}[abstr_language]. The resulting
@@ -322,13 +581,13 @@ verified SAT solver and a syntactic equivalence check of the actual symbolic exp
 
 \subsection[comparing-symbolic-abstract-expressions]{Comparing Symbolic Abstract Expressions}
 
-Abstract expressions comprise a predicate and a symbolic expression. The translation from RTLBlock
-and RTLPar to Abstract ensures that each register will contain a predicated symbolic expression that
-describes the value that will be assigned to it after evaluating the basic block.  This means that
-if an expression always produces the same result as another expression, that the register will
-always contain the same value. If the RTLBlock program and RTLPar program translate to two Abstract
-languages where all the register's symbolic expressions match, then this means that both programs
-will always produce the same value.
+Abstract expressions comprise a predicate and a symbolic expression. The translation from \rtlblock\
+and \rtlpar\ to \abstr\ ensures that each register will contain a predicated symbolic expression
+that describes the value that will be assigned to it after evaluating the basic block.  This means
+that if an expression always produces the same result as another expression, that the register will
+always contain the same value. If the \rtlblock\ program and \rtlpar\ program translate to two
+\abstr\ representations where all the register's symbolic expressions match, then this means that
+both programs will always produce the same value.
 
 The main comparison between two symbolic expressions can be done using a simple syntactic
 comparison, because this ensures that each of the expressions always evaluate to the same value, and
@@ -434,8 +693,8 @@ in the same language.
 
 \subsubsection[proof-of-correctness-of-scheduling]{Proof of Correctness of Scheduling}
 
-The scheduling pass is formally verified using translation validation, using the Abstr language as a
-point of comparison between RTLBlock and RTLPar. Using the algorithm defined in
+The scheduling pass is formally verified using translation validation, using the \abstr\ language as
+a point of comparison between \rtlblock\ and \rtlpar. Using the algorithm defined in
 \in{Section}[abstr_interp], the formally verified SAT solver described in \in{Section}[verified_sat]
 can be used prove that if the two predicated expression lists are equivalent, that they will always
 evaluate to the same value.
@@ -446,37 +705,11 @@ This section describes the implementation details of the static \SDC\ scheduling
 also describes some of the details of the basic block generation as well as the if-conversion pass
 which generates the predicated instructions.
 
-\subsection[implementation-of-if-conversion]{Implementation of If-Conversion}
-
-If-conversion is the conversion that introduces predicated instructions which can make use of the
-hyperblocks. It converts conditional statements into predicated instructions. This transformation
-has a few limitations on the kind of conditional statements that it can translate.  First, only
-conditional statements without loops can be translated, therefore, one must identify cycles in the
-control-flow before performing the if-conversion. Secondly, if-conversion will not always result in
-more efficient code, so it should not be applied to any conditional statements. Instead, it is best
-applied to conditional statements where each branch will take a similar amount of time to execute.
-
-For back ends that do not support predicates, it is also important to eliminate the predicates again
-using a process called reverse if-conversion, which creates branches out of the predicates and
-groups these together again. This allows for hyperblock scheduling to also be used with back ends
-that do not support predicates, which are most the default back ends included in CompCert.
-
 \subsection[static-sdc-scheduling]{Static \SDC\ Scheduling}
 
-\Word{\infull{SDC}} (\SDC) scheduling~\cite[cong06_sdc] is a flexible algorithm to perform
-scheduling of instructions, especially if the target is hardware directly. It allows for flexible
-definitions of optimisation functions in terms of latencies and dependencies of the operations in
-the basic block.
-
-\section[performance-comparison]{Performance Comparison}
-
-\section[sec:related-work]{Related Work}
-
-\section[conclusion]{Conclusion}
-
-This material is based upon work supported by the under Grant No.~ and Grant No.~. Any opinions,
-findings, and conclusions or recommendations expressed in this material are those of the author and
-do not necessarily reflect the views of the National Science Foundation.
+\SDC\ scheduling~\cite[cong06_sdc] is a flexible algorithm to perform scheduling of instructions,
+especially if the target is hardware directly. It allows for flexible definitions of optimisation
+functions in terms of latencies and dependencies of the operations in the basic block.
 
 \startmode[chapter]
   \section{Bibliography}
diff --git a/figures/timing-3.tex b/figures/timing-3.tex
new file mode 100644
index 0000000..804f8c5
--- /dev/null
+++ b/figures/timing-3.tex
@@ -0,0 +1,23 @@
+\documentclass{standalone}
+
+\usepackage{tikz}
+\usepackage{tikz-timing}
+\usetikzlibrary{patterns}
+\usepackage{newpxtext}
+\usepackage{newpxmath}
+
+\definecolor{s1col}{HTML}{7fc97f}
+\definecolor{s2col}{HTML}{beaed4}
+\definecolor{s3col}{HTML}{fdc086}
+\definecolor{s4col}{HTML}{ffff99}
+\definecolor{s5col}{HTML}{386cb0}
+
+\begin{document}
+      \begin{tikztimingtable}[timing/d/background/.style={fill=white},timing/yunit=15pt]
+        clk & 1L 4{7C} 1C \\
+        r1  & 8D{} [fill=s2col]7D{if(p1) r2 + r4} [fill=white]15D{}\\
+        r2  & 1D{} [fill=s1col]7D{r1 + r4} [fill=white]22D{} \\
+        r3 & 1D{} [fill=s3col]14D{if(!p2 \&\& !p1) r1 * r1} [postaction={pattern=north west lines,pattern
+          color=s5col},fill=s4col]14D{if(p2) r1 * r4 || if(!p2) r3 * r3} [fill=white]1D{}\\
+      \end{tikztimingtable}
+\end{document}
diff --git a/lsr_env.tex b/lsr_env.tex
index 8438c18..1c1f9bf 100644
--- a/lsr_env.tex
+++ b/lsr_env.tex
@@ -20,6 +20,8 @@
     click=yes,
     style=\rm,
 ]
+
+\usecolors[crayola]
 %\placebookmarks[chapter,section,subsection,subsubsection][chapter,section]
 %\setupinteractionscreen[option=bookmark]
 
@@ -339,7 +341,7 @@
 \definevimtyping [hlcoq] [syntax=coq]
 \definevimtyping [hlocaml] [syntax=ocaml]
 \definevimtyping [hlverilog] [syntax=verilog]
-\definevimtyping [hlC] [syntax=c]
+\definevimtyping [hlC] [syntax=c,escape=command]
 
 % ==========================================================================
 % Tikz
@@ -478,8 +480,14 @@
 % Macros
 % ==========================================================================
 
-\define\eqdef{{\strut\hbox to 0pt{\raisebox{0.5\baselineskip}\hbox{\rmx def}\hss}}=}
+\define\eqdef{{\strut\hbox to -2pt{\raisebox{0.4\baselineskip}\hbox{\rmxx def}\hss}}=}
 \define\blockbb{\mathcal{B}_{\rm b}}
 \define\parbb{\mathcal{B}_{\rm p}}
 
+\define\rtlblock{{\sc RtlBlock}}
+\define\rtlpar{{\sc RtlPar}}
+\define\rtl{{\sc Rtl}}
+\define\htl{{\sc Htl}}
+\define\abstr{{\sc Abstr}}
+
 \stopenvironment
diff --git a/lsr_refs.bib b/lsr_refs.bib
index a107605..992410d 100644
--- a/lsr_refs.bib
+++ b/lsr_refs.bib
@@ -371,6 +371,18 @@
 	year = {2008}
 }
 
+@article{fisher81_trace_sched,
+	author = {Fisher, Joseph A.},
+	doi = {10.1109/TC.1981.1675827},
+	journaltitle = {IEEE Transactions on Computers},
+	keywords = {static scheduling,trace scheduling},
+	number = {7},
+	pages = {478--490},
+	title = {Trace Scheduling: A Technique for Global Microcode Compaction},
+	volume = {C-30},
+	year = {1981}
+}
+
 @inproceedings{gajski10_what_hls,
 	author = {Gajski, Dan and Austin, Todd and Svoboda, Steve},
 	booktitle = {Design Automation Conference},
diff --git a/main.tex b/main.tex
index 8988110..f77f086 100644
--- a/main.tex
+++ b/main.tex
@@ -31,7 +31,7 @@
   \component pipelining
   %\component dynamic
   \component schedule
-  \component conclusion
+  %\component conclusion
 \stopbodymatter
 
 \startbackmatter
author	Yann Herklotz <git@yannherklotz.com>	2022-05-05 17:08:12 +0100
committer	Yann Herklotz <git@yannherklotz.com>	2022-05-05 17:08:12 +0100
commit	c9648f2dc56a0c970a90e1539925ae78b0610fbc (patch)
tree	41a7934ad8d327684e33038b0e7f63d7bfe407dc
parent	b90e207d5a9d1871d5ecde168d39ce91c7691fe1 (diff)
download	lsr22_fvhls-c9648f2dc56a0c970a90e1539925ae78b0610fbc.tar.gz lsr22_fvhls-c9648f2dc56a0c970a90e1539925ae78b0610fbc.zip