1 files changed, 51 insertions, 22 deletions
diff --git a/chapters/pipelining.tex b/chapters/pipelining.tex
index 7d81da6..a7a54d0 100644
--- a/chapters/pipelining.tex
+++ b/chapters/pipelining.tex
@@ -97,9 +97,9 @@ $\mu$ and representing the unroll factor of the steady state $\delta$.
 The goal is to verify the equivalence of the original loop and the pipelined loop using a validation
 algorithm, and then prove that the validation algorithm is \emph{sound}, i.e. if it finds the two
 loops to be equivalent, this implies that they will behave the same according to the language
-semantics.  Essentially, as the loop pipelining algorithm only modifies loops, it is enough to show
-that the input loop $X$ is equivalent to the pipelined version of the loop $Y$.  Assuming that the
-validation algorithm is called $\alpha$, we therefore want to show that the following statement
+semantics.  Essentially, as the loop pipelining algorithm only modifies loops, it is sufficient to
+show that the input loop $X$ is equivalent to the pipelined version of the loop $Y$.  Assuming that
+the validation algorithm is called $\alpha$, we therefore want to show that the following statement
 holds, where $X_N$ means that we are considering $N$ iterations of the loop.
 
 \placeformula\startformula
@@ -113,7 +113,7 @@ number of loop iterations $N$ are not always known at compile time, this may bec
 property that needs to be checked.
 
 This infinite property can be proven by checking smaller finite properties that loop pipelining
-should ensure.  The two key insights by \cite[author][tristan10_simpl_verif_valid_softw_pipel] where
+should ensure.  The two key insights by \cite[author][tristan10_simpl_verif_valid_softw_pipel] were
 that mainly only the two following properties needed to be checked:
 
 \placeformula[eq:loop1]\startformula
@@ -130,8 +130,8 @@ one more body $\mathcal{S}$ of the pipelined loop.  The second property shown in
 \in{Equation}[eq:loop2] describes that executing the initial loop $\mu$ times is exactly the same as
 executing the prologue $\mathcal{P}$ of the loop followed by the epilogue $\mathcal{E}$.  In
 addition to some simpler properties, this means that the infinite property can still be
-checked.\footnote{In addition to other more trivial properties, for example that the loop structure
-  is correct, and only contains one increment operation on the induction variable.}
+checked.\footnote{This is in addition to other more trivial properties, for example that the loop
+  structure is correct, and only contains one increment operation on the induction variable.}
 
 Finally, the verifier itself needs to be proven correct, meaning the correctness of the symbolic
 evaluation needs to be shown to be semantics preserving.  If the comparison between two symbolic
@@ -140,22 +140,22 @@ states succeeds, then this must imply that the initial blocks themselves execute
 \section{Novelties of verifying software pipelining in Vericert}
 
 As mentioned in the previous section, software pipelining has already been implemented in CompCert
-by \cite[author][tristan10_simpl_verif_valid_softw_pipel].  From an implementation point of view for
+by \cite[author][tristan10_simpl_verif_valid_softw_pipel].  From an implementation point of view,
 the actual pipelining algorithm that was implemented was a backtracking iterative modulo scheduling
 algorithm paired with modulo variable expansion.  However, the results were underwhelming because it
 was tested on an out-of-order PowerPC G5 processor, whereas pipelining is more effective on in-order
 processors to take full advantage of the static pipeline.  In addition to that, alias analysis on
-memory dependencies is also not performed, which is perhaps the best way to improve the performance
-of the optimisation.  Finally, the loop scheduling is also only performed on basic blocks, and not
-on extended basic blocks such as superblocks or hyperblocks.  These are all short-comings that could
-be easily improved by using our existing {\sc RTLBlock} or {\sc RTLPar} language as the source
-language of the pipelining transformation.  Paired with the scheduling step, one could extract more
-parallelism and also support conditionals in loop bodies.  Memory aliasing is not directly supported
-yet in the {\sc RTLBlock} abstract interpretation, however, using some simple arithmetic properties
-it should be possible to normalise memory accesses and therefore prove reordering of memory accesses
-that do not alias under certain conditions.
-
-From a proof perspective there is also some improvements that can be made to the work.  First of
+memory dependencies is also not performed, although this is perhaps the best way to improve the
+results of the optimisation.  Finally, the loop scheduling is also only performed on basic blocks,
+and not on extended basic blocks such as superblocks or hyperblocks.  These are all short-comings
+that could be easily improved by using our existing {\sc RTLBlock} or {\sc RTLPar} language as the
+source language of the pipelining transformation.  Paired with the scheduling step, one could
+extract more parallelism and also support conditionals in loop bodies.  Memory aliasing is not
+directly supported yet in the {\sc RTLBlock} abstract interpretation, however, using some simple
+arithmetic properties it should be possible to normalise memory accesses and therefore prove
+reordering of memory accesses that do not alias under certain conditions.
+
+From a proof perspective there are also some improvements that can be made to the work.  First of
 all, the proof from the paper was actually never fully completed:
 
 \startframedtext[
@@ -180,7 +180,7 @@ all, the proof from the paper was actually never fully completed:
 \stopframedtext
 
 \noindent The proof of semantic preservation going from unrolled loops with basic blocks back to a
-control-flow graph were tedious, showing that the language was not designed to handle loop
+control-flow graph was tedious, showing that the language was not designed to handle loop
 transformations well.  However, recent work on the formalisation of polyhedral transformations of
 loops~\cite[authoryear][courant21_verif_code_gener_polyh_model] shows that this may be a better
 representation for the intermediate language.  However, translations to and from the {\sc Loop}
@@ -208,7 +208,7 @@ which the pipeline needs to be filled.  The benefit of generating hardware direc
 pipeline can already be laid out in the optimal and natural way.
 
 It is less clear why, and especially how, software can be pipelined in the same way as hardware, and
-what the benefits are if the code is going to be executed.\footnote{It is more clear why scheduling
+what the benefits are if the code is going to be executed.\footnote{It is clearer why scheduling
   would be important on a VLIW processor, but still not clear how the pipelined loop can be
   represented in software without explicit parallelism.}  The main difference between to a hardware
 pipeline is that in software the code is executed inherently sequentially.  This means that it is
@@ -216,16 +216,45 @@ not possible to lay out a pipeline in the same way, as the next input cannot jus
 loop while it is executing.  However, a pipeline can be simulated in software, so that it behaves
 the same as the hardware pipeline, but it will manually have to unroll and rearrange the loop body
 so that all the stages can be executed by following one control-flow.  However, the order of
-instructions will be quite different hardware, where the pipeline can be written naturally in
+instructions will be quite different in hardware, where the pipeline can be written naturally in
 sequential order of the stages.  In software, the pipeline is instead represented by one horizontal
 slice of the pipeline, which will be the new loop body and represents the computation that the whole
 hardware pipeline is performing at one point in time.
 
-The main question that comes up is are the two pipelining methods actually equivalent?  In Vericert
+The main question that comes up is: are the two pipelining methods actually equivalent?  In Vericert
 we are assuming that performing software pipelining, followed by scheduling and finally followed by
 the hardware generation approximates the hardware pipelining that traditional \HLS\ tools perform as
 part of their hardware generation.
 
+\subsection{Hardware Pipelining Can be Approximated by Software Pipelining}
+
+The hope is that hardware pipelining can be approximated by software pipelining, so that the
+translation can easily be separated.  We argue that this is in fact possible, with two additional
+properties that can be added to the software intermediate language.  The main two problems that
+software pipelining runs into when directly comparing it to the equivalent hardware pipeline are
+first the duplicated code for the prologue and the epilogue, and secondly the possible loop
+unrolling that is necessary to then allocate the registers correctly.
+
+\paragraph{Removing duplicate prologue and epilogue.}  Disregarding the code duplication that
+results from generating the prologue and the epilogue, the main problem with having these pieces of
+code is that one ends up with a loop that has a minimum number of iterations which it needs to be
+executed for.  This is not the case with a hardware pipeline, and it means that in software, to keep
+the code correct, the original version of the loop needs to be kept around so that if the iterations
+are under the minimum number of iterations the loop can execute, it will branch to the original loop
+instead.  This does not really affect the performance of the code, but it does add a duplicate of
+every loop to the program, which might increase the size of the hardware dramatically.  Predicated
+execution can be used to selectively turn on instructions that should be running at this point in
+time, therefore simulating the prologue and epilogue without having to duplicate any code.
+
+\paragraph{Removing the need to unroll the loop.}  In addition to that, sometimes register
+allocation will require the loop to be unrolled many times to remain correct.  This also means that
+the original loop will have to be kept around in case the number of executions of the loop is not
+divisible by the number of times the loop was unrolled.  However, by using a rotating register
+file~\cite[rau92_regis_alloc_softw_pipel_loops] the clash between registers from different
+iterations is removed, thereby eliminating the need for register allocation.  This does increase the
+amount of registers, however, this can further be optimised.  In addition to that, rotating register
+files act a lot like proper hardware registers in pipelines.
+
 \startmode[chapter]
   \section{Bibliography}
   \placelistofpublications