1 files changed, 54 insertions, 5 deletions
diff --git a/chapters/pipelining.tex b/chapters/pipelining.tex
index 6c351c8..5aa03dc 100644
--- a/chapters/pipelining.tex
+++ b/chapters/pipelining.tex
@@ -13,7 +13,12 @@
 
 Standard instruction scheduling only addresses parallelisation inside hyperblocks, which are linear
 sections of code.  However, loops are often the most critical sections in code, and scheduling only
-addresses parallelisation within one iteration.
+addresses parallelisation within one iteration.  Traditionally, loop pipelining is performed as part
+of the normal scheduling step in \HLS, as the scheduling algorithm can be generalised to support
+these new constraints.  However, it becomes expensive to check the correctness of a scheduling
+algorithm that transforms the code in this many ways.  It is better to separate loop pipelining into
+it's own translation pass which does a source-to-source translation of the code.  The final result
+should be similar to performing the loop pipelining together with the scheduling.
 
 \section{Loop pipelining example}
 
@@ -53,10 +58,54 @@ addresses parallelisation within one iteration.
   \stopfloatcombination
 \stopplacemarginfigure
 
-\in{Figure}[fig:pipelined-loop] shows an example of pipelining a loop which accumulates values and
-modifies an array.  In \in{Figure}{a}[fig:pipelined-loop], the body of the loop cannot be scheduled
-in less than three cycles, assuming that a load takes two clock cycles.  However, after transforming
-the code into the pipelined version on the right
+\in{Figure}[fig:pipelined-loop] shows an example of \emph{software pipelining} of a loop which
+accumulates values and modifies an array.  In \in{Figure}{a}[fig:pipelined-loop], the body of the
+loop cannot be scheduled in less than three cycles, assuming that a load takes two clock cycles.
+However, after transforming the code into the pipelined version in
+\in{Figure}{b}[fig:pipelined-loop], the number of inter-iteration dependencies have been reduced, by
+moving the store into the next iteration of the loop.  This means that the body of the loop could
+now be scheduled in two clock cycles.
+
+The process of pipelining the loop by being resource constrained can be performed using iterative
+modulo scheduling~\cite[rau94_iterat_sched], followed by modulo variable
+expansion~\cite[lam88_softw_pipel].  The steps performed in the optimisation are the following.
+
+\startitemize[n]
+\item Calculate a minimum \II, which is the lowest possible \II\ based on the resources of the code
+  inside of the loop and the distance and delay of inter-iteration dependencies.
+\item Perform modulo scheduling with an increasing \II\ until one is found that works.  The
+  scheduling is performed by adding instructions to a \MRT~\cite[lam88_softw_pipel], assigning
+  operations to each resource for one iteration.
+\item Once a valid modulo schedule is found, the loop code can be generated from it.  To keep the
+  \II\ that was found in the previous step, modulo variable expansion might be necessary and require
+  unrolling the loop.
+\stopitemize
+
+\section{Verification of pipelining}
+
+Verification of pipelining has already been performed in CompCert by
+\cite[authoryears][tristan10_simpl_verif_valid_softw_pipel].  Assuming that one has a loop body
+$\mathcal{B}$, the pipelined code can be described using the prologue $\mathcal{P}$, the steady
+state $\mathcal{S}$, the epilogue $\mathcal{E}$, and finally some additional variables representing
+the minimum number of iterations that need to be performed to be able to use the pipelined loop
+$\mu$ and representing the unroll factor of the steady state $\delta$.
+
+The goal is to verify the equivalence of the original loop and the pipelined loop using a validation
+algorithm, and then prove that the validation algorithm is \emph{sound}, i.e. if it finds the two
+loops to be equivalent, this implies that they will behave the same according to the language
+semantics.  Essentially, as the loop pipelining algorithm only modifies loops, it is enough to show
+that the input loop $X$ is equivalent to the pipelined version of the loop $Y$.  Assuming that the
+validation algorithm is called $\alpha$, we therefore want to show that the following statement
+holds, where $X_N$ means that we are considering $N$ iterations of the loop.
+\placeformula\startformula
+  \forall N,\quad \alpha(X_N) = \startmathcases
+    \NC \alpha(Y_{N/\delta}; X_{N\%\delta}) \NC N \ge \mu \NR
+    \NC \alpha(X_N) \NC \text{otherwise} \NR
+\stopmathcases\stopformula
+
+This states that for any number of iteration of the loop X, the formula should hold.  As the number
+of loop iterations $N$ are not always known at compile time, this may become an infinite property
+that needs to be checked.
 
 \startmode[section]
   \section{Bibliography}