\environment fonts_env
\environment lsr_env

\startcomponent pipelining

\chapter[sec:pipelining]{Loop Pipelining}

\startsynopsis
  This section describes the future plans of implementing loop pipelining in Vericert, also called
  loop scheduling.  This addresses the final major issue with Vericert, which is efficiently
  handling loops.
\stopsynopsis

Standard instruction scheduling only addresses parallelisation inside hyperblocks, which are linear
sections of code.  However, loops are often the most critical sections in code, and scheduling only
addresses parallelisation within one iteration.  Traditionally, loop pipelining is performed as part
of the normal scheduling step in \HLS, as the scheduling algorithm can be generalised to support
these new constraints.  However, it becomes expensive to check the correctness of a scheduling
algorithm that transforms the code in this many ways.  It is better to separate loop pipelining into
its own translation pass which does a source-to-source translation of the code.  The final result
should be similar to performing the loop pipelining together with the scheduling.

\section{Loop pipelining example}

\startplacemarginfigure[location=here,reference={fig:pipelined-loop},title={Example of pipelining a
    loop.}]
  \startfloatcombination[nx=2,distance=10mm,align=middle]

    \startplacesubfigure[title={Simple loop containing an accumulation of values with an
        inter-iteration dependency.}]
      \startframedtext[frame=off,offset=none,width={0.6\textwidth}]
        \starthlC
          for (int i = 1; i < N; i++) {
              c1 = acc[i-1] * c;
              c2 = x[i] * y[i];
              acc[i] = c1 + c2;
          }
        \stophlC
      \stopframedtext
    \stopplacesubfigure

    \startplacesubfigure[title={Pipelined loop reducing the number of dependencies inside of the
        loop.}]
      \startframedtext[frame=off,offset=none,width={0.6\textwidth}]
        \starthlC
          c1 = acc[0] * c;
          c2 = x[1] * y[1];
          for (int i = 1; i < N-1; i++) {
              acc[i] = c1 + c2;
              c2 = x[i+1] * y[i+1];
              c1 = acc[i+1] * c;
          }
          acc[N-1] = c1 + c2;
        \stophlC
      \stopframedtext
    \stopplacesubfigure

  \stopfloatcombination
\stopplacemarginfigure

\in{Figure}[fig:pipelined-loop] shows an example of \emph{software pipelining} of a loop which
accumulates values and modifies an array.  In \in{Figure}{a}[fig:pipelined-loop], the body of the
loop cannot be scheduled in less than three cycles assuming that a load takes two clock cycles.
However, after transforming the code into the pipelined version in
\in{Figure}{b}[fig:pipelined-loop], the number of inter-iteration dependencies have been reduced by
moving the store into the next iteration of the loop.  This means that the body of the loop could
now be scheduled in two clock cycles.

The process of pipelining the loop by being resource constrained can be done purely as a
source-to-source translation in software using iterative modulo
scheduling~\cite[rau94_iterat_sched], followed by modulo variable
expansion~\cite[lam88_softw_pipel].  The steps performed in the optimisation are the following:

\startitemize[n]
\item Calculate a minimum \II, which is the lowest possible \II\ based on the resources of the code
  inside of the loop and the distance and delay of inter-iteration dependencies.
\item Perform modulo scheduling with an increasing \II\ until one is found that works.  The
  scheduling is performed by adding instructions to a \MRT~\cite[lam88_softw_pipel], assigning
  operations to each resource for one iteration.
\item Once a valid modulo schedule is found, the loop code can be generated from it.  To keep the
  \II\ that was found in the previous step, modulo variable expansion might be necessary and require
  unrolling the loop.
\stopitemize

This is very similar to the way loop scheduling directly in hardware is performed, but it is often
done at the same time as scheduling.

\section{Verification of pipelining}

Verification of pipelining has already been performed in CompCert by
\cite[authoryears][tristan10_simpl_verif_valid_softw_pipel].  Assuming that one has a loop body
$\mathcal{B}$, the pipelined code can be described using the prologue $\mathcal{P}$, the steady
state $\mathcal{S}$, the epilogue $\mathcal{E}$, and finally some additional variables representing
the minimum number of iterations that need to be performed to be able to use the pipelined loop
$\mu$ and representing the unroll factor of the steady state $\delta$.

The goal is to verify the equivalence of the original loop and the pipelined loop using a validation
algorithm, and then prove that the validation algorithm is \emph{sound}, i.e. if it finds the two
loops to be equivalent, this implies that they will behave the same according to the language
semantics.  Essentially, as the loop pipelining algorithm only modifies loops, it is sufficient to
show that the input loop $X$ is equivalent to the pipelined version of the loop $Y$.  Assuming that
the validation algorithm is called $\alpha$, we therefore want to show that the following statement
holds, where $X_N$ means that we are considering $N$ iterations of the loop.

\placeformula\startformula
  \forall N,\quad \alpha(X_N) = \startmathcases
    \NC \alpha(Y_{N/\delta}; X_{N\%\delta}) \NC N \ge \mu \NR
    \NC \alpha(X_N) \NC \text{otherwise} \NR
\stopmathcases\stopformula

This states that for any number of iteration of the loop $X$, the formula should hold.  As the
number of loop iterations $N$ are not always known at compile time, this may become an infinite
property that needs to be checked.

This infinite property can be proven by checking smaller finite properties that loop pipelining
should ensure.  The two key insights by \cite[author][tristan10_simpl_verif_valid_softw_pipel] were
that mainly only the two following properties needed to be checked:

\placeformula[eq:loop1]\startformula
  \alpha(\mathcal{S}; \mathcal{E}) = \alpha(\mathcal{E}; \mathcal{B}^{\delta})
\stopformula

\placeformula[eq:loop2]\startformula
  \alpha(\mathcal{P}; \mathcal{E}) = \alpha(\mathcal{B}^{\mu})
\stopformula

\noindent The first property shown in \in{Equation}[eq:loop1] describes the fact that one can exit
the loop at any point and execute $\mathcal{B}^{\delta}$ iterations of the loop instead of executing
one more body $\mathcal{S}$ of the pipelined loop.  The second property shown in
\in{Equation}[eq:loop2] describes that executing the initial loop $\mu$ times is exactly the same as
executing the prologue $\mathcal{P}$ of the loop followed by the epilogue $\mathcal{E}$.  In
addition to some simpler properties, this means that the infinite property can still be
checked.\footnote{This is in addition to other more trivial properties, for example that the loop
  structure is correct, and only contains one increment operation on the induction variable.}

Finally, the verifier itself needs to be proven correct, meaning the correctness of the symbolic
evaluation needs to be shown to be semantics preserving.  If the comparison between two symbolic
states succeeds, then this must imply that the initial blocks themselves execute in the same way.

\section{Novelties of verifying software pipelining in Vericert}

As mentioned in the previous section, software pipelining has already been implemented in CompCert
by \cite[author][tristan10_simpl_verif_valid_softw_pipel].  From an implementation point of view,
the actual pipelining algorithm that was implemented was a backtracking iterative modulo scheduling
algorithm paired with modulo variable expansion.  However, the results were underwhelming because it
was tested on an out-of-order PowerPC G5 processor, whereas pipelining is more effective on in-order
processors to take full advantage of the static pipeline.  In addition to that, alias analysis on
memory dependencies is also not performed, although this is perhaps the best way to improve the
results of the optimisation.  Finally, the loop scheduling is also only performed on basic blocks,
and not on extended basic blocks such as superblocks or hyperblocks.  These are all short-comings
that could be easily improved by using our existing {\sc RTLBlock} or {\sc RTLPar} language as the
source language of the pipelining transformation.  Paired with the scheduling step, one could
extract more parallelism and also support conditionals in loop bodies.  Memory aliasing is not
directly supported yet in the {\sc RTLBlock} abstract interpretation, however, using some simple
arithmetic properties it should be possible to normalise memory accesses and therefore prove
reordering of memory accesses that do not alias under certain conditions.

From a proof perspective there are also some improvements that can be made to the work.  First of
all, the proof from the paper was actually never fully completed:

\startframedtext[
  frame=off,
  leftframe=on,
  offset=0pt,
  loffset=0.5cm,
  framecolor=darkcyan,
  rulethickness=2pt,
  location=middle,
  width=\dimexpr \textwidth - 1cm \relax,
]
  {\noindent\it [Semantic preservation going from unrolled loops in a basic block representation to
      the actual loops in a CFG] is intuitively obvious but surprisingly tedious to formalize in
    full details: indeed, this is the only result in this paper that we have not yet mechanized in
    Coq.}

  \startalignment[flushright]
    --- \cite[alternative=authoryears,right={, Section
        6.2}][tristan10_simpl_verif_valid_softw_pipel].
  \stopalignment
\stopframedtext

\noindent The proof of semantic preservation going from unrolled loops with basic blocks back to a
control-flow graph was tedious, showing that the language was not designed to handle loop
transformations well.  However, recent work on the formalisation of polyhedral transformations of
loops~\cite[authoryear][courant21_verif_code_gener_polyh_model] shows that this may be a better
representation for the intermediate language.  However, translations to and from the {\sc Loop}
language would still have to be implemented.

In addition to that, the original loop-pipelining implementation did not have a syntactic
representation of basic blocks, making the proofs a bit more tedious than necessary.  This may mean
that the current syntactic representation of basic blocks in {\sc RTLBlock} may simplify the
translations as well.

\section{Hardware and Software Pipelining}

Hardware and software pipelining algorithms share many features, even though the final
implementation of the pipeline is quite different.

Pipelining in hardware is intuitive.  If one has two separate blocks where one is data-dependent on
the other, then it is natural to pass a new input to the first block as soon as it has finished and
passed the result to the second block.  This idea is also the key building block for processor
designs, as it is a straightforward way to improve the throughput of a series of distinct and
dependent operations.  In \HLS\ it is also an important optimisation to efficiently translate loops.
Loops are often the limiting factor of a design, and any improvements in their translation can
reduce the throughput of the entire design.  The \HLS\ tool therefore often automatically creates
pipeline stages for loops and calculates a correct \II\ for the loop, which will be the rate at
which the pipeline needs to be filled.  The benefit of generating hardware directly, is that the
pipeline can already be laid out in the optimal and natural way.

It is less clear why, and especially how, software can be pipelined in the same way as hardware, and
what the benefits are if the code is going to be executed.\footnote{It is clearer why scheduling
  would be important on a VLIW processor, but still not clear how the pipelined loop can be
  represented in software without explicit parallelism.}  The main difference between to a hardware
pipeline is that in software the code is executed inherently sequentially.  This means that it is
not possible to lay out a pipeline in the same way, as the next input cannot just be passed to the
loop while it is executing.  However, a pipeline can be simulated in software, so that it behaves
the same as the hardware pipeline, but it will manually have to unroll and rearrange the loop body
so that all the stages can be executed by following one control-flow.  However, the order of
instructions will be quite different in hardware, where the pipeline can be written naturally in
sequential order of the stages.  In software, the pipeline is instead represented by one horizontal
slice of the pipeline, which will be the new loop body and represents the computation that the whole
hardware pipeline is performing at one point in time.

The main question that comes up is: are the two pipelining methods actually equivalent?  In Vericert
we are assuming that performing software pipelining, followed by scheduling and finally followed by
the hardware generation approximates the hardware pipelining that traditional \HLS\ tools perform as
part of their hardware generation.

\subsection{Hardware Pipelining Can be Approximated by Software Pipelining}

The hope is that hardware pipelining can be approximated by software pipelining, so that the
translation can easily be separated.  We argue that this is in fact possible, with two additional
properties that can be added to the software intermediate language.  The main two problems that
software pipelining runs into when directly comparing it to the equivalent hardware pipeline are
first the duplicated code for the prologue and the epilogue, and secondly the possible loop
unrolling that is necessary to then allocate the registers correctly.

\paragraph{Removing duplicate prologue and epilogue.}  Disregarding the code duplication that
results from generating the prologue and the epilogue, the main problem with having these pieces of
code is that one ends up with a loop that has a minimum number of iterations which it needs to be
executed for.  This is not the case with a hardware pipeline, and it means that in software, to keep
the code correct, the original version of the loop needs to be kept around so that if the iterations
are under the minimum number of iterations the loop can execute, it will branch to the original loop
instead.  This does not really affect the performance of the code, but it does add a duplicate of
every loop to the program, which might increase the size of the hardware dramatically.  Predicated
execution can be used to selectively turn on instructions that should be running at this point in
time, therefore simulating the prologue and epilogue without having to duplicate any code.

\paragraph{Removing the need to unroll the loop.}  In addition to that, sometimes register
allocation will require the loop to be unrolled many times to remain correct.  This also means that
the original loop will have to be kept around in case the number of executions of the loop is not
divisible by the number of times the loop was unrolled.  However, by using a rotating register
file~\cite[rau92_regis_alloc_softw_pipel_loops] the clash between registers from different
iterations is removed, thereby eliminating the need for register allocation.  This does increase the
amount of registers, however, this can further be optimised.  In addition to that, rotating register
files act a lot like proper hardware registers in pipelines.

\startmode[chapter]
  \section{Bibliography}
  \placelistofpublications
\stopmode

\stopcomponent