chapters/pipelining.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169

\environment fonts_env
\environment lsr_env

\startcomponent pipelining

\chapter[sec:pipelining]{Loop Pipelining}

\startsynopsis
  This section describes the future plans of implementing loop pipelining in Vericert, also called
  loop scheduling.  This addresses the final major issue with Vericert, which is efficiently
  handling loops.
\stopsynopsis

Standard instruction scheduling only addresses parallelisation inside hyperblocks, which are linear
sections of code.  However, loops are often the most critical sections in code, and scheduling only
addresses parallelisation within one iteration.  Traditionally, loop pipelining is performed as part
of the normal scheduling step in \HLS, as the scheduling algorithm can be generalised to support
these new constraints.  However, it becomes expensive to check the correctness of a scheduling
algorithm that transforms the code in this many ways.  It is better to separate loop pipelining into
it's own translation pass which does a source-to-source translation of the code.  The final result
should be similar to performing the loop pipelining together with the scheduling.

\section{Loop pipelining example}

\startplacemarginfigure[location=here,reference={fig:pipelined-loop},title={Example of pipelining a
    loop.}]
  \startfloatcombination[nx=2]

    \startplacesubfigure[title={Simple loop containing an accumulation of values with an
        inter-iteration dependency.}]
      \startframedtext[frame=off,offset=none,width={0.6\textwidth}]
        \starthlC
          for (int i = 1; i < N; i++) {
              c1 = acc[i-1] * c;
              c2 = x[i] * y[i];
              acc[i] = c1 + c2;
          }
        \stophlC
      \stopframedtext
    \stopplacesubfigure

    \startplacesubfigure[title={Pipelined loop reducing the number of dependencies inside of the
        loop.}]
      \startframedtext[frame=off,offset=none,width={0.6\textwidth}]
        \starthlC
          c1 = acc[0] * c;
          c2 = x[1] * y[1];
          for (int i = 1; i < N-1; i++) {
              acc[i] = c1 + c2;
              c2 = x[i+1] * y[i+1];
              c1 = acc[i+1] * c;
          }
          acc[N-1] = c1 + c2;
        \stophlC
      \stopframedtext
    \stopplacesubfigure

  \stopfloatcombination
\stopplacemarginfigure

\in{Figure}[fig:pipelined-loop] shows an example of \emph{software pipelining} of a loop which
accumulates values and modifies an array.  In \in{Figure}{a}[fig:pipelined-loop], the body of the
loop cannot be scheduled in less than three cycles, assuming that a load takes two clock cycles.
However, after transforming the code into the pipelined version in
\in{Figure}{b}[fig:pipelined-loop], the number of inter-iteration dependencies have been reduced, by
moving the store into the next iteration of the loop.  This means that the body of the loop could
now be scheduled in two clock cycles.

The process of pipelining the loop by being resource constrained can be performed using iterative
modulo scheduling~\cite[rau94_iterat_sched], followed by modulo variable
expansion~\cite[lam88_softw_pipel].  The steps performed in the optimisation are the following.

\startitemize[n]
\item Calculate a minimum \II, which is the lowest possible \II\ based on the resources of the code
  inside of the loop and the distance and delay of inter-iteration dependencies.
\item Perform modulo scheduling with an increasing \II\ until one is found that works.  The
  scheduling is performed by adding instructions to a \MRT~\cite[lam88_softw_pipel], assigning
  operations to each resource for one iteration.
\item Once a valid modulo schedule is found, the loop code can be generated from it.  To keep the
  \II\ that was found in the previous step, modulo variable expansion might be necessary and require
  unrolling the loop.
\stopitemize

\section{Verification of pipelining}

Verification of pipelining has already been performed in CompCert by
\cite[authoryears][tristan10_simpl_verif_valid_softw_pipel].  Assuming that one has a loop body
$\mathcal{B}$, the pipelined code can be described using the prologue $\mathcal{P}$, the steady
state $\mathcal{S}$, the epilogue $\mathcal{E}$, and finally some additional variables representing
the minimum number of iterations that need to be performed to be able to use the pipelined loop
$\mu$ and representing the unroll factor of the steady state $\delta$.

The goal is to verify the equivalence of the original loop and the pipelined loop using a validation
algorithm, and then prove that the validation algorithm is \emph{sound}, i.e. if it finds the two
loops to be equivalent, this implies that they will behave the same according to the language
semantics.  Essentially, as the loop pipelining algorithm only modifies loops, it is enough to show
that the input loop $X$ is equivalent to the pipelined version of the loop $Y$.  Assuming that the
validation algorithm is called $\alpha$, we therefore want to show that the following statement
holds, where $X_N$ means that we are considering $N$ iterations of the loop.

\placeformula\startformula
  \forall N,\quad \alpha(X_N) = \startmathcases
    \NC \alpha(Y_{N/\delta}; X_{N\%\delta}) \NC N \ge \mu \NR
    \NC \alpha(X_N) \NC \text{otherwise} \NR
\stopmathcases\stopformula

This states that for any number of iteration of the loop $X$, the formula should hold.  As the
number of loop iterations $N$ are not always known at compile time, this may become an infinite
property that needs to be checked.

This infinite property can be proven by checking smaller finite properties that loop pipelining
should ensure.  The two key insights that \cite[authors][tristan10_simpl_verif_valid_softw_pipel]
found where that mainly only the two following properties needed to be checked:

\placeformula\startformula
\alpha(\mathcal{S}; \mathcal{E}) = \alpha(\mathcal{E}; \mathcal{B}^{\delta})
\stopformula

\placeformula\startformula
\alpha(\mathcal{S}; \mathcal{E}) = \alpha(\mathcal{E}; \mathcal{B}^{\delta})
\stopformula

\noindent In addition to some simpler properties, this means that the infinite property can still be
checked.

Finally, the verifier itself needs to be proven correct, meaning the correctness of the symbolic
evaluation needs to be shown to be semantics preserving.  If the comparison between two symbolic
states succeeds, then this must imply that the initial blocks themselves execute in the same way.

\section{Hardware and Software Pipelining}

Hardware and software pipelining algorithms have a lot of commonalities, even though the way the
pipeline is implemented in the end differs.

Pipelining in hardware is intuitive.  If one has two separate blocks where one is data-dependent on
the other, then it is natural to start executing the first block once it has finished and passed the
result to the second block.  This idea is also the key building block for processor designs, as it
is a straightforward way to improve the throughput of a series of distinct and dependent operations.
In \HLS\ it is also an important optimisation to efficiently translate loops.  As these are often
executed many times, any improvements can reduce the throughput of the entire design.  The \HLS\
tool therefore often automatically creates pipeline stages for loops and calculates a correct \II\
for the loop, which will be the rate at which the pipeline needs to be filled.  The benefit though
of generating hardware directly, is that the pipeline can already be laid out in the optimal and
natural way.

It is less clear why, and especially how, software can be pipelined in the same way as hardware, and
what the benefits are if the code is going to be executed.  The main difference between the hardware
pipeline is that in software one inherently only following one control-flow through the program.
This means that it is not possible to lay out a pipeline in the same way as it is done in hardware
to improve the throughput of loops.  However, a pipeline can be simulated in software, so that it
behaves the same as the hardware pipeline, but it will manually have to unroll the loop so that all
the stages can be executed by following one control-flow.  However, the inherent order of
instructions will be quite different to the order of the pipeline stages in hardware.  In hardware,
the pipeline can be written naturally in sequential order, and then the \II\ contract is fulfilled
by setting the rate at which the pipeline is filled.  However, in software, one horizontal slice of
the pipeline needs to be extracted and will be the new loop body, which represents the computation
that the whole hardware pipeline is performing at one point in time.

The main question that comes up is are the two pipelining methods actually equivalent?  In Vericert
we are assuming that performing software pipelining, followed by scheduling and finally followed by
the hardware generation approximates the hardware pipelining that traditional \HLS\ tools perform as
part of their hardware generation.

\startmode[chapter]
  \section{Bibliography}
  \placelistofpublications
\stopmode

\stopcomponent