summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorYann Herklotz <git@yannherklotz.com>2022-05-03 09:51:01 +0100
committerYann Herklotz <git@yannherklotz.com>2022-05-03 09:51:01 +0100
commitdb0b534ae51bbb31ab33f1354ab807c04292838f (patch)
treeb22f8d388b401aaf328082fb086997da050bbb23
parent7458e17b2bf5b3c6f342731510f1be589851c3a0 (diff)
downloadlsr22_fvhls-db0b534ae51bbb31ab33f1354ab807c04292838f.tar.gz
lsr22_fvhls-db0b534ae51bbb31ab33f1354ab807c04292838f.zip
Add pipelining text
-rw-r--r--chapters/pipelining.tex59
-rw-r--r--chapters/pipelining_notes.org24
-rw-r--r--lsr_env.tex1
-rw-r--r--lsr_refs.bib76
-rw-r--r--main.tex2
5 files changed, 155 insertions, 7 deletions
diff --git a/chapters/pipelining.tex b/chapters/pipelining.tex
index 6c351c8..5aa03dc 100644
--- a/chapters/pipelining.tex
+++ b/chapters/pipelining.tex
@@ -13,7 +13,12 @@
Standard instruction scheduling only addresses parallelisation inside hyperblocks, which are linear
sections of code. However, loops are often the most critical sections in code, and scheduling only
-addresses parallelisation within one iteration.
+addresses parallelisation within one iteration. Traditionally, loop pipelining is performed as part
+of the normal scheduling step in \HLS, as the scheduling algorithm can be generalised to support
+these new constraints. However, it becomes expensive to check the correctness of a scheduling
+algorithm that transforms the code in this many ways. It is better to separate loop pipelining into
+it's own translation pass which does a source-to-source translation of the code. The final result
+should be similar to performing the loop pipelining together with the scheduling.
\section{Loop pipelining example}
@@ -53,10 +58,54 @@ addresses parallelisation within one iteration.
\stopfloatcombination
\stopplacemarginfigure
-\in{Figure}[fig:pipelined-loop] shows an example of pipelining a loop which accumulates values and
-modifies an array. In \in{Figure}{a}[fig:pipelined-loop], the body of the loop cannot be scheduled
-in less than three cycles, assuming that a load takes two clock cycles. However, after transforming
-the code into the pipelined version on the right
+\in{Figure}[fig:pipelined-loop] shows an example of \emph{software pipelining} of a loop which
+accumulates values and modifies an array. In \in{Figure}{a}[fig:pipelined-loop], the body of the
+loop cannot be scheduled in less than three cycles, assuming that a load takes two clock cycles.
+However, after transforming the code into the pipelined version in
+\in{Figure}{b}[fig:pipelined-loop], the number of inter-iteration dependencies have been reduced, by
+moving the store into the next iteration of the loop. This means that the body of the loop could
+now be scheduled in two clock cycles.
+
+The process of pipelining the loop by being resource constrained can be performed using iterative
+modulo scheduling~\cite[rau94_iterat_sched], followed by modulo variable
+expansion~\cite[lam88_softw_pipel]. The steps performed in the optimisation are the following.
+
+\startitemize[n]
+\item Calculate a minimum \II, which is the lowest possible \II\ based on the resources of the code
+ inside of the loop and the distance and delay of inter-iteration dependencies.
+\item Perform modulo scheduling with an increasing \II\ until one is found that works. The
+ scheduling is performed by adding instructions to a \MRT~\cite[lam88_softw_pipel], assigning
+ operations to each resource for one iteration.
+\item Once a valid modulo schedule is found, the loop code can be generated from it. To keep the
+ \II\ that was found in the previous step, modulo variable expansion might be necessary and require
+ unrolling the loop.
+\stopitemize
+
+\section{Verification of pipelining}
+
+Verification of pipelining has already been performed in CompCert by
+\cite[authoryears][tristan10_simpl_verif_valid_softw_pipel]. Assuming that one has a loop body
+$\mathcal{B}$, the pipelined code can be described using the prologue $\mathcal{P}$, the steady
+state $\mathcal{S}$, the epilogue $\mathcal{E}$, and finally some additional variables representing
+the minimum number of iterations that need to be performed to be able to use the pipelined loop
+$\mu$ and representing the unroll factor of the steady state $\delta$.
+
+The goal is to verify the equivalence of the original loop and the pipelined loop using a validation
+algorithm, and then prove that the validation algorithm is \emph{sound}, i.e. if it finds the two
+loops to be equivalent, this implies that they will behave the same according to the language
+semantics. Essentially, as the loop pipelining algorithm only modifies loops, it is enough to show
+that the input loop $X$ is equivalent to the pipelined version of the loop $Y$. Assuming that the
+validation algorithm is called $\alpha$, we therefore want to show that the following statement
+holds, where $X_N$ means that we are considering $N$ iterations of the loop.
+\placeformula\startformula
+ \forall N,\quad \alpha(X_N) = \startmathcases
+ \NC \alpha(Y_{N/\delta}; X_{N\%\delta}) \NC N \ge \mu \NR
+ \NC \alpha(X_N) \NC \text{otherwise} \NR
+\stopmathcases\stopformula
+
+This states that for any number of iteration of the loop X, the formula should hold. As the number
+of loop iterations $N$ are not always known at compile time, this may become an infinite property
+that needs to be checked.
\startmode[section]
\section{Bibliography}
diff --git a/chapters/pipelining_notes.org b/chapters/pipelining_notes.org
index 0680800..01ff668 100644
--- a/chapters/pipelining_notes.org
+++ b/chapters/pipelining_notes.org
@@ -49,9 +49,31 @@ A3 x6 = x6 + 1 (int)
| Instr | Stage | Code |
|-------+-------+-------------------------------------------------------------------|
-| 0 | 0 | ~x18 = x6 - 1~, ~x16 = int32[x4+x18*4+0]~, ~x12 = int32[x3+x6*4]~ |
+| 0 | 0 | ~x18 = x6 - 1~, ~x13 = int32[x4+x18*4+0]~, ~x12 = int32[x3+x6*4]~ |
| | 1 | ~x7 = x12 * x13~ |
| | 2 | ~x11 = x8 + x7~ |
| | 3 | ~x6 = x6 + 1~, ~int32[x4+x6*4] = x11~ |
| 1 | 0 | ~x16 = int32[x4+x18*4]~ |
| | 1 | ~x8 = x16 * x1~ |
+
+
+#+begin_src
+x18[0] = x6[0] - 1
+|| x13[0] = int32[x4[0]+x18[0]*4+0]
+|| x12[0] = int32[x3[0]+x6[0]*4]
+|| x7[1] = x12[1] * x13[1]
+|| x11[2] = x8[2] + x7[2]
+|| int32[x4[3]+x6[3]*4] = x11[3]
+|| x6[3] = x6[3] + 1
+
+x16[0] = int32[x4+x18*4]
+|| x8[1] = x16[1] * x1[1]
+
+x18>>
+x16>>
+x12>>
+x13>>
+x7>>
+x8>>
+x11>>
+#+end_src
diff --git a/lsr_env.tex b/lsr_env.tex
index cbedab1..43e33d6 100644
--- a/lsr_env.tex
+++ b/lsr_env.tex
@@ -276,6 +276,7 @@
\abbreviation{ASAP}{as soon as possible}
\abbreviation{ALAP}{as late as possible}
\abbreviation{LUT}{look-up table}
+\abbreviation{MRT}{modulo reservation table}
% ==========================================================================
% Appendices
diff --git a/lsr_refs.bib b/lsr_refs.bib
index f9cf0d3..abfd180 100644
--- a/lsr_refs.bib
+++ b/lsr_refs.bib
@@ -723,6 +723,23 @@
year = {2019}
}
+@inproceedings{lam88_softw_pipel,
+ author = {Lam, M.},
+ title = {Software Pipelining: An Effective Scheduling Technique for VLIW Machines},
+ year = {1988},
+ isbn = {0897912691},
+ publisher = {Association for Computing Machinery},
+ address = {New York, NY, USA},
+ url = {https://doi.org/10.1145/53990.54022},
+ doi = {10.1145/53990.54022},
+ abstract = {This paper shows that software pipelining is an effective and viable scheduling technique for VLIW processors. In software pipelining, iterations of a loop in the source program are continuously initiated at constant intervals, before the preceding iterations complete. The advantage of software pipelining is that optimal performance can be achieved with compact object code.This paper extends previous results of software pipelining in two ways: First, this paper shows that by using an improved algorithm, near-optimal performance can be obtained without specialized hardware. Second, we propose a hierarchical reduction scheme whereby entire control constructs are reduced to an object similar to an operation in a basic block. With this scheme, all innermost loops, including those containing conditional statements, can be software pipelined. It also diminishes the start-up cost of loops with small number of iterations. Hierarchical reduction complements the software pipelining technique, permitting a consistent performance improvement be obtained.The techniques proposed have been validated by an implementation of a compiler for Warp, a systolic array consisting of 10 VLIW processors. This compiler has been used for developing a large number of applications in the areas of image, signal and scientific processing.},
+ booktitle = {Proceedings of the ACM SIGPLAN 1988 Conference on Programming Language Design and Implementation},
+ pages = {318–328},
+ numpages = {11},
+ location = {Atlanta, Georgia, USA},
+ series = {PLDI '88}
+}
+
@inproceedings{leroy06_formal_certif_compil_back_end,
author = {Leroy, Xavier},
location = {Charleston, South Carolina, USA},
@@ -1067,6 +1084,65 @@
year = {2013}
}
+@inproceedings{rau92_code_gener_schem_sched_loops,
+ author = {Rau, B. Ramakrishna and Schlansker, Michael S. and Tirumalai, P. P.},
+ location = {Portland, Oregon, USA},
+ publisher = {IEEE Computer Society Press},
+ booktitle = {Proceedings of the 25th Annual International Symposium on Microarchitecture},
+ isbn = {0818631759},
+ keywords = {modulo scheduling,code motion,loop scheduling,software pipelining,rotating registers},
+ pages = {158--169},
+ series = {MICRO 25},
+ title = {Code Generation Schema for modulo Scheduled Loops},
+ year = {1992}
+}
+
+@inproceedings{rau92_regis_alloc_softw_pipel_loops,
+ abstract = {Software pipelining is an important instruction scheduling technique for efficiently overlapping successive iterations of loops and executing them in parallel. This paper studies the task of register allocation for software pipelined loops, both with and without hardware features that are specifically aimed at supporting software pipelines. Register allocation for software pipelines presents certain novel problems leading to unconventional solutions, especially in the presence of hardware support. This paper formulates these novel problems and presents a number of alternative solution strategies. These alternatives are comprehensively tested against over one thousand loops to determine the best register allocation strategy, both with and without the hardware support for software pipelining.},
+ author = {Rau, B. R. and Lee, M. and Tirumalai, P. P. and Schlansker, M. S.},
+ location = {San Francisco, California, USA},
+ publisher = {Association for Computing Machinery},
+ url = {https://doi.org/10.1145/143095.143141},
+ booktitle = {Proceedings of the ACM SIGPLAN 1992 Conference on Programming Language Design and Implementation},
+ doi = {10.1145/143095.143141},
+ isbn = {0897914759},
+ keywords = {software pipelining,loop scheduling,register allocation,rotating registers,compiler optimisation},
+ pages = {283--299},
+ series = {PLDI '92},
+ title = {Register Allocation for Software Pipelined Loops},
+ year = {1992}
+}
+
+@inproceedings{rau94_iterat_sched,
+ abstract = {Modulo scheduling is a framework within which a wide variety of algorithms and heuristics may be defined for software pipelining innermost loops. This paper presents a practical algorithm, iterative modulo scheduling, that is capable of dealing with realistic machine models. This paper also characterizes the algorithm in terms of the quality of the generated schedules as well the computational expense incurred.},
+ author = {Rau, B. Ramakrishna},
+ location = {San Jose, California, USA},
+ publisher = {Association for Computing Machinery},
+ url = {https://doi.org/10.1145/192724.192731},
+ booktitle = {Proceedings of the 27th Annual International Symposium on Microarchitecture},
+ doi = {10.1145/192724.192731},
+ isbn = {0897917073},
+ keywords = {software pipelining,loop scheduling,rotating registers,code motion,compiler optimisation,modulo scheduling},
+ pages = {63--74},
+ series = {MICRO 27},
+ title = {Iterative modulo Scheduling: An Algorithm for Software Pipelining Loops},
+ year = {1994}
+}
+
+@article{rau96_iterat_modul_sched,
+ abstract = {Modulo scheduling is a framework within which algorithms for software pipelining innermost loops may be defined. The framework specifies a set of constraints that must be met in order to achieve a legal modulo schedule. A wide variety of algorithms and heuristics can be defined within this framework. Little work has been done to evaluate and compare alternative algorithms and heuristics for modulo scheduling from the viewpoints of schedule quality as well as computational complexity. This, along with a vague and unfounded perception that modulo scheduling is computationally expensive as well as difficult to implement, have inhibited its incorporation into product compilers. This paper presents iterative modulo scheduling, a practical algorithm that is capable of dealing with realistic machine models. The paper also characterizes the algorithm in terms of the quality of the generated schedules as well the computational expense incurred.},
+ author = {Rau, B. Ramakrishna},
+ url = {https://doi.org/10.1007/BF03356742},
+ date = {1996-02-01},
+ issn = {1573-7640},
+ journaltitle = {International Journal of Parallel Programming},
+ keywords = {loop scheduling,software pipelining,code motion,compiler optimisation,rotating registers,modulo scheduling},
+ number = {1},
+ pages = {3--64},
+ title = {Iterative Modulo Scheduling},
+ volume = {24}
+}
+
@inproceedings{schuiki20_llhd,
author = {Schuiki, Fabian and Kurth, Andreas and Grosser, Tobias and Benini, Luca},
location = {London, UK},
diff --git a/main.tex b/main.tex
index cc6492c..8988110 100644
--- a/main.tex
+++ b/main.tex
@@ -29,7 +29,7 @@
\component hls
\component scheduling
\component pipelining
- \component dynamic
+ %\component dynamic
\component schedule
\component conclusion
\stopbodymatter