Finish the pipelining chapter

author: Yann Herklotz <git@yannherklotz.com> 2022-05-04 20:25:25 +0100
committer: Yann Herklotz <git@yannherklotz.com> 2022-05-04 20:25:25 +0100
commit: b90e207d5a9d1871d5ecde168d39ce91c7691fe1 (patch)
tree: 5ee4b9fc86ac03edc9220c3cebbc316b4f37d830
parent: 428e424b75999eda16f5c5ccb4ee48763d99c9d1 (diff)
download: lsr22_fvhls-b90e207d5a9d1871d5ecde168d39ce91c7691fe1.tar.gz
lsr22_fvhls-b90e207d5a9d1871d5ecde168d39ce91c7691fe1.zip
5 files changed, 149 insertions, 49 deletions
diff --git a/chapters/hls.tex b/chapters/hls.tex
index 90dc238..4090218 100644
--- a/chapters/hls.tex
+++ b/chapters/hls.tex
@@ -782,9 +782,10 @@ inputs for Verilog modules.
 Finally, we use 32-bit integers to represent bitvectors rather than arrays of booleans. This is
 because (currently) only supports types represented by 32 bits.
 
-\subsection[sec:verilog:integrating]{Integrating the Verilog Semantics into 's Model}
+\subsection[sec:verilog:integrating]{Integrating the Verilog Semantics into CompCert's Model}
 
-\startplacemarginfigure
+\startplacefigure[title={Top-level semantics used in CompCert to initiate the first call to
+    \type{main} and well as return the final result of the \type{main} function.}]
 \startformula \startalign[n=1]
   \NC\text{\sc Step }\ \frac{\startmatrix[n=1]
     \NC \Gamma_r[\mathit{rst}] = 0 \qquad \Gamma_r[\mathit{fin}] = 0 \NR
@@ -795,13 +796,22 @@ because (currently) only supports types represented by 32 bits.
   \NC\text{\sc Finish }\ \frac{\Gamma_r[\mathit{fin}] = 1}{\text{\tt State}\ \mathit{sf}\ m\ \sigma\
     \Gamma_r\ \Gamma_a \longrightarrow \text{\tt Returnstate}\ \mathit{sf}\ \Gamma_r[ \mathit{ret}]}\NR
 %
-  \NC\text{\sc Call }\ \frac{}{\text{\tt Callstate}\ \mathit{sf}\ m\ \vec{r} \longrightarrow
-    \text{\tt State}\ \mathit{sf}\ m\ n\ ((\text{\tt init\_params}\ \vec{r}\ a)[\sigma \mapsto n, \mathit{fin} \mapsto 0, \mathit{rst} \mapsto 0])\ \epsilon}\NR
+  \NC\text{\sc Call }\ \frac{}{\startmatrix[n=1]
+    \NC\text{\tt Callstate}\ \mathit{sf}\ m\ \vec{r}\longrightarrow\NR
+    \NC
+    \text{\tt State}\ \mathit{sf}\ m\ n\ ((\text{\tt init\_params}\ \vec{r}\ a)[\sigma \mapsto n,
+      \mathit{fin} \mapsto 0, \mathit{rst} \mapsto 0])\ \epsilon \NR
+  \stopmatrix
+}\NR
 %
-  \NC\text{\sc Return }\ \frac{}{\text{\tt Returnstate}\ (\text{\tt Stackframe}\ r\ m\ \mathit{pc}\
-    \Gamma_r\ \Gamma_a :: \mathit{sf})\ v \longrightarrow \text{\tt State}\ \mathit{sf}\ m\ \mathit{pc}\ (\Gamma_{r} [ \sigma \mapsto \mathit{pc}, r \mapsto v ])\ \Gamma_{a}}\NR
+  \NC\text{\sc Return }\ \frac{}{\startmatrix[n=1]
+    \NC\text{\tt Returnstate}\ (\text{\tt Stackframe}\ r\ m\ \mathit{pc}\
+      \Gamma_r\ \Gamma_a :: \mathit{sf})\ v \NR
+    \NC\longrightarrow \text{\tt State}\ \mathit{sf}\ m\ \mathit{pc}\ (\Gamma_{r} [ \sigma \mapsto
+      \mathit{pc}, r \mapsto v ])\ \Gamma_{a}\NR
+    \stopmatrix}\NR
 \stopalign \stopformula
-\stopplacemarginfigure
+\stopplacefigure
 
 The computation model defines a set of states through which execution passes. In this subsection, we
 explain how we extend our Verilog semantics with four special-purpose registers in order to
diff --git a/chapters/pipelining.tex b/chapters/pipelining.tex
index 26ed8c3..09fae11 100644
--- a/chapters/pipelining.tex
+++ b/chapters/pipelining.tex
@@ -24,7 +24,7 @@ should be similar to performing the loop pipelining together with the scheduling
 
 \startplacemarginfigure[location=here,reference={fig:pipelined-loop},title={Example of pipelining a
     loop.}]
-  \startfloatcombination[nx=2]
+  \startfloatcombination[nx=2,distance=10mm,align=middle]
 
     \startplacesubfigure[title={Simple loop containing an accumulation of values with an
         inter-iteration dependency.}]
@@ -60,15 +60,16 @@ should be similar to performing the loop pipelining together with the scheduling
 
 \in{Figure}[fig:pipelined-loop] shows an example of \emph{software pipelining} of a loop which
 accumulates values and modifies an array.  In \in{Figure}{a}[fig:pipelined-loop], the body of the
-loop cannot be scheduled in less than three cycles, assuming that a load takes two clock cycles.
+loop cannot be scheduled in less than three cycles assuming that a load takes two clock cycles.
 However, after transforming the code into the pipelined version in
-\in{Figure}{b}[fig:pipelined-loop], the number of inter-iteration dependencies have been reduced, by
+\in{Figure}{b}[fig:pipelined-loop], the number of inter-iteration dependencies have been reduced by
 moving the store into the next iteration of the loop.  This means that the body of the loop could
 now be scheduled in two clock cycles.
 
-The process of pipelining the loop by being resource constrained can be performed using iterative
-modulo scheduling~\cite[rau94_iterat_sched], followed by modulo variable
-expansion~\cite[lam88_softw_pipel].  The steps performed in the optimisation are the following.
+The process of pipelining the loop by being resource constrained can be done purely as a
+source-to-source translation in software using iterative modulo
+scheduling~\cite[rau94_iterat_sched], followed by modulo variable
+expansion~\cite[lam88_softw_pipel].  The steps performed in the optimisation are the following:
 
 \startitemize[n]
 \item Calculate a minimum \II, which is the lowest possible \II\ based on the resources of the code
@@ -81,6 +82,9 @@ expansion~\cite[lam88_softw_pipel].  The steps performed in the optimisation are
   unrolling the loop.
 \stopitemize
 
+This is very similar to the way loop scheduling directly in hardware is performed, but it is often
+done at the same time as scheduling.
+
 \section{Verification of pipelining}
 
 Verification of pipelining has already been performed in CompCert by
@@ -109,52 +113,112 @@ number of loop iterations $N$ are not always known at compile time, this may bec
 property that needs to be checked.
 
 This infinite property can be proven by checking smaller finite properties that loop pipelining
-should ensure.  The two key insights that \cite[authors][tristan10_simpl_verif_valid_softw_pipel]
-found where that mainly only the two following properties needed to be checked:
+should ensure.  The two key insights by \cite[author][tristan10_simpl_verif_valid_softw_pipel] where
+that mainly only the two following properties needed to be checked:
 
-\placeformula\startformula
-\alpha(\mathcal{S}; \mathcal{E}) = \alpha(\mathcal{E}; \mathcal{B}^{\delta})
+\placeformula[eq:loop1]\startformula
+  \alpha(\mathcal{S}; \mathcal{E}) = \alpha(\mathcal{E}; \mathcal{B}^{\delta})
 \stopformula
 
-\placeformula\startformula
-\alpha(\mathcal{S}; \mathcal{E}) = \alpha(\mathcal{E}; \mathcal{B}^{\delta})
+\placeformula[eq:loop2]\startformula
+  \alpha(\mathcal{P}; \mathcal{E}) = \alpha(\mathcal{B}^{\mu})
 \stopformula
 
-\noindent In addition to some simpler properties, this means that the infinite property can still be
-checked.
+\noindent The first property shown in \in{Equation}[eq:loop1] describes the fact that one can exit
+the loop at any point and execute $\mathcal{B}^{\delta}$ iterations of the loop instead of executing
+one more body $\mathcal{S}$ of the pipelined loop.  The second property shown in
+\in{Equation}[eq:loop2] describes that executing the initial loop $\mu$ times is exactly the same as
+executing the prologue $\mathcal{P}$ of the loop followed by the epilogue $\mathcal{E}$.  In
+addition to some simpler properties, this means that the infinite property can still be
+checked.\footnote{In addition to other more trivial properties, for example that the loop structure
+  is correct, and only contains one increment operation on the induction variable.}
 
 Finally, the verifier itself needs to be proven correct, meaning the correctness of the symbolic
 evaluation needs to be shown to be semantics preserving.  If the comparison between two symbolic
 states succeeds, then this must imply that the initial blocks themselves execute in the same way.
 
+\section{Novelties of verifying software pipelining in Vericert}
+
+As mentioned in the previous section, software pipelining has already been implemented in CompCert
+by \cite[author][tristan10_simpl_verif_valid_softw_pipel].  From an implementation point of view for
+the actual pipelining algorithm that was implemented was a backtracking iterative modulo scheduling
+algorithm paired with modulo variable expansion.  However, the results were underwhelming because it
+was tested on an out-of-order PowerPC G5 processor, whereas pipelining is more effective on in-order
+processors to take full advantage of the static pipeline.  In addition to that, alias analysis on
+memory dependencies is also not performed, which is perhaps the best way to improve the performance
+of the optimisation.  Finally, the loop scheduling is also only performed on basic blocks, and not
+on extended basic blocks such as superblocks or hyperblocks.  These are all short-comings that could
+be easily improved by using our existing {\sc RTLBlock} or {\sc RTLPar} language as the source
+language of the pipelining transformation.  Paired with the scheduling step, one could extract more
+parallelism and also support conditionals in loop bodies.  Memory aliasing is not directly supported
+yet in the {\sc RTLBlock} abstract interpretation, however, using some simple arithmetic properties
+it should be possible to normalise memory accesses and therefore prove reordering of memory accesses
+that do not alias under certain conditions.
+
+From a proof perspective there is also some improvements that can be made to the work.  First of
+all, the proof from the paper was actually never fully completed:
+
+\startframedtext[
+  frame=off,
+  leftframe=on,
+  offset=0pt,
+  loffset=0.5cm,
+  framecolor=darkcyan,
+  rulethickness=2pt,
+  location=middle,
+  width=\dimexpr \textwidth - 1cm \relax,
+]
+  {\noindent\it [Semantic preservation going from unrolled loops in a basic-block representation to
+   the actual loops in a CFG] is intuitively obvious but surprisingly tedious to formalize in full
+   details: indeed, this is the only result in this paper that we have not yet mechanized in Coq.}
+
+  \startalignment[flushright]
+    --- \cite[alternative=authoryears,right={, Section
+        6.2}][tristan10_simpl_verif_valid_softw_pipel].
+  \stopalignment
+\stopframedtext
+
+\noindent The proof of semantic preservation going from unrolled loops with basic blocks back to a
+control-flow graph were tedious, showing that the language was not designed to handle loop
+transformations well.  However, recent work on the formalisation of polyhedral transformations of
+loops~\cite[authoryear][courant21_verif_code_gener_polyh_model] shows that this may be a better
+representation for the intermediate language.  However, translations to and from the {\sc Loop}
+language would still have to be implemented.
+
+In addition to that, the original loop-pipelining implementation did not have a syntactic
+representation of basic blocks, making the proofs a bit more tedious than necessary.  This may mean
+that the current syntactic representation of basic blocks in {\sc RTLBlock} may simplify the
+translations as well.
+
 \section{Hardware and Software Pipelining}
 
-Hardware and software pipelining algorithms have a lot of commonalities, even though the way the
-pipeline is implemented in the end differs.
+Hardware and software pipelining algorithms share many features, even though the final
+implementation of the pipeline is quite different.
 
 Pipelining in hardware is intuitive.  If one has two separate blocks where one is data-dependent on
-the other, then it is natural to start executing the first block once it has finished and passed the
-result to the second block.  This idea is also the key building block for processor designs, as it
-is a straightforward way to improve the throughput of a series of distinct and dependent operations.
-In \HLS\ it is also an important optimisation to efficiently translate loops.  As these are often
-executed many times, any improvements can reduce the throughput of the entire design.  The \HLS\
-tool therefore often automatically creates pipeline stages for loops and calculates a correct \II\
-for the loop, which will be the rate at which the pipeline needs to be filled.  The benefit though
-of generating hardware directly, is that the pipeline can already be laid out in the optimal and
-natural way.
+the other, then it is natural to pass a new input to the first block as soon as it has finished and
+passed the result to the second block.  This idea is also the key building block for processor
+designs, as it is a straightforward way to improve the throughput of a series of distinct and
+dependent operations.  In \HLS\ it is also an important optimisation to efficiently translate loops.
+Loops are often the limiting factor of a design, and any improvements in their translation can
+reduce the throughput of the entire design.  The \HLS\ tool therefore often automatically creates
+pipeline stages for loops and calculates a correct \II\ for the loop, which will be the rate at
+which the pipeline needs to be filled.  The benefit of generating hardware directly, is that the
+pipeline can already be laid out in the optimal and natural way.
 
 It is less clear why, and especially how, software can be pipelined in the same way as hardware, and
-what the benefits are if the code is going to be executed.  The main difference between the hardware
-pipeline is that in software one inherently only following one control-flow through the program.
-This means that it is not possible to lay out a pipeline in the same way as it is done in hardware
-to improve the throughput of loops.  However, a pipeline can be simulated in software, so that it
-behaves the same as the hardware pipeline, but it will manually have to unroll the loop so that all
-the stages can be executed by following one control-flow.  However, the inherent order of
-instructions will be quite different to the order of the pipeline stages in hardware.  In hardware,
-the pipeline can be written naturally in sequential order, and then the \II\ contract is fulfilled
-by setting the rate at which the pipeline is filled.  However, in software, one horizontal slice of
-the pipeline needs to be extracted and will be the new loop body, which represents the computation
-that the whole hardware pipeline is performing at one point in time.
+what the benefits are if the code is going to be executed.\footnote{It is more clear why scheduling
+  would be important on a VLIW processor, but still not clear how the pipelined loop can be
+  represented in software without explicit parallelism.}  The main difference between to a hardware
+pipeline is that in software the code is executed inherently sequentially.  This means that it is
+not possible to lay out a pipeline in the same way, as the next input cannot just be passed to the
+loop while it is executing.  However, a pipeline can be simulated in software, so that it behaves
+the same as the hardware pipeline, but it will manually have to unroll and rearrange the loop body
+so that all the stages can be executed by following one control-flow.  However, the order of
+instructions will be quite different hardware, where the pipeline can be written naturally in
+sequential order of the stages.  In software, the pipeline is instead represented by one horizontal
+slice of the pipeline, which will be the new loop body and represents the computation that the whole
+hardware pipeline is performing at one point in time.
 
 The main question that comes up is are the two pipelining methods actually equivalent?  In Vericert
 we are assuming that performing software pipelining, followed by scheduling and finally followed by
diff --git a/lsr_env.tex b/lsr_env.tex
index 53362e5..8438c18 100644
--- a/lsr_env.tex
+++ b/lsr_env.tex
@@ -224,16 +224,18 @@
 ]
 \setuplabeltext[subfigure=]
 
+% https://www.mail-archive.com/ntg-context@ntg.nl/msg75292.html
+
 \newdimen\LeftMarginSize
 \LeftMarginSize=\dimexpr \backspace + \rightmarginwidth + \rightmargindistance \relax
 \newdimen\TextAreaSize
 \TextAreaSize=\dimexpr \textwidth + \rightmarginwidth + \rightmargindistance \relax
 \definefloat[marginfigure][marginfigures][figure]
-\setupfloat[marginfigure][location=inner,default=top,width=\TextAreaSize]
+\setupfloat[marginfigure][location={inner},default=top]
 \setupcaptions[marginfigure][
-  leftmargin=\dimexpr \LeftMarginSize / 2 \relax,
+  leftmargin=1cm,
   location=bottom,
-  width=\textwidth,
+  width=\dimexpr \textwidth - 2cm \relax,
 ]
 \setuplabeltext[marginfigure=Figure~]
 
@@ -337,7 +339,7 @@
 \definevimtyping [hlcoq] [syntax=coq]
 \definevimtyping [hlocaml] [syntax=ocaml]
 \definevimtyping [hlverilog] [syntax=verilog]
-\definevimtyping [hlC] [syntax=C]
+\definevimtyping [hlC] [syntax=c]
 
 % ==========================================================================
 % Tikz
diff --git a/lsr_refs.bib b/lsr_refs.bib
index abfd180..a107605 100644
--- a/lsr_refs.bib
+++ b/lsr_refs.bib
@@ -305,6 +305,25 @@
 	year = {2011}
 }
 
+@article{courant21_verif_code_gener_polyh_model,
+	author = {Courant, Nathanaël and Leroy, Xavier},
+	title = {Verified Code Generation for the Polyhedral Model},
+	year = {2021},
+	issue_date = {January 2021},
+	publisher = {Association for Computing Machinery},
+	address = {New York, NY, USA},
+	volume = {5},
+	number = {POPL},
+	url = {https://doi.org/10.1145/3434321},
+	doi = {10.1145/3434321},
+	abstract = {The polyhedral model is a high-level intermediate representation for loop nests that supports elegantly a great many loop optimizations. In a compiler, after polyhedral loop optimizations have been performed, it is necessary and difficult to regenerate sequential or parallel loop nests before continuing compilation. This paper reports on the formalization and proof of semantic preservation of such a code generator that produces sequential code from a polyhedral representation. The formalization and proofs are mechanized using the Coq proof assistant.},
+	journal = {Proc. ACM Program. Lang.},
+	month = {jan},
+	articleno = {40},
+	numpages = {24},
+	keywords = {Compiler verification, Polyhedral code generation, Polyhedral model}
+}
+
 @inbook{coussy08_gaut,
 	abstract = {This chapter presents GAUT, an academic and open-source high-level synthesis tool dedicated to digital signal processing applications. Starting from an algorithmic bit-accurate specification written in C/C++, GAUT extracts the potential parallelism before processing the allocation, the scheduling and the binding tasks. Mandatory synthesis constraints are the throughput and the clock period while the memory mapping and the I/O timing diagram are optional. GAUT next generates a potentially pipelined architecture composed of a processing unit, a memory unit and a communication with a GALS/LIS interface.},
 	author = {Coussy, Philippe and Chavet, Cyrille and Bomel, Pierre and Heller, Dominique and Senn, Eric and Martin, Eric},
diff --git a/title.tex b/title.tex
index 4571680..0c7bcc0 100644
--- a/title.tex
+++ b/title.tex
@@ -1,5 +1,7 @@
+\environment fonts_env
+\environment lsr_env
+
 \startcomponent title
-\product main
 
 \setuplayout[wide]
 
@@ -11,7 +13,10 @@
   {\bfb\ss Late Stage Review}
   \blank[2cm]
   {\rma Yann Herklotz Grave}
-  \blank[5cm]
+  \blank[3cm]
+
+  {\leavevmode\externalfigure[figures/vericert.pdf][location=local]}
+  \blank[3cm]
 
   % http://wiki.contextgarden.net/Command/currentdate
   {\bf Imperial College London}\blank
author	Yann Herklotz <git@yannherklotz.com>	2022-05-04 20:25:25 +0100
committer	Yann Herklotz <git@yannherklotz.com>	2022-05-04 20:25:25 +0100
commit	b90e207d5a9d1871d5ecde168d39ce91c7691fe1 (patch)
tree	5ee4b9fc86ac03edc9220c3cebbc316b4f37d830
parent	428e424b75999eda16f5c5ccb4ee48763d99c9d1 (diff)
download	lsr22_fvhls-b90e207d5a9d1871d5ecde168d39ce91c7691fe1.tar.gz lsr22_fvhls-b90e207d5a9d1871d5ecde168d39ce91c7691fe1.zip