flashlight22.org


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185

#+title: Formal Verification of High-Level Synthesis
#+author: \underline{Yann Herklotz}, John Wickerson
#+options: H:2 toc:nil
#+columns: %45ITEM %10BEAMER_ENV(Env) %10BEAMER_ACT(Act) %4BEAMER_COL(Col)
#+setupfile: setup.org

** Why HLS is unreliable
** Small Introduction to Coq
** Solution: Formally Verify HLS
*** Vericert
** Current Status of Vericert
* Loop Pipelining
** The Need for Loop Pipelining

# - really useful, or you might say necessary optimisation that HLS performs.

- Main difficulty with having hardware as a target is the need to pipeline loops.

*** Column left
:PROPERTIES:
:BEAMER_COL: 0.45
:END:

#+attr_latex: :options fontsize=\small,escapeinside=||
#+begin_src c
for (int i = 0; i < N; i++) {
|\sA{1}| x18 = i - 1
|\sR{1}| x16 = load[1, x18 * 4]
|\sM{1}| x8 = x16 * x1
|\sR{2}| x12 = load[3, i * 4]
|\sR{3}| x13 = load[2, i * 4]
|\sM{2}| x7 = x12 * x13
|\sA{2}| x11 = x8 + x7
|\sW{1}| store[1, i * 4] = x11
   i = i + 1
}
#+end_src

*** Column right
:PROPERTIES:
:BEAMER_COL: 0.45
:END:

#+begin_export latex
\begin{tikzpicture}[lnode/.style={circle,draw=black,minimum size=4mm,scale=0.8},nlabel/.style={midway,right
,font=\tiny},node distance=1.3cm,shorten >=1pt]
  \node[lnode,fill=s1col] (A1) {1};
  \node[lnode,fill=s2col,below of=A1] (R1) {1};
  \node[lnode,fill=s3col,below of=R1] (M1) {1};
  \node[lnode,fill=s2col,right of=A1] (R2) {2};
  \node[lnode,fill=s2col,right of=R2] (R3) {3};
  \node[lnode,fill=s3col,below of=R2,xshift=0.7cm] (M2) {2};
  \node[lnode,fill=s1col,below right of=M1,xshift=0.4cm] (A2) {2};
  \node[lnode,fill=s4col,below of=A2] (W1) {1};

  \draw[->,thick] (A1) -- (R1);
  \draw[->,thick] (R1) -- node[nlabel] {2} (M1);
  \draw[->,thick] (R2) -- node[nlabel] {2} (M2);
  \draw[->,thick] (R3) -- node[nlabel] {2} (M2);
  \draw[->,thick] (M1) -- (A2);
  \draw[->,thick] (M2) -- (A2);
  \draw[->,thick] (A2) -- (W1);
  \draw[->,thick] (R1) to [out=220,in=180,loop,looseness=2] (W1);
\end{tikzpicture}
#+end_export

** Ideas Behind Static Hardware Loop Pipelining

*** One Possible Workflow

- Generate scheduling constraints for linear code as well as loops.
- Solve for a scheduling using an ILP solver.
- Place the instructions into the cycle that it was assigned to.

#+begin_export latex
\begin{center}
  \begin{tikzpicture}[lnode/.style={circle,draw=black,minimum size=4mm,scale=0.8},node
    distance=1.3cm,shorten >=1pt]
    \node[lnode,fill=s1col] (A1) {1}; \node[lnode,fill=s2col,below of=A1] (R2) {2};
    \node[lnode,fill=s2col,below of=R2] (R3) {3}; \node[lnode,fill=s2col,right of=A1] (R1) {1};
    \node[lnode,fill=s3col,right of=R1] (M2) {2}; \node[lnode,fill=s3col,right of=M2] (M1) {1};
    \node[lnode,fill=s1col,right of=M1] (A2) {2}; \node[lnode,right of=A2] (E1) {};
    \node[lnode,fill=s4col,right of=E1] (W1) {1}; \node[lnode,right of=W1] (E2) {}; \node[above
    right of=R1,xshift=-0.35cm] (S1T) {}; \node[below of=S1T,yshift=-2.5cm] (S1B) {}; \draw[very
    thick] (S1T) -- (S1B); \node[above right of=M1,xshift=-0.35cm] (S2T) {}; \node[below
    of=S2T,yshift=-2.5cm] (S2B) {}; \draw[very thick] (S2T) -- (S2B); \node[above right
    of=E1,xshift=-0.35cm] (S3T) {}; \node[below of=S3T,yshift=-2.5cm] (S3B) {}; \draw[very thick]
    (S3T) -- (S3B);
  \end{tikzpicture}
\end{center}
#+end_export

** Verifying Hardware Pipelining is Difficult

- Normally part of the scheduling step.
- Lose control about how the loops are translated, the fundamental structure of the loop could
  change and would be difficult to identify again.

* Verifying Loop Pipelining
** Ideas Behind Software Loop Pipelining

*** Main Idea

Can perform a source-to-source transformation which generates a pipeline purely in software.

*** Column left
:PROPERTIES:
:BEAMER_COL: 0.45
:END:

#+attr_latex: :options fontsize=\small,escapeinside=||
#+begin_src c
for (int i = 0; i < N; i++) {
|\sA{1}| x18 = i - 1
|\sR{1}| x16 = load[1, x18 * 4]
|\sM{1}| x8 = x16 * x1
|\sR{2}| x12 = load[3, i * 4]
|\sR{3}| x13 = load[2, i * 4]
|\sM{2}| x7 = x12 * x13
|\sA{2}| x11 = x8 + x7
|\sW{1}| store[1, i * 4] = x11
   i = i + 1
}
#+end_src

*** Column right
:PROPERTIES:
:BEAMER_COL: 0.45
:END:

#+attr_latex: :options fontsize=\footnotesize,escapeinside=||
#+begin_src c
for (int i = 0; i < N; i++) {
|\sA{1}| x18[i] = i - 1
|\sR{2}| x12[i] = load[3, i * 4]
|\sR{3}| x13[i] = load[2, i * 4]
|\sM{2}| x7[i-1] = x12[i-1] * x13[i-1]
|\sA{2}| x11[i-2] = x8[i-2] + x7[i-2]
|\sW{1}| store[1, (i-3) * 4] = x11[i-3]

|\sR{1}| x16[i] = load[1, x18[i] * 4]
|\sM{1}| x8[i-1] = x16[i-1] * x1
   i = i + 1
}
#+end_src

** Verifying Software Loop Pipelining

*** Symbolic Execution

Define an $\alpha$, such that $\alpha(\mathcal{C})$ evaluates some code $\mathcal{C}$ and returns
*symbolic states* for all registers.

*** Example

Executing the following code will evaluate to the following symbolic code:

*** Column left
:PROPERTIES:
:BEAMER_COL: 0.45
:END:


*** Column right
:PROPERTIES:
:BEAMER_COL: 0.45
:END:

* Comparing Software and Hardware Loop Pipelining
** Comparing Pipelines in Hardware and Software

*** Pipelines can be identical

- In terms of expressivity, both hardware and software pipelining can express the same loop
  pipelines.

** Resource Usage of Pipelines

Naively, They will both contain

* Wrapping up
** Conclusion

- To verify loop pipelines, it seems easier