summaryrefslogtreecommitdiffstats
path: root/algorithm.tex
diff options
context:
space:
mode:
authorYann Herklotz <git@yannherklotz.com>2021-09-08 18:55:41 +0100
committerYann Herklotz <git@yannherklotz.com>2021-09-08 18:55:41 +0100
commita68d9b19756801fb1633edd17aec2ac9c22d8444 (patch)
tree5deb0f398621309decd0c106ef7b9b67eb06b0dc /algorithm.tex
parentaf5b7dd0370e7f0087b5bc58042baa71288c59bb (diff)
downloadoopsla21_fvhls-a68d9b19756801fb1633edd17aec2ac9c22d8444.tar.gz
oopsla21_fvhls-a68d9b19756801fb1633edd17aec2ac9c22d8444.zip
Some initial changes towards the final submission
Diffstat (limited to 'algorithm.tex')
-rw-r--r--algorithm.tex36
1 files changed, 18 insertions, 18 deletions
diff --git a/algorithm.tex b/algorithm.tex
index c3b2153..06cf110 100644
--- a/algorithm.tex
+++ b/algorithm.tex
@@ -34,26 +34,26 @@ The .NET framework has been used as a basis for other HLS tools, such as Kiwi~\c
[language/.style={fill=white,rounded corners=3pt,minimum height=7mm},
continuation/.style={},
linecount/.style={rounded corners=3pt,dashed}]
- \fill[compcert,rounded corners=3pt] (-1,-0.5) rectangle (11.4,2);
- \fill[formalhls,rounded corners=3pt] (-1,-1) rectangle (11.4,-2);
- \draw[linecount] (-0.95,-0.45) rectangle (3.6,1);
- \draw[linecount] (4,-0.45) rectangle (7.5,1);
+ \fill[compcert,rounded corners=3pt] (-1,-0.5) rectangle (9.9,2);
+ \fill[formalhls,rounded corners=3pt] (-1,-1) rectangle (9.9,-2);
+ %\draw[linecount] (-0.95,-0.45) rectangle (3.6,1);
+ %\draw[linecount] (4,-0.45) rectangle (7.5,1);
\node[language] at (-0.3,0) (clight) {Clight};
\node[continuation] at (1,0) (conta) {$\cdots$};
\node[language] at (2.7,0) (cminor) {CminorSel};
\node[language] at (4.7,0) (rtl) {3AC};
\node[language] at (6.2,0) (ltl) {LTL};
- \node[language,anchor=west] at (8.4,0) (aarch) {aarch64 \small ($\sim$ 12 kloc)};
- \node[language,anchor=west] at (8.4,0.8) (x86) {x86 \small ($\sim$ 14 kloc)};
+ \node[language,anchor=west] at (8.4,0) (aarch) {aarch64};
+ \node[language,anchor=west] at (8.4,0.8) (x86) {x86};
\node[continuation,anchor=west] at (8.4,1.4) (backs) {$\cdots$};
\node[continuation] at (7.3,0) (contb) {$\cdots$};
\node[language] at (4.7,-1.5) (htl) {HTL};
\node[language] at (6.7,-1.5) (verilog) {Verilog};
\node at (0,1.5) {\bf\compcert{}};
\node at (0,-1.5) {\bf\vericert{}};
- \node[anchor=west] at (-0.9,0.7) {\small $\sim$ 27 kloc};
- \node[anchor=west] at (4.1,0.7) {\small $\sim$ 46 kloc};
- \node[anchor=west] at (2,-1.5) {\small $\sim$ 17 kloc};
+ %%\node[anchor=west] at (-0.9,0.7) {\small $\sim$ 27 kloc};
+ %%\node[anchor=west] at (4.1,0.7) {\small $\sim$ 46 kloc};
+ %%\node[anchor=west] at (2,-1.5) {\small $\sim$ 17 kloc};
\node[align=center] at (3.5,-2.4) {\footnotesize RAM\\[-0.5em]\footnotesize insertion};
\draw[->,thick] (clight) -- (conta);
\draw[->,thick] (conta) -- (cminor);
@@ -67,7 +67,7 @@ The .NET framework has been used as a basis for other HLS tools, such as Kiwi~\c
\draw[->,thick] (htl) -- (verilog);
\draw[->,thick] (htl.west) to [out=180,in=150] (4,-2.2) to [out=330,in=270] (htl.south);
\end{tikzpicture}%}
- \caption{\vericert{} as a Verilog back end to \compcert{}. For scale, the approximate lines of code (kloc) are shown for \vericert{}, as well as for the front end and back end of \compcert{}, including any comments and whitespace.}%
+ \caption{\vericert{} as a Verilog back end to \compcert{}.}%
\label{fig:rtlbranch}
\end{figure}
@@ -115,9 +115,9 @@ endmodule
\begin{subfigure}{0.45\linewidth}
\centering
\begin{tikzpicture}
- \node[draw,circle,inner sep=8pt] (s0) at (0,0) {$S_{0}$};
- \node[draw,circle,inner sep=8pt] (s1) at (1.5,-3) {$S_{1}$};
- \node[draw,circle,inner sep=8pt] (s2) at (3,0) {$S_{2}$};
+ \node[draw,circle,inner sep=6pt] (s0) at (0,0) {$S_{\mathit{start}}$};
+ \node[draw,circle,inner sep=8pt] (s1) at (1.5,-3) {$S_{0}$};
+ \node[draw,circle,inner sep=8pt] (s2) at (3,0) {$S_{1}$};
\node (s2s) at ($(s2.west) + (-0.3,1)$) {\texttt{x0/1}};
\node (s2ss) at ($(s2.east) + (0.3,1)$) {\texttt{1x/1}};
\draw[-{Latex[length=2mm,width=1.4mm]}] ($(s0.west) + (-0.3,1)$) to [out=0,in=120] (s0);
@@ -250,8 +250,8 @@ This involves going from a CFG representation of the computation to a finite sta
Hence, an HTL program consists of two maps from states to Verilog statements: the \emph{control logic} map, which expresses state transitions, and the \emph{data-path} map, which expresses computations.
Figure~\ref{fig:accumulator_diagram} shows the resulting FSMD architecture. The right-hand block is the control logic that computes the next state, while the left-hand block updates all the registers and RAM based on the current program state.
-The HTL language was mainly introduced to make it easier to prove the translation from 3AC to Verilog, as these languages have very different semantics.
-It serves as an intermediate language with similar semantics to 3AC at the top level, using maps to represents what to execute at every state, and similar semantics to Verilog at the low level by already using Verilog statements instead of more abstract instructions.
+The HTL language was mainly introduced to simplify the proof of translation from 3AC to Verilog, as these languages have very different semantics.
+It serves as an intermediate language with similar semantics to 3AC at the top level, using maps to represents what to execute at every state, and similar semantics to Verilog at the lower level by already using Verilog statements instead of more abstract instructions.
Compared to plain Verilog, HTL is simpler to manipulate and analyse, thereby making it easier to prove optimisations like proper RAM insertion.
\begin{figure*}
@@ -375,7 +375,7 @@ Most 3AC instructions correspond to hardware constructs.
% JW: Thanks; please check proposed new text.
For example, line 2 in Figure~\ref{fig:accumulator_rtl} shows a 32-bit register \texttt{x5} being initialised to 3, after which the control flow moves execution to line 3. This initialisation is also encoded in the Verilog generated from HTL at state 8 in both the control logic and data-path always-blocks, shown at lines 33 and 16 respectively in Figure~\ref{fig:accumulator_v}. Simple operator instructions are translated in a similar way. For example, the add instruction is just translated to the built-in add operator, similarly for the multiply operator. All 32-bit instructions can be translated in this way, but some special instructions require extra care. One such is the \texttt{Oshrximm} instruction, which is discussed further in Section~\ref{sec:algorithm:optimisation:oshrximm}. Another is the \texttt{Oshldimm} instruction, which is a left rotate instruction that has no Verilog equivalent and therefore has to be implemented in terms of other operations and proven to be equivalent.
% In addition to any non-32-bit operations, the remaining
-The only 32-bit instructions that we do not translate are those related to function calls (\texttt{Icall}, \texttt{Ibuiltin}, and \texttt{Itailcall}), because we enable inlining by default.
+The only 32-bit instructions that we do not translate are case-statements (\texttt{Ijumptable}) and those instructions related to function calls (\texttt{Icall}, \texttt{Ibuiltin}, and \texttt{Itailcall}), because we enable inlining by default.
\subsubsection{Translating HTL to Verilog}
@@ -398,9 +398,9 @@ Although we would not claim that \vericert{} is a proper `optimising' HLS compil
\subsubsection{Byte- and word-addressable memories}
One big difference between C and Verilog is how memory is represented. Although Verilog arrays use similar syntax to C arrays, they must be treated quite differently. To make loads and stores as efficient as possible, the RAM needs to be word-addressable, which means that an entire integer can be loaded or stored in one clock cycle.
-However, the memory model that \compcert{} uses for its intermediate languages is byte-addre\-ssa\-ble~\cite{blazy05_formal_verif_memor_model_c}. If a byte-addressable memory was used in the target hardware, which is closer to \compcert{}'s memory model, then a load and store would instead take four clock cycles, because a RAM can only perform one read and write per clock cycle. It therefore has to be proven that the byte-addressable memory behaves in the same way as the word-addressable memory in hardware. Any modifications of the bytes in the \compcert{} memory model also have to be shown to modify the word-addressable memory in the same way. Since only integer loads and stores are currently supported in \vericert{}, it follows that the addresses given to the loads and stores will be multiples of four. If that is the case, then the translation from byte-addressed memory to word-addressed memory can be done by dividing the address by four.
+However, the memory model that \compcert{} uses for its intermediate languages is byte-addre\-ssa\-ble~\cite{blazy05_formal_verif_memor_model_c}. If a byte-addressable memory was used in the target hardware, which is closer to \compcert{}'s memory model, then a load and store would instead take four clock cycles, because a RAM can only perform one read and write per clock cycle. It therefore has to be proven that the byte-addressable memory behaves in the same way as the word-addressable memory in hardware. Any modifications of the bytes in the \compcert{} memory model also have to be shown to modify the word-addressable memory in the same way. Since only integer loads and stores are currently supported in \vericert{}, it follows that the addresses given to the loads and stores will be multiples of four. Translating from byte-addressed memory to word-addressed memory can then be done by dividing the address by four.
-\subsubsection{Implementation of RAM interface}
+\subsubsection{Implementation of RAM interface}\label{sec:algorithm:optimisation:ram}
The simplest way to implement loads and stores in \vericert{} would be to access the Verilog array directly from within the data-path (i.e., inside the always-block on lines 16--32 of Figure~\ref{fig:accumulator_v}). This would be correct, but when a Verilog array is accessed at several program points, the synthesis tool is unlikely to detect that it can be implemented as a RAM block, and will resort to using lots of registers instead, ruining the circuit's area and performance. To avert this, we arrange that the data-path does not access memory directly, but simply sets the address it wishes to access and then toggles the \texttt{u\_en} flag. This activates the RAM interface (lines 9--15 of Figure~\ref{fig:accumulator_v}) on the next falling clock edge, which performs the requested load or store. By factoring all the memory accesses out into a separate interface like this, we ensure that the underlying array is only accessed from a single program point in the Verilog code, and thus ensure that the synthesis tool will correctly infer a RAM block.\footnote{Interestingly, the Verilog code shown for the RAM interface must not be modified, because the synthesis tool will only generate a RAM when the code matches a small set of specific patterns.}
%\JW{I tweaked this slightly in an attempt to clarify; please check.} %\NR{Bring forward this sentence to help with flow.}