summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorYann Herklotz <git@yannherklotz.com>2022-05-05 18:13:13 +0100
committerYann Herklotz <git@yannherklotz.com>2022-05-05 18:13:13 +0100
commit968a1603be67e1352078fd7a38109da6bd5c3d6a (patch)
tree61c8d07c423509f7f16c976224da255555fa07e6
parentc9648f2dc56a0c970a90e1539925ae78b0610fbc (diff)
downloadlsr22_fvhls-968a1603be67e1352078fd7a38109da6bd5c3d6a.tar.gz
lsr22_fvhls-968a1603be67e1352078fd7a38109da6bd5c3d6a.zip
Add build file
-rw-r--r--.build.yml17
-rw-r--r--chapters/hls.tex92
-rw-r--r--chapters/introduction.tex6
-rw-r--r--main.tex2
4 files changed, 69 insertions, 48 deletions
diff --git a/.build.yml b/.build.yml
new file mode 100644
index 0000000..de01be4
--- /dev/null
+++ b/.build.yml
@@ -0,0 +1,17 @@
+image: debian/unstable
+packages:
+ - context
+ - texlive
+ - texlive-extra-utils
+ - texlive-latex-extra
+ - latexmk
+sources:
+ - https://git.sr.ht/~ymherklotz/lsr22_fvhls
+secrets:
+ - f1c07b45-32bd-4559-b370-28e59e4c11e1
+tasks:
+ - build: |
+ cd lsr22_fvhls
+ make main.pdf
+artifacts:
+ - lsr22_fvhls/main.pdf
diff --git a/chapters/hls.tex b/chapters/hls.tex
index 4090218..ae162a0 100644
--- a/chapters/hls.tex
+++ b/chapters/hls.tex
@@ -177,7 +177,7 @@ concentrating on the features that are used in the output of Vericert. Verilog
description language (HDL) and is used to design hardware ranging from complete CPUs that are
eventually produced as integrated circuits, to small application-specific accelerators that are
placed on FPGAs. Verilog is a popular language because it allows for fine-grained control over the
-hardware, and also provides high-level constructs to simplify development. \goto{yannherklotz.com}[url(https://yannherklotz.com)]
+hardware, and also provides high-level constructs to simplify development.
Verilog behaves quite differently to standard software programming languages due to it having to
express the parallel nature of hardware. The basic construct to achieve this is the always-block,
@@ -342,19 +342,21 @@ endmodule
\stopplacemarginfigure
\subsection{Translating C to Verilog by Example}
-Fig.~\ref{fig:accumulator_c_rtl} illustrates the translation of a simple program that stores and retrieves values from an array.
-In this section, we describe the stages of the Vericert translation, referring to this program as an example.
+\in{Figure}[fig:accumulator_c_rtl] illustrates the translation of a simple program that stores and
+retrieves values from an array. In this section, we describe the stages of the Vericert
+translation, referring to this program as an example.
\subsubsection{Translating C to 3AC}
The first stage of the translation uses unmodified CompCert to transform the C input, shown in
-Fig.~\ref{fig:accumulator_c}, into a 3AC intermediate representation, shown in
-Fig.~\ref{fig:accumulator_rtl}. As part of this translation, function inlining is performed on all
-functions, which allows us to support function calls without having to support the \mono{Icall} 3AC
-instruction. Although the duplication of the function bodies caused by inlining can increase the
-area of the hardware, it can have a positive effect on latency and is therefore a common HLS
-optimisation~\cite{noronha17_rapid_fpga}. Inlining precludes support for recursive function calls,
-but this feature is not supported in most HLS tools anyway~\cite{davidthomas_asap16}.
+\in{Figure}{c}[fig:accumulator_c_rtl], into a 3AC intermediate representation, shown in
+\in{Figure}{d}[fig:accumulator_c_rtl]. As part of this translation, function inlining is performed
+on all functions, which allows us to support function calls without having to support the
+\mono{Icall} 3AC instruction. Although the duplication of the function bodies caused by inlining
+can increase the area of the hardware, it can have a positive effect on latency and is therefore a
+common HLS optimisation~\cite{noronha17_rapid_fpga}. Inlining precludes support for recursive
+function calls, but this feature is not supported in most HLS tools
+anyway~\cite{davidthomas_asap16}.
\subsubsection{Translating 3AC to HTL}
@@ -364,7 +366,7 @@ with data-path (FSMD) representation~\cite{hwang99_fsmd}. The core idea of the F
is that it separates the control flow from the operations on the memory and registers. Hence, an
HTL program consists of two maps from states to Verilog statements: the \emph{control logic} map,
which expresses state transitions, and the \emph{data-path} map, which expresses computations.
-Fig.~\ref{fig:accumulator_diagram} shows the resulting FSMD architecture. The right-hand block is
+\in{Figure}[fig:accumulator_diagram] shows the resulting FSMD architecture. The right-hand block is
the control logic that computes the next state, while the left-hand block updates all the registers
and RAM based on the current program state.
@@ -375,13 +377,12 @@ similar semantics to Verilog at the lower level by already using Verilog stateme
abstract instructions. Compared to plain Verilog, HTL is simpler to manipulate and analyse, thereby
making it easier to prove optimisations like proper RAM insertion.
-\startplacemarginfigure[title={The FSMD for the example shown in Fig.~\ref{fig:accumulator_c_rtl},
+\startplacemarginfigure[title={The FSMD for the example shown in \in{Figure}[fig:accumulator_c_rtl],
split into a data-path and control logic for the next state calculation. The Update block takes
the current state, current values of all registers and at most one value stored in the RAM, and
calculates a new value that can either be stored back in the RAM or in a
register.},reference={fig:accumulator_diagram}]
-\definecolor[control][x=B3E2CD]
-\definecolor[data][x=FDCDAC]
+ \definecolor[control][x=B3E2CD] \definecolor[data][x=FDCDAC]
\starttikzpicture
\startscope[scale=1.15]
\fill[control,fill opacity=1] (6.5,0) rectangle (12,5);
@@ -492,29 +493,30 @@ becomes an unpacked array of 32-bit integers representing the RAM block. Any lo
temporarily translated to direct accesses to this array, where each address has its offset removed
and is divided by four. In a separate HTL-to-HTL conversion, these direct accesses are then
translated to proper loads and stores that use a RAM interface to communicate with the RAM, shown on
-lines 21, 24 and 28 of Fig.~\ref{fig:accumulator_v}. This pass inserts a RAM block with the
-interface around the unpacked array. Without this interface and without the RAM block, the
+lines 21, 24 and 28 of \in{Figure}{a}[fig:accumulator_c_rtl]. This pass inserts a RAM block with
+the interface around the unpacked array. Without this interface and without the RAM block, the
synthesis tool processing the Verilog hardware description would not identify the array as a RAM,
and would instead implement it using many registers. This interface is shown on lines 9--15 in the
-Verilog code in Fig.~\ref{fig:accumulator_v}. A high-level overview of the architecture and of the
-RAM interface can be seen in Fig.~\ref{fig:accumulator_diagram}.
+Verilog code in \in{Figure}{a}[fig:accumulator_c_rtl]. A high-level overview of the architecture
+and of the RAM interface can be seen in \in{Figure}[fig:accumulator_diagram].
\paragraph{Translating instructions}
Most 3AC instructions correspond to hardware constructs. For example, line 2 in
-Fig.~\ref{fig:accumulator_rtl} shows a 32-bit register \mono{x5} being initialised to 3, after which
-the control flow moves execution to line 3. This initialisation is also encoded in the Verilog
-generated from HTL at state 8 in both the control logic and data-path always-blocks, shown at lines
-33 and 16 respectively in Fig.~\ref{fig:accumulator_v}. Simple operator instructions are translated
-in a similar way. For example, the add instruction is just translated to the built-in add operator,
-similarly for the multiply operator. All 32-bit instructions can be translated in this way, but
-some special instructions require extra care. One such instruction is the \mono{Oshrximm}
-instruction, which is discussed further in Section~\ref{sec:algorithm:optimisation:oshrximm}.
-Another is the \mono{Oshldimm} instruction, which is a left rotate instruction that has no Verilog
-equivalent and therefore has to be implemented in terms of other operations and proven to be
-equivalent. The only 32-bit instructions that we do not translate are case-statements
-(\mono{Ijumptable}) and those instructions related to function calls (\mono{Icall}, \mono{Ibuiltin},
-and \mono{Itailcall}), because we enable \oindex{inlining}inlining by default.
+\in{Figure}{d}[fig:accumulator_c_rtl] shows a 32-bit register \mono{x5} being initialised to 3,
+after which the control flow moves execution to line 3. This initialisation is also encoded in the
+Verilog generated from HTL at state 8 in both the control logic and data-path always-blocks, shown
+at lines 33 and 16 respectively in \in{Figure}{a}[fig:accumulator_c_rtl]. Simple operator
+instructions are translated in a similar way. For example, the add instruction is just translated
+to the built-in add operator, similarly for the multiply operator. All 32-bit instructions can be
+translated in this way, but some special instructions require extra care. One such instruction is
+the \mono{Oshrximm} instruction, which is discussed further in
+\in{Section}[sec:algorithm:optimisation:oshrximm]. Another is the \mono{Oshldimm} instruction,
+which is a left rotate instruction that has no Verilog equivalent and therefore has to be
+implemented in terms of other operations and proven to be equivalent. The only 32-bit instructions
+that we do not translate are case-statements (\mono{Ijumptable}) and those instructions related to
+function calls (\mono{Icall}, \mono{Ibuiltin}, and \mono{Itailcall}), because we enable
+\oindex{inlining}inlining by default.
\subsubsection{Translating HTL to Verilog}
@@ -525,8 +527,8 @@ translated to valid Verilog case-statements. We also require declarations for a
the program, as well as declarations of the inputs and outputs to the module, so that the module can
be used inside a larger hardware design. In addition to translating the maps of Verilog statements,
an always-block that will behave like the RAM also has to be created, which is only modelled
-abstractly at the HTL level. Fig.~\ref{fig:accumulator_v} shows the final Verilog output that is
-generated for our example.
+abstractly at the HTL level. \in{Figure}{a}[fig:accumulator_c_rtl] shows the final Verilog output
+that is generated for our example.
Although this translation seems quite straight\-forward, proving that this translation is correct is
complex. All the implicit assumptions that were made in HTL need to be translated explicitly to
@@ -537,7 +539,7 @@ proofs in upcoming sections.
\subsection{Optimisations}
-Although we would not claim that Vericert is a proper `optimising' HLS compiler yet, we have
+Although we would not claim that Vericert is a proper \quote{optimising} HLS compiler yet, we have
nonetheless made several design choices that aim to improve the quality of the hardware designs it
produces.
@@ -562,18 +564,18 @@ dividing the address by four.
The simplest way to implement loads and stores in Vericert would be to access the Verilog array
directly from within the data-path (i.e., inside the always-block on lines 16--32 of
-Fig.~\ref{fig:accumulator_v}). This would be correct, but when a Verilog array is accessed at
-several program points, the synthesis tool is unlikely to detect that it can be implemented as a RAM
-block, and will resort to using lots of registers instead, ruining the circuit's area and
+\in{Figure}{a}[fig:accumulator_c_rtl]). This would be correct, but when a Verilog array is accessed
+at several program points, the synthesis tool is unlikely to detect that it can be implemented as a
+RAM block, and will resort to using lots of registers instead, ruining the circuit's area and
performance. To avert this, we arrange that the data-path does not access memory directly, but
simply sets the address it wishes to access and then toggles the \mono{u\_en} flag. This activates
-the RAM interface (lines 9--15 of Fig.~\ref{fig:accumulator_v}) on the next falling clock edge,
-which performs the requested load or store. By factoring all the memory accesses out into a separate
-interface, we ensure that the underlying array is only accessed from a single program point in the
-Verilog code, and thus ensure that the synthesis tool will correctly infer a RAM
+the RAM interface (lines 9--15 of \in{Figure}{a}[fig:accumulator_c_rtl]) on the next falling clock
+edge, which performs the requested load or store. By factoring all the memory accesses out into a
+separate interface, we ensure that the underlying array is only accessed from a single program point
+in the Verilog code, and thus ensure that the synthesis tool will correctly infer a RAM
block.\footnote{Interestingly, the Verilog code shown for the RAM interface must not be modified,
- because the synthesis tool will only generate a RAM when the code matches a small set of
- specific patterns.}
+ because the synthesis tool will only generate a RAM when the code matches a small set of specific
+ patterns.}
Therefore, an extra compiler pass is added from HTL to HTL to extract all the direct accesses to the
Verilog array and replace them by signals that access the RAM interface in a separate
@@ -624,10 +626,10 @@ always be disabled straight after it was used, without having to insert or modif
\stopfloatcombination
\stopplacemarginfigure
-Fig.~\ref{fig:ram_load_store} gives an example of how the RAM interface behaves when values are
+\in{Figure}[fig:ram_load_store] gives an example of how the RAM interface behaves when values are
loaded and stored.
-\subsubsection[sec:hls:algorithm:optimisation:oshrximm]{Implementing the \mono{Oshrximm}
+\subsubsection[sec:algorithm:optimisation:oshrximm]{Implementing the \mono{Oshrximm}
Instruction}
Many of the CompCert instructions map well to hardware, but \mono{Oshrximm} (efficient signed
diff --git a/chapters/introduction.tex b/chapters/introduction.tex
index c7dd2cc..fecfcaa 100644
--- a/chapters/introduction.tex
+++ b/chapters/introduction.tex
@@ -103,9 +103,9 @@ making use of validation to prove it correct.
\desc{\in{Chapter}[sec:pipelining]} describes the current plan to implement loop pipelining in
Vericert by also using a validator.
-\desc{\in{Chapter}[sec:dynamic]} describes a possible extension to Vericert by implementing dynamic
-scheduling instead of the current static scheduling implementation. This relies on ongoing work in
-a fork of CompCert which implements \SSA\ as an intermediate language.
+%\desc{\in{Chapter}[sec:dynamic]} describes a possible extension to Vericert by implementing dynamic
+%scheduling instead of the current static scheduling implementation. This relies on ongoing work in
+%a fork of CompCert which implements \SSA\ as an intermediate language.
\desc{\in{Chapter}[sec:schedule]} describes the current implementation timeline.
diff --git a/main.tex b/main.tex
index f77f086..7bffd10 100644
--- a/main.tex
+++ b/main.tex
@@ -16,6 +16,8 @@
% \showbodyfont
% \showjustification
+% \showgrid
+
\component title
\startfrontmatter