summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorJohn Wickerson <j.wickerson@imperial.ac.uk>2021-04-16 09:56:41 +0000
committeroverleaf <overleaf@localhost>2021-04-16 09:58:43 +0000
commit60ce3969298b29733ded4ae6bcf166174a2a2487 (patch)
treefb5e5f024ba4ed819bd056dce8b311dde238c180
parent57d85e008dec396ecf342c0a1d513eede18440bb (diff)
downloadoopsla21_fvhls-60ce3969298b29733ded4ae6bcf166174a2a2487.tar.gz
oopsla21_fvhls-60ce3969298b29733ded4ae6bcf166174a2a2487.zip
Update on Overleaf.
-rw-r--r--algorithm.tex6
-rw-r--r--verilog.tex2
2 files changed, 4 insertions, 4 deletions
diff --git a/algorithm.tex b/algorithm.tex
index 4e636f8..801c905 100644
--- a/algorithm.tex
+++ b/algorithm.tex
@@ -311,7 +311,7 @@ Finally, the \texttt{mulhs} and \texttt{mulhu} instructions, which fetch the upp
However, 64-bit number representations are currently not supported in the generated hardware, so this operation cannot currently be performed. These instructions are only generated to optimise divides \JW{divisions?} by a constant \JWcouldcut{number} that is not a power of two, so turning off constant propagation will allow these programs to pass without error.
\subsubsection{RAM insertion}
-\JW{I thought this was going in the `optimisation}
+\JW{I think this belongs in the `optimisations' subsection?}
This pass goes from HTL \JWcouldcut{back} to HTL and extracts all the direct accesses to the Verilog array implementing memory and replaces them by signals which access the memory in a separate always-block. This ensures that the synthesis tool correctly identifies the array as being a RAM, so that it is not implemented by logic directly. The translation is performed by going through all the instructions and replacing each load and store expression one after another. Stores can simply be replaced by the necessary wires directly, however, loads using the RAM block take two clock cycles instead of a direct load from an array which only takes one clock cycles. This pass therefore creates a extra state which is inserted after each load.
\subsubsection{Translating HTL to Verilog}
@@ -334,11 +334,11 @@ Although we would not claim that \vericert{} is a proper `optimising' HLS compil
\subsubsection{Byte- and word-addressable memories}
-One big difference between C and Verilog is how memory is represented. Although Verilog arrays might seem to mirror their C counterparts directly, they must be treated quite differently. To reduce the design area and avoid timing issues, it is beneficial if Verilog arrays can be synthesised as RAMs, but this imposes various constraints on how Verilog arrays are used; for instance, RAMs often only allow one read and one write operation per clock cycle. To make loads and stores as efficient as possible, the RAM needs to be word-addressable, which means that an entire integer can be loaded or stored in one clock cycle.
+One big difference between C and Verilog is how memory is represented. Although Verilog arrays might seem to mirror their C counterparts directly, \JW{Or: `Although Verilog arrays use similar syntax to C arrays'} they must be treated quite differently. To reduce the design area and avoid timing issues, it is beneficial if Verilog arrays can be synthesised as RAMs, \JWcouldcut{but this imposes various constraints on how Verilog arrays are used; for instance, RAMs often only allow one read and one write operation per clock cycle.} \JW{I think that bit can be chopped from here because that's the topic of the next paragraph.} To make loads and stores as efficient as possible, the RAM needs to be word-addressable, which means that an entire integer can be loaded or stored in one clock cycle. \JW{I suggest spelling out explicitly that if you had byte-addressable RAMs you'd need to take four clock cycles to read a single word.}
However, the memory model that \compcert{} uses for its intermediate languages is byte-addre\-ssa\-ble~\cite{blazy05_formal_verif_memor_model_c}. It therefore has to be proven that the byte-addressable memory behaves in the same way as the word-addressable memory in hardware. Any modifications of the bytes in the \compcert{} memory model also have to be shown to modify the word-addressable memory in the same way. Since only integer loads and stores are currently supported in \vericert{}, it follows that the addresses given to the loads and stores should be multiples of four. If that is the case, then the translation from byte-addressed memory to word-addressed memory can be done by dividing the address by four.
\subsubsection{Implementation of RAM templates}
-
+\JW{The simplest way to implement loads and stores in \vericert{} would be to access the Verilog arrays directly in the data-path (i.e., within the always-block on lines 16--32 of Figure~\ref{fig:accumulator_v}). This would be correct, but when a Verilog array is accessed at several program points, the synthesis tool is unlikely to detect that it can be implemented as a RAM block, and will resort to a large number of registers instead, ruining the circuits area and performance. To avert this, we arrange that the data-path simply sets the address it wishes to access and then toggles the \texttt{u\_en} flag. Then, on the next falling clock edge, the RAM driver (lines 9--15 of Figure~\ref{fig:accumulator_v}) is activated, whereupon it performs the requested load or store. By factoring all the memory accesses out into a separate driver like this, we ensure that the underlying array is only accessed from a single program point in the Verilog code, and thus ensure that the synthesis tool will correctly infer a RAM block.} \JW{I've called that negedge always-block the `RAM driver' in my proposed text above -- that feels like quite a nice a word for it to my mind -- what do you think?}
Verilog arrays can be used in a variety of ways, however, these do not all produce optimal hardware designs. If, for example, arrays in Verilog are accessed immediately in the data-path, then the synthesis tool is not be able to identify it as having the right properties for a RAM, and would instead implement the array using registers. This is extremely expensive, and for large memories this can easily blow up the area usage of the FPGA, and because of the longer wires that are needed, it would also affect the performance of the circuit. The synthesis tools therefore provide code snippets that they know how to transform into various constructs, including snippets that will generate proper RAMs in the final hardware. This process is called memory inference. The initial translation from 3AC to HTL converts loads and stores to direct accesses to the memory, as this preserves the same behaviour without having to insert more registers and logic. We therefore have another pass from HTL to itself which performs the translation from this na\"ive use of arrays to a representation which always allows for memory inference. This pass creates a separate always block to perform the loads and stores to the memory, and adds the necessary data, address and enable signals to communicate with that always-block from other always-blocks. This always-block is shown between lines 10-15 in Figure~\ref{fig:accumulator_v}.
There are two interesting parts to this RAM template. Firstly, the memory updates are triggered on the negative edge of the clock, out of phase with the rest of the design which is triggered on the positive edge of the clock. The main advantage is that instead of loads and stores taking three clock cycles and two clock cycles respectively, they only take two clock cycles and one clock cycle instead, greatly improving their performance. In addition to that, using the negative edge for the clock is supported by many synthesis tools, it therefore does not affect the maximum frequency of the final design.
diff --git a/verilog.tex b/verilog.tex
index e0d1257..b97c6e7 100644
--- a/verilog.tex
+++ b/verilog.tex
@@ -125,7 +125,7 @@ This translation is represented in Figure~\ref{fig:memory_model_transl}, where \
\begin{figure}
\centering
\includegraphics[width=0.5\linewidth]{diagrams/memory_model.pdf}
- \caption{Change in the memory model during the translation of 3AC to HTL. This is immediately after the assignment to the array}\label{fig:memory_model_transl}
+ \caption{Change in the memory model during the translation of 3AC to HTL. This is immediately after the assignment to the array.\YH{TODO: Update diagram}}\label{fig:memory_model_transl}
\end{figure}
However, in practice, assigning and reading from an array directly in the state machine will not produce a memory in the final hardware design, as the synthesis tool cannot identify the array as having the necessary properties that a RAM needs, even though this is the most natural formulation of memory. Even though theoretically the memory will only be read from once per clock cycle, the synthesis tool cannot ensure that this is true, and will instead create a register for each memory location. This increases the size of the circuit dramatically, as the RAM on the FPGA chip will not be reused. Instead, the synthesis tool expects a specific template that ensures these properties, and will then transform the template into a proper RAM during synthesis. Therefore, a translation has to be performed from the naive use of memory in the state machine, to a proper use of a memory template.