summaryrefslogtreecommitdiffstats
path: root/algorithm.tex
diff options
context:
space:
mode:
authorJohn Wickerson <j.wickerson@imperial.ac.uk>2021-04-16 22:52:24 +0000
committeroverleaf <overleaf@localhost>2021-04-16 23:34:56 +0000
commitbbf0c06c409180665f04346d186c4c5d991ecc15 (patch)
tree4a4d314ddef575ff015566957da873904c7c3b8a /algorithm.tex
parent71933526a7c203fb76d54d8f08fea3e132da535c (diff)
downloadoopsla21_fvhls-bbf0c06c409180665f04346d186c4c5d991ecc15.tar.gz
oopsla21_fvhls-bbf0c06c409180665f04346d186c4c5d991ecc15.zip
Update on Overleaf.
Diffstat (limited to 'algorithm.tex')
-rw-r--r--algorithm.tex6
1 files changed, 3 insertions, 3 deletions
diff --git a/algorithm.tex b/algorithm.tex
index f8514b8..6303173 100644
--- a/algorithm.tex
+++ b/algorithm.tex
@@ -334,12 +334,12 @@ However, the memory model that \compcert{} uses for its intermediate languages i
\subsubsection{Implementation of RAM interface}
The simplest way to implement loads and stores in \vericert{} would be to access the Verilog array directly from within the data-path (i.e., inside the always-block on lines 16--32 of Figure~\ref{fig:accumulator_v}). This would be correct, but when a Verilog array is accessed at several program points, the synthesis tool is unlikely to detect that it can be implemented as a RAM block, and will resort to using lots of registers instead, ruining the circuit's area and performance. To avert this, we arrange that the data-path does not access memory directly, but simply sets the address it wishes to access and then toggles the \texttt{u\_en} flag. This activates the RAM interface (lines 9--15 of Figure~\ref{fig:accumulator_v}) on the next falling clock edge, which performs the requested load or store. By factoring all the memory accesses out into a separate interface like this, we ensure that the underlying array is only accessed from a single program point in the Verilog code, and thus ensure that the synthesis tool will correctly infer a RAM block. Without the interface, the array would be implemented using registers, which would increase the size of the hardware considerably.
-Therefore, an extra compiler pass is added from HTL to HTL to extracts all the direct accesses to the Verilog array and replaces them by signals which access the RAM interface in a separate always-block. The translation is performed by going through all the instructions and replacing each load and store expression one after another. Stores can simply be replaced by the necessary wires directly, however, loads using the RAM block take two clock cycles instead of a direct load from an array which only takes one clock cycles. This pass therefore creates a extra state which is inserted after each load.
+Therefore, an extra compiler pass is added from HTL to HTL to extract all the direct accesses to the Verilog array and replace them by signals which access the RAM interface in a separate always-block. The translation is performed by going through all the instructions and replacing each load and store expression in turn. Stores can simply be replaced by the necessary wires directly, however, because loads using the RAM interface take two clock cycles where a direct load from an array takes only one, this pass inserts an extra state after each load.
%\JW{I've called that negedge always-block the `RAM driver' in my proposed text above -- that feels like quite a nice a word for it to my mind -- what do you think?}\YH{Yes I quite like it!}
%Verilog arrays can be used in a variety of ways, however, these do not all produce optimal hardware designs. If, for example, arrays in Verilog are accessed immediately in the data-path, then the synthesis tool is not be able to identify it as having the right properties for a RAM, and would instead implement the array using registers. This is extremely expensive, and for large memories this can easily blow up the area usage of the FPGA, and because of the longer wires that are needed, it would also affect the performance of the circuit. The synthesis tools therefore provide code snippets that they know how to transform into various constructs, including snippets that will generate proper RAMs in the final hardware. This process is called memory inference. The initial translation from 3AC to HTL converts loads and stores to direct accesses to the memory, as this preserves the same behaviour without having to insert more registers and logic. We therefore have another pass from HTL to itself which performs the translation from this na\"ive use of arrays to a representation which always allows for memory inference. This pass creates a separate always block to perform the loads and stores to the memory, and adds the necessary data, address and enable signals to communicate with that always-block from other always-blocks. This always-block is shown between lines 10-15 in Figure~\ref{fig:accumulator_v}.
-There are two interesting parts to the inserted RAM interface. Firstly, the memory updates are triggered on the negative edge of the clock, out of phase with the rest of the design which is triggered on the positive edge of the clock. The main advantage is that instead of loads and stores taking three clock cycles and two clock cycles respectively, they only take two clock cycles and one clock cycle instead, greatly improving their performance. In addition to that, using the negative edge for the clock is supported by many synthesis tools, it therefore does not affect the maximum frequency of the final design.
+There are two interesting parts to the inserted RAM interface. Firstly, the memory updates are triggered on the negative edge of the clock, out of phase with the rest of the design which is triggered on the positive edge of the clock. The main advantage is that instead of loads and stores taking three clock cycles and two clock cycles respectively, they only take two clock cycles and one clock cycle instead, greatly improving their performance. In addition to that, using the negative edge for the clock is supported by many synthesis tools, and does not affect the maximum frequency of the final design.
Secondly, the logic in the enable signal of the RAM (\texttt{en != u\_en}) is also atypical. To make the proof simpler, the goal is to create a RAM which disables itself after every use, so that firstly, the proof can assume that the RAM is disabled at the start and end of every clock cycle, and secondly so that only the state which contains the load and store need to be modified to ensure this property. With a simple enable signal, it would not be possible to disable it in the RAM itself, because this would result in a register being driven twice from two different locations, once from the RAM interface and once from the data-path. The only other solutions are to either insert extra states that disable the RAM accordingly, thereby eliminating the speed advantage of the negative edge triggered RAM, or to check the next state after a load and store and insert disables into that state. The latter solution can quickly become complicated though, especially as this new state could contain another memory operation, in which case the disable signal should not be added to that state. A solution to this problem is to create another enable signal that is controlled by the self-disabling RAM, which is always set to be equal to the enable signal set by the data-path. The RAM is then considered enabled if the data-path enable and the RAM enable are different.
@@ -387,7 +387,7 @@ Secondly, the logic in the enable signal of the RAM (\texttt{en != u\_en}) is al
\caption{Timing diagrams showing the execution of loads and stores over multiple clock cycles.}\label{fig:ram_load_store}
\end{figure}
-Figure~\ref{fig:ram_load} shows an example of how the waveforms in the RAM shown in Figure~\ref{fig:accumulator_v} behaves when a value is loaded. To initiate a the load, the data-path enable signal \texttt{u\_en} flag is toggled, the address \texttt{addr} is set and the write enable \texttt{wr\_en} is set to low. This all happens at the positive edge of the clock, in the time slice 1. Then, on the next negative edge of the clock, at time slice 2, as the \texttt{u\_en} is now different to the RAM enable \texttt{en}, this means that the RAM is enabled. A load is then performed by assigning the data out register to the value stored at the address in the RAM and the \texttt{en} is set to the same value as \texttt{u\_en} to disable the RAM again. Finally, on the next positive edge of the clock, the value in \texttt{d\_out} is assigned to the destination register \texttt{r}. An example of a store is shown in Figure~\ref{fig:ram_store}, which instead assigns the \texttt{d\_in} register with the value to be stored. The store is then performed on the negative edge of the clock and is therefore complete by the next positive edge.
+Figure~\ref{fig:ram_load} shows an example of how the waveforms in the RAM shown in Figure~\ref{fig:accumulator_v} behave when a value is loaded. To initiate a load, the data-path enable signal \texttt{u\_en} flag is toggled, the address \texttt{addr} is set and the write enable \texttt{wr\_en} is set to low. This all happens at the positive edge of the clock, in the time slice 1. Then, on the next negative edge of the clock, at time slice 2, the \texttt{u\_en} is now different to the RAM enable \texttt{en}, so the RAM is enabled. A load is then performed by assigning the data-out register to the value stored at the address in the RAM and the \texttt{en} is set to the same value as \texttt{u\_en} to disable the RAM again. Finally, on the next positive edge of the clock, the value in \texttt{d\_out} is assigned to the destination register \texttt{r}. An example of a store is shown in Figure~\ref{fig:ram_store}, which instead assigns the \texttt{d\_in} register with the value to be stored. The store is then performed on the negative edge of the clock and is therefore complete by the next positive edge.
\subsubsection{Implementing the \texttt{Oshrximm} instruction}