Add text

author: Yann Herklotz <git@yannherklotz.com> 2021-04-14 22:24:43 +0100
committer: Yann Herklotz <git@yannherklotz.com> 2021-04-14 22:24:43 +0100
commit: 6363b3998d65dc7a4e45cec0db9b41f69f45fb31 (patch)
tree: 851d52ed23f9fc1d22e9752cb8a6644c220565f3 /algorithm.tex
parent: a855aa4ff2211421db9d11b3270b69d2aa8b18f4 (diff)
download: oopsla21_fvhls-6363b3998d65dc7a4e45cec0db9b41f69f45fb31.tar.gz
oopsla21_fvhls-6363b3998d65dc7a4e45cec0db9b41f69f45fb31.zip
1 files changed, 14 insertions, 7 deletions
diff --git a/algorithm.tex b/algorithm.tex
index 83b879a..1d8201d 100644
--- a/algorithm.tex
+++ b/algorithm.tex
@@ -116,7 +116,8 @@ module main(reset, clk, finish, return_val);
   // RAM Template
   always @(negedge clk)
     if ({u_en != en}) begin
-      if (wr_en) stack[addr] <= d_in; else d_out <= stack[addr];
+      if (wr_en) stack[addr] <= d_in;
+      else d_out <= stack[addr];
       en <= u_en;
     end
   // Data-path
@@ -331,9 +332,11 @@ However, the memory model that \compcert{} uses for its intermediate languages i
 
 \subsubsection{Implementation of RAM templates}
 
-Verilog arrays can be used in a variety of ways, however, these do not all produce optimal hardware designs.  Synthesis tools provide code snippets that they know how to transform into various constructs, including proper RAMs in the final hardware.  If, instead of using these templates, the array was accessed immediately in the data-path, then the synthesis tool would not be able to identify it as having the right properties for a RAM, and would instead implement the array using registers.  This is extremely expensive, and for large memories this can easily blow up the area usage of the FPGA, and because of the longer wires that are needed, it would also affect the performance of the circuit.  The initial translation from 3AC to HTL converts loads and stores to direct accesses to the memory, as this preserves the same behaviour without having to insert more registers and signalling.  Another pass from HTL to HTL then performs the translation from this na\"ive representation to using a proper RAM template, adding the necessary data, address and enable signals to make the synthesis tool infer that block as a proper RAM.
+Verilog arrays can be used in a variety of ways, however, these do not all produce optimal hardware designs.  If, for example, arrays in Verilog are accessed immediately in the data-path, then the synthesis tool is not be able to identify it as having the right properties for a RAM, and would instead implement the array using registers.  This is extremely expensive, and for large memories this can easily blow up the area usage of the FPGA, and because of the longer wires that are needed, it would also affect the performance of the circuit.  The synthesis tools therefore provide code snippets that they know how to transform into various constructs, including snippets that will generate proper RAMs in the final hardware.  This process is called memory inferrence.  The initial translation from 3AC to HTL converts loads and stores to direct accesses to the memory, as this preserves the same behaviour without having to insert more registers and logic.  We therefore have another pass from HTL to itself which performs the translation from this na\"ive use of arrays to a representation which always allows for memory inferrence.  This pass creates a separate always block to perform the loads and stores to the memory, and adds the necessary data, address and enable signals to communicate with that always-block from other always-blocks.  This always-block is shown between lines 10-15 in Figure~\ref{fig:accumulator_v}.
 
-There are two interesting parts to the memory template that is used for the stack of the main function.  Firstly, the memory updates are triggered on the negative edge of the clock, out of phase with the rest of the design, which is triggered on the positive edge of the clock.  The main advantage is that instead of loads and stores taking three and two clock cycles respectively, they only take two and one clock cycle instead, greatly improving their performance.  In addition to that, using the negative edge for the clock is supported by many synthesis tools, it therefore does not affect the maximum frequency of the final design.  Secondly, the logic in the enable signal of the RAM (\texttt{en != u\_en}) is also atypical.  To make the proof simpler, the goal is to create a RAM which disables itself after every use, so that firstly, the proof can assume that the RAM is disabled at the start and end of every clock cycle, and secondly so that only the state which contains the load and store need to be modified to ensure this property.  Using a simple enable signal, it would not be possible to disable it in the RAM itself, as well as enabling it in the datapath, as this would result in a register being driven twice from two different locations.  We can instead generate a second enable signal that is set by the user, and the original enable signal is then updated by the RAM to be equal to the value that the user set.  This means that the RAM should be enabled whenever the two signals are different, and disabled otherwise.
+There are two interesting parts to this RAM template.  Firstly, the memory updates are triggered on the negative edge of the clock, out of phase with the rest of the design which is triggered on the positive edge of the clock.  The main advantage is that instead of loads and stores taking three clock cycles and two clock cycles respectively, they only take two clock cycles and one clock cycle instead, greatly improving their performance.  In addition to that, using the negative edge for the clock is supported by many synthesis tools, it therefore does not affect the maximum frequency of the final design.
+
+Secondly, the logic in the enable signal of the RAM (\texttt{en != u\_en}) is also atypical.  To make the proof simpler, the goal is to create a RAM which disables itself after every use, so that firstly, the proof can assume that the RAM is disabled at the start and end of every clock cycle, and secondly so that only the state which contains the load and store need to be modified to ensure this property.  Using a simple enable signal, it would not be possible to disable it in the RAM itself, because this would result in a register being driven twice from two different locations.  It has to be enabled from the data-path and disabled from the RAM.  The only other solutions are to either insert extra states that disable the RAM accordingly, thereby eliminating the speed advantage of the negative edge triggered RAM, or to check the next state after a load and store and insert disables into that state.  The latter solution can quickly become complicated though, especially as this new state could contain another memory operation, in which case the disable signal should not be added to that state.  We can instead generate a second enable signal that is set by the user, and the original enable signal is then updated by the RAM to be equal to the value that the user set.  This means that the RAM should be enabled whenever the two signals are different, and disabled otherwise.  A solution to this problem is to create another enable signal that is controlled by the self-disabling RAM, which is always set to be equal to the enable signal set by the data-path.  The RAM is then considered enabled if the data-path enable and the RAM enable are different.
 
 \begin{figure}
   \centering
@@ -341,8 +344,9 @@ There are two interesting parts to the memory template that is used for the stac
     \begin{tikztimingtable}[timing/d/background/.style={fill=white}]
       \small clk & 2L 3{6C} \\
       \small u\_en & 2D{u\_en} 18D{$\overline{\text{u\_en}}$}\\
-      \small en & 8D{u\_en} 12D{$\overline{\text{u\_en}}$}\\
       \small addr & 2U 18D{3} \\
+      \small wr\_en & 2U 18L \\
+      \small en & 8D{u\_en} 12D{$\overline{\text{u\_en}}$}\\
       \small d\_out & 8U 12D{0xDEADBEEF} \\
       \small r & 14U 6D{0xDEADBEEF} \\
       \extracode
@@ -353,15 +357,16 @@ There are two interesting parts to the memory template that is used for the stac
         \vertlines[help lines]{2,8,14}
       \end{pgfonlayer}
     \end{tikztimingtable}
-    \caption{Timing diagram for loads.  The \texttt{u\_en} signal is toggled which enables the RAM, then d\_out is set to be the value stored at the address in the RAM, which is finally assigned to the register \texttt{r}.}
+    \caption{Timing diagram for loads.  The \texttt{u\_en} signal is toggled which enables the RAM, then d\_out is set to be the value stored at the address in the RAM, which is finally assigned to the register \texttt{r}.}\label{fig:ram_load}
   \end{subfigure}\hfill%
   \begin{subfigure}[b]{0.48\linewidth}
     \begin{tikztimingtable}[timing/d/background/.style={fill=white}]
       \small clk & 2L 2{7C} \\
       \small u\_en & 2D{u\_en} 14D{$\overline{\text{u\_en}}$}\\
-      \small en & 9D{u\_en} 7D{$\overline{\text{u\_en}}$}\\
       \small addr & 2U 14D{3} \\
+      \small wr\_en & 2U 14H \\
       \small d\_in & 2U 14D{0xDEADBEEF} \\
+      \small en & 9D{u\_en} 7D{$\overline{\text{u\_en}}$}\\
       \small stack[addr] & 9U 7D{0xDEADBEEF} \\
       \extracode
       \node[help lines] at (2,2.25) {\tiny 1};
@@ -370,11 +375,13 @@ There are two interesting parts to the memory template that is used for the stac
         \vertlines[help lines]{2,9}
       \end{pgfonlayer}
     \end{tikztimingtable}
-    \caption{Timing diagram for stores.  The \texttt{u\_en} signal is toggled to enable the RAM, together with the address \texttt{addr} and the data to store \texttt{d\_in}.  On the negative edge the data is stored into the RAM.}
+    \caption{Timing diagram for stores.  The \texttt{u\_en} signal is toggled to enable the RAM, together with the address \texttt{addr} and the data to store \texttt{d\_in}.  On the negative edge the data is stored into the RAM.}\label{fig:ram_store}
   \end{subfigure}
   \caption{Timing diagrams showing the execution of loads and stores over multiple clock cycles.}\label{fig:ram_load_store}
 \end{figure}
 
+Figure~\ref{fig:ram_load} shows an example of how the waveforms in the RAM shown in Figure~\ref{fig:accumulator_v} behaves when a value is loaded.  To initiate a the load, the data-path enable signal \texttt{u\_en} flag is toggled, the address \texttt{addr} is set and the write enable \texttt{wr\_en} is set to low.  This all happens at the positive edge of the clock, in the time slice 1.  Then, on the next negative edge of the clock, at time slice 2, as the \texttt{u\_en} is now different to the RAM enable \texttt{en}, this means that the RAM is enabled.  A load is then performed by assigning the data out register to the value stored at the address in the RAM and the \texttt{en} is set to the same value as \texttt{u\_en} to disable the RAM again.  Finally, on the next positive edge of the clock, the value in \texttt{d\_out} is assigned to the destination register \texttt{r}.  An example of a store is shown in Figure~\ref{fig:ram_store}, which instead assigns the \texttt{d\_in} register with the value to be stored.  The store is then performed on the negative edge of the clock and is therefore complete by the next positive edge.
+
 \subsubsection{Implementing the \texttt{Oshrximm} instruction}
 
 % Mention that this optimisation is not performed sometimes (clang -03).
author	Yann Herklotz <git@yannherklotz.com>	2021-04-14 22:24:43 +0100
committer	Yann Herklotz <git@yannherklotz.com>	2021-04-14 22:24:43 +0100
commit	6363b3998d65dc7a4e45cec0db9b41f69f45fb31 (patch)
tree	851d52ed23f9fc1d22e9752cb8a6644c220565f3 /algorithm.tex
parent	a855aa4ff2211421db9d11b3270b69d2aa8b18f4 (diff)
download	oopsla21_fvhls-6363b3998d65dc7a4e45cec0db9b41f69f45fb31.tar.gz oopsla21_fvhls-6363b3998d65dc7a4e45cec0db9b41f69f45fb31.zip