Add algorith oshrximm section

author: Yann Herklotz <git@yannherklotz.com> 2020-11-19 22:23:38 +0000
committer: Yann Herklotz <git@yannherklotz.com> 2020-11-19 22:23:53 +0000
commit: 7693bd9c9ba215fa132b696a0de77881ba59c214 (patch)
tree: 80afa094e84e2170b3776e3ed62e285b74598592 /algorithm.tex
parent: cfdd7c67242a9211dc572070c33e7af758ed5745 (diff)
download: oopsla21_fvhls-7693bd9c9ba215fa132b696a0de77881ba59c214.tar.gz
oopsla21_fvhls-7693bd9c9ba215fa132b696a0de77881ba59c214.zip
1 files changed, 23 insertions, 11 deletions
diff --git a/algorithm.tex b/algorithm.tex
index 65c2cb1..c4f59c5 100644
--- a/algorithm.tex
+++ b/algorithm.tex
@@ -262,24 +262,36 @@ However, the memory model that \compcert{} uses for its intermediate languages i
 
 % Mention that this optimisation is not performed sometimes (clang -03).
 
-\vericert{} performs some optimisations at the level of the instructions that are generated, so that the hardware performs the instructions as quickly as possible and so that the maximum frequency at which the hardware can run is increased.  One of the main constructs that cripples performance of the generated hardware is the instantiation of divider circuits.  Since \vericert{} requires the result of a divide operation to be ready within a single clock cycle, the divide circuit needs to be entirely combinational.  This is inefficient in terms of area, but also in terms of latency, because it means that the maximum frequency of the hardware must be reduced dramatically so that the divide circuit has enough time to finish.
+\vericert{} performs some optimisations at the level at which instructions from 3AC are converted to Verilog statements.  One of the main constructs that cripples performance of the generated hardware is the instantiation of divider circuits.  Since \vericert{} requires the result of a divide operation to be ready within a single clock cycle, the divide circuit needs to be entirely combinational.  This is inefficient in terms of area, but also in terms of latency, because it means that the maximum frequency of the hardware must be reduced dramatically so that the divide circuit has enough time to finish.
 
 %\JP{Multi-cycle paths might be something worth exploring in future work: fairly error-prone/dangerous for hand-written code, but might be interesting in generated code.}\YH{Definitely is on the list for next things to look into, will make divide so much more efficient.}
 
 %These small optimisations were found to be the most error prone, and guaranteeing that the new representation is equivalent to representation used in the \compcert{} semantics is difficult without proving this for all possible inputs.
 
-But dividing by a constant can often be optimised to a more efficient operation, especially if the denominator is a power of two.  In \compcert{}, the \texttt{Oshrximm} instruction is a strength-reduced version of a signed divide operation, performing the following operation, which is transformed to our optimal representation on the right, where $\div$ stands for integer signed division and $>>$ stands for a logical right shift:
+But dividing by a constant can often be optimised to a more efficient operation, especially if the denominator is a power of two.  In \compcert{}, the \texttt{Oshrximm} instruction is a strength-reduced version of a signed divide operation, performing the following operation, where $\div$ stands for integer signed division:
 
-\begin{align*}
-  &\forall x, y \in \mathbb{Z},\ \ 0 \leq y < 31,\ \ -2^{31} \leq x < 2^{31},\\
-  &x \div 2^y =
-  \begin{cases}
-    \left\lfloor \frac{x}{2^y} \right\rfloor = x >> y,& \text{if } x \geq 0\\
-    \left\lceil \frac{x}{2^y} \right\rceil = - \left\lfloor \frac{-x}{2^y} \right\rfloor = - ( - x >> y ),& \text{otherwise}.
-  \end{cases}\\
-\end{align*}
+\begin{equation*}
+  \texttt{Oshrximm } x\ y = x \div 2^{y}
+\end{equation*}
 
-The \compcert{} semantics for the \texttt{Oshrximm} instruction expresses its operation exactly as shown in the equation above, even though in hardware the computation that would be performed would be different.  In \vericert{}, if the same operation would be implemented using Verilog operators, it is not guaranteed to be optimised correctly by the synthesis tools that convert the Verilog into a circuit.  To guarantee an output that does not include divides, we therefore have to express it in Verilog using shifts, and then prove that this representation is equivalent to the divide representation used in the \compcert{} semantics.  While conducting the proof, we discovered quite a few bugs in our initial implementation of optimisations, which rounded to $-\infty$ instead of 0.
+As these are the semantics of the \texttt{Oshrximm} instruction, it means that the Verilog statement that is supposed to implement this instruction needs to have the exact same semantics.  The na\"{i}ve, implementation would be to just implement it using the Verilog division operator and rely on the hardware synthesis tool to convert that to an efficient implementation.  It is quite rare that this will succeed with signed division though, as it cannot be implemented using a single simple shift instruction and the synthesis tool will often fall back to using a division circuit.  However, the observation can be made that signed division can be implemented using shifts as shown below,  where $>>$ stands for a logical right shift:
+
+
+\begin{remark}
+  $\forall x, y \in \mathbb{Z},\ \ 0 \leq y < 31,\ \ -2^{31} \leq x < 2^{31},\\$
+
+  \begin{equation*}
+    x \div 2^y =
+    \begin{cases}
+      \left\lfloor \frac{x}{2^y} \right\rfloor = x >> y,& \text{if } x \geq 0\\
+      \left\lceil \frac{x}{2^y} \right\rceil = - \left\lfloor \frac{-x}{2^y} \right\rfloor = - ( - x >> y ),& \text{otherwise}.
+    \end{cases}\\
+  \end{equation*}
+\end{remark}
+
+Once this equivalence about the shifts and division operator is proven correct, it can be used to implement the \texttt{Oshrximm} using the efficient shift version instead of how the \compcert{} semantics described it.  When proving this equivalence initially this actually found a bug in our original implementation, as division will round to 0 whereas a na\"{i}ve shift will round to $-\infty$.
+
+%The \compcert{} semantics for the \texttt{Oshrximm} instruction expresses its operation exactly as shown in the equation above, even though in hardware the computation that would be performed would be different.  In \vericert{}, if the same operation would be implemented using Verilog operators, it is not guaranteed to be optimised correctly by the synthesis tools that convert the Verilog into a circuit.  To guarantee an output that does not include divides, we therefore have to express it in Verilog using shifts, and then prove that this representation is equivalent to the divide representation used in the \compcert{} semantics.  While conducting the proof, we discovered quite a few bugs in our initial implementation of optimisations, which rounded to $-\infty$ instead of 0.
author	Yann Herklotz <git@yannherklotz.com>	2020-11-19 22:23:38 +0000
committer	Yann Herklotz <git@yannherklotz.com>	2020-11-19 22:23:53 +0000
commit	7693bd9c9ba215fa132b696a0de77881ba59c214 (patch)
tree	80afa094e84e2170b3776e3ed62e285b74598592 /algorithm.tex
parent	cfdd7c67242a9211dc572070c33e7af758ed5745 (diff)
download	oopsla21_fvhls-7693bd9c9ba215fa132b696a0de77881ba59c214.tar.gz oopsla21_fvhls-7693bd9c9ba215fa132b696a0de77881ba59c214.zip