summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
-rw-r--r--evaluation.tex2
-rw-r--r--verilog.tex2
2 files changed, 2 insertions, 2 deletions
diff --git a/evaluation.tex b/evaluation.tex
index 9cf8d9e..aee2992 100644
--- a/evaluation.tex
+++ b/evaluation.tex
@@ -19,7 +19,7 @@ We were able to use 27 of the 30 programs; three had to be discarded (\texttt{co
%In summary, we evaluate 27 programs from the latest Polybench suite.
We configured \polybench{}'s parameters so that only integer types are used. We use \polybench{}'s smallest datasets for each program to ensure that data can reside within on-chip memories of the FPGA, avoiding any need for off-chip memory accesses.
-\vericert{} implements divisions and modulo operations in C using the corresponding built-in Verilog operators. These built-in operators are designed to complete within a single clock cycle, and this causes substantial penalties in clock frequency. Other HLS tools, including LegUp, supply their own multi-cycle division/modulo implementations, and we plan to do the same in future versions of \vericert{}. Implementing pipelined operators such as the divide and modulus operator can be solved by scheduling the instructions so that these can execute in parallel, which is the main optimisation that needs to be added to \vericert{}. In the meantime, we have prepared an alternative version of the benchmarks in which each division/modulo operation is replaced with our own implementation that uses repeated division and multiplications by 2. Figure~\ref{fig:polybench-div} shows the results of comparing Vericert with optimised LegUp 4.0 on the \polybench{} benchmarks, where divisions have been left intact. Figure~\ref{fig:polybench-nodiv} performs the comparison where the division/modulo operations have been replaced by the iterative algorithm.
+\vericert{} implements divisions and modulo operations in C using the corresponding built-in Verilog operators. These built-in operators are designed to complete within a single clock cycle, and this causes substantial penalties in clock frequency. Other HLS tools, including LegUp, supply their own multi-cycle division/modulo implementations, and we plan to do the same in future versions of \vericert{}. Implementing pipelined operators such as the divide and modulus operator can be solved by scheduling the instructions so that these can execute in parallel, which is the main optimisation that needs to be added to \vericert{}. In the meantime, we have prepared an alternative version of the benchmarks in which each division/modulo operation is replaced with our own implementation that uses repeated division and multiplications by 2. Figure~\ref{fig:polybench-div} shows the results of comparing \vericert{} with optimised LegUp 4.0 on the \polybench{} benchmarks, where divisions have been left intact. Figure~\ref{fig:polybench-nodiv} performs the comparison where the division/modulo operations have been replaced by the iterative algorithm.
\paragraph{Synthesis setup} The Verilog that is generated by \vericert{} or \legup{} is provided to Xilinx Vivado v2017.1~\cite{xilinx_vivad_desig_suite}, which synthesises it to a netlist, before placing-and-routing this netlist onto a Xilinx XC7Z020 FPGA device that contains approximately 85000 LUTs.
diff --git a/verilog.tex b/verilog.tex
index d144f90..2c474e6 100644
--- a/verilog.tex
+++ b/verilog.tex
@@ -193,7 +193,7 @@ The Verilog semantics do not define a memory model for Verilog, as this is not n
%\JW{It's not completely clear what the relationship is between your work and those works. The use of `only' suggests that you've re-done a subset of work that has already been done -- is that the right impression?}\YH{Hopefully that's more clear.}
-This translation is represented in Figure~\ref{fig:memory_model_transl}, where \compcert{} defines a map from blocks to maps from memory address to memory contents. Each block represents an area in memory, for example, a block can represent a global variable or a stack for a function. Instead, our Verilog semantics define two finitely sized arrays of optional values, one for the blocking assignments map $\Gamma_{\rm a}$ and one for the nonblocking assignments map $\Delta_{\rm a}$. The optional values are present to ensure correct merging of the two association maps at the end of the clock cycle. During our translation we only convert block 0 to a Verilog memory, and ensure that it is the only block that is present. This means that the block necessarily represents the stack of the main function. The invariant that then has to hold in the proofs, is that block 0 should be equivalent to the merged representation of the $\Gamma_{rm a}$ and $\Delta_{rm a}$ maps.
+This translation is represented in Figure~\ref{fig:memory_model_transl}, where \compcert{} defines a map from blocks to maps from memory address to memory contents. Each block represents an area in memory, for example, a block can represent a global variable or a stack for a function. Instead, our Verilog semantics define two finitely sized arrays of optional values, one for the blocking assignments map $\Gamma_{\rm a}$ and one for the nonblocking assignments map $\Delta_{\rm a}$. The optional values are present to ensure correct merging of the two association maps at the end of the clock cycle. During our translation we only convert block 0 to a Verilog memory, and ensure that it is the only block that is present. This means that the block necessarily represents the stack of the main function. The invariant that then has to hold in the proofs, is that block 0 should be equivalent to the merged representation of the $\Gamma_{\rm a}$ and $\Delta_{\rm a}$ maps.
%However, in practice, assigning and reading from an array directly in the state machine will not produce a memory in the final hardware design, as the synthesis tool cannot identify the array as having the necessary properties that a RAM needs, even though this is the most natural formulation of memory. Even though theoretically the memory will only be read from once per clock cycle, the synthesis tool cannot ensure that this is true, and will instead create a register for each memory location. This increases the size of the circuit dramatically, as the RAM on the FPGA chip will not be reused. Instead, the synthesis tool expects a specific interface that ensures these properties, and will then transform the interface into a proper RAM during synthesis. Therefore, a translation has to be performed from the naive use of memory in the state machine, to a proper use of a memory interface.