verilog.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180

\section{A Formal Semantics for Verilog}\label{sec:verilog}

\newcommand{\alwaysblock}{always-block}

This section describes the Verilog semantics that was chosen for the target language, including the changes that were made to the semantics to make it a suitable HLS target.  The Verilog standard is quite large~\cite{06_ieee_stand_veril_hardw_descr_languag,05_ieee_stand_veril_regis_trans_level_synth}, but the syntax and semantics can be reduced to a small subset that \vericert{} needs to target.

The Verilog semantics we use is ported to Coq from a semantics written in HOL4 by \citet{loow19_proof_trans_veril_devel_hol} and used to prove the translation from HOL4 to Verilog~\cite{loow19_verif_compil_verif_proces}. % which was used to create a formal translation from a logic representation encoded in the HOL4~\cite{slind08_brief_overv_hol4} theorem prover into an equivalent Verilog design. 
This semantics is quite practical as it is restricted to a small subset of Verilog, which can nonetheless be used to model the hardware constructs required for HLS.  The main features that are excluded are continuous assignment and combinational \alwaysblock{}s; these are modelled in other semantics such as that by~\citet{meredith10_veril}. %however, these are not necessarily needed, but require more complicated event queues and execution model.

The semantics of Verilog differs from regular programming languages, as it is used to describe hardware directly, which is inherently parallel, rather than an algorithm, which is usually sequential.  The main construct in Verilog is the \alwaysblock{}.
A module can contain multiple \alwaysblock{}s, all of which run in parallel.  These \alwaysblock{}s further contain statements such as if-statements or assignments to variables.  We support only \emph{synchronous} logic, which means that the \alwaysblock{} is triggered on (and only on) the rising edge of a clock signal.
%\NR{We should mention that variables cannot be driven by multiple \alwaysblock{}s, since one might get confused with data races when relating to concurrent processes in software.} \JW{Given the recent discussion on Teams, it seems to me that we perhaps don't need to mention here what happens if a variable is driven multiple times per clock cycle, especially since Vericert isn't ever going to do that.}

The semantics combines the big-step and small-step styles. The overall execution of the hardware is described using a small-step semantics, with one small step per clock cycle; this is appropriate because hardware is routinely designed to run for an unlimited number of clock cycles and the big-step style is ill-suited to describing infinite executions. Then, within each clock cycle, a big-step semantics is used to execute all the statements.
An example of a rule for executing an \alwaysblock{} is shown below, where $\Sigma$ is the state of the registers in the module and $s$ is the statement inside the \alwaysblock{}:

\begin{equation*}
  \inferrule[Always]{(\Sigma, s)\downarrow_{\text{stmnt}} \Sigma'}{(\Sigma, \yhkeyword{always @(posedge clk) } s) \downarrow_{\text{always}^{+}} \Sigma'}
\end{equation*}

\noindent This rule says that assuming the statement $s$ in the \alwaysblock{} runs with state $\Sigma$ and produces the new state $\Sigma'$, the \alwaysblock{} will result in the same final state.  %Since only clocked \alwaysblock{} are supported, and one step in the semantics correspond to one clock cycle, it means that this rule is run once per clock cycle for each \alwaysblock{}.

Two types of assignments are supported in \alwaysblock{}s: nonblocking and blocking assignment.  Nonblocking assignments all take effect simultaneously at the end of the clock cycle, %and atomically.
while blocking assignments happen instantly so that later assignments in the clock cycle can pick them up.  To model both of these assignments, the state $\Sigma$ has to be split into two maps: $\Gamma$, which contains the current values of all variables, and $\Delta$, which contains the values that will be assigned at the end of the clock cycle. %, we can therefore say that $\Sigma = (\Gamma, \Delta)$. 
Nonblocking assignment can therefore be expressed as follows:
\begin{equation*}
  \inferrule[Nonblocking Reg]{\yhkeyword{name}\ d = \yhkeyword{OK}\ n \\ (\Gamma, e) \downarrow_{\text{expr}} v}{((\Gamma, \Delta), d\ \yhkeyword{ <= } e) \downarrow_{\text{stmnt}} (\Gamma, \Delta [n \mapsto v])}\\
\end{equation*}

\noindent where assuming that $\downarrow_{\text{expr}}$ evaluates an expression $e$ to a value $v$, the nonblocking assignment $d\ \yhkeyword{ <= } e$ updates the future state of the variable $d$ with value $v$.

Finally, the following rule dictates how the whole module runs in one clock cycle:
\begin{equation*}
  \inferrule[Module]{(\Gamma, \epsilon, \vec{m})\ \downarrow_{\text{module}^{+}} (\Gamma', \Delta') \\ (\Gamma'\ //\ \Delta', \epsilon, \vec{m}) \downarrow_{\text{module}^{-}} (\Gamma'', \Delta'')}{(\Gamma, \yhkeyword{module } \yhconstant{main} \yhkeyword{(...);}\ \vec{m}\ \yhkeyword{endmodule}) \downarrow_{\text{program}} (\Gamma''\ //\ \Delta'')}
\end{equation*}
where $\Gamma$ is the initial state of all the variables, and $\vec{m}$ is a list of variable declarations and \alwaysblock{}s that $\downarrow_{\text{module}}$ evaluates sequentially to obtain $(\Gamma', \Delta')$. The final state is obtained by merging these maps using the $//$ operator, which gives priority to the right-hand operand in a conflict. This rule ensures that the nonblocking assignments overwrite at the end of the clock cycle any blocking assignments made during the cycle.

\subsection{Changes to the Semantics}

Four changes were made to the semantics proposed by \citet{loow19_proof_trans_veril_devel_hol} to make it suitable as a HLS target.

\paragraph{Adding array support}
The main change is the addition of support for arrays, which are needed to model RAM in Verilog.  RAM is needed to model the stack in C efficiently, without having to declare a variable for each possible stack location. % In the original semantics, RAMs (as well as inputs and outputs to the module) could be modelled using a function from variable names (strings) to values, which could be modified accordingly to model inputs to the module.  This is quite an abstract description of memory and can also be expressed as an array of bitvectors instead, which is the path we took. This requires the addition of array operators to the semantics and correct reasoning of loads and stores to the array in different \alwaysblock{}s simultaneously.  
Consider the following Verilog code:
\begin{center}
\begin{minted}{verilog}
reg [31:0] x[1:0];
always @(posedge clk) begin
  x[0] = 1;
  x[1] <= 1;
end
\end{minted}
\end{center}
which modifies one array element using blocking assignment and then a second using nonblocking assignment. If the existing semantics were used to update the array, then during the merge, the entire array \texttt{x} from the nonblocking association map would replace the entire array from the blocking association map.  This would replace \texttt{x[0]} with its original value and therefore behave incorrectly. Accordingly, we modified the maps so they record updates on a per-el\-em\-ent basis. Our state $\Gamma$ is therefore split up into $\Gamma_{r}$ for instantaneous updates to variables, and $\Gamma_{a}$ for instantaneous updates to arrays; $\Delta$ is split similarly. The merge function then ensures that only the modified indices get updated when $\Gamma_{a}$ is merged with the nonblocking map equivalent $\Delta_{a}$.

\paragraph{Adding declarations} Explicit support for declaring inputs, outputs and internal variables was added to the semantics to make sure that the generated Verilog also contains the correct declarations.  This adds some guarantees to the generated Verilog and ensures that it synthesises and simulates correctly.

\paragraph{Removing support for external inputs to modules} Support for receiving external inputs was removed from the semantics for simplicity, as these are not needed for an HLS target. The main module in Verilog models the main function in C, and since the inputs to a C function shouldn't change during its execution, there is no need for external inputs for Verilog modules.  

\paragraph{Simplifying representation of bitvectors} Finally, we use 32-bit integers to represent bitvectors rather than arrays of Booleans. This is because \vericert{} (currently) only supports types represented by 32 bits.

\subsection{Integrating the Verilog Semantics into \compcert{}'s Model}
\label{sec:verilog:integrating}

\begin{figure*}
  \centering
  \begin{minipage}{1.0\linewidth}
    \begin{gather*}
      \inferrule[Step]{\Gamma_r\ !\ \textit{rst} = \yhconstant{Some } 0 \\ \Gamma_r\ !\ \textit{fin} = \yhconstant{Some } 0 \\ \Gamma_r\ !\ \sigma = \yhconstant{Some } v \\ (m, (\Gamma_r, \Gamma_a))\ \downarrow_{\text{program}} (\Gamma_r', \Gamma_a') \\ \Gamma_r'\ !\ \sigma = \yhconstant{Some } v'}{\yhconstant{State } \textit{sf }\ m\ v\ \Gamma_r\ \Gamma_a \longrightarrow \yhconstant{State } \textit{sf }\ m\ v'\ \Gamma_r'\ \Gamma_a'}\\
      %
      \inferrule[Finish]{\Gamma_r\ !\ \textit{fin} = \yhconstant{Some } 1 \\ \Gamma_r\ !\ \textit{ret} = \yhconstant{Some } r}{\yhconstant{State } \textit{sf }\ m\ \sigma\ \Gamma_r\ \Gamma_a \longrightarrow \yhconstant{Returnstate } \textit{sf }\ r}\\
      %
      \inferrule[Call]{ }{\yhconstant{Callstate } \textit{sf }\ m\ \vec{r} \longrightarrow \yhconstant{State } \textit{sf }\ m\ n\ ((\yhfunction{init\_params}\ \vec{r}\ a)[\sigma \mapsto n, \textit{fin} \mapsto 0, \textit{rst} \mapsto 0])\ \epsilon}\\
      %
      \inferrule[Return]{ }{\yhconstant{Returnstate } (\yhconstant{Stackframe } r\ m\ \textit{pc }\ \Gamma_r\ \Gamma_a :: \textit{sf})\ v \longrightarrow \yhconstant{State } \textit{sf }\ m\ \textit{pc }\ (\Gamma_{r} [ \textit{st} \mapsto \textit{pc}, r \mapsto v ])\ \Gamma_{a}}
    \end{gather*}
  \end{minipage}
  \caption{Top-level small-step semantics for Verilog modules in \compcert{}'s computational framework.}%
  \label{fig:inferrence_module}
\end{figure*}

The \compcert{} computation model defines a set of states through which execution passes. In this subsection, we explain how we extend our Verilog semantics with five special-purpose  registers in order to integrate it into \compcert{}.

\compcert{} executions pass through three main states:
\begin{description}
  \item[\texttt{State} \textit{sf} $m$ $v$ $\Gamma_{r}$ $\Gamma_{a}$] The main state when executing a function, with stack frame \textit{sf}, current module $m$, current state $v$ and variable states $\Gamma_{r}$ and $\Gamma_{a}$.
  \item[\texttt{Callstate} \textit{sf} $m$ $\vec{r}$] The state that is reached when a function is called, with the current stack frame \textit{sf}, current module $m$ and arguments $\vec{r}$.
  \item[\texttt{Returnstate} \textit{sf} $v$] The state that is reached when a function returns back to the caller, with stack frame \textit{sf} and return value $v$.
\end{description}

To support this computational model, we extend the Verilog module we generate with the following five registers and a RAM block:

\begin{description}
  \item[program counter] The program counter can be modelled using a register that keeps track of the state, denoted as $\sigma$.
  \item[function entry point] When a function is called, the entry point denotes the first instruction that will be executed. This can be modelled using a reset signal that sets the state accordingly, denoted as \textit{rst}.
  \item[return value] The return value can be modelled by setting a finished flag to 1 when the result is ready, and putting the result into a 32-bit output register. These are denoted as \textit{fin} and \textit{rtrn} respectively.
  \item[stack] The function stack can be modelled as a RAM block, which is implemented using an array in the module, and denoted as \textit{stk}.
\end{description}

Figure~\ref{fig:inferrence_module} shows the inference rules for moving between the computational states.  The first, \textsc{Step}, is the normal rule of execution.  It defines one step in the \texttt{State} state, assuming that the module is not being reset, that the finish state has not been reached yet, that the current and next state are $v$ and $v'$, and that the module runs from state $\Gamma$ to $\Gamma'$ using the \textsc{Step} rule.  The \textsc{Finish} rule returns the final value of running the module and is applied when the \textit{fin} register is set; the return value is then taken from the \textit{ret} register.

Note that there is no step from \texttt{State} to \texttt{Callstate}; this is because function calls are not supported, and it is therefore impossible in our semantics to ever reach a \texttt{Callstate} except for the initial call to main. So the \textsc{Call} rule is only used at the very beginning of execution; likewise, the \textsc{Return} rule is only matched for the final return value from the main function. %as there is no rule that allocates another stack frame \textit{sf} except for the initial call to main.  
Therefore, in addition to the rules shown in Figure~\ref{fig:inferrence_module}, an initial state and final state need to be defined:

\begin{gather*}
  \inferrule[Initial]{\yhfunction{is\_internal}\ (P.\texttt{main})}{\yhfunction{initial\_state}\ (\yhconstant{Callstate } []\ (P.\texttt{main})\ [])}\qquad
  \inferrule[Final]{ }{\yhfunction{final\_state}\ (\yhconstant{Returnstate } []\ n)\ n}
\end{gather*}

\noindent where the initial state is the \texttt{Callstate} with an empty stack frame and no arguments for the \texttt{main} function of program $P$, where this \texttt{main} function needs to be in the current translation unit.  The final state results in the program output of value $n$ when reaching a \texttt{Returnstate} with an empty stack frame.

\subsection{Memory Model}

The Verilog semantics do not define a memory model for Verilog, as this is not needed for a hardware description language.  There is no preexisting architecture that Verilog will produce, it can describe any memory layout that is needed.  Instead of having specific semantics for memory, the semantics only need to support the language features that can produce these different memory layouts, these being Verilog arrays.  We therefore define semantics for updating Verilog arrays using blocking and nonblocking assignment.  We then have to prove that the C memory model that \compcert{} uses matches with the interpretation of arrays that are used in Verilog.  The \compcert{} memory model is infinite, whereas our representation of arrays in Verilog is inherently finite.  There have already been various efforts to define a finite memory model for \compcert{}, such as CompCertS~\cite{besson18_compc}, CompCertELF~\cite{wang20_compc} and CompCertTSO~\cite{sevcik13_compc}, however, we only define the translation from \compcert{}'s standard infinite memory model to finitely sized arrays that can be represented in Verilog.

This translation is represented in Figure~\ref{fig:memory_model_transl}, where \compcert{} defines a map from blocks to maps from memory address to memory contents.  Instead, our Verilog semantics define two finitely sized arrays of optional values, one for the blocking assignments map $\Gamma_{\rm a}$ and one for the nonblocking assignments map $\Delta_{\rm a}$.  The optional values are present to ensure correct merging of the two association maps at the end of the clock cycle.

\begin{figure}
  \centering
  \includegraphics[width=0.5\linewidth]{diagrams/memory_model.pdf}
  \caption{Change in the memory model during the translation of 3AC to HTL.  This is immediately after the assignment to the array}\label{fig:memory_model_transl}
\end{figure}

However, in practice, assigning and reading from an array directly in the state machine will not produce a memory in the final hardware design, as the synthesis tool cannot identify the array as having the necessary properties that a RAM needs, even though this is the most natural formulation of memory.  Even though theoretically the memory will only be read from once per clock cycle, the synthesis tool cannot ensure that this is true, and will instead create a register for each memory location.  This increases the size of the circuit dramatically, as the RAM on the FPGA chip will not be reused.  Instead, the synthesis tool expects a specific template that ensures these properties, and will then transform the template into a proper RAM during synthesis.  Therefore, a translation has to be performed from the naive use of memory in the state machine, to a proper use of a memory template.

\begin{figure}
  \centering
  \begin{minipage}{1.0\linewidth}
    \begin{gather*}
      \inferrule[None]{ }{(\Gamma,\Delta, \yhconstant{None}) \downarrow_{\text{ram}} \Delta}\qquad
%
      \inferrule[Idle]{\Gamma_{\rm r}\ !\ \textit{r.en} = \Gamma_{\rm r}\ !\ \textit{r.u\_{en}}}{((\Gamma_{\rm r}, \Gamma_{\rm a}), \Delta, \yhconstant{Some}\ r) \downarrow_{\text{ram}} \Delta}\\
%
      \inferrule[Read]{\Gamma_{\rm r}\ !\ \textit{r.en} \ne \Gamma_{\rm r}\ !\ \textit{r.u\_en} \\ \Gamma_{\rm r}\ !\ \textit{r.wr\_en} = 0}{((\Gamma_{\rm r}, \Gamma_{\rm a}), (\Delta_{\rm r}, \Delta_{\rm a}), \yhconstant{Some}\ r) \downarrow_{\text{ram}} (\Delta_{\rm r}[\textit{r.en} \mapsto \textit{r.u\_en}, \textit{r.d\_out} \mapsto (\Gamma_{\rm a}\ !\ \textit{r.mem})\ !\ \textit{r.addr}], \Delta_{\rm a}) }\\
%
      \inferrule[Write]{\Gamma_{\rm r}\ !\ \textit{r.en} \ne \Gamma_{\rm r}\ !\ \textit{r.u\_en} \\ \Gamma_{\rm r}\ !\ \textit{r.wr\_en} = 1}{((\Gamma_{\rm r}, \Gamma_{\rm a}), (\Delta_{\rm r}, \Delta_{\rm a}), \yhconstant{Some}\ r) \downarrow_{\text{ram}} (\Delta_{\rm r}[\textit{r.en} \mapsto \textit{r.u\_en}], \Delta_{\rm a}[\textit{r.mem} \mapsto (\Gamma_{\rm a}\ !\ \textit{r.mem})[\textit{r.addr} \mapsto \textit{r.d\_in}]]) }
    \end{gather*}
  \end{minipage}
  \caption{Specification for memory implementation in HTL, which is then implemented by equivalent Verilog code.}\label{fig:htl_ram_spec}
\end{figure}

\begin{figure}
  \begin{subfigure}[b]{0.48\linewidth}
\begin{minted}{verilog}
  reg [31:0] mem [SIZE-1:0];
  always @(negedge clk)
    if (en != u_en) begin
      if (wr_en) mem[addr] <= d_in;
      else d_out <= mem[addr];
      en <= u_en;
    end
\end{minted}
    \caption{Implementation of memory specification in Verilog of its HTL specification.}\label{fig:verilog_ram_impl}
  \end{subfigure}\hfill%
  \begin{subfigure}[b]{0.48\linewidth}
    \includegraphics[width=\linewidth]{diagrams/load_waveform.pdf}
    \caption{Example wave form for a load being performed by the Verilog code.}
  \end{subfigure}
  \caption{RAM representation in Verilog and a trace of it's execution.}
\end{figure}

This memory template can be represented using the following semantics shown in Figure~\ref{fig:htl_ram_spec}, which is then translated to the equivalent Verilog implementation shown in Figure~\ref{fig:verilog_ram_impl}.  There are two interesting parts to the memory template that is used for the stack of the main function.  Firstly, the memory updates are triggered on the negative edge of the clock, out of phase with the rest of the design, which is triggered on the positive edge of the clock.  The main advantage is that instead of loads and stores taking three and two clock cycles respectively, they only take two and one clock cycle instead, greatly improving their performance.  In addition to that, using the negative edge for the clock is supported by many synthesis tools, it therefore does not affect the maximum frequency of the final design.  Secondly, the logic in the enable signal of the RAM (\texttt{en != u\_en}) is also atypical.  To make the proof simpler, the goal is to create a RAM which disables itself after every use, so that firstly, the proof can assume that the RAM is disabled at the start and end of every clock cycle, and secondly so that only the state which contains the load and store need to be modified to ensure this property.  Using a simple enable signal, it would not be possible to disable it in the RAM itself, as well as enabling it in the datapath, as this would result in a register being driven twice from two different locations.  We can instead generate a second enable signal that is set by the user, and the original enable signal is then updated by the RAM to be equal to the value that the user set.  This means that the RAM should be enabled whenever the two signals are different, and disabled otherwise.

%\begin{figure}
%  \centering
%  \begin{subfigure}[t]{0.48\linewidth}
%    \includegraphics[width=\linewidth]{diagrams/store_waveform.pdf}
%    \caption{Store waveform.}
%  \end{subfigure}\hfill%
%  \begin{subfigure}[t]{0.48\linewidth}
%    \includegraphics[width=\linewidth]{diagrams/load_waveform.pdf}
%    \caption{Load waveform.}
%  \end{subfigure}
%\end{figure}

%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"
%%% TeX-command-extra-options: "-shell-escape"
%%% End: