\section{Designing a verified HLS tool}

%\JW{The first part of this section (up to 2.1) is good but needs tightening up. Ultimately, the point of this section is to explain that there's an existing verified compiler called CompCert which has a bunch of stages, and we need to make a decision about where to tap into that pipeline. Too early and we miss out on some helpful optimisations; too late and we've ended up too target-specific. What if you put a few more stages into Figure 1 -- there aren't actually that many missing anyway. Suppose you add in Cminor between C\#minor and 3AC. Then the high-level structure of your argument in this subsection could be: (1) why Cminor is too early, (2) why LTL is too late, and then maybe (3) why 3AC is just right. The Goldilocks structure, haha!}

This section covers the main architecture of the HLS tool, and the way in which the back end was added to \compcert{}.  This section will also cover an example of converting a simple C program into hardware, expressed in the Verilog language.
%\JW{I've experimented with adding a few paragraph headings to the text below -- see what you think. The advantage of headings is that it can make the text easier to read quickly.}\YH{Yes I think it works well actually, makes the sections clearer.}

\paragraph{Choice of source language}\YH{Could combine this with ``Choice of implementation language'' maybe, as there is a bit of overlap.} \JW{Although the answers to these questions happen to overlap, I think the questions are distinct and it is logical to ask them separately.}
First of all, the choice of C for the input language of \vericert{} is because it is the most widely supported language for HLS, and most major HLS tools also use it as an input.  As a lot of existing code is also written in C for HLS, supporting C as an input language compared to a custom domain-specific language means that \vericert{} is more practical.  Another alternative was to support LLVM IR as an input language, however, to get a full work flow from a higher level language to hardware, a front end for that language to LLVM IR would also have to be verified.  Finally, a language similar to Occam was also considered, as it has inherent parallel constructs, however, this would not qualify as being high-level synthesis due to the manual parallelism that would have to be performed. %\JW{Can we mention one or two alternatives that we considered?}

\paragraph{Choice of target language}
Next, Verilog~\cite{06_ieee_stand_veril_hardw_descr_languag} is a hardware description language, which can be synthesised into logic gates which can be either placed onto a field-programmable gate array (FPGA) or turned into an application-specific integrated circuit (ASIC).  Verilog was chosen as the output language for \vericert{} because it is one of the most popular hardware description languages and there already exist a few formal semantics for it that could be used as a target~\cite{loow19_verif_compil_verif_proces,meredith10_veril}. Other possible targets could have been Bluespec, a higher level hardware description language, for which there exists a formally verified translation to circuits using K\^{o}ika~\cite{bourgeat20_essen_blues}, however, targeting this language would not be trivial as it is not meant to be targeted by an automatic tool.  Finally, a custom circuit language could also have been targeted, which can then be translated to Verilog in an unverified way, however, some guarantees would be lost and it would not be possible to completely trust the output. %\JW{Can we mention one or two alternatives that we considered? Bluespec or Chisel or one of Adam Chlipala's languages, perhaps?}

\paragraph{Choice of implementation language}
The framework that was chosen for the frontend was \compcert{}, as it is a mature framework for simulation proofs about intermediate languages, in addition to already providing a validated parser~\cite{jourdan12_valid_lr_parser} from C into the internal representation of Clight.  Other frameworks were also considered, such as Vellvm~\cite{zhao12_formal_llvm_inter_repres_verif_progr_trans}, as LLVM IR in particular is often used by HLS tools anyways, however, these would require more work to support a higher level language such as C as input, or even providing a parser for LLVM IR.\@

\begin{figure}
  \centering
  \resizebox{0.47\textwidth}{!}{
  \begin{tikzpicture}
    [language/.style={fill=white,rounded corners=3pt,minimum height=7mm},
    continuation/.style={}]
    \fill[compcert,rounded corners=3pt] (-1,-1) rectangle (9,1.5);
    \fill[formalhls,rounded corners=3pt] (-1,-1.5) rectangle (9,-2.5);
    \node[language] at (-0.3,0) (clight) {Clight};
    \node[continuation] at (1,0) (conta) {$\cdots$};
    \node[language] at (2.7,0) (cminor) {CminorSel};
    \node[language] at (4.7,0) (rtl) {3AC};
    \node[language] at (6.2,0) (ltl) {LTL};
    \node[language] at (8.4,0) (ppc) {PPC};
    \node[continuation] at (7.3,0) (contb) {$\cdots$};
    \node[language] at (4.7,-2) (dfgstmd) {HTL};
    \node[language] at (6.7,-2) (verilog) {Verilog};
    \node at (0,1) {\compcert{}};
    \node at (0,-2) {Vericert};
    \draw[->] (clight) -- (conta);
    \draw[->] (conta) -- (cminor);
    \draw[->] (cminor) -- (rtl);
    \draw[->] (rtl) -- (ltl);
    \draw[->] (ltl) -- (contb);
    \draw[->] (contb) -- (ppc);
    \draw[->] (rtl) -- (dfgstmd);
    \draw[->] (dfgstmd) -- (verilog);
  \end{tikzpicture}}
  \caption{Verilog back end to Compcert, branching off at the three address code (3AC), at which point the three address code is transformed into a state machine.  Finally, it is transformed to a hardware description of the state machine in Verilog.}%
  \label{fig:rtlbranch}
\end{figure}

\paragraph{Architecture of \vericert{}}
The main work flow of \vericert{} is shown in Figure~\ref{fig:rtlbranch}, which shows the parts of the translation that are performed in \compcert{}, and which have been added with \vericert{}.

\compcert{} is made up of 11 intermediate languages in between the Clight input and the assembly output.  These intermediate languages each serve a different purpose and contain various optimisations.  When designing a new back end for \compcert{}, it is crucial to know where to branch off, so as to benefit from all the useful optimisations that \compcert{} performs, but not performing optimisations that are not useful, which include optimisations that are specific to the target CPU architecture.  These optimisations include register allocation, as there is not a fixed number of registers that need to be targeted.

To choose where to branch off at, each intermediate language in \compcert{} can be evaluated to see if it is suitable to be transformed into hardware.  Existing HLS compilers often use LLVM IR as an intermediate representation when performing HLS-specific optimisations, as each instruction can be mapped quite well to hardware that performs the same behaviour.  Looking at the intermediate languages in \compcert{} shown in Figure~\ref{fig:rtlbranch}, there are many languages to choose from.  Clight and CminorSel are an abstract syntax tree (AST) representation of the C code, which does not correspond to a more assembly like language similar to LLVM IR.\@  In addition to that, looking at the languages from LTL to PPC, even though these languages do contain basic blocks, which are desirable when doing HLS, starting from LTL the number of registers is limited.  Register allocation limits the number of registers when translating 3AC into LTL, and stores variables on the stack if that is required.  This is not needed when performing HLS, as there are many more registers available, and it is preferable to use these instead of RAM whenever possible.

\compcert{}'s three-address code (3AC)\footnote{Three-address code (3AC) is also known as register transfer language (RTL) in the \compcert{} literature, however, 3AC is used in this paper instead so as not to confuse it with register-transfer level (RTL), which is another name for the final hardware target of the HLS tool.} is the intermediate language that resembles LLVM IR the most, as it also has an infinite number of pseudo-registers and each instruction maps well to hardware.  3AC is represented as a control flow graph (CFG) in CompCert.  Each instruction is a node in the graph and links to the instructions that follow it.  This CFG then describes how the computation should proceed, and is a good representation for performing optimisations on as well as local transformations.  However, one difference between LLVM IR and 3AC is that 3AC uses operations of the target architecture and performs architecture specific optimisations as well, which is not the case in LLVM IR where all the instructions are quite abstract.  This can be mitigated by making \compcert{} target a specific architecture such as x86\_32, where most operations translate quite well into hardware.  In addition to that, many optimisations that are also useful for HLS are performed in 3AC, which means that if it is supported as the input language, the HLS algorithm benefits from the same optimisations.  It is therefore a good candidate to be chosen as the input language to the HLS back end. The complete flow that \vericert{} takes is show in figure~\ref{fig:rtlbranch}.

\begin{figure}
  \centering
  \begin{subfigure}[b]{0.49\linewidth}
\begin{minted}{c}
int main() {
  int x[3] = {1, 2, 3};
  int sum = 0;
  for (int i = 0;
       i < 3;
       i++)
    sum += x[i];
  return sum;
}
\end{minted}
    \caption{Input C code.}\label{fig:accumulator_c}
  \end{subfigure}\hspace*{-4mm}
  \begin{subfigure}[b]{0.49\linewidth}
\begin{minted}[fontsize=\footnotesize]{c}
main() {
 15:  x8 = 1
 14:  int32[stack(0)] = x8
 13:  x7 = 2
 12:  int32[stack(4)] = x7
 11:  x6 = 3
 10:  int32[stack(8)] = x6
  9:  x2 = 0
  8:  x1 = 0
  7:  x5 = stack(0) (int)
  6:  x4 = int32[x5 + x1 * 4 + 0]
  5:  x2 = x2 + x4 + 0 (int)
  4:  x1 = x1 + 1 (int)
  3:  if (x1 <s 3) goto 7 else goto 2
  2:  x3 = x2
  1:  return x3
}
\end{minted}
    \caption{3AC produced by \compcert{}.}\label{fig:accumulator_rtl}
  \end{subfigure}
  \caption{Using \compcert{} to translate a simple program from C to three address code (3AC).}\label{fig:accumulator_c_rtl}
\end{figure}

\subsection{Translating C to Verilog, by example}

Using the simple accumulator program shown in Figure~\ref{fig:accumulator_c} as a worked example, this section describes the main translation that is performed in Vericert to go from the behavioural description in C into a hardware design in Verilog.

\subsubsection{Translating C to 3AC}

The first step of the translation is to use \compcert{} to transform the input C code into the 3AC shown in Figure~\ref{fig:accumulator_rtl}. As part of this, \compcert{} performs optimisations such as constant propagation and dead-code elimination.  Function inlining is also performed, which allows us to support function calls without having to support the \texttt{Icall} 3AC instruction.  The duplication of the function bodies caused by inlining does affect the total area of the hardware \JW{Is that definitely true? Was discussing this with Nadesh and George recently, and I ended up not being so sure. Inlining could actually lead to \emph{reduced} resource usage because once everything has been inlined, the (big) scheduling problem could then be solved quite optimally. Certainly inlining is known to increase register pressure, but that's not really an issue here. If we're  not sure, we could just say that inlining everything leads to bloated Verilog files and the inability to support recursion, and leave it at that.}\YH{I think that is true, just because we don't do scheduling.  With scheduling I think that's true, inlining actually becomes quite good.}, however it improves the latency of the hardware.  In addition to that, inlining removes the possibility of supporting recursive function calls, however, this is a feature that isn't supported in most other HLS tools either.

\subsubsection{Translating 3AC to HTL}

%   + TODO Explain the main mapping in a short simple way

%   + TODO Clarify connection between CFG and FSMD

%   + TODO Explain how memory is mapped

The first translation performed in Vericert is from 3AC to a new hardware translation language (HTL), which is one step towards being completely translated to hardware described in Verilog.  The main translation that is performed is going from a CFG representation of the computation to a finite state machine with datapath (FSMD)~\cite{hwang99_fsmd}\JW{I feel like this could use some sort of citation, but I'm not sure what. I guess this is all from "Hardware Design 101", right?}\YH{I think I found a good one actually, which goes over the basics.} representation in HTL.\@  The core idea of the FSMD representation is that it separates the control flow from the operations on the memory and registers, so that the state transitions can be translated into a simple finite state machine (FSM) and each state then contains data operations that update the memory and registers.  Figure~\ref{fig:accumulator_diagram} shows the resulting architecture of the FSMD. \JW{I think it would be worth having a sentence to explain how the C model of memory is translated to a hardware-centric model of memory. For instance, in C we have global variables/arrays, stack-allocated variables/arrays, and heap-allocated variables/arrays (anything else?). In Verilog we have registers and RAM blocks. So what's the correspondence between the two worlds? Globals and heap-allocated are not handled, stack-allocated variables become registers, and stack-allocated arrays become RAM blocks? Am I close?}\YH{Stack allocated variables become RAM as well, so that we can deal with addresses easily and take addresses of any variable.}  Hardware does not have the same memory model as C, the memory model therefore needs to be translated in the following way.  Global variables are not translated in Vericert at the moment, however, the stack of the main function will become the RAM seen in Figure~\ref{fig:accumulator_diagram}.  Variables that have their address is taken will therefore be stored in the RAM, as well as any arrays or structs defined in the function.  Variables that did not have their address taken will be kept in registers.

\begin{figure*}
  \centering
  \includegraphics[scale=0.3,trim={10cm 8cm 5cm 5cm},clip=true]{data/accumulator_fsmd2.pdf}
  \caption{The FSMD for our running example. \JW{Maybe replace `State' with `Current State'? And maybe `Calculate State' could be clearer as `Calculate Next State'?} \JW{Is it worth distinguishing between the different types of box? We have a RAM box which is a single hardware block, then the Registers and State boxes both indicate a collection of registers, and finally the Update and Calculate boxes indicate combinational logic. Perhaps the combinational logic bits could be visualised as clouds? (Or if clouds are a bit of a faff to draw, then rounded rectangles, or something?)} \JW{Can we label `data path' and `control path' on the diagram? Looks like the diagram is nicely split into these two parts already.} \JW{Can state 15 (or should it be state 16??) have a dangling incoming arrow to indicate that it is the start state? And perhaps state 1 could have a double outline to indicate that it is an `accepting' state? Since there's space above the `Calculate State' box, I'd be mildly in favour of expanding that box a bit so that it included all 15 states explicitly (snaking back and forth).}}\label{fig:accumulator_diagram}
\end{figure*}

The translation from 3AC to HTL is straightforward, as each 3AC instruction either matches up quite well to a hardware construct, or does not have to be handled by the translation, such as function calls.
%At each instruction, the control flow is separated from the data computation and is then added to the control logic and data-flow map respectively.
%\JW{I suspect that you could safely chop that sentence.}
For example, in state 16 in figure~\ref{fig:accumulator_rtl}, the register \texttt{x9} is initialised to 1, after which the control flow moves to state 15.  This is encoded in HTL by initialising a 32-bit register \texttt{reg\_9} to 1 in the data-flow section, and also adding a transition to the state 15 in the control logic section.  Simple operator instructions are translated in a similar way.  For example, in state 5, the value in the array is added to the current value of the accumulated sum, which is simply translated to an addition of the equivalent registers in the HTL code.

\paragraph{Key challenge: signedness} Note that the comparison in state 3 is signed.  By default, all operators and registers in Verilog and HTL are unsigned, so to force an operation to handle the bits as signed, both operators have to be forced to signed.  In addition to that, Verilog resizes expressions to the largest needed size by default, which can affect the result of the computation.  This feature is also not supported by the Verilog semantics we used, and there would therefore be a mismatch between the Verilog semantics and the actual behaviour of Verilog according to the standard.  To bypass this issue braces are used to stop the Verilog simulator or synthesis tool from resizing anything inside the braces.  Instead, explicit resizing is used in the semantics and operations can only be performed on two registers that have the same size.

\subsection{Translating HTL to Verilog}

Finally, we have to translate the HTL code into proper Verilog and prove that it behaves the same as the 3AC according to the Verilog semantics.  The Verilog output is modelled as a complete abstract syntax tree (AST) instead of being an abstract map over the instructions that are executed. \JW{I find `an abstract map over the instructions that are executed' hard to unpick. How about ``Whereas HTL is a language that is specifically designed to represent the FSMDs we are interested in, Verilog is a general-purpose HDL. So the challenge here is to translate our FSMD representation into a Verilog AST.''}  However, as all the instructions are already expressed in Verilog, only the maps need to be translated to valid Verilog, and correct declarations for all the variables in the program need to be added as well.

This translation seems quite straightforward, however, proving that it is correct is not that simple, as all the implicit assumptions that were made in HTL need to be translated explicitly to Verilog and needs to have the same behaviour according to the semantics. 
\JW{What does `needs' refer to? Missing `it' perhaps?}
Figure~\ref{fig:accumulator_v} shows the final Verilog output that is generated.  In general, the structure is similar to \JWcouldcut{the structure}\JW{that} of the HTL code, however, the control and datapath maps have been translated to case statements \JWcouldcut{that serve the same purpose}.  The other main addition to the code is the initialisation of all the variables in the code to the correct bitwidths and the declaration of the inputs and outputs to the module, so that the module can be used inside a larger hardware design.  

\begin{figure}
  \centering
  \begin{subfigure}[b]{0.49\linewidth}
\begin{minted}[fontsize=\tiny]{verilog}
module main(reset, clk, finish, return_val);
  reg [31:0] stack [2:0];
  input [0:0] clk, reset;
  output reg [31:0] return_val;
  output reg [0:0] finish;
  reg [31:0] reg_8, reg_4, state,
             reg_6, reg_1, reg_9,
             reg_5, reg_3, reg_7;
  always @(posedge clk)
    case (state)
      32'd15: reg_9 <= 32'd1;
      32'd14: stack[32'd0] <= reg_9;
      32'd13: reg_8 <= 32'd2;
      32'd12: stack[32'd1] <= reg_8;
      32'd11: reg_7 <= 32'd3;
      32'd10: stack[32'd2] <= reg_7;
      32'd9: reg_3 <= 32'd0;
      32'd8: reg_1 <= 32'd0;
      32'd7: reg_6 <= 32'd0;
      32'd6: reg_5 <= stack[{{{reg_6 + 32'd0}
        + {reg_1 * 32'd4}} / 32'd4}];
      32'd5: reg_3 <= {reg_3 + {reg_5 + 32'd0}};
      32'd4: reg_1 <= {reg_1 + 32'd1};
      32'd3: ;
      32'd2: reg_4 <= reg_3;
      32'd1: begin
        finish = 1'd1;
        return_val = reg_4;
      end
      default:;
    endcase
\end{minted}
    \caption{Verilog always block describing the datapath of the module.}\label{fig:accumulator_v_1}
  \end{subfigure}\hfill%
  \begin{subfigure}[b]{0.49\linewidth}
\begin{minted}[fontsize=\tiny]{verilog}
  always @(posedge clk)
    if ({reset == 1'd1})
      state <= 32'd16;
    else
      case (state)
        32'd15: state <= 32'd14;
        32'd14: state <= 32'd13;
        32'd13: state <= 32'd12;
        32'd12: state <= 32'd11;
        32'd11: state <= 32'd10;
        32'd10: state <= 32'd9;
        32'd9: state <= 32'd8;
        32'd8: state <= 32'd7;
        32'd7: state <= 32'd6;
        32'd6: state <= 32'd5;
        32'd5: state <= 32'd4;
        32'd4: state <= 32'd3;
        32'd3: state <=
          ({$signed(reg_1) < $signed(32'd3)}
             ? 32'd7 : 32'd2);
        32'd2: state <= 32'd1;
        32'd1: ;
        default:;
      endcase
endmodule
\end{minted}
    \caption{Verilog always block describing the control logic of the module.}\label{fig:accumulator_v_2}
  \end{subfigure}
  \caption{Accumulator example using \vericert{} to translate the 3AC to a state machine expressed in Verilog. \JW{If space permits, it would probably be preferable to have this code in a single column, as splitting a single module across two subfigures is a bit jarring.}}\label{fig:accumulator_v}
\end{figure}

\JW{What do you think about moving the following two paragraphs to the Proof section?}
The translation from maps to case statements is done by turning each node of the tree into a case expression with the statments in each being the same.  The main difficulty for the proof is that a structure that can be directly accessed is transformed into an inductive structure where a certain number of constructors need to be called to get to the correct case.  The proof of the translation from maps to case statements follows by induction over the list of elements in the map and the fact that each key in this list will be unique.  In addition to that, the statement that is currently being evaluated is guaranteed by the correctness of the list of elements to be in that list.  The latter fact therefore eliminates the base case, as an empty list does not contain the element we know is in the list.  The other two cases follow by the fact that either the key is equal to the evaluated value of the case expression, or it isn't.  In the first case we can then evaluate the statement and get the state after the case expression, as the uniqueness of the key tells us that the key cannot show up in the list anymore.  In the other case we can just apply the inductive hypothesis and remove the current case from the case statement, as it did not match.

Another problem with the representation of the state as an actual register is that we have to make sure that the state does not overflow.  Currently, the state register will always be 32 bits, meaning the maximum number of states can only be $2^{32} - 1$.  We therefore have to prove that the state value will never go over that value.  This means that during the translation we have to check for each state that it can fit into an integer. \JW{So I guess this means that Vericert will refuse to compile a program with 5 million instructions? If so, might be worth making that explicit, and maybe even making `Size of state register' one of your Key Challenges?}

\subsection{Optimisations}

Although \vericert{} is not yet a proper `optimising' HLS compiler, we have implemented a few optimisations that aim to improve the quality of the hardware designs it produces.

\subsubsection{Byte- and word-addressable memories}

One big difference between C and Verilog is how memory is represented.  In hardware, efficient RAMs are not as available as in software, and need to be explicitly implemented by declaring two-dimensional arrays with specific properties.  A major limitation is that RAMs often only allow one read and one write per clock cycle, \JWcouldcut{for example if implementing single port RAM, which is the most common type of RAM}.  To make loads and stores as efficient as possible, the RAM needs to be \JWcouldcut{implemented as being} word-addressable, so that an entire integer can be loaded or stored in one clock cycle.
However, the memory model that \compcert{} uses for its intermediate languages~\cite{blazy05_formal_verif_memor_model_c} is byte-addressable.  It therefore has to be proven that the byte-addressable memory behaves in the same way as the word-addressable memory in hardware.  Any modifications of the bytes in the \compcert{} memory model also have to be shown to modify the word-addressable memory in the same way.  As only integer loads and stores are currently supported in our HLS back end, it follows that the addresses given to the loads and stores should be divisible by four.  If that is the case, then the translation from byte-addressed memory to word-addressed memory could be done by dividing the address by four and subtracting \JWcouldcut{by} the base address of the memory. \JW{Why does `subtracting the base address of the memory' have anything to do with whether the memory is byte or word addressed? Don't you have to do that either way? Or perhaps you're saying that your memory is not only word-addressed, but it also starts at address 0 rather than some random address like you'd get in software?}

\subsubsection{Reset signals}

\YH{This section could maybe go into the proof section instead.} 

\JW{Yeah I agree that the rest of this paragraph describes a detail of the proof and should be in the Proof section.}
Even though functions calls are not supported by \vericert{}, the initial function call that \compcert{} uses to enter the main function does need to be supported in the Verilog semantics, as explained further in Section~\ref{sec:verilog}.  The reset signal therefore has to be reasoned about correctly in the Semantics and in the initial function call to ensure that the initial state of the module is set correctly, as in 3AC, the entry point of the function is part of the function definition.

\subsubsection{Implementing the \texttt{Oshrximm} instruction}

% Mention that this optimisation is not performed sometimes (clang -03).

Vericert performs some optimisations at the level of the instructions that are generated, so that the hardware performs the instructions as quickly as possible and so that the maximum frequency at which the hardware can run is increased.  One of the main constructs that cripple performance of the generated hardware is the instantiation of divider circuits in the hardware.  In the case of Vericert, it requires the result of the divide operation to be ready in the same clock cycle, meaning the divide circuit needs to be implemented fully combinationally.  This is inefficient in terms of hardware size, but also in terms of latency, because it means that the maximum frequency of the hardware needs to be reduced dramatically so that the divide circuit has enough time to finish.

These small optimisations were found to be the most error prone, and guaranteeing that the new representation is equivalent to representation used in the \compcert{} semantics is difficult without proving this for all possible inputs.

Dividing by a constant can often be optimised to a more efficient operation, especially if the denominator is a factor of two.  In \compcert{}, the \texttt{Oshrximm} instruction does exactly this, and a normal signed divide operation can be replaced by the \texttt{Oshrximm} instruction, performing the following operation, which is transformed to our optimal representation on the right, where $\div$ stands for integer signed division:

\begin{align*}
  &\forall x, y \in \mathbb{Z},\ \ 0 \leq y < 31,\ \ -2^{31} \leq x < 2^{31},\\
  &x \div 2^y =
  \begin{cases}
    \left\lfloor \frac{x}{2^y} \right\rfloor = x >> y,& \text{if } x \geq 0\\
    \left\lceil \frac{x}{2^y} \right\rceil = - \left\lfloor \frac{-x}{2^y} \right\rfloor = - ( - x >> y ),& \text{otherwise}.
  \end{cases}\\
\end{align*}

The \compcert{} semantics for the \texttt{Oshrximm} instruction express it's operation exactly as shown in the equation above, even though in hardware the computation that would be performed would be different.  In \vericert{}, if the same operation would be implemented using Verilog operators, it is not guaranteed to be optimised correctly by the synthesis tools that convert the Verilog into a circuit.  To guarantee an output that does not include divides, we therefore have to express it in Verilog using shifts, and then prove that this representation is equivalent to the divide representation used in the \compcert{} semantics.  This proof discovered quite a few bugs in our initial implementation of optimisations, which rounded to $-\infty$ instead of 0.

%\JW{I wonder if Section 2 could benefit from a `Some Key Challenges' subsection, where you highlight several interesting bits of the translation process, each with their own paragraph heading. These could be something like:\begin{enumerate}\item Discrepancy between C and Verilog w.r.t. signedness \item Deciding between byte- and word-addressable memories \item Adding reset signals \item Implementing the Oshrximm instruction correctly \end{enumerate} For the causal reader, this would immediately signal two things: (1) you can skip this subsection on your initial pass, and (2) proving the HLS tool correct was a non-trivial undertaking.}

% - Explain main differences between translating C to software and to hardware.

%   + This can be done by going through the simple example.


%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"
%%% TeX-command-extra-options: "-shell-escape"
%%% End: