* PLDI 2021 Paper #543 Reviews and Comments
  :PROPERTIES:
  :CUSTOM_ID: pldi-2021-paper-543-reviews-and-comments
  :END:
Paper #543 Formal Verification of High-Level Synthesis

* Review #543A
  :PROPERTIES:
  :CUSTOM_ID: review-543a
  :END:
** Overall merit
   :PROPERTIES:
   :CUSTOM_ID: overall-merit
   :END:
B. I support accepting this paper but will not champion it.

** Reviewer expertise
   :PROPERTIES:
   :CUSTOM_ID: reviewer-expertise
   :END:
X. *Expert* = "I have written a paper on one or more of this paper's
topics"

** Paper summary
   :PROPERTIES:
   :CUSTOM_ID: paper-summary
   :END:
"High-level synthesis" (HLS) is a term of art for compiling C programs
to hardware designs. This paper presents the first mechanized proof of
soundness for an HLS compiler. The implementation and proof are carried
out in Coq and build on CompCert, targeting Verilog. Evaluation shows
that most designs from a standard benchmark suite can be compiled, with
a significant overall performance penalty.

** Comments for author
   :PROPERTIES:
   :CUSTOM_ID: comments-for-author
   :END:
Cool project, presented very well! I understand HLS is indeed getting
popular in the real world, and it's important to expand compiler
verification to cover HLS. This paper unquestionably extends the
CompCert proofs to cover correctness of hardware generation, following
proof-assistant community norms about modularity and trusted-base size.
The evaluation seems appropriately scoped.

The main problem I have is that the work feels preliminary for the
simplicity of the compilation strategy. We more or less get the
equivalent of one machine-language instruction run per cycle, which
doesn't sound so bad compared to traditional processors, but,
considering the high costs of fabricating a custom chip, we would expect
superior performance per price just by running machine code on commodity
processors! Your HLS is basically generating a custom processor with one
"instruction" per line of code in the C IR, which is totally not the
right way to get high performance from hardware accelerators, the most
important reason being what you note near the end of the paper, that you
don't take advantage of cross-instruction parallelism at all. However,
you are also incurring repeated synchronization overhead from tracking
control state, even for subcomputations that would fit well as
combinational logic.

It's quite possible all the above is fixable using the standard HLS
optimizations that the paper alludes to as future work, but I feel a
truly convincing evaluation would need to show some of those
optimizations. I'm not convinced the current HLSCert is going to
generate Verilog that outperforms running machine code on commodity
processors. The proof structure may need to change fundamentally to
accommodate smarter compilation.

Still, it makes sense that essentially all ideas introduced here are
sound and provide a novel and proper foundation for extending to
optimizations, so I'm giving the paper a small benefit of the doubt in
my rating.

** Other apparent weaknesses in the design
  :PROPERTIES:
  :CUSTOM_ID: other-apparent-weaknesses-in-the-design
  :END:
You inline all function calls, which not only increases area but also
precludes separate compilation, which is often viewed as essential for
efficient hardware synthesis (which is often much slower than for
comparably sized software). I suppose your approach still allows
modularity in hardware designs, where each module has private state and
public methods (functions), but I wonder if you are evaluating on any
such examples.

I can sympathize with the difficulty of getting Verilog compilers to
identify RAMs. IMO, this is not a significant demerit of the current
approach as part of a research prototype.

** Evaluation
  :PROPERTIES:
  :CUSTOM_ID: evaluation
  :END:
If the paper is accepted, I will be tempted to require a revision
showing a performance comparison with software on cutting-edge
processors, which IMO is more important to establish than the gap with
some open-source HLS tool.

I'm struck that your compilation strategy is rather close to partially
evaluating a hardware design for a processor, with respect to a fixed
machine-code program. The Clash library for Haskell can actually be
coaxed into doing that kind of partial evaluation! What would be
different in that approach, considering implementation and proof? It
would make for another worthwhile comparison as evaluation.

** Size of contribution
  :PROPERTIES:
  :CUSTOM_ID: size-of-contribution
  :END:
It's quite dangerous to eyeball a finished proof-assistant development
and decide how much work it /deserved/ to be and what new research
contributions were required. Still, I'm tempted to call the proofs
reported in this paper "routine though tedious." I'm not surprised that
one of your biggest time sinks was proving correspondence between
different versions of bitvector operations! Less obvious is the gap
between memory models in CompCert and Verilog, but now it makes sense.

It is not clear that there are strong research contributions in the
proofs, just engineering, which seems worth doing even if there are not
surprises in the results.

** Small stuff
  :PROPERTIES:
  :CUSTOM_ID: small-stuff
  :END:
Line 148 links to an anonymized GitHub repository. Please don't include
such links when submitting to conferences that allow code supplements.
It's too easy to modify a GitHub repo after the conference deadline,
without leaving any trace in the history.

Figure 3 seems to indicate that register values can't affect state
transitions, but the example in Figure 2 shows just such an influence.
What's going on?

Figure 4 feels unnecessarily complicated, given that you only compile
function bodies that themselves contain no function calls. The
complexity that justified CompCert's state vocabulary is gone.

You discuss how HLSCert fails if more than $2^{32}$ states are needed.
That would be a pretty huge C program, wouldn't it, assuming each state
corresponds to at least one machine-language instruction taking at least
one byte? You'd need machine code of several gigabytes, which I suppose
could arise, but it doesn't sound like such an important restriction in
practice.

- Line 621, "rather of" should be "rather than".
- Line 831, "from CompCert.These assumptions" is missing a space after
  the period.

** Post author-response comments
   :PROPERTIES:
   :CUSTOM_ID: post-author-response-comments
   :END:
I agree that packing lists of instructions in single cycles should fit
well with your proof structure. However, other important optimizations
would include pipelining and copying subcomputations so that several can
run in parallel. It is not clear how much your proof framework would
need to change to support those optimizations.

* Review #543B
  :PROPERTIES:
  :CUSTOM_ID: review-543b
  :END:
** Overall merit
   :PROPERTIES:
   :CUSTOM_ID: overall-merit-1
   :END:
B. I support accepting this paper but will not champion it.

** Reviewer expertise
   :PROPERTIES:
   :CUSTOM_ID: reviewer-expertise-1
   :END:
Y. *Knowledgeable* = "I follow the literature on this paper's topic(s)
but may have missed some relevant developments"

** Paper summary
   :PROPERTIES:
   :CUSTOM_ID: paper-summary-1
   :END:
The paper deals with designing a verified HLS tool which provide
confidence in the translation of C to a widely used Verilog HDL. Such a
verified HLS is imperative to avoid the alternative workarounds which
are either unfaithful (i.e. the ones based on simulating test-bench) or
need computationally expensive TV techniques which might need domain
specific (hence non-trivial) adjustments.

The authors designed the HLS tool with verification as their key design
goal and highlighted the critical design decisions which need to be
incorporated early in the design phase. That include the choice of the
the theorem prover, the framework for simulation proofs, etc. with
proper rationale behind those choices.

The designed in based on extending CompCert with a newly added IR and
supporting a verilog backend. The TCB includes the (borrowed-) semantics
used for the source and target languages and the coq proof assistant -
which are all well established.

The system is tested against HLS tool LegUp, using Polybench/C benchmark
suite, and evaluated based on matrices like (1) How much clock cycles
the design took to complete, (2) clock frequency, (3) Area efficiency,
and (4) Compilation times.

The results shows that HLDCert is a promising effort towards the first
mechanically verified HLS. The tools is (somewhat) comparable to its
(unverified-) competitor based on the aforementioned matrices, with room
for significant improvements.

** Strengths
  :PROPERTIES:
  :CUSTOM_ID: strengths
  :END:

1. Well written
2. The limitations are clearly mentioned: Includes the allowed C-subset,
   the single-cycle division/modulo operations, missing the memory &
   scheduling optimizations and verifying their correctness.\\
3. The related work is neatly depicted using the Venn diagram. Thanks
   for the addition!
4. Good that you run the extra miles in flossing out the unevenness of
   the test candidates by preparing an alternative version of the
   benchmarks that uses repeated division and multiplications by 2.

** A few Questions/Concerns
  :PROPERTIES:
  :CUSTOM_ID: a-few-questionsconcerns
  :END:
I have one broad concern related to the practicality of the system. I am
sure that you have also thought about it and plans to mitigate those.
The concern can be elaborate in two scenarios.

1. Related to the choice of the benchmark: Is the selection based on the
   fact that the system in its current form does not support globals?The
   Polybench/C benchmark has limited (or maybe zero) usage of globals.
   Whereas, the other relevant benchmarks like CHStone and Dhrystone has
   substantial usages of globals used many a times to statically
   initialize inputs.

On a related note, Is there a specific reason (like a fundamental
limitation) for skipping global support? Supporting is definitely
recommenced if that can enhance the reach of the system to realistic
programs.

The LegUp system under comparison, with its support for functions,
global variables and pointer arithmetic is comparatively better suited
for more realistic C programs and hence the above mentioned benchmarks.
I understand that the current work is a WIP and big step towards the
broader goal, but it makes sense to be as complete as possible specially
when it is trivial to achieve that.

2. Related to the absence of certified optimizations As highlighted in
   your paper that a simple state-machine translation is inefficient and
   having a system which executes design fast, with high-frequency, and
   size/power efficiency is desirable. Thus it is imperative to add
   optimization related to scheduling pipelined resources/memory
   analysis/IR optimizations. One of the highlighted reason for the
   comparison-setback is the fact that optimizations are not supported.
   I understand the plans for implementing a certified version of those
   in future and in my opinion (please correct me if I am wrong) it
   should be a as challenging as certifying the instruction translation
   from C to Verilog.

** nits
  :PROPERTIES:
  :CUSTOM_ID: nits
  :END:

1. 831: space missing
2. line 563 (and other similar places): Please explain the notations
   (like delta). It's challenging to follow otherwise.
3. line 216: Please specify a mean (e.g the color code like orange for
   new additions) to separate the parts newly added.
4. Which framework are you using for function inlining. Can we have
   constructs in the system (other than recursion) which might inhibit
   inlining? What about the use of function pointers, or the presence of
   static variables and goto statements?

** Post author-response comments
   :PROPERTIES:
   :CUSTOM_ID: post-author-response-comments-1
   :END:
Thanks to authors for your response. I have carefully read all of it. It
did improve my perception on some parts of the paper.

With the absence of support to globals and of certified optimizations,
it looks more like to work in progress. However, I believe, the work has
the potential of becoming the state of the art ( more so because of the
wide acceptance of the choices of the source and target languages ) or
could encourage (as a baseline) for similar efforts.

As the paper stands now, I will keep the score intact.

* Review #543C
  :PROPERTIES:
  :CUSTOM_ID: review-543c
  :END:
** Overall merit
   :PROPERTIES:
   :CUSTOM_ID: overall-merit-2
   :END:
D. I will argue against accepting this paper.

** Reviewer expertise
   :PROPERTIES:
   :CUSTOM_ID: reviewer-expertise-2
   :END:
X. *Expert* = "I have written a paper on one or more of this paper's
topics"

** Paper summary
   :PROPERTIES:
   :CUSTOM_ID: paper-summary-2
   :END:
The paper adapts CompCert's proof-carrying code model to the development
of an HLS tool. The CompCert implementation is used, and a hook is added
just after the 3AC (three address code) IR (before regalloc and after
some CompCert optimizations) to translate it to HTL and finally
Verilog.\\
They call their tool, HLSCert. There is very limited support in HLSCert
and this is also evident in the order-of-magnitude slowdowns wrt
uncertified HLS tools. The formal semantics of Clight are from CompCert
and the formal semantics of Verilog are adapted from a previous work by
Loow and Myeen. As I see it, the primary contribution of the paper is an
experience study on what was required to plug a different (formally
verified) backend to the Compiler toolchain, a backend that caters to
Verilog as the target language.

** Strengths
  :PROPERTIES:
  :CUSTOM_ID: strengths-1
  :END:

1. Important and relevant problem
2. Clear description of the objectives, design decisions, and experience
   studies
3. I enjoyed studying Figure 9

** Weaknesses
  :PROPERTIES:
  :CUSTOM_ID: weaknesses
  :END:

1. No support for optimization; shows in the performance/area results
   wrt LegUp
2. No discussion on the reduction in the size of the "trusted computing
   base". What is the SLOC for Verilog + Clight + Simulation
   specifications (and how much is it shorter than the actual
   implementation). Without this discussion, how does one assess the
   value of the verification effort? There could be bugs in the spec
   itself, and unless there is evidence that the spec is much shorter
   than the implementation, it would be hard to justify this effort. I
   am guessing that adding support for optimizations will significantly
   increase the gap between spec and implementation complexity, and the
   verification effort would be much more meaningful then.
3. Most of the ideas are a rehash of the CompCert work in a new setting.
   Some of the proof discussion was not clearly motivated/justified I
   thought (see comments section)

** Comments for author
   :PROPERTIES:
   :CUSTOM_ID: comments-for-author-2
   :END:
As clearly motivated in Section 1, Verified HLS is a derisable
capability. It would be great to have such a verified HLS tool that
provides strong correctness guarantees.

My primary criticism is regarding the absence of optimization support.
This is also evident in the experimental results. As I see it, an
order-of-magnitude performance/area degradation is not very compelling.

More fundamentally, the evaluation of the verification effort requires
us to understand the difference in the size and complexity of the
implementation and the size and compleity of the spec (which includes
Clight semantics, Verilog semantics, and the encoding of the observable
equivalence properties). Without this discussion, it seems hard to
evaluate the verification effort.

Optimization logic is usually particularly amenable to verification
efforts. This is so because optimizations do not really change the
"meaning" of the implementation, but make the implementation much more
complex. But unfortunately, this paper does not tackle this space.

Also, I was surprised that the authors dismiss translation validation
(TV) by saying that "it does not provide watertight guarantees unless
the validator itself has been mechanically proven correct". Didn't
CompCert use TV for some parts of their compiler pipeline (I recall
register allocation). Won't the authors also need to use TV as well,
especially if they need to implement support for optimizations?

As it stands now, the paper just seems like a straightforward adaptation
of "mechanical proof infrastructure" from CompCert to the HLS problem.
This is not very interesting in itself --- there need to be more
insights and better results to be able to merit a PLDI paper IMO.

Some more questions: 1. 480-482: why does HLSCert require the result of
the divide circuit to be entirely combinational? Given the high costs, I
would have definitely preferred a better solution to the problem. 2.
616-618. "we use 32-bit integers to represent bitvectors rather than
arrays of booleans". What is the diffference between a 32-bit (symbolic)
integer and a bitvector? I could not understand this statement at all
--- could you please elaborate in your author response. 3. Sections 4.2
and 4.3 : why are you doing a forward simuation and then arguing about
deterministic semantics to show backward simulation. Why not directly
show a backward simulation? Please answer this in your author response.

** Post author-response comments
   :PROPERTIES:
   :CUSTOM_ID: post-author-response-comments-2
   :END:
Thank you for the comprehensive response -- I was especially happy to
see the TCB discussion in your response. I see two issues with the
current version of the paper: 1. It is perhaps not appropriate to take
the improved area/performance results (described in the author response)
into account, as the response is primarily meant for clarifications, not
additional results.

2. The work becomes significantly more interesting in the presence of a
   significant optimizations like pipelining and/or parallelism, and the
   discussions of the challenges around supporting something like that.
   However such optimizations are currently missing.

While it is a good start to have the first verified HLS synthesis tool,
support for some optimizations and some interesting discussion around
the challenges around such support would make the paper a much stronger
contender. We hope the authors would consider submitting a revised
version in the future (to an appropriate venue).

* Review #543D
  :PROPERTIES:
  :CUSTOM_ID: review-543d
  :END:
** Overall merit
   :PROPERTIES:
   :CUSTOM_ID: overall-merit-3
   :END:
B. I support accepting this paper but will not champion it.

** Reviewer expertise
   :PROPERTIES:
   :CUSTOM_ID: reviewer-expertise-3
   :END:
Y. *Knowledgeable* = "I follow the literature on this paper's topic(s)
but may have missed some relevant developments"

** Paper summary
   :PROPERTIES:
   :CUSTOM_ID: paper-summary-3
   :END:
The paper presents a verified HLS compiler for translating a subset of C
programs directly into Verilog. The approach is extending CompCert from
the 3-address code layer downwards into new intermediate languages,
ending in an existing Verilog semantics adapted from HOL4.

The evaluation shows that the generated hardware is substantially (order
of magnitude) larger and slower than existing optimising, but unverified
compilers.

** Comments for author
   :PROPERTIES:
   :CUSTOM_ID: comments-for-author-3
   :END:
The overall approach makes sense, and I also agree with the motivation
-- I think this is an important and fruitful area to apply formal
complier correctness to.

My main concern with this paper is that it might be a bit too early. The
first verified chain is a good milestone, but the performance is so bad
that I would not really call it viable yet. Since there are other
hardware verification approaches that do work (although maybe not
directly HLS), reasonably performance is important. I'm still slightly
in favour of acceptance, but I'm on the fence. If this showed good
performance even within 50% or so of existing compilers, it would be a
strong accept.

It did not become entirely clear for me from the paper wether this is
due to the approach (I think it's fine), or due to some phases/major
optimisations missing that are just required to get reasonable hardware.

The main point that stuck out for me is not using register allocation[*]
and the probably not hardware-idiosyncratic treatment of RAM (although
the work is attempting to deal with this issue, the model and generated
code might just still be too far from what is "good" for hardware). I'd
be curious to hear more from the authors what exactly is missing,
because if the paper is published as is, people who are not well
disposed to verification will take it as an example that verification is
infeasible to apply to this problem, which I think is neither true nor
in the author's interest.

[*]: you make the point that registers are cheap in FPGAs, but infinite
registers will just blow up the size, and even slight pressure on the
register file to make the compiler reuse registers (without spilling on
the stack) might make a difference.

Details: - have you investigated whether function inlining is leading to
some of the size blow-up in your verliog output?

- comparison to existing compilers: for the evaluation would there be
  the possibility of attempting to switch off certain or all
  optimisations in the existing compiler you compare to for getting an
  additional set of data points? (You still need the original one). This
  might also help to pinpoint where some of the overhead comes from.

- p2: "We also considered using a language with built-in parallel
  constructs that map well to parallel hardware, such as occam [40],
  Spatial [30] or Scala [2], but found these languages too niche.". This
  sounds disingenuous to me. Surely, the main criterion was that
  CompCert exists and can be connected to? Trying to verify correctness
  of a parallel Scala to HW compiler sounds like a massive undertaking
  to me without existing verification infrastructure, and would be
  discarded because of that, not because the language is not used as
  widely (and if it really was better suited, maybe wide use would not
  be such a strong argument). I don't think it would hurt to just say
  that.

- p 7, line 612: "Removing support for external inputs to modules". You
  are not discussing how I/O is modelled and part of the final
  correctness statement then, if these are removed. Surely your hardware
  can compute more than one final value? It'd be fairly easy to generate
  and would not need to be that large..

- Theorem 1: you call this "backward simulation". This is in accordance
  with CompCert terminology, which unfortunately is the exact opposite
  of established refinement and data refinement terminology. This (the
  harder direction) is a classic forward simulation. You are not really
  to blame for this, but I think you should spell out which terminology
  you are using. Maybe say "in CompCert terminology" or similar, citing
  the paper where Xavier introduces it.

- p 9, line 978 "avoiding any need for off-chip memory access": Why? Can
  your compiler not deal with these or would they obscure results? Might
  external RAM not alleviate some of the hardware size pressure?

** Post author-response comments
   :PROPERTIES:
   :CUSTOM_ID: post-author-response-comments-3
   :END:
Thank you for your response, it is great to see that performance can be
increased so easily. It is unfortunate that we are not allowed to take
into account new results, which this will have to count as.

I would encourage the authors to take the feedback gathered here into
account and resubmit with the new results to a different venue.

* Response by Yann Herklotz
[[mailto:yann.herklotz15@imperial.ac.uk][yann.herklotz15@imperial.ac.uk]]
   :PROPERTIES:
   :CUSTOM_ID: response-by-yann-herklotz-yann.herklotz15imperial.ac.uk
   :END:
Thanks very much for such kind and detailed reviews!

All reviewers were concerned by the area and performance of the hardware
designs generated by HLSCert. We'll begin our rebuttal with three items
of good news on that front, and then respond to individual questions.

1. *HLSCert's designs are actually much more area-efficient than our
   current evaluation suggests.* The synthesis tool that we used in our
   evaluation (Intel Quartus) was able to infer RAM blocks in the LegUp
   designs but not in the HLSCert designs; this led to the HLSCert
   designs consuming about 21x the area of the LegUp designs on Intel
   FPGAs, and having much worse clock frequencies too. But we have found
   that by switching to the Xilinx Vivado synthesis tool and targeting a
   Xilinx FPGA instead, RAM inference succeeds for both LegUp and
   HLSCert designs. This gives a much fairer area comparison. HLSCert
   designs now consume only *about 1.7x the area* of LegUp designs. We
   propose to use Vivado instead of Quartus when revising our evaluation
   section. (Interestingly, another synthesis tool called Yosys can
   infer RAMs in HLSCert designs but not in LegUp designs, making
   HLSCert designs consume about 0.1x the area of LegUp designs!)

2. *HLSCert's designs actually obtain much higher clock frequencies than
   our current evaluation suggests.* Not only does RAM inference reduce
   area; it also leads to much shorter critical paths. So, in the
   absence of division, HLSCert designs now obtain *about the same clock
   frequency (1-1.5x)* as LegUp designs.

3. *Much of LegUp's remaining performance advantage is down to generic
   (not HLS-specific) LLVM optimisations that CompCert doesn't perform.*
   Following an investigation suggested by Reviewer D, we have found
   that HLSCert only gains about a 1.1x improvement in cycle count from
   CompCert's built-in optimisations, whereas LegUp gains about a 2x
   improvement thanks to invoking the LLVM optimiser at =-O3=. All this
   is before any HLS-specific optimisations kick in. This discrepancy --
   which rather highlights the gap between a state-of-the-art verified
   compiler (CompCert) and a state-of-the-art mainstream compiler (LLVM
   =-O3=) -- explains much of overall 4.5x improvement in cycle count
   that LegUp has over HLSCert.

LegUp's remaining performance advantage can be mostly explained by an
optimisation called /operation chaining/, whereby multiple dependent
operations are packed into a single clock cycle (instead of generating a
chain of registers across multiple cycles). We are currently working to
implement (and eventually verify) this optimisation in HLSCert.

With CompCert/LLVM optimisations and operation chaining /disabled/,
HLSCert designs have only *about 1.3x the cycle count* of LegUp designs.
When taking the updated clock frequencies into account too, HLSCert
designs are only *about 1.6x slower* in terms of execution time. This
drops to 1.3x slower when we enable an as-yet-unverified scheduling pass
that is currently under development. When operation chaining is enabled
in LegUp, HLSCert designs remain around 3-4x slower.

Of course, as several reviewers remarked, there is plenty of room for
improvement before HLSCert can be considered an /optimising/ HLS
compiler. Still, we believe that what we have achieved so far -- the
first mechanised HLS compiler -- is an important milestone in itself
that sets a new standard for HLS research, and upon which future
researchers will be well-positioned to build.

*** Responses to Reviewer A
    :PROPERTIES:
    :CUSTOM_ID: responses-to-reviewer-a
    :END:

#+begin_quote
  You inline all function calls.
#+end_quote

This is in keeping with existing HLS tools, which
[[https://dl.acm.org/doi/10.1145/2629547][make liberal use of
inlining]]. We are currently working to extend HLSCert with support for
function calls; once this is done, we look forward to experimenting with
inlining thresholds.

#+begin_quote
  The proof structure may need to change fundamentally to accommodate
  smarter compilation.
#+end_quote

We believe that the proof structure should allow for the addition of
simple optimisations in different passes to get comparable
quality-of-results to LegUp. Naturally, changes will need to made --
e.g. replacing "one instruction per clock cycle" with "list of
instructions per clock cycle" to enable scheduling -- but we don't
anticipate fundamental redesigns. These changes can be done in other
intermediate languages so that our original proofs don't need to change.

#+begin_quote
  Figure 3 seems to indicate that register values can't affect state
  transitions, but the example in Figure 2 shows just such an influence.
  What's going on?
#+end_quote

Yes, indeed the registers should be able to affect the control flow. We
will rectify that in the diagram.

#+begin_quote
  Figure 4 feels unnecessarily complicated, given that you only compile
  function bodies that themselves contain no function calls.
#+end_quote

Indeed, it might be simpler if they were not present -- they do not
really add much by themselves, but they are needed to interact with the
rest of the CompCert model and are part of the assumptions that we make.
We will try and clarify this section, and maybe simplify the diagram.

#+begin_quote
  You discuss how HLSCert fails if more than $2^{32}$ states are needed.
  That would be a pretty huge C program, wouldn't it?
#+end_quote

Quite! This rather extreme example is just meant to illustrate the fact
that HLSCert, despite being verified, is not guaranteed to produce an
output for every valid input. From a practical standpoint, no modern HLS
tools would be able to handle such a huge number of states.

*** Responses to Reviewer B
    :PROPERTIES:
    :CUSTOM_ID: responses-to-reviewer-b
    :END:

#+begin_quote
  Is the benchmark selection based on the fact that the system in its
  current form does not support globals?
#+end_quote

The choice of benchmark is mostly because CHStone uses large static
arrays to initialise the test bench. Since we currently don't support
separating a file into a test bench that is run in software and a
function that is synthesised into hardware, we need to synthesise the
large static array into a memory and initialise it, and this takes too
long to simulate. We are looking into how to circumvent this issue, by
maybe having better support for initialising large static memories.

#+begin_quote
  Why do you not support global variables?
#+end_quote

It's on our todo list. CompCert stores each global variable in a
separate memory. We have already implemented translation of memories
into Verilog arrays as part of our handling of the stack. We expect that
supporting global variables will be a matter of translating multiple
memories, and engineering a way to discern from each pointer which array
to index into.

#+begin_quote
  Which framework are you using for function inlining? Can we have
  constructs in the system (other than recursion) which might inhibit
  inlining?
#+end_quote

We have modified the existing inliner in CompCert to inline all
functions, and we should be therefore able to inline any non-recursive
function.

#+begin_quote
  What about the use of function pointers, or the presence of static
  variables and goto statements?
#+end_quote

For function pointers and static variables, HLSCert will error-out at
compile time as these cannot be translated. However, gotos are
translated correctly.

*** Responses to Reviewer C
    :PROPERTIES:
    :CUSTOM_ID: responses-to-reviewer-c
    :END:

#+begin_quote
  As it stands now, the paper just seems like a straightforward
  adaptation of "mechanical proof infrastructure" from CompCert to the
  HLS problem.
#+end_quote

We believe that HLS is a topic of great and growing importance, and that
the first mechanically verified HLS tool is a valuable contribution that
future researchers will be keen to build upon. We think that our paper
can bring insights about hardware design to the PL community, and can
also enlighten the hardware design community about the feasibility of a
fully verified HLS tool.

#+begin_quote
  Why does HLSCert require the result of the divide circuit to be
  entirely combinational?
#+end_quote

This is the default behaviour of the synthesis tool when given a
division operator in Verilog. In such a scenario, the synthesis tool is
obliged complete the division in one cycle, as other parts of the
circuit may rely on this fact. This could be fixed by creating (and
verifying) a pipelined divider module.

#+begin_quote
  I was surprised that the authors dismiss translation validation. Will
  you use translation validation to implement support for optimizations?
#+end_quote

We only intended to dismiss /unverified/ translation validation. We are
indeed working on using translation validation to implement
optimisations. We will further clarify this when revising the paper.

#+begin_quote
  Sections 4.2 and 4.3 : why are you doing a forward simulation and then
  arguing about deterministic semantics to show backward simulation. Why
  not directly show a backward simulation?
#+end_quote

We suspect that this confusion results from CompCert's terminology
diverging from established terminology: please see the last part of
[[https://pldi21.hotcrp.com/review/543D][Review #D]]. We'll be sure to
clarify this terminology when revising the paper.

#+begin_quote
  What is the size of the trusted computing base for HLSCert?
#+end_quote

The length of the implementation section alone is 1184 loc, though as
the reviewer remarks, this will increase considerably as we add
optimisations. The main trusted part of HLSCert is the Verilog semantics
(405 loc), which is mostly inherited from Lööw and Myreen. The
specification of Clight is inherited from CompCert and is
well-established, but if we include that too (584 loc), then the total
trusted base is 989 loc. (All loc figures are without spaces.) This is
similar to the size of the implementation, though we're not sure how
meaningful it is to equate one line of specification with one line of
implementation. Table 1 includes all these numbers (though the Verilog
semantics, quoted there as 739 loc, includes a few proofs about the
semantics -- we'll fix that).

#+begin_quote
  What is the difference between a 32-bit (symbolic) integer and a
  bitvector?
#+end_quote

A bitvector is a list of Booleans. We instead use the internal CompCert
integer type, which has two fields: a number and a bitwidth (32).

*** Responses to Reviewer D
    :PROPERTIES:
    :CUSTOM_ID: responses-to-reviewer-d
    :END:

#+begin_quote
  How is I/O modelled? How does it feature in the final correctness
  statement?
#+end_quote

As we do not support modules in the Verilog semantics yet (one reason we
inline all functions), we translate arguments to the main function into
inputs to the module, and in the semantics we model the transition from
Callstate to State (cf. Figure 4) as taking the input values the module
was "called with" and assigning them to registers. There are calling
conventions that need to be respected here, such as not changing the
inputs or raising the reset signal during execution. As long as these
are respected, the correctness theorem will hold.

#+begin_quote
  Would there be the possibility of attempting to switch off certain or
  all optimisations in the existing compiler you compare to for getting
  an additional set of data points?
#+end_quote

Thanks for the suggestion! We have done exactly this at the top of our
rebuttal.

#+begin_quote
  Why do you need to avoid off-chip memory access? Can your compiler not
  deal with these or would they obscure results? Might external RAM not
  alleviate some of the hardware size pressure?
#+end_quote

Off-chip memory currently cannot be used because it requires general
interaction with other Verilog modules from C. However, the stack, and
in future any other memories, will be stored in on-chip RAM blocks, if
inferred by the synthesis tool correctly.

#+begin_quote
  Dismissing occam, Spatial and Scala as too niche sounds disingenuous
  to me. Surely, the main criterion was that CompCert exists and can be
  connected to?
#+end_quote

You are quite right: the existence of CompCert (and other verification
infrastructure for C) was the big factor. We'll rephrase.

#+begin_quote
  Have you investigated whether function inlining is leading to some of
  the size blow-up in your verilog output?
#+end_quote

We believe that almost all of the blame for the area blow-up can be laid
on the synthesis tool failing to infer RAM blocks in the HLSCert
designs. After switching to a synthesis tool that infers RAM blocks in
both HLSCert and LegUp designs, the area blow-up goes away.