From c77cd6168a805255e44988b41d2911182ec99327 Mon Sep 17 00:00:00 2001
From: Yann Herklotz <git@yannherklotz.com>
Date: Fri, 26 Feb 2021 12:56:08 +0000
Subject: Add sync and reviews

---
 reviews/pldi.org | 848 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 848 insertions(+)
 create mode 100644 reviews/pldi.org

(limited to 'reviews')

diff --git a/reviews/pldi.org b/reviews/pldi.org
new file mode 100644
index 0000000..765e7e6
--- /dev/null
+++ b/reviews/pldi.org
@@ -0,0 +1,848 @@
+* PLDI 2021 Paper #543 Reviews and Comments
+  :PROPERTIES:
+  :CUSTOM_ID: pldi-2021-paper-543-reviews-and-comments
+  :END:
+Paper #543 Formal Verification of High-Level Synthesis
+
+* Review #543A
+  :PROPERTIES:
+  :CUSTOM_ID: review-543a
+  :END:
+** Overall merit
+   :PROPERTIES:
+   :CUSTOM_ID: overall-merit
+   :END:
+B. I support accepting this paper but will not champion it.
+
+** Reviewer expertise
+   :PROPERTIES:
+   :CUSTOM_ID: reviewer-expertise
+   :END:
+X. *Expert* = "I have written a paper on one or more of this paper's
+topics"
+
+** Paper summary
+   :PROPERTIES:
+   :CUSTOM_ID: paper-summary
+   :END:
+"High-level synthesis" (HSL) is a term of art for compiling C programs
+to hardware designs. This paper presents the first mechanized proof of
+soundness for an HLS compiler. The implementation and proof are carried
+out in Coq and build on CompCert, targeting Verilog. Evaluation shows
+that most designs from a standard benchmark suite can be compiled, with
+a significant overall performance penalty.
+
+** Comments for author
+   :PROPERTIES:
+   :CUSTOM_ID: comments-for-author
+   :END:
+Cool project, presented very well! I understand HLS is indeed getting
+popular in the real world, and it's important to expand compiler
+verification to cover HLS. This paper unquestionably extends the
+CompCert proofs to cover correctness of hardware generation, following
+proof-assistant community norms about modularity and trusted-base size.
+The evaluation seems appropriately scoped.
+
+The main problem I have is that the work feels preliminary for the
+simplicity of the compilation strategy. We more or less get the
+equivalent of one machine-language instruction run per cycle, which
+doesn't sound so bad compared to traditional processors, but,
+considering the high costs of fabricating a custom chip, we would expect
+superior performance per price just by running machine code on commodity
+processors! Your HLS is basically generating a custom processor with one
+"instruction" per line of code in the C IR, which is totally not the
+right way to get high performance from hardware accelerators, the most
+important reason being what you note near the end of the paper, that you
+don't take advantage of cross-instruction parallelism at all. However,
+you are also incurring repeated synchronization overhead from tracking
+control state, even for subcomputations that would fit well as
+combinational logic.
+
+It's quite possible all the above is fixable using the standard HLS
+optimizations that the paper alludes to as future work, but I feel a
+truly convincing evaluation would need to show some of those
+optimizations. I'm not convinced the current HLSCert is going to
+generate Verilog that outperforms running machine code on commodity
+processors. The proof structure may need to change fundamentally to
+accommodate smarter compilation.
+
+Still, it makes sense that essentially all ideas introduced here are
+sound and provide a novel and proper foundation for extending to
+optimizations, so I'm giving the paper a small benefit of the doubt in
+my rating.
+
+* Other apparent weaknesses in the design
+  :PROPERTIES:
+  :CUSTOM_ID: other-apparent-weaknesses-in-the-design
+  :END:
+You inline all function calls, which not only increases area but also
+precludes separate compilation, which is often viewed as essential for
+efficient hardware synthesis (which is often much slower than for
+comparably sized software). I suppose your approach still allows
+modularity in hardware designs, where each module has private state and
+public methods (functions), but I wonder if you are evaluating on any
+such examples.
+
+I can sympathize with the difficulty of getting Verilog compilers to
+identify RAMs. IMO, this is not a significant demerit of the current
+approach as part of a research prototype.
+
+* Evaluation
+  :PROPERTIES:
+  :CUSTOM_ID: evaluation
+  :END:
+If the paper is accepted, I will be tempted to require a revision
+showing a performance comparison with software on cutting-edge
+processors, which IMO is more important to establish than the gap with
+some open-source HLS tool.
+
+I'm struck that your compilation strategy is rather close to partially
+evaluating a hardware design for a processor, with respect to a fixed
+machine-code program. The Clash library for Haskell can actually be
+coaxed into doing that kind of partial evaluation! What would be
+different in that approach, considering implementation and proof? It
+would make for another worthwhile comparison as evaluation.
+
+* Size of contribution
+  :PROPERTIES:
+  :CUSTOM_ID: size-of-contribution
+  :END:
+It's quite dangerous to eyeball a finished proof-assistant development
+and decide how much work it /deserved/ to be and what new research
+contributions were required. Still, I'm tempted to call the proofs
+reported in this paper "routine though tedious." I'm not surprised that
+one of your biggest time sinks was proving correspondence between
+different versions of bitvector operations! Less obvious is the gap
+between memory models in CompCert and Verilog, but now it makes sense.
+
+It is not clear that there are strong research contributions in the
+proofs, just engineering, which seems worth doing even if there are not
+surprises in the results.
+
+* Small stuff
+  :PROPERTIES:
+  :CUSTOM_ID: small-stuff
+  :END:
+Line 148 links to an anonymized GitHub repository. Please don't include
+such links when submitting to conferences that allow code supplements.
+It's too easy to modify a GitHub repo after the conference deadline,
+without leaving any trace in the history.
+
+Figure 3 seems to indicate that register values can't affect state
+transitions, but the example in Figure 2 shows just such an influence.
+What's going on?
+
+Figure 4 feels unnecessarily complicated, given that you only compile
+function bodies that themselves contain no function calls. The
+complexity that justified CompCert's state vocabulary is gone.
+
+You discuss how HLSCert fails if more than $2^{32}$ states are needed.
+That would be a pretty huge C program, wouldn't it, assuming each state
+corresponds to at least one machine-language instruction taking at least
+one byte? You'd need machine code of several gigabytes, which I suppose
+could arise, but it doesn't sound like such an important restriction in
+practice.
+
+- Line 621, "rather of" should be "rather than".
+- Line 831, "from CompCert.These assumptions" is missing a space after
+  the period.
+
+** Post author-response comments
+   :PROPERTIES:
+   :CUSTOM_ID: post-author-response-comments
+   :END:
+I agree that packing lists of instructions in single cycles should fit
+well with your proof structure. However, other important optimizations
+would include pipelining and copying subcomputations so that several can
+run in parallel. It is not clear how much your proof framework would
+need to change to support those optimizations.
+
+* Review #543B
+  :PROPERTIES:
+  :CUSTOM_ID: review-543b
+  :END:
+** Overall merit
+   :PROPERTIES:
+   :CUSTOM_ID: overall-merit-1
+   :END:
+B. I support accepting this paper but will not champion it.
+
+** Reviewer expertise
+   :PROPERTIES:
+   :CUSTOM_ID: reviewer-expertise-1
+   :END:
+Y. *Knowledgeable* = "I follow the literature on this paper's topic(s)
+but may have missed some relevant developments"
+
+** Paper summary
+   :PROPERTIES:
+   :CUSTOM_ID: paper-summary-1
+   :END:
+The paper deals with designing a verified HLS tool which provide
+confidence in the translation of C to a widely used Verilog HDL. Such a
+verified HLS is imperative to avoid the alternative workarounds which
+are either unfaithful (i.e. the ones based on simulating test-bench) or
+need computationally expensive TV techniques which might need domain
+specific (hence non-trivial) adjustments.
+
+The authors designed the HLS tool with verification as their key design
+goal and highlighted the critical design decisions which need to be
+incorporated early in the design phase. That include the choice of the
+the theorem prover, the framework for simulation proofs, etc. with
+proper rationale behind those choices.
+
+The designed in based on extending CompCert with a newly added IR and
+supporting a verilog backend. The TCB includes the (borrowed-) semantics
+used for the source and target languages and the coq proof assistant -
+which are all well established.
+
+The system is tested against HLS tool LegUp, using Polybench/C benchmark
+suite, and evaluated based on matrices like (1) How much clock cycles
+the design took to complete, (2) clock frequency, (3) Area efficiency,
+and (4) Compilation times.
+
+The results shows that HLDCert is a promising effort towards the first
+mechanically verified HLS. The tools is (somewhat) comparable to its
+(unverified-) competitor based on the aforementioned matrices, with room
+for significant improvements.
+
+** Comments for author
+   :PROPERTIES:
+   :CUSTOM_ID: comments-for-author-1
+   :END:
+* Strengths
+  :PROPERTIES:
+  :CUSTOM_ID: strengths
+  :END:
+
+1. Well written
+2. The limitations are clearly mentioned: Includes the allowed C-subset,
+   the single-cycle division/modulo operations, missing the memory &
+   scheduling optimizations and verifying their correctness.\\
+3. The related work is neatly depicted using the Venn diagram. Thanks
+   for the addition!
+4. Good that you run the extra miles in flossing out the unevenness of
+   the test candidates by preparing an alternative version of the
+   benchmarks that uses repeated division and multiplications by 2.
+
+* A few Questions/Concerns
+  :PROPERTIES:
+  :CUSTOM_ID: a-few-questionsconcerns
+  :END:
+I have one broad concern related to the practicality of the system. I am
+sure that you have also thought about it and plans to mitigate those.
+The concern can be elaborate in two scenarios.
+
+1. Related to the choice of the benchmark: Is the selection based on the
+   fact that the system in its current form does not support globals?The
+   Polybench/C benchmark has limited (or maybe zero) usage of globals.
+   Whereas, the other relevant benchmarks like CHStone and Dhrystone has
+   substantial usages of globals used many a times to statically
+   initialize inputs.
+
+On a related note, Is there a specific reason (like a fundamental
+limitation) for skipping global support? Supporting is definitely
+recommenced if that can enhance the reach of the system to realistic
+programs.
+
+The LegUp system under comparison, with its support for functions,
+global variables and pointer arithmetic is comparatively better suited
+for more realistic C programs and hence the above mentioned benchmarks.
+I understand that the current work is a WIP and big step towards the
+broader goal, but it makes sense to be as complete as possible specially
+when it is trivial to achieve that.
+
+2. Related to the absence of certified optimizations As highlighted in
+   your paper that a simple state-machine translation is inefficient and
+   having a system which executes design fast, with high-frequency, and
+   size/power efficiency is desirable. Thus it is imperative to add
+   optimization related to scheduling pipelined resources/memory
+   analysis/IR optimizations. One of the highlighted reason for the
+   comparison-setback is the fact that optimizations are not supported.
+   I understand the plans for implementing a certified version of those
+   in future and in my opinion (please correct me if I am wrong) it
+   should be a as challenging as certifying the instruction translation
+   from C to Verilog.
+
+* nits
+  :PROPERTIES:
+  :CUSTOM_ID: nits
+  :END:
+
+1. 831: space missing
+2. line 563 (and other similar places): Please explain the notations
+   (like delta). It's challenging to follow otherwise.
+3. line 216: Please specify a mean (e.g the color code like orange for
+   new additions) to separate the parts newly added.
+4. Which framework are you using for function inlining. Can we have
+   constructs in the system (other than recursion) which might inhibit
+   inlining? What about the use of function pointers, or the presence of
+   static variables and goto statements?
+
+** Post author-response comments
+   :PROPERTIES:
+   :CUSTOM_ID: post-author-response-comments-1
+   :END:
+Thanks to authors for your response. I have carefully read all of it. It
+did improve my perception on some parts of the paper.
+
+With the absence of support to globals and of certified optimizations,
+it looks more like to work in progress. However, I believe, the work has
+the potential of becoming the state of the art ( more so because of the
+wide acceptance of the choices of the source and target languages ) or
+could encourage (as a baseline) for similar efforts.
+
+As the paper stands now, I will keep the score intact.
+
+* Review #543C
+  :PROPERTIES:
+  :CUSTOM_ID: review-543c
+  :END:
+** Overall merit
+   :PROPERTIES:
+   :CUSTOM_ID: overall-merit-2
+   :END:
+D. I will argue against accepting this paper.
+
+** Reviewer expertise
+   :PROPERTIES:
+   :CUSTOM_ID: reviewer-expertise-2
+   :END:
+X. *Expert* = "I have written a paper on one or more of this paper's
+topics"
+
+** Paper summary
+   :PROPERTIES:
+   :CUSTOM_ID: paper-summary-2
+   :END:
+The paper adapts CompCert's proof-carrying code model to the development
+of an HLS tool. The CompCert implementation is used, and a hook is added
+just after the 3AC (three address code) IR (before regalloc and after
+some CompCert optimizations) to translate it to HTL and finally
+Verilog.\\
+They call their tool, HLSCert. There is very limited support in HLSCert
+and this is also evident in the order-of-magnitude slowdowns wrt
+uncertified HLS tools. The formal semantics of Clight are from CompCert
+and the formal semantics of Verilog are adapted from a previous work by
+Loow and Myeen. As I see it, the primary contribution of the paper is an
+experience study on what was required to plug a different (formally
+verified) backend to the Compiler toolchain, a backend that caters to
+Verilog as the target language.
+
+* Strengths
+  :PROPERTIES:
+  :CUSTOM_ID: strengths-1
+  :END:
+
+1. Important and relevant problem
+2. Clear description of the objectives, design decisions, and experience
+   studies
+3. I enjoyed studying Figure 9
+
+* Weaknesses
+  :PROPERTIES:
+  :CUSTOM_ID: weaknesses
+  :END:
+
+1. No support for optimization; shows in the performance/area results
+   wrt LegUp
+2. No discussion on the reduction in the size of the "trusted computing
+   base". What is the SLOC for Verilog + Clight + Simulation
+   specifications (and how much is it shorter than the actual
+   implementation). Without this discussion, how does one assess the
+   value of the verification effort? There could be bugs in the spec
+   itself, and unless there is evidence that the spec is much shorter
+   than the implementation, it would be hard to justify this effort. I
+   am guessing that adding support for optimizations will significantly
+   increase the gap between spec and implementation complexity, and the
+   verification effort would be much more meaningful then.
+3. Most of the ideas are a rehash of the CompCert work in a new setting.
+   Some of the proof discussion was not clearly motivated/justified I
+   thought (see comments section)
+
+** Comments for author
+   :PROPERTIES:
+   :CUSTOM_ID: comments-for-author-2
+   :END:
+As clearly motivated in Section 1, Verified HLS is a derisable
+capability. It would be great to have such a verified HLS tool that
+provides strong correctness guarantees.
+
+My primary criticism is regarding the absence of optimization support.
+This is also evident in the experimental results. As I see it, an
+order-of-magnitude performance/area degradation is not very compelling.
+
+More fundamentally, the evaluation of the verification effort requires
+us to understand the difference in the size and complexity of the
+implementation and the size and compleity of the spec (which includes
+Clight semantics, Verilog semantics, and the encoding of the observable
+equivalence properties). Without this discussion, it seems hard to
+evaluate the verification effort.
+
+Optimization logic is usually particularly amenable to verification
+efforts. This is so because optimizations do not really change the
+"meaning" of the implementation, but make the implementation much more
+complex. But unfortunately, this paper does not tackle this space.
+
+Also, I was surprised that the authors dismiss translation validation
+(TV) by saying that "it does not provide watertight guarantees unless
+the validator itself has been mechanically proven correct". Didn't
+CompCert use TV for some parts of their compiler pipeline (I recall
+register allocation). Won't the authors also need to use TV as well,
+especially if they need to implement support for optimizations?
+
+As it stands now, the paper just seems like a straightforward adaptation
+of "mechanical proof infrastructure" from CompCert to the HLS problem.
+This is not very interesting in itself --- there need to be more
+insights and better results to be able to merit a PLDI paper IMO.
+
+Some more questions: 1. 480-482: why does HLSCert require the result of
+the divide circuit to be entirely combinational? Given the high costs, I
+would have definitely preferred a better solution to the problem. 2.
+616-618. "we use 32-bit integers to represent bitvectors rather than
+arrays of booleans". What is the diffference between a 32-bit (symbolic)
+integer and a bitvector? I could not understand this statement at all
+--- could you please elaborate in your author response. 3. Sections 4.2
+and 4.3 : why are you doing a forward simuation and then arguing about
+deterministic semantics to show backward simulation. Why not directly
+show a backward simulation? Please answer this in your author response.
+
+** Post author-response comments
+   :PROPERTIES:
+   :CUSTOM_ID: post-author-response-comments-2
+   :END:
+Thank you for the comprehensive response -- I was especially happy to
+see the TCB discussion in your response. I see two issues with the
+current version of the paper: 1. It is perhaps not appropriate to take
+the improved area/performance results (described in the author response)
+into account, as the response is primarily meant for clarifications, not
+additional results.
+
+2. The work becomes significantly more interesting in the presence of a
+   significant optimizations like pipelining and/or parallelism, and the
+   discussions of the challenges around supporting something like that.
+   However such optimizations are currently missing.
+
+While it is a good start to have the first verified HLS synthesis tool,
+support for some optimizations and some interesting discussion around
+the challenges around such support would make the paper a much stronger
+contender. We hope the authors would consider submitting a revised
+version in the future (to an appropriate venue).
+
+* Review #543D
+  :PROPERTIES:
+  :CUSTOM_ID: review-543d
+  :END:
+** Overall merit
+   :PROPERTIES:
+   :CUSTOM_ID: overall-merit-3
+   :END:
+B. I support accepting this paper but will not champion it.
+
+** Reviewer expertise
+   :PROPERTIES:
+   :CUSTOM_ID: reviewer-expertise-3
+   :END:
+Y. *Knowledgeable* = "I follow the literature on this paper's topic(s)
+but may have missed some relevant developments"
+
+** Paper summary
+   :PROPERTIES:
+   :CUSTOM_ID: paper-summary-3
+   :END:
+The paper presents a verified HLS compiler for translating a subset of C
+programs directly into Verilog. The approach is extending CompCert from
+the 3-address code layer downwards into new intermediate languages,
+ending in an existing Verilog semantics adapted from HOL4.
+
+The evaluation shows that the generated hardware is substantially (order
+of magnitude) larger and slower than existing optimising, but unverified
+compilers.
+
+** Comments for author
+   :PROPERTIES:
+   :CUSTOM_ID: comments-for-author-3
+   :END:
+The overall approach makes sense, and I also agree with the motivation
+-- I think this is an important and fruitful area to apply formal
+complier correctness to.
+
+My main concern with this paper is that it might be a bit too early. The
+first verified chain is a good milestone, but the performance is so bad
+that I would not really call it viable yet. Since there are other
+hardware verification approaches that do work (although maybe not
+directly HLS), reasonably performance is important. I'm still slightly
+in favour of acceptance, but I'm on the fence. If this showed good
+performance even within 50% or so of existing compilers, it would be a
+strong accept.
+
+It did not become entirely clear for me from the paper wether this is
+due to the approach (I think it's fine), or due to some phases/major
+optimisations missing that are just required to get reasonable hardware.
+
+The main point that stuck out for me is not using register allocation[*]
+and the probably not hardware-idiosyncratic treatment of RAM (although
+the work is attempting to deal with this issue, the model and generated
+code might just still be too far from what is "good" for hardware). I'd
+be curious to hear more from the authors what exactly is missing,
+because if the paper is published as is, people who are not well
+disposed to verification will take it as an example that verification is
+infeasible to apply to this problem, which I think is neither true nor
+in the author's interest.
+
+[*]: you make the point that registers are cheap in FPGAs, but infinite
+registers will just blow up the size, and even slight pressure on the
+register file to make the compiler reuse registers (without spilling on
+the stack) might make a difference.
+
+Details: - have you investigated whether function inlining is leading to
+some of the size blow-up in your verliog output?
+
+- comparison to existing compilers: for the evaluation would there be
+  the possibility of attempting to switch off certain or all
+  optimisations in the existing compiler you compare to for getting an
+  additional set of data points? (You still need the original one). This
+  might also help to pinpoint where some of the overhead comes from.
+
+- p2: "We also considered using a language with built-in parallel
+  constructs that map well to parallel hardware, such as occam [40],
+  Spatial [30] or Scala [2], but found these languages too niche.". This
+  sounds disingenuous to me. Surely, the main criterion was that
+  CompCert exists and can be connected to? Trying to verify correctness
+  of a parallel Scala to HW compiler sounds like a massive undertaking
+  to me without existing verification infrastructure, and would be
+  discarded because of that, not because the language is not used as
+  widely (and if it really was better suited, maybe wide use would not
+  be such a strong argument). I don't think it would hurt to just say
+  that.
+
+- p 7, line 612: "Removing support for external inputs to modules". You
+  are not discussing how I/O is modelled and part of the final
+  correctness statement then, if these are removed. Surely your hardware
+  can compute more than one final value? It'd be fairly easy to generate
+  and would not need to be that large..
+
+- Theorem 1: you call this "backward simulation". This is in accordance
+  with CompCert terminology, which unfortunately is the exact opposite
+  of established refinement and data refinement terminology. This (the
+  harder direction) is a classic forward simulation. You are not really
+  to blame for this, but I think you should spell out which terminology
+  you are using. Maybe say "in CompCert terminology" or similar, citing
+  the paper where Xavier introduces it.
+
+- p 9, line 978 "avoiding any need for off-chip memory access": Why? Can
+  your compiler not deal with these or would they obscure results? Might
+  external RAM not alleviate some of the hardware size pressure?
+
+** Post author-response comments
+   :PROPERTIES:
+   :CUSTOM_ID: post-author-response-comments-3
+   :END:
+Thank you for your response, it is great to see that performance can be
+increased so easily. It is unfortunate that we are not allowed to take
+into account new results, which this will have to count as.
+
+I would encourage the authors to take the feedback gathered here into
+account and resubmit with the new results to a different venue.
+
+** Response by Yann Herklotz
+[[mailto:yann.herklotz15@imperial.ac.uk][yann.herklotz15@imperial.ac.uk]]
+   :PROPERTIES:
+   :CUSTOM_ID: response-by-yann-herklotz-yann.herklotz15imperial.ac.uk
+   :END:
+Thanks very much for such kind and detailed reviews!
+
+All reviewers were concerned by the area and performance of the hardware
+designs generated by HLSCert. We'll begin our rebuttal with three items
+of good news on that front, and then respond to individual questions.
+
+1. *HLSCert's designs are actually much more area-efficient than our
+   current evaluation suggests.* The synthesis tool that we used in our
+   evaluation (Intel Quartus) was able to infer RAM blocks in the LegUp
+   designs but not in the HLSCert designs; this led to the HLSCert
+   designs consuming about 21x the area of the LegUp designs on Intel
+   FPGAs, and having much worse clock frequencies too. But we have found
+   that by switching to the Xilinx Vivado synthesis tool and targeting a
+   Xilinx FPGA instead, RAM inference succeeds for both LegUp and
+   HLSCert designs. This gives a much fairer area comparison. HLSCert
+   designs now consume only *about 1.7x the area* of LegUp designs. We
+   propose to use Vivado instead of Quartus when revising our evaluation
+   section. (Interestingly, another synthesis tool called Yosys can
+   infer RAMs in HLSCert designs but not in LegUp designs, making
+   HLSCert designs consume about 0.1x the area of LegUp designs!)
+
+2. *HLSCert's designs actually obtain much higher clock frequencies than
+   our current evaluation suggests.* Not only does RAM inference reduce
+   area; it also leads to much shorter critical paths. So, in the
+   absence of division, HLSCert designs now obtain *about the same clock
+   frequency (1-1.5x)* as LegUp designs.
+
+3. *Much of LegUp's remaining performance advantage is down to generic
+   (not HLS-specific) LLVM optimisations that CompCert doesn't perform.*
+   Following an investigation suggested by Reviewer D, we have found
+   that HLSCert only gains about a 1.1x improvement in cycle count from
+   CompCert's built-in optimisations, whereas LegUp gains about a 2x
+   improvement thanks to invoking the LLVM optimiser at =-O3=. All this
+   is before any HLS-specific optimisations kick in. This discrepancy --
+   which rather highlights the gap between a state-of-the-art verified
+   compiler (CompCert) and a state-of-the-art mainstream compiler (LLVM
+   =-O3=) -- explains much of overall 4.5x improvement in cycle count
+   that LegUp has over HLSCert.
+
+LegUp's remaining performance advantage can be mostly explained by an
+optimisation called /operation chaining/, whereby multiple dependent
+operations are packed into a single clock cycle (instead of generating a
+chain of registers across multiple cycles). We are currently working to
+implement (and eventually verify) this optimisation in HLSCert.
+
+With CompCert/LLVM optimisations and operation chaining /disabled/,
+HLSCert designs have only *about 1.3x the cycle count* of LegUp designs.
+When taking the updated clock frequencies into account too, HLSCert
+designs are only *about 1.6x slower* in terms of execution time. This
+drops to 1.3x slower when we enable an as-yet-unverified scheduling pass
+that is currently under development. When operation chaining is enabled
+in LegUp, HLSCert designs remain around 3-4x slower.
+
+Of course, as several reviewers remarked, there is plenty of room for
+improvement before HLSCert can be considered an /optimising/ HLS
+compiler. Still, we believe that what we have achieved so far -- the
+first mechanised HLS compiler -- is an important milestone in itself
+that sets a new standard for HLS research, and upon which future
+researchers will be well-positioned to build.
+
+*** Responses to Reviewer A
+    :PROPERTIES:
+    :CUSTOM_ID: responses-to-reviewer-a
+    :END:
+
+#+begin_quote
+  You inline all function calls.
+#+end_quote
+
+This is in keeping with existing HLS tools, which
+[[https://dl.acm.org/doi/10.1145/2629547][make liberal use of
+inlining]]. We are currently working to extend HLSCert with support for
+function calls; once this is done, we look forward to experimenting with
+inlining thresholds.
+
+#+begin_quote
+  The proof structure may need to change fundamentally to accommodate
+  smarter compilation.
+#+end_quote
+
+We believe that the proof structure should allow for the addition of
+simple optimisations in different passes to get comparable
+quality-of-results to LegUp. Naturally, changes will need to made --
+e.g. replacing "one instruction per clock cycle" with "list of
+instructions per clock cycle" to enable scheduling -- but we don't
+anticipate fundamental redesigns. These changes can be done in other
+intermediate languages so that our original proofs don't need to change.
+
+#+begin_quote
+  Figure 3 seems to indicate that register values can't affect state
+  transitions, but the example in Figure 2 shows just such an influence.
+  What's going on?
+#+end_quote
+
+Yes, indeed the registers should be able to affect the control flow. We
+will rectify that in the diagram.
+
+#+begin_quote
+  Figure 4 feels unnecessarily complicated, given that you only compile
+  function bodies that themselves contain no function calls.
+#+end_quote
+
+Indeed, it might be simpler if they were not present -- they do not
+really add much by themselves, but they are needed to interact with the
+rest of the CompCert model and are part of the assumptions that we make.
+We will try and clarify this section, and maybe simplify the diagram.
+
+#+begin_quote
+  You discuss how HLSCert fails if more than $2^{32}$ states are needed.
+  That would be a pretty huge C program, wouldn't it?
+#+end_quote
+
+Quite! This rather extreme example is just meant to illustrate the fact
+that HLSCert, despite being verified, is not guaranteed to produce an
+output for every valid input. From a practical standpoint, no modern HLS
+tools would be able to handle such a huge number of states.
+
+*** Responses to Reviewer B
+    :PROPERTIES:
+    :CUSTOM_ID: responses-to-reviewer-b
+    :END:
+
+#+begin_quote
+  Is the benchmark selection based on the fact that the system in its
+  current form does not support globals?
+#+end_quote
+
+The choice of benchmark is mostly because CHStone uses large static
+arrays to initialise the test bench. Since we currently don't support
+separating a file into a test bench that is run in software and a
+function that is synthesised into hardware, we need to synthesise the
+large static array into a memory and initialise it, and this takes too
+long to simulate. We are looking into how to circumvent this issue, by
+maybe having better support for initialising large static memories.
+
+#+begin_quote
+  Why do you not support global variables?
+#+end_quote
+
+It's on our todo list. CompCert stores each global variable in a
+separate memory. We have already implemented translation of memories
+into Verilog arrays as part of our handling of the stack. We expect that
+supporting global variables will be a matter of translating multiple
+memories, and engineering a way to discern from each pointer which array
+to index into.
+
+#+begin_quote
+  Which framework are you using for function inlining? Can we have
+  constructs in the system (other than recursion) which might inhibit
+  inlining?
+#+end_quote
+
+We have modified the existing inliner in CompCert to inline all
+functions, and we should be therefore able to inline any non-recursive
+function.
+
+#+begin_quote
+  What about the use of function pointers, or the presence of static
+  variables and goto statements?
+#+end_quote
+
+For function pointers and static variables, HLSCert will error-out at
+compile time as these cannot be translated. However, gotos are
+translated correctly.
+
+*** Responses to Reviewer C
+    :PROPERTIES:
+    :CUSTOM_ID: responses-to-reviewer-c
+    :END:
+
+#+begin_quote
+  As it stands now, the paper just seems like a straightforward
+  adaptation of "mechanical proof infrastructure" from CompCert to the
+  HLS problem.
+#+end_quote
+
+We believe that HLS is a topic of great and growing importance, and that
+the first mechanically verified HLS tool is a valuable contribution that
+future researchers will be keen to build upon. We think that our paper
+can bring insights about hardware design to the PL community, and can
+also enlighten the hardware design community about the feasibility of a
+fully verified HLS tool.
+
+#+begin_quote
+  Why does HLSCert require the result of the divide circuit to be
+  entirely combinational?
+#+end_quote
+
+This is the default behaviour of the synthesis tool when given a
+division operator in Verilog. In such a scenario, the synthesis tool is
+obliged complete the division in one cycle, as other parts of the
+circuit may rely on this fact. This could be fixed by creating (and
+verifying) a pipelined divider module.
+
+#+begin_quote
+  I was surprised that the authors dismiss translation validation. Will
+  you use translation validation to implement support for optimizations?
+#+end_quote
+
+We only intended to dismiss /unverified/ translation validation. We are
+indeed working on using translation validation to implement
+optimisations. We will further clarify this when revising the paper.
+
+#+begin_quote
+  Sections 4.2 and 4.3 : why are you doing a forward simulation and then
+  arguing about deterministic semantics to show backward simulation. Why
+  not directly show a backward simulation?
+#+end_quote
+
+We suspect that this confusion results from CompCert's terminology
+diverging from established terminology: please see the last part of
+[[https://pldi21.hotcrp.com/review/543D][Review #D]]. We'll be sure to
+clarify this terminology when revising the paper.
+
+#+begin_quote
+  What is the size of the trusted computing base for HLSCert?
+#+end_quote
+
+The length of the implementation section alone is 1184 loc, though as
+the reviewer remarks, this will increase considerably as we add
+optimisations. The main trusted part of HLSCert is the Verilog semantics
+(405 loc), which is mostly inherited from Lööw and Myreen. The
+specification of Clight is inherited from CompCert and is
+well-established, but if we include that too (584 loc), then the total
+trusted base is 989 loc. (All loc figures are without spaces.) This is
+similar to the size of the implementation, though we're not sure how
+meaningful it is to equate one line of specification with one line of
+implementation. Table 1 includes all these numbers (though the Verilog
+semantics, quoted there as 739 loc, includes a few proofs about the
+semantics -- we'll fix that).
+
+#+begin_quote
+  What is the difference between a 32-bit (symbolic) integer and a
+  bitvector?
+#+end_quote
+
+A bitvector is a list of Booleans. We instead use the internal CompCert
+integer type, which has two fields: a number and a bitwidth (32).
+
+*** Responses to Reviewer D
+    :PROPERTIES:
+    :CUSTOM_ID: responses-to-reviewer-d
+    :END:
+
+#+begin_quote
+  How is I/O modelled? How does it feature in the final correctness
+  statement?
+#+end_quote
+
+As we do not support modules in the Verilog semantics yet (one reason we
+inline all functions), we translate arguments to the main function into
+inputs to the module, and in the semantics we model the transition from
+Callstate to State (cf. Figure 4) as taking the input values the module
+was "called with" and assigning them to registers. There are calling
+conventions that need to be respected here, such as not changing the
+inputs or raising the reset signal during execution. As long as these
+are respected, the correctness theorem will hold.
+
+#+begin_quote
+  Would there be the possibility of attempting to switch off certain or
+  all optimisations in the existing compiler you compare to for getting
+  an additional set of data points?
+#+end_quote
+
+Thanks for the suggestion! We have done exactly this at the top of our
+rebuttal.
+
+#+begin_quote
+  Why do you need to avoid off-chip memory access? Can your compiler not
+  deal with these or would they obscure results? Might external RAM not
+  alleviate some of the hardware size pressure?
+#+end_quote
+
+Off-chip memory currently cannot be used because it requires general
+interaction with other Verilog modules from C. However, the stack, and
+in future any other memories, will be stored in on-chip RAM blocks, if
+inferred by the synthesis tool correctly.
+
+#+begin_quote
+  Dismissing occam, Spatial and Scala as too niche sounds disingenuous
+  to me. Surely, the main criterion was that CompCert exists and can be
+  connected to?
+#+end_quote
+
+You are quite right: the existence of CompCert (and other verification
+infrastructure for C) was the big factor. We'll rephrase.
+
+#+begin_quote
+  Have you investigated whether function inlining is leading to some of
+  the size blow-up in your verilog output?
+#+end_quote
+
+We believe that almost all of the blame for the area blow-up can be laid
+on the synthesis tool failing to infer RAM blocks in the HLSCert
+designs. After switching to a synthesis tool that infers RAM blocks in
+both HLSCert and LegUp designs, the area blow-up goes away.
-- 
cgit