* PLDI 2021 Paper #543 Reviews and Comments :PROPERTIES: :CUSTOM_ID: pldi-2021-paper-543-reviews-and-comments :END: Paper #543 Formal Verification of High-Level Synthesis * Review #543A :PROPERTIES: :CUSTOM_ID: review-543a :END: ** Overall merit :PROPERTIES: :CUSTOM_ID: overall-merit :END: B. I support accepting this paper but will not champion it. ** Reviewer expertise :PROPERTIES: :CUSTOM_ID: reviewer-expertise :END: X. *Expert* = "I have written a paper on one or more of this paper's topics" ** Paper summary :PROPERTIES: :CUSTOM_ID: paper-summary :END: "High-level synthesis" (HLS) is a term of art for compiling C programs to hardware designs. This paper presents the first mechanized proof of soundness for an HLS compiler. The implementation and proof are carried out in Coq and build on CompCert, targeting Verilog. Evaluation shows that most designs from a standard benchmark suite can be compiled, with a significant overall performance penalty. ** Comments for author :PROPERTIES: :CUSTOM_ID: comments-for-author :END: Cool project, presented very well! I understand HLS is indeed getting popular in the real world, and it's important to expand compiler verification to cover HLS. This paper unquestionably extends the CompCert proofs to cover correctness of hardware generation, following proof-assistant community norms about modularity and trusted-base size. The evaluation seems appropriately scoped. The main problem I have is that the work feels preliminary for the simplicity of the compilation strategy. We more or less get the equivalent of one machine-language instruction run per cycle, which doesn't sound so bad compared to traditional processors, but, considering the high costs of fabricating a custom chip, we would expect superior performance per price just by running machine code on commodity processors! Your HLS is basically generating a custom processor with one "instruction" per line of code in the C IR, which is totally not the right way to get high performance from hardware accelerators, the most important reason being what you note near the end of the paper, that you don't take advantage of cross-instruction parallelism at all. However, you are also incurring repeated synchronization overhead from tracking control state, even for subcomputations that would fit well as combinational logic. It's quite possible all the above is fixable using the standard HLS optimizations that the paper alludes to as future work, but I feel a truly convincing evaluation would need to show some of those optimizations. I'm not convinced the current HLSCert is going to generate Verilog that outperforms running machine code on commodity processors. The proof structure may need to change fundamentally to accommodate smarter compilation. Still, it makes sense that essentially all ideas introduced here are sound and provide a novel and proper foundation for extending to optimizations, so I'm giving the paper a small benefit of the doubt in my rating. ** Other apparent weaknesses in the design :PROPERTIES: :CUSTOM_ID: other-apparent-weaknesses-in-the-design :END: You inline all function calls, which not only increases area but also precludes separate compilation, which is often viewed as essential for efficient hardware synthesis (which is often much slower than for comparably sized software). I suppose your approach still allows modularity in hardware designs, where each module has private state and public methods (functions), but I wonder if you are evaluating on any such examples. I can sympathize with the difficulty of getting Verilog compilers to identify RAMs. IMO, this is not a significant demerit of the current approach as part of a research prototype. ** Evaluation :PROPERTIES: :CUSTOM_ID: evaluation :END: If the paper is accepted, I will be tempted to require a revision showing a performance comparison with software on cutting-edge processors, which IMO is more important to establish than the gap with some open-source HLS tool. I'm struck that your compilation strategy is rather close to partially evaluating a hardware design for a processor, with respect to a fixed machine-code program. The Clash library for Haskell can actually be coaxed into doing that kind of partial evaluation! What would be different in that approach, considering implementation and proof? It would make for another worthwhile comparison as evaluation. ** Size of contribution :PROPERTIES: :CUSTOM_ID: size-of-contribution :END: It's quite dangerous to eyeball a finished proof-assistant development and decide how much work it /deserved/ to be and what new research contributions were required. Still, I'm tempted to call the proofs reported in this paper "routine though tedious." I'm not surprised that one of your biggest time sinks was proving correspondence between different versions of bitvector operations! Less obvious is the gap between memory models in CompCert and Verilog, but now it makes sense. It is not clear that there are strong research contributions in the proofs, just engineering, which seems worth doing even if there are not surprises in the results. ** Small stuff :PROPERTIES: :CUSTOM_ID: small-stuff :END: Line 148 links to an anonymized GitHub repository. Please don't include such links when submitting to conferences that allow code supplements. It's too easy to modify a GitHub repo after the conference deadline, without leaving any trace in the history. Figure 3 seems to indicate that register values can't affect state transitions, but the example in Figure 2 shows just such an influence. What's going on? Figure 4 feels unnecessarily complicated, given that you only compile function bodies that themselves contain no function calls. The complexity that justified CompCert's state vocabulary is gone. You discuss how HLSCert fails if more than $2^{32}$ states are needed. That would be a pretty huge C program, wouldn't it, assuming each state corresponds to at least one machine-language instruction taking at least one byte? You'd need machine code of several gigabytes, which I suppose could arise, but it doesn't sound like such an important restriction in practice. - Line 621, "rather of" should be "rather than". - Line 831, "from CompCert.These assumptions" is missing a space after the period. ** Post author-response comments :PROPERTIES: :CUSTOM_ID: post-author-response-comments :END: I agree that packing lists of instructions in single cycles should fit well with your proof structure. However, other important optimizations would include pipelining and copying subcomputations so that several can run in parallel. It is not clear how much your proof framework would need to change to support those optimizations. * Review #543B :PROPERTIES: :CUSTOM_ID: review-543b :END: ** Overall merit :PROPERTIES: :CUSTOM_ID: overall-merit-1 :END: B. I support accepting this paper but will not champion it. ** Reviewer expertise :PROPERTIES: :CUSTOM_ID: reviewer-expertise-1 :END: Y. *Knowledgeable* = "I follow the literature on this paper's topic(s) but may have missed some relevant developments" ** Paper summary :PROPERTIES: :CUSTOM_ID: paper-summary-1 :END: The paper deals with designing a verified HLS tool which provide confidence in the translation of C to a widely used Verilog HDL. Such a verified HLS is imperative to avoid the alternative workarounds which are either unfaithful (i.e. the ones based on simulating test-bench) or need computationally expensive TV techniques which might need domain specific (hence non-trivial) adjustments. The authors designed the HLS tool with verification as their key design goal and highlighted the critical design decisions which need to be incorporated early in the design phase. That include the choice of the the theorem prover, the framework for simulation proofs, etc. with proper rationale behind those choices. The designed in based on extending CompCert with a newly added IR and supporting a verilog backend. The TCB includes the (borrowed-) semantics used for the source and target languages and the coq proof assistant - which are all well established. The system is tested against HLS tool LegUp, using Polybench/C benchmark suite, and evaluated based on matrices like (1) How much clock cycles the design took to complete, (2) clock frequency, (3) Area efficiency, and (4) Compilation times. The results shows that HLDCert is a promising effort towards the first mechanically verified HLS. The tools is (somewhat) comparable to its (unverified-) competitor based on the aforementioned matrices, with room for significant improvements. ** Strengths :PROPERTIES: :CUSTOM_ID: strengths :END: 1. Well written 2. The limitations are clearly mentioned: Includes the allowed C-subset, the single-cycle division/modulo operations, missing the memory & scheduling optimizations and verifying their correctness.\\ 3. The related work is neatly depicted using the Venn diagram. Thanks for the addition! 4. Good that you run the extra miles in flossing out the unevenness of the test candidates by preparing an alternative version of the benchmarks that uses repeated division and multiplications by 2. ** A few Questions/Concerns :PROPERTIES: :CUSTOM_ID: a-few-questionsconcerns :END: I have one broad concern related to the practicality of the system. I am sure that you have also thought about it and plans to mitigate those. The concern can be elaborate in two scenarios. 1. Related to the choice of the benchmark: Is the selection based on the fact that the system in its current form does not support globals?The Polybench/C benchmark has limited (or maybe zero) usage of globals. Whereas, the other relevant benchmarks like CHStone and Dhrystone has substantial usages of globals used many a times to statically initialize inputs. On a related note, Is there a specific reason (like a fundamental limitation) for skipping global support? Supporting is definitely recommenced if that can enhance the reach of the system to realistic programs. The LegUp system under comparison, with its support for functions, global variables and pointer arithmetic is comparatively better suited for more realistic C programs and hence the above mentioned benchmarks. I understand that the current work is a WIP and big step towards the broader goal, but it makes sense to be as complete as possible specially when it is trivial to achieve that. 2. Related to the absence of certified optimizations As highlighted in your paper that a simple state-machine translation is inefficient and having a system which executes design fast, with high-frequency, and size/power efficiency is desirable. Thus it is imperative to add optimization related to scheduling pipelined resources/memory analysis/IR optimizations. One of the highlighted reason for the comparison-setback is the fact that optimizations are not supported. I understand the plans for implementing a certified version of those in future and in my opinion (please correct me if I am wrong) it should be a as challenging as certifying the instruction translation from C to Verilog. ** nits :PROPERTIES: :CUSTOM_ID: nits :END: 1. 831: space missing 2. line 563 (and other similar places): Please explain the notations (like delta). It's challenging to follow otherwise. 3. line 216: Please specify a mean (e.g the color code like orange for new additions) to separate the parts newly added. 4. Which framework are you using for function inlining. Can we have constructs in the system (other than recursion) which might inhibit inlining? What about the use of function pointers, or the presence of static variables and goto statements? ** Post author-response comments :PROPERTIES: :CUSTOM_ID: post-author-response-comments-1 :END: Thanks to authors for your response. I have carefully read all of it. It did improve my perception on some parts of the paper. With the absence of support to globals and of certified optimizations, it looks more like to work in progress. However, I believe, the work has the potential of becoming the state of the art ( more so because of the wide acceptance of the choices of the source and target languages ) or could encourage (as a baseline) for similar efforts. As the paper stands now, I will keep the score intact. * Review #543C :PROPERTIES: :CUSTOM_ID: review-543c :END: ** Overall merit :PROPERTIES: :CUSTOM_ID: overall-merit-2 :END: D. I will argue against accepting this paper. ** Reviewer expertise :PROPERTIES: :CUSTOM_ID: reviewer-expertise-2 :END: X. *Expert* = "I have written a paper on one or more of this paper's topics" ** Paper summary :PROPERTIES: :CUSTOM_ID: paper-summary-2 :END: The paper adapts CompCert's proof-carrying code model to the development of an HLS tool. The CompCert implementation is used, and a hook is added just after the 3AC (three address code) IR (before regalloc and after some CompCert optimizations) to translate it to HTL and finally Verilog.\\ They call their tool, HLSCert. There is very limited support in HLSCert and this is also evident in the order-of-magnitude slowdowns wrt uncertified HLS tools. The formal semantics of Clight are from CompCert and the formal semantics of Verilog are adapted from a previous work by Loow and Myeen. As I see it, the primary contribution of the paper is an experience study on what was required to plug a different (formally verified) backend to the Compiler toolchain, a backend that caters to Verilog as the target language. ** Strengths :PROPERTIES: :CUSTOM_ID: strengths-1 :END: 1. Important and relevant problem 2. Clear description of the objectives, design decisions, and experience studies 3. I enjoyed studying Figure 9 ** Weaknesses :PROPERTIES: :CUSTOM_ID: weaknesses :END: 1. No support for optimization; shows in the performance/area results wrt LegUp 2. No discussion on the reduction in the size of the "trusted computing base". What is the SLOC for Verilog + Clight + Simulation specifications (and how much is it shorter than the actual implementation). Without this discussion, how does one assess the value of the verification effort? There could be bugs in the spec itself, and unless there is evidence that the spec is much shorter than the implementation, it would be hard to justify this effort. I am guessing that adding support for optimizations will significantly increase the gap between spec and implementation complexity, and the verification effort would be much more meaningful then. 3. Most of the ideas are a rehash of the CompCert work in a new setting. Some of the proof discussion was not clearly motivated/justified I thought (see comments section) ** Comments for author :PROPERTIES: :CUSTOM_ID: comments-for-author-2 :END: As clearly motivated in Section 1, Verified HLS is a derisable capability. It would be great to have such a verified HLS tool that provides strong correctness guarantees. My primary criticism is regarding the absence of optimization support. This is also evident in the experimental results. As I see it, an order-of-magnitude performance/area degradation is not very compelling. More fundamentally, the evaluation of the verification effort requires us to understand the difference in the size and complexity of the implementation and the size and compleity of the spec (which includes Clight semantics, Verilog semantics, and the encoding of the observable equivalence properties). Without this discussion, it seems hard to evaluate the verification effort. Optimization logic is usually particularly amenable to verification efforts. This is so because optimizations do not really change the "meaning" of the implementation, but make the implementation much more complex. But unfortunately, this paper does not tackle this space. Also, I was surprised that the authors dismiss translation validation (TV) by saying that "it does not provide watertight guarantees unless the validator itself has been mechanically proven correct". Didn't CompCert use TV for some parts of their compiler pipeline (I recall register allocation). Won't the authors also need to use TV as well, especially if they need to implement support for optimizations? As it stands now, the paper just seems like a straightforward adaptation of "mechanical proof infrastructure" from CompCert to the HLS problem. This is not very interesting in itself --- there need to be more insights and better results to be able to merit a PLDI paper IMO. Some more questions: 1. 480-482: why does HLSCert require the result of the divide circuit to be entirely combinational? Given the high costs, I would have definitely preferred a better solution to the problem. 2. 616-618. "we use 32-bit integers to represent bitvectors rather than arrays of booleans". What is the diffference between a 32-bit (symbolic) integer and a bitvector? I could not understand this statement at all --- could you please elaborate in your author response. 3. Sections 4.2 and 4.3 : why are you doing a forward simuation and then arguing about deterministic semantics to show backward simulation. Why not directly show a backward simulation? Please answer this in your author response. ** Post author-response comments :PROPERTIES: :CUSTOM_ID: post-author-response-comments-2 :END: Thank you for the comprehensive response -- I was especially happy to see the TCB discussion in your response. I see two issues with the current version of the paper: 1. It is perhaps not appropriate to take the improved area/performance results (described in the author response) into account, as the response is primarily meant for clarifications, not additional results. 2. The work becomes significantly more interesting in the presence of a significant optimizations like pipelining and/or parallelism, and the discussions of the challenges around supporting something like that. However such optimizations are currently missing. While it is a good start to have the first verified HLS synthesis tool, support for some optimizations and some interesting discussion around the challenges around such support would make the paper a much stronger contender. We hope the authors would consider submitting a revised version in the future (to an appropriate venue). * Review #543D :PROPERTIES: :CUSTOM_ID: review-543d :END: ** Overall merit :PROPERTIES: :CUSTOM_ID: overall-merit-3 :END: B. I support accepting this paper but will not champion it. ** Reviewer expertise :PROPERTIES: :CUSTOM_ID: reviewer-expertise-3 :END: Y. *Knowledgeable* = "I follow the literature on this paper's topic(s) but may have missed some relevant developments" ** Paper summary :PROPERTIES: :CUSTOM_ID: paper-summary-3 :END: The paper presents a verified HLS compiler for translating a subset of C programs directly into Verilog. The approach is extending CompCert from the 3-address code layer downwards into new intermediate languages, ending in an existing Verilog semantics adapted from HOL4. The evaluation shows that the generated hardware is substantially (order of magnitude) larger and slower than existing optimising, but unverified compilers. ** Comments for author :PROPERTIES: :CUSTOM_ID: comments-for-author-3 :END: The overall approach makes sense, and I also agree with the motivation -- I think this is an important and fruitful area to apply formal complier correctness to. My main concern with this paper is that it might be a bit too early. The first verified chain is a good milestone, but the performance is so bad that I would not really call it viable yet. Since there are other hardware verification approaches that do work (although maybe not directly HLS), reasonably performance is important. I'm still slightly in favour of acceptance, but I'm on the fence. If this showed good performance even within 50% or so of existing compilers, it would be a strong accept. It did not become entirely clear for me from the paper wether this is due to the approach (I think it's fine), or due to some phases/major optimisations missing that are just required to get reasonable hardware. The main point that stuck out for me is not using register allocation[*] and the probably not hardware-idiosyncratic treatment of RAM (although the work is attempting to deal with this issue, the model and generated code might just still be too far from what is "good" for hardware). I'd be curious to hear more from the authors what exactly is missing, because if the paper is published as is, people who are not well disposed to verification will take it as an example that verification is infeasible to apply to this problem, which I think is neither true nor in the author's interest. [*]: you make the point that registers are cheap in FPGAs, but infinite registers will just blow up the size, and even slight pressure on the register file to make the compiler reuse registers (without spilling on the stack) might make a difference. Details: - have you investigated whether function inlining is leading to some of the size blow-up in your verliog output? - comparison to existing compilers: for the evaluation would there be the possibility of attempting to switch off certain or all optimisations in the existing compiler you compare to for getting an additional set of data points? (You still need the original one). This might also help to pinpoint where some of the overhead comes from. - p2: "We also considered using a language with built-in parallel constructs that map well to parallel hardware, such as occam [40], Spatial [30] or Scala [2], but found these languages too niche.". This sounds disingenuous to me. Surely, the main criterion was that CompCert exists and can be connected to? Trying to verify correctness of a parallel Scala to HW compiler sounds like a massive undertaking to me without existing verification infrastructure, and would be discarded because of that, not because the language is not used as widely (and if it really was better suited, maybe wide use would not be such a strong argument). I don't think it would hurt to just say that. - p 7, line 612: "Removing support for external inputs to modules". You are not discussing how I/O is modelled and part of the final correctness statement then, if these are removed. Surely your hardware can compute more than one final value? It'd be fairly easy to generate and would not need to be that large.. - Theorem 1: you call this "backward simulation". This is in accordance with CompCert terminology, which unfortunately is the exact opposite of established refinement and data refinement terminology. This (the harder direction) is a classic forward simulation. You are not really to blame for this, but I think you should spell out which terminology you are using. Maybe say "in CompCert terminology" or similar, citing the paper where Xavier introduces it. - p 9, line 978 "avoiding any need for off-chip memory access": Why? Can your compiler not deal with these or would they obscure results? Might external RAM not alleviate some of the hardware size pressure? ** Post author-response comments :PROPERTIES: :CUSTOM_ID: post-author-response-comments-3 :END: Thank you for your response, it is great to see that performance can be increased so easily. It is unfortunate that we are not allowed to take into account new results, which this will have to count as. I would encourage the authors to take the feedback gathered here into account and resubmit with the new results to a different venue. * Response by Yann Herklotz [[mailto:yann.herklotz15@imperial.ac.uk][yann.herklotz15@imperial.ac.uk]] :PROPERTIES: :CUSTOM_ID: response-by-yann-herklotz-yann.herklotz15imperial.ac.uk :END: Thanks very much for such kind and detailed reviews! All reviewers were concerned by the area and performance of the hardware designs generated by HLSCert. We'll begin our rebuttal with three items of good news on that front, and then respond to individual questions. 1. *HLSCert's designs are actually much more area-efficient than our current evaluation suggests.* The synthesis tool that we used in our evaluation (Intel Quartus) was able to infer RAM blocks in the LegUp designs but not in the HLSCert designs; this led to the HLSCert designs consuming about 21x the area of the LegUp designs on Intel FPGAs, and having much worse clock frequencies too. But we have found that by switching to the Xilinx Vivado synthesis tool and targeting a Xilinx FPGA instead, RAM inference succeeds for both LegUp and HLSCert designs. This gives a much fairer area comparison. HLSCert designs now consume only *about 1.7x the area* of LegUp designs. We propose to use Vivado instead of Quartus when revising our evaluation section. (Interestingly, another synthesis tool called Yosys can infer RAMs in HLSCert designs but not in LegUp designs, making HLSCert designs consume about 0.1x the area of LegUp designs!) 2. *HLSCert's designs actually obtain much higher clock frequencies than our current evaluation suggests.* Not only does RAM inference reduce area; it also leads to much shorter critical paths. So, in the absence of division, HLSCert designs now obtain *about the same clock frequency (1-1.5x)* as LegUp designs. 3. *Much of LegUp's remaining performance advantage is down to generic (not HLS-specific) LLVM optimisations that CompCert doesn't perform.* Following an investigation suggested by Reviewer D, we have found that HLSCert only gains about a 1.1x improvement in cycle count from CompCert's built-in optimisations, whereas LegUp gains about a 2x improvement thanks to invoking the LLVM optimiser at =-O3=. All this is before any HLS-specific optimisations kick in. This discrepancy -- which rather highlights the gap between a state-of-the-art verified compiler (CompCert) and a state-of-the-art mainstream compiler (LLVM =-O3=) -- explains much of overall 4.5x improvement in cycle count that LegUp has over HLSCert. LegUp's remaining performance advantage can be mostly explained by an optimisation called /operation chaining/, whereby multiple dependent operations are packed into a single clock cycle (instead of generating a chain of registers across multiple cycles). We are currently working to implement (and eventually verify) this optimisation in HLSCert. With CompCert/LLVM optimisations and operation chaining /disabled/, HLSCert designs have only *about 1.3x the cycle count* of LegUp designs. When taking the updated clock frequencies into account too, HLSCert designs are only *about 1.6x slower* in terms of execution time. This drops to 1.3x slower when we enable an as-yet-unverified scheduling pass that is currently under development. When operation chaining is enabled in LegUp, HLSCert designs remain around 3-4x slower. Of course, as several reviewers remarked, there is plenty of room for improvement before HLSCert can be considered an /optimising/ HLS compiler. Still, we believe that what we have achieved so far -- the first mechanised HLS compiler -- is an important milestone in itself that sets a new standard for HLS research, and upon which future researchers will be well-positioned to build. *** Responses to Reviewer A :PROPERTIES: :CUSTOM_ID: responses-to-reviewer-a :END: #+begin_quote You inline all function calls. #+end_quote This is in keeping with existing HLS tools, which [[https://dl.acm.org/doi/10.1145/2629547][make liberal use of inlining]]. We are currently working to extend HLSCert with support for function calls; once this is done, we look forward to experimenting with inlining thresholds. #+begin_quote The proof structure may need to change fundamentally to accommodate smarter compilation. #+end_quote We believe that the proof structure should allow for the addition of simple optimisations in different passes to get comparable quality-of-results to LegUp. Naturally, changes will need to made -- e.g. replacing "one instruction per clock cycle" with "list of instructions per clock cycle" to enable scheduling -- but we don't anticipate fundamental redesigns. These changes can be done in other intermediate languages so that our original proofs don't need to change. #+begin_quote Figure 3 seems to indicate that register values can't affect state transitions, but the example in Figure 2 shows just such an influence. What's going on? #+end_quote Yes, indeed the registers should be able to affect the control flow. We will rectify that in the diagram. #+begin_quote Figure 4 feels unnecessarily complicated, given that you only compile function bodies that themselves contain no function calls. #+end_quote Indeed, it might be simpler if they were not present -- they do not really add much by themselves, but they are needed to interact with the rest of the CompCert model and are part of the assumptions that we make. We will try and clarify this section, and maybe simplify the diagram. #+begin_quote You discuss how HLSCert fails if more than $2^{32}$ states are needed. That would be a pretty huge C program, wouldn't it? #+end_quote Quite! This rather extreme example is just meant to illustrate the fact that HLSCert, despite being verified, is not guaranteed to produce an output for every valid input. From a practical standpoint, no modern HLS tools would be able to handle such a huge number of states. *** Responses to Reviewer B :PROPERTIES: :CUSTOM_ID: responses-to-reviewer-b :END: #+begin_quote Is the benchmark selection based on the fact that the system in its current form does not support globals? #+end_quote The choice of benchmark is mostly because CHStone uses large static arrays to initialise the test bench. Since we currently don't support separating a file into a test bench that is run in software and a function that is synthesised into hardware, we need to synthesise the large static array into a memory and initialise it, and this takes too long to simulate. We are looking into how to circumvent this issue, by maybe having better support for initialising large static memories. #+begin_quote Why do you not support global variables? #+end_quote It's on our todo list. CompCert stores each global variable in a separate memory. We have already implemented translation of memories into Verilog arrays as part of our handling of the stack. We expect that supporting global variables will be a matter of translating multiple memories, and engineering a way to discern from each pointer which array to index into. #+begin_quote Which framework are you using for function inlining? Can we have constructs in the system (other than recursion) which might inhibit inlining? #+end_quote We have modified the existing inliner in CompCert to inline all functions, and we should be therefore able to inline any non-recursive function. #+begin_quote What about the use of function pointers, or the presence of static variables and goto statements? #+end_quote For function pointers and static variables, HLSCert will error-out at compile time as these cannot be translated. However, gotos are translated correctly. *** Responses to Reviewer C :PROPERTIES: :CUSTOM_ID: responses-to-reviewer-c :END: #+begin_quote As it stands now, the paper just seems like a straightforward adaptation of "mechanical proof infrastructure" from CompCert to the HLS problem. #+end_quote We believe that HLS is a topic of great and growing importance, and that the first mechanically verified HLS tool is a valuable contribution that future researchers will be keen to build upon. We think that our paper can bring insights about hardware design to the PL community, and can also enlighten the hardware design community about the feasibility of a fully verified HLS tool. #+begin_quote Why does HLSCert require the result of the divide circuit to be entirely combinational? #+end_quote This is the default behaviour of the synthesis tool when given a division operator in Verilog. In such a scenario, the synthesis tool is obliged complete the division in one cycle, as other parts of the circuit may rely on this fact. This could be fixed by creating (and verifying) a pipelined divider module. #+begin_quote I was surprised that the authors dismiss translation validation. Will you use translation validation to implement support for optimizations? #+end_quote We only intended to dismiss /unverified/ translation validation. We are indeed working on using translation validation to implement optimisations. We will further clarify this when revising the paper. #+begin_quote Sections 4.2 and 4.3 : why are you doing a forward simulation and then arguing about deterministic semantics to show backward simulation. Why not directly show a backward simulation? #+end_quote We suspect that this confusion results from CompCert's terminology diverging from established terminology: please see the last part of [[https://pldi21.hotcrp.com/review/543D][Review #D]]. We'll be sure to clarify this terminology when revising the paper. #+begin_quote What is the size of the trusted computing base for HLSCert? #+end_quote The length of the implementation section alone is 1184 loc, though as the reviewer remarks, this will increase considerably as we add optimisations. The main trusted part of HLSCert is the Verilog semantics (405 loc), which is mostly inherited from Lööw and Myreen. The specification of Clight is inherited from CompCert and is well-established, but if we include that too (584 loc), then the total trusted base is 989 loc. (All loc figures are without spaces.) This is similar to the size of the implementation, though we're not sure how meaningful it is to equate one line of specification with one line of implementation. Table 1 includes all these numbers (though the Verilog semantics, quoted there as 739 loc, includes a few proofs about the semantics -- we'll fix that). #+begin_quote What is the difference between a 32-bit (symbolic) integer and a bitvector? #+end_quote A bitvector is a list of Booleans. We instead use the internal CompCert integer type, which has two fields: a number and a bitwidth (32). *** Responses to Reviewer D :PROPERTIES: :CUSTOM_ID: responses-to-reviewer-d :END: #+begin_quote How is I/O modelled? How does it feature in the final correctness statement? #+end_quote As we do not support modules in the Verilog semantics yet (one reason we inline all functions), we translate arguments to the main function into inputs to the module, and in the semantics we model the transition from Callstate to State (cf. Figure 4) as taking the input values the module was "called with" and assigning them to registers. There are calling conventions that need to be respected here, such as not changing the inputs or raising the reset signal during execution. As long as these are respected, the correctness theorem will hold. #+begin_quote Would there be the possibility of attempting to switch off certain or all optimisations in the existing compiler you compare to for getting an additional set of data points? #+end_quote Thanks for the suggestion! We have done exactly this at the top of our rebuttal. #+begin_quote Why do you need to avoid off-chip memory access? Can your compiler not deal with these or would they obscure results? Might external RAM not alleviate some of the hardware size pressure? #+end_quote Off-chip memory currently cannot be used because it requires general interaction with other Verilog modules from C. However, the stack, and in future any other memories, will be stored in on-chip RAM blocks, if inferred by the synthesis tool correctly. #+begin_quote Dismissing occam, Spatial and Scala as too niche sounds disingenuous to me. Surely, the main criterion was that CompCert exists and can be connected to? #+end_quote You are quite right: the existence of CompCert (and other verification infrastructure for C) was the big factor. We'll rephrase. #+begin_quote Have you investigated whether function inlining is leading to some of the size blow-up in your verilog output? #+end_quote We believe that almost all of the blame for the area blow-up can be laid on the synthesis tool failing to infer RAM blocks in the HLSCert designs. After switching to a synthesis tool that infers RAM blocks in both HLSCert and LegUp designs, the area blow-up goes away.