From c77cd6168a805255e44988b41d2911182ec99327 Mon Sep 17 00:00:00 2001 From: Yann Herklotz Date: Fri, 26 Feb 2021 12:56:08 +0000 Subject: Add sync and reviews --- reviews/pldi.org | 848 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 848 insertions(+) create mode 100644 reviews/pldi.org (limited to 'reviews') diff --git a/reviews/pldi.org b/reviews/pldi.org new file mode 100644 index 0000000..765e7e6 --- /dev/null +++ b/reviews/pldi.org @@ -0,0 +1,848 @@ +* PLDI 2021 Paper #543 Reviews and Comments + :PROPERTIES: + :CUSTOM_ID: pldi-2021-paper-543-reviews-and-comments + :END: +Paper #543 Formal Verification of High-Level Synthesis + +* Review #543A + :PROPERTIES: + :CUSTOM_ID: review-543a + :END: +** Overall merit + :PROPERTIES: + :CUSTOM_ID: overall-merit + :END: +B. I support accepting this paper but will not champion it. + +** Reviewer expertise + :PROPERTIES: + :CUSTOM_ID: reviewer-expertise + :END: +X. *Expert* = "I have written a paper on one or more of this paper's +topics" + +** Paper summary + :PROPERTIES: + :CUSTOM_ID: paper-summary + :END: +"High-level synthesis" (HSL) is a term of art for compiling C programs +to hardware designs. This paper presents the first mechanized proof of +soundness for an HLS compiler. The implementation and proof are carried +out in Coq and build on CompCert, targeting Verilog. Evaluation shows +that most designs from a standard benchmark suite can be compiled, with +a significant overall performance penalty. + +** Comments for author + :PROPERTIES: + :CUSTOM_ID: comments-for-author + :END: +Cool project, presented very well! I understand HLS is indeed getting +popular in the real world, and it's important to expand compiler +verification to cover HLS. This paper unquestionably extends the +CompCert proofs to cover correctness of hardware generation, following +proof-assistant community norms about modularity and trusted-base size. +The evaluation seems appropriately scoped. + +The main problem I have is that the work feels preliminary for the +simplicity of the compilation strategy. We more or less get the +equivalent of one machine-language instruction run per cycle, which +doesn't sound so bad compared to traditional processors, but, +considering the high costs of fabricating a custom chip, we would expect +superior performance per price just by running machine code on commodity +processors! Your HLS is basically generating a custom processor with one +"instruction" per line of code in the C IR, which is totally not the +right way to get high performance from hardware accelerators, the most +important reason being what you note near the end of the paper, that you +don't take advantage of cross-instruction parallelism at all. However, +you are also incurring repeated synchronization overhead from tracking +control state, even for subcomputations that would fit well as +combinational logic. + +It's quite possible all the above is fixable using the standard HLS +optimizations that the paper alludes to as future work, but I feel a +truly convincing evaluation would need to show some of those +optimizations. I'm not convinced the current HLSCert is going to +generate Verilog that outperforms running machine code on commodity +processors. The proof structure may need to change fundamentally to +accommodate smarter compilation. + +Still, it makes sense that essentially all ideas introduced here are +sound and provide a novel and proper foundation for extending to +optimizations, so I'm giving the paper a small benefit of the doubt in +my rating. + +* Other apparent weaknesses in the design + :PROPERTIES: + :CUSTOM_ID: other-apparent-weaknesses-in-the-design + :END: +You inline all function calls, which not only increases area but also +precludes separate compilation, which is often viewed as essential for +efficient hardware synthesis (which is often much slower than for +comparably sized software). I suppose your approach still allows +modularity in hardware designs, where each module has private state and +public methods (functions), but I wonder if you are evaluating on any +such examples. + +I can sympathize with the difficulty of getting Verilog compilers to +identify RAMs. IMO, this is not a significant demerit of the current +approach as part of a research prototype. + +* Evaluation + :PROPERTIES: + :CUSTOM_ID: evaluation + :END: +If the paper is accepted, I will be tempted to require a revision +showing a performance comparison with software on cutting-edge +processors, which IMO is more important to establish than the gap with +some open-source HLS tool. + +I'm struck that your compilation strategy is rather close to partially +evaluating a hardware design for a processor, with respect to a fixed +machine-code program. The Clash library for Haskell can actually be +coaxed into doing that kind of partial evaluation! What would be +different in that approach, considering implementation and proof? It +would make for another worthwhile comparison as evaluation. + +* Size of contribution + :PROPERTIES: + :CUSTOM_ID: size-of-contribution + :END: +It's quite dangerous to eyeball a finished proof-assistant development +and decide how much work it /deserved/ to be and what new research +contributions were required. Still, I'm tempted to call the proofs +reported in this paper "routine though tedious." I'm not surprised that +one of your biggest time sinks was proving correspondence between +different versions of bitvector operations! Less obvious is the gap +between memory models in CompCert and Verilog, but now it makes sense. + +It is not clear that there are strong research contributions in the +proofs, just engineering, which seems worth doing even if there are not +surprises in the results. + +* Small stuff + :PROPERTIES: + :CUSTOM_ID: small-stuff + :END: +Line 148 links to an anonymized GitHub repository. Please don't include +such links when submitting to conferences that allow code supplements. +It's too easy to modify a GitHub repo after the conference deadline, +without leaving any trace in the history. + +Figure 3 seems to indicate that register values can't affect state +transitions, but the example in Figure 2 shows just such an influence. +What's going on? + +Figure 4 feels unnecessarily complicated, given that you only compile +function bodies that themselves contain no function calls. The +complexity that justified CompCert's state vocabulary is gone. + +You discuss how HLSCert fails if more than $2^{32}$ states are needed. +That would be a pretty huge C program, wouldn't it, assuming each state +corresponds to at least one machine-language instruction taking at least +one byte? You'd need machine code of several gigabytes, which I suppose +could arise, but it doesn't sound like such an important restriction in +practice. + +- Line 621, "rather of" should be "rather than". +- Line 831, "from CompCert.These assumptions" is missing a space after + the period. + +** Post author-response comments + :PROPERTIES: + :CUSTOM_ID: post-author-response-comments + :END: +I agree that packing lists of instructions in single cycles should fit +well with your proof structure. However, other important optimizations +would include pipelining and copying subcomputations so that several can +run in parallel. It is not clear how much your proof framework would +need to change to support those optimizations. + +* Review #543B + :PROPERTIES: + :CUSTOM_ID: review-543b + :END: +** Overall merit + :PROPERTIES: + :CUSTOM_ID: overall-merit-1 + :END: +B. I support accepting this paper but will not champion it. + +** Reviewer expertise + :PROPERTIES: + :CUSTOM_ID: reviewer-expertise-1 + :END: +Y. *Knowledgeable* = "I follow the literature on this paper's topic(s) +but may have missed some relevant developments" + +** Paper summary + :PROPERTIES: + :CUSTOM_ID: paper-summary-1 + :END: +The paper deals with designing a verified HLS tool which provide +confidence in the translation of C to a widely used Verilog HDL. Such a +verified HLS is imperative to avoid the alternative workarounds which +are either unfaithful (i.e. the ones based on simulating test-bench) or +need computationally expensive TV techniques which might need domain +specific (hence non-trivial) adjustments. + +The authors designed the HLS tool with verification as their key design +goal and highlighted the critical design decisions which need to be +incorporated early in the design phase. That include the choice of the +the theorem prover, the framework for simulation proofs, etc. with +proper rationale behind those choices. + +The designed in based on extending CompCert with a newly added IR and +supporting a verilog backend. The TCB includes the (borrowed-) semantics +used for the source and target languages and the coq proof assistant - +which are all well established. + +The system is tested against HLS tool LegUp, using Polybench/C benchmark +suite, and evaluated based on matrices like (1) How much clock cycles +the design took to complete, (2) clock frequency, (3) Area efficiency, +and (4) Compilation times. + +The results shows that HLDCert is a promising effort towards the first +mechanically verified HLS. The tools is (somewhat) comparable to its +(unverified-) competitor based on the aforementioned matrices, with room +for significant improvements. + +** Comments for author + :PROPERTIES: + :CUSTOM_ID: comments-for-author-1 + :END: +* Strengths + :PROPERTIES: + :CUSTOM_ID: strengths + :END: + +1. Well written +2. The limitations are clearly mentioned: Includes the allowed C-subset, + the single-cycle division/modulo operations, missing the memory & + scheduling optimizations and verifying their correctness.\\ +3. The related work is neatly depicted using the Venn diagram. Thanks + for the addition! +4. Good that you run the extra miles in flossing out the unevenness of + the test candidates by preparing an alternative version of the + benchmarks that uses repeated division and multiplications by 2. + +* A few Questions/Concerns + :PROPERTIES: + :CUSTOM_ID: a-few-questionsconcerns + :END: +I have one broad concern related to the practicality of the system. I am +sure that you have also thought about it and plans to mitigate those. +The concern can be elaborate in two scenarios. + +1. Related to the choice of the benchmark: Is the selection based on the + fact that the system in its current form does not support globals?The + Polybench/C benchmark has limited (or maybe zero) usage of globals. + Whereas, the other relevant benchmarks like CHStone and Dhrystone has + substantial usages of globals used many a times to statically + initialize inputs. + +On a related note, Is there a specific reason (like a fundamental +limitation) for skipping global support? Supporting is definitely +recommenced if that can enhance the reach of the system to realistic +programs. + +The LegUp system under comparison, with its support for functions, +global variables and pointer arithmetic is comparatively better suited +for more realistic C programs and hence the above mentioned benchmarks. +I understand that the current work is a WIP and big step towards the +broader goal, but it makes sense to be as complete as possible specially +when it is trivial to achieve that. + +2. Related to the absence of certified optimizations As highlighted in + your paper that a simple state-machine translation is inefficient and + having a system which executes design fast, with high-frequency, and + size/power efficiency is desirable. Thus it is imperative to add + optimization related to scheduling pipelined resources/memory + analysis/IR optimizations. One of the highlighted reason for the + comparison-setback is the fact that optimizations are not supported. + I understand the plans for implementing a certified version of those + in future and in my opinion (please correct me if I am wrong) it + should be a as challenging as certifying the instruction translation + from C to Verilog. + +* nits + :PROPERTIES: + :CUSTOM_ID: nits + :END: + +1. 831: space missing +2. line 563 (and other similar places): Please explain the notations + (like delta). It's challenging to follow otherwise. +3. line 216: Please specify a mean (e.g the color code like orange for + new additions) to separate the parts newly added. +4. Which framework are you using for function inlining. Can we have + constructs in the system (other than recursion) which might inhibit + inlining? What about the use of function pointers, or the presence of + static variables and goto statements? + +** Post author-response comments + :PROPERTIES: + :CUSTOM_ID: post-author-response-comments-1 + :END: +Thanks to authors for your response. I have carefully read all of it. It +did improve my perception on some parts of the paper. + +With the absence of support to globals and of certified optimizations, +it looks more like to work in progress. However, I believe, the work has +the potential of becoming the state of the art ( more so because of the +wide acceptance of the choices of the source and target languages ) or +could encourage (as a baseline) for similar efforts. + +As the paper stands now, I will keep the score intact. + +* Review #543C + :PROPERTIES: + :CUSTOM_ID: review-543c + :END: +** Overall merit + :PROPERTIES: + :CUSTOM_ID: overall-merit-2 + :END: +D. I will argue against accepting this paper. + +** Reviewer expertise + :PROPERTIES: + :CUSTOM_ID: reviewer-expertise-2 + :END: +X. *Expert* = "I have written a paper on one or more of this paper's +topics" + +** Paper summary + :PROPERTIES: + :CUSTOM_ID: paper-summary-2 + :END: +The paper adapts CompCert's proof-carrying code model to the development +of an HLS tool. The CompCert implementation is used, and a hook is added +just after the 3AC (three address code) IR (before regalloc and after +some CompCert optimizations) to translate it to HTL and finally +Verilog.\\ +They call their tool, HLSCert. There is very limited support in HLSCert +and this is also evident in the order-of-magnitude slowdowns wrt +uncertified HLS tools. The formal semantics of Clight are from CompCert +and the formal semantics of Verilog are adapted from a previous work by +Loow and Myeen. As I see it, the primary contribution of the paper is an +experience study on what was required to plug a different (formally +verified) backend to the Compiler toolchain, a backend that caters to +Verilog as the target language. + +* Strengths + :PROPERTIES: + :CUSTOM_ID: strengths-1 + :END: + +1. Important and relevant problem +2. Clear description of the objectives, design decisions, and experience + studies +3. I enjoyed studying Figure 9 + +* Weaknesses + :PROPERTIES: + :CUSTOM_ID: weaknesses + :END: + +1. No support for optimization; shows in the performance/area results + wrt LegUp +2. No discussion on the reduction in the size of the "trusted computing + base". What is the SLOC for Verilog + Clight + Simulation + specifications (and how much is it shorter than the actual + implementation). Without this discussion, how does one assess the + value of the verification effort? There could be bugs in the spec + itself, and unless there is evidence that the spec is much shorter + than the implementation, it would be hard to justify this effort. I + am guessing that adding support for optimizations will significantly + increase the gap between spec and implementation complexity, and the + verification effort would be much more meaningful then. +3. Most of the ideas are a rehash of the CompCert work in a new setting. + Some of the proof discussion was not clearly motivated/justified I + thought (see comments section) + +** Comments for author + :PROPERTIES: + :CUSTOM_ID: comments-for-author-2 + :END: +As clearly motivated in Section 1, Verified HLS is a derisable +capability. It would be great to have such a verified HLS tool that +provides strong correctness guarantees. + +My primary criticism is regarding the absence of optimization support. +This is also evident in the experimental results. As I see it, an +order-of-magnitude performance/area degradation is not very compelling. + +More fundamentally, the evaluation of the verification effort requires +us to understand the difference in the size and complexity of the +implementation and the size and compleity of the spec (which includes +Clight semantics, Verilog semantics, and the encoding of the observable +equivalence properties). Without this discussion, it seems hard to +evaluate the verification effort. + +Optimization logic is usually particularly amenable to verification +efforts. This is so because optimizations do not really change the +"meaning" of the implementation, but make the implementation much more +complex. But unfortunately, this paper does not tackle this space. + +Also, I was surprised that the authors dismiss translation validation +(TV) by saying that "it does not provide watertight guarantees unless +the validator itself has been mechanically proven correct". Didn't +CompCert use TV for some parts of their compiler pipeline (I recall +register allocation). Won't the authors also need to use TV as well, +especially if they need to implement support for optimizations? + +As it stands now, the paper just seems like a straightforward adaptation +of "mechanical proof infrastructure" from CompCert to the HLS problem. +This is not very interesting in itself --- there need to be more +insights and better results to be able to merit a PLDI paper IMO. + +Some more questions: 1. 480-482: why does HLSCert require the result of +the divide circuit to be entirely combinational? Given the high costs, I +would have definitely preferred a better solution to the problem. 2. +616-618. "we use 32-bit integers to represent bitvectors rather than +arrays of booleans". What is the diffference between a 32-bit (symbolic) +integer and a bitvector? I could not understand this statement at all +--- could you please elaborate in your author response. 3. Sections 4.2 +and 4.3 : why are you doing a forward simuation and then arguing about +deterministic semantics to show backward simulation. Why not directly +show a backward simulation? Please answer this in your author response. + +** Post author-response comments + :PROPERTIES: + :CUSTOM_ID: post-author-response-comments-2 + :END: +Thank you for the comprehensive response -- I was especially happy to +see the TCB discussion in your response. I see two issues with the +current version of the paper: 1. It is perhaps not appropriate to take +the improved area/performance results (described in the author response) +into account, as the response is primarily meant for clarifications, not +additional results. + +2. The work becomes significantly more interesting in the presence of a + significant optimizations like pipelining and/or parallelism, and the + discussions of the challenges around supporting something like that. + However such optimizations are currently missing. + +While it is a good start to have the first verified HLS synthesis tool, +support for some optimizations and some interesting discussion around +the challenges around such support would make the paper a much stronger +contender. We hope the authors would consider submitting a revised +version in the future (to an appropriate venue). + +* Review #543D + :PROPERTIES: + :CUSTOM_ID: review-543d + :END: +** Overall merit + :PROPERTIES: + :CUSTOM_ID: overall-merit-3 + :END: +B. I support accepting this paper but will not champion it. + +** Reviewer expertise + :PROPERTIES: + :CUSTOM_ID: reviewer-expertise-3 + :END: +Y. *Knowledgeable* = "I follow the literature on this paper's topic(s) +but may have missed some relevant developments" + +** Paper summary + :PROPERTIES: + :CUSTOM_ID: paper-summary-3 + :END: +The paper presents a verified HLS compiler for translating a subset of C +programs directly into Verilog. The approach is extending CompCert from +the 3-address code layer downwards into new intermediate languages, +ending in an existing Verilog semantics adapted from HOL4. + +The evaluation shows that the generated hardware is substantially (order +of magnitude) larger and slower than existing optimising, but unverified +compilers. + +** Comments for author + :PROPERTIES: + :CUSTOM_ID: comments-for-author-3 + :END: +The overall approach makes sense, and I also agree with the motivation +-- I think this is an important and fruitful area to apply formal +complier correctness to. + +My main concern with this paper is that it might be a bit too early. The +first verified chain is a good milestone, but the performance is so bad +that I would not really call it viable yet. Since there are other +hardware verification approaches that do work (although maybe not +directly HLS), reasonably performance is important. I'm still slightly +in favour of acceptance, but I'm on the fence. If this showed good +performance even within 50% or so of existing compilers, it would be a +strong accept. + +It did not become entirely clear for me from the paper wether this is +due to the approach (I think it's fine), or due to some phases/major +optimisations missing that are just required to get reasonable hardware. + +The main point that stuck out for me is not using register allocation[*] +and the probably not hardware-idiosyncratic treatment of RAM (although +the work is attempting to deal with this issue, the model and generated +code might just still be too far from what is "good" for hardware). I'd +be curious to hear more from the authors what exactly is missing, +because if the paper is published as is, people who are not well +disposed to verification will take it as an example that verification is +infeasible to apply to this problem, which I think is neither true nor +in the author's interest. + +[*]: you make the point that registers are cheap in FPGAs, but infinite +registers will just blow up the size, and even slight pressure on the +register file to make the compiler reuse registers (without spilling on +the stack) might make a difference. + +Details: - have you investigated whether function inlining is leading to +some of the size blow-up in your verliog output? + +- comparison to existing compilers: for the evaluation would there be + the possibility of attempting to switch off certain or all + optimisations in the existing compiler you compare to for getting an + additional set of data points? (You still need the original one). This + might also help to pinpoint where some of the overhead comes from. + +- p2: "We also considered using a language with built-in parallel + constructs that map well to parallel hardware, such as occam [40], + Spatial [30] or Scala [2], but found these languages too niche.". This + sounds disingenuous to me. Surely, the main criterion was that + CompCert exists and can be connected to? Trying to verify correctness + of a parallel Scala to HW compiler sounds like a massive undertaking + to me without existing verification infrastructure, and would be + discarded because of that, not because the language is not used as + widely (and if it really was better suited, maybe wide use would not + be such a strong argument). I don't think it would hurt to just say + that. + +- p 7, line 612: "Removing support for external inputs to modules". You + are not discussing how I/O is modelled and part of the final + correctness statement then, if these are removed. Surely your hardware + can compute more than one final value? It'd be fairly easy to generate + and would not need to be that large.. + +- Theorem 1: you call this "backward simulation". This is in accordance + with CompCert terminology, which unfortunately is the exact opposite + of established refinement and data refinement terminology. This (the + harder direction) is a classic forward simulation. You are not really + to blame for this, but I think you should spell out which terminology + you are using. Maybe say "in CompCert terminology" or similar, citing + the paper where Xavier introduces it. + +- p 9, line 978 "avoiding any need for off-chip memory access": Why? Can + your compiler not deal with these or would they obscure results? Might + external RAM not alleviate some of the hardware size pressure? + +** Post author-response comments + :PROPERTIES: + :CUSTOM_ID: post-author-response-comments-3 + :END: +Thank you for your response, it is great to see that performance can be +increased so easily. It is unfortunate that we are not allowed to take +into account new results, which this will have to count as. + +I would encourage the authors to take the feedback gathered here into +account and resubmit with the new results to a different venue. + +** Response by Yann Herklotz +[[mailto:yann.herklotz15@imperial.ac.uk][yann.herklotz15@imperial.ac.uk]] + :PROPERTIES: + :CUSTOM_ID: response-by-yann-herklotz-yann.herklotz15imperial.ac.uk + :END: +Thanks very much for such kind and detailed reviews! + +All reviewers were concerned by the area and performance of the hardware +designs generated by HLSCert. We'll begin our rebuttal with three items +of good news on that front, and then respond to individual questions. + +1. *HLSCert's designs are actually much more area-efficient than our + current evaluation suggests.* The synthesis tool that we used in our + evaluation (Intel Quartus) was able to infer RAM blocks in the LegUp + designs but not in the HLSCert designs; this led to the HLSCert + designs consuming about 21x the area of the LegUp designs on Intel + FPGAs, and having much worse clock frequencies too. But we have found + that by switching to the Xilinx Vivado synthesis tool and targeting a + Xilinx FPGA instead, RAM inference succeeds for both LegUp and + HLSCert designs. This gives a much fairer area comparison. HLSCert + designs now consume only *about 1.7x the area* of LegUp designs. We + propose to use Vivado instead of Quartus when revising our evaluation + section. (Interestingly, another synthesis tool called Yosys can + infer RAMs in HLSCert designs but not in LegUp designs, making + HLSCert designs consume about 0.1x the area of LegUp designs!) + +2. *HLSCert's designs actually obtain much higher clock frequencies than + our current evaluation suggests.* Not only does RAM inference reduce + area; it also leads to much shorter critical paths. So, in the + absence of division, HLSCert designs now obtain *about the same clock + frequency (1-1.5x)* as LegUp designs. + +3. *Much of LegUp's remaining performance advantage is down to generic + (not HLS-specific) LLVM optimisations that CompCert doesn't perform.* + Following an investigation suggested by Reviewer D, we have found + that HLSCert only gains about a 1.1x improvement in cycle count from + CompCert's built-in optimisations, whereas LegUp gains about a 2x + improvement thanks to invoking the LLVM optimiser at =-O3=. All this + is before any HLS-specific optimisations kick in. This discrepancy -- + which rather highlights the gap between a state-of-the-art verified + compiler (CompCert) and a state-of-the-art mainstream compiler (LLVM + =-O3=) -- explains much of overall 4.5x improvement in cycle count + that LegUp has over HLSCert. + +LegUp's remaining performance advantage can be mostly explained by an +optimisation called /operation chaining/, whereby multiple dependent +operations are packed into a single clock cycle (instead of generating a +chain of registers across multiple cycles). We are currently working to +implement (and eventually verify) this optimisation in HLSCert. + +With CompCert/LLVM optimisations and operation chaining /disabled/, +HLSCert designs have only *about 1.3x the cycle count* of LegUp designs. +When taking the updated clock frequencies into account too, HLSCert +designs are only *about 1.6x slower* in terms of execution time. This +drops to 1.3x slower when we enable an as-yet-unverified scheduling pass +that is currently under development. When operation chaining is enabled +in LegUp, HLSCert designs remain around 3-4x slower. + +Of course, as several reviewers remarked, there is plenty of room for +improvement before HLSCert can be considered an /optimising/ HLS +compiler. Still, we believe that what we have achieved so far -- the +first mechanised HLS compiler -- is an important milestone in itself +that sets a new standard for HLS research, and upon which future +researchers will be well-positioned to build. + +*** Responses to Reviewer A + :PROPERTIES: + :CUSTOM_ID: responses-to-reviewer-a + :END: + +#+begin_quote + You inline all function calls. +#+end_quote + +This is in keeping with existing HLS tools, which +[[https://dl.acm.org/doi/10.1145/2629547][make liberal use of +inlining]]. We are currently working to extend HLSCert with support for +function calls; once this is done, we look forward to experimenting with +inlining thresholds. + +#+begin_quote + The proof structure may need to change fundamentally to accommodate + smarter compilation. +#+end_quote + +We believe that the proof structure should allow for the addition of +simple optimisations in different passes to get comparable +quality-of-results to LegUp. Naturally, changes will need to made -- +e.g. replacing "one instruction per clock cycle" with "list of +instructions per clock cycle" to enable scheduling -- but we don't +anticipate fundamental redesigns. These changes can be done in other +intermediate languages so that our original proofs don't need to change. + +#+begin_quote + Figure 3 seems to indicate that register values can't affect state + transitions, but the example in Figure 2 shows just such an influence. + What's going on? +#+end_quote + +Yes, indeed the registers should be able to affect the control flow. We +will rectify that in the diagram. + +#+begin_quote + Figure 4 feels unnecessarily complicated, given that you only compile + function bodies that themselves contain no function calls. +#+end_quote + +Indeed, it might be simpler if they were not present -- they do not +really add much by themselves, but they are needed to interact with the +rest of the CompCert model and are part of the assumptions that we make. +We will try and clarify this section, and maybe simplify the diagram. + +#+begin_quote + You discuss how HLSCert fails if more than $2^{32}$ states are needed. + That would be a pretty huge C program, wouldn't it? +#+end_quote + +Quite! This rather extreme example is just meant to illustrate the fact +that HLSCert, despite being verified, is not guaranteed to produce an +output for every valid input. From a practical standpoint, no modern HLS +tools would be able to handle such a huge number of states. + +*** Responses to Reviewer B + :PROPERTIES: + :CUSTOM_ID: responses-to-reviewer-b + :END: + +#+begin_quote + Is the benchmark selection based on the fact that the system in its + current form does not support globals? +#+end_quote + +The choice of benchmark is mostly because CHStone uses large static +arrays to initialise the test bench. Since we currently don't support +separating a file into a test bench that is run in software and a +function that is synthesised into hardware, we need to synthesise the +large static array into a memory and initialise it, and this takes too +long to simulate. We are looking into how to circumvent this issue, by +maybe having better support for initialising large static memories. + +#+begin_quote + Why do you not support global variables? +#+end_quote + +It's on our todo list. CompCert stores each global variable in a +separate memory. We have already implemented translation of memories +into Verilog arrays as part of our handling of the stack. We expect that +supporting global variables will be a matter of translating multiple +memories, and engineering a way to discern from each pointer which array +to index into. + +#+begin_quote + Which framework are you using for function inlining? Can we have + constructs in the system (other than recursion) which might inhibit + inlining? +#+end_quote + +We have modified the existing inliner in CompCert to inline all +functions, and we should be therefore able to inline any non-recursive +function. + +#+begin_quote + What about the use of function pointers, or the presence of static + variables and goto statements? +#+end_quote + +For function pointers and static variables, HLSCert will error-out at +compile time as these cannot be translated. However, gotos are +translated correctly. + +*** Responses to Reviewer C + :PROPERTIES: + :CUSTOM_ID: responses-to-reviewer-c + :END: + +#+begin_quote + As it stands now, the paper just seems like a straightforward + adaptation of "mechanical proof infrastructure" from CompCert to the + HLS problem. +#+end_quote + +We believe that HLS is a topic of great and growing importance, and that +the first mechanically verified HLS tool is a valuable contribution that +future researchers will be keen to build upon. We think that our paper +can bring insights about hardware design to the PL community, and can +also enlighten the hardware design community about the feasibility of a +fully verified HLS tool. + +#+begin_quote + Why does HLSCert require the result of the divide circuit to be + entirely combinational? +#+end_quote + +This is the default behaviour of the synthesis tool when given a +division operator in Verilog. In such a scenario, the synthesis tool is +obliged complete the division in one cycle, as other parts of the +circuit may rely on this fact. This could be fixed by creating (and +verifying) a pipelined divider module. + +#+begin_quote + I was surprised that the authors dismiss translation validation. Will + you use translation validation to implement support for optimizations? +#+end_quote + +We only intended to dismiss /unverified/ translation validation. We are +indeed working on using translation validation to implement +optimisations. We will further clarify this when revising the paper. + +#+begin_quote + Sections 4.2 and 4.3 : why are you doing a forward simulation and then + arguing about deterministic semantics to show backward simulation. Why + not directly show a backward simulation? +#+end_quote + +We suspect that this confusion results from CompCert's terminology +diverging from established terminology: please see the last part of +[[https://pldi21.hotcrp.com/review/543D][Review #D]]. We'll be sure to +clarify this terminology when revising the paper. + +#+begin_quote + What is the size of the trusted computing base for HLSCert? +#+end_quote + +The length of the implementation section alone is 1184 loc, though as +the reviewer remarks, this will increase considerably as we add +optimisations. The main trusted part of HLSCert is the Verilog semantics +(405 loc), which is mostly inherited from Lööw and Myreen. The +specification of Clight is inherited from CompCert and is +well-established, but if we include that too (584 loc), then the total +trusted base is 989 loc. (All loc figures are without spaces.) This is +similar to the size of the implementation, though we're not sure how +meaningful it is to equate one line of specification with one line of +implementation. Table 1 includes all these numbers (though the Verilog +semantics, quoted there as 739 loc, includes a few proofs about the +semantics -- we'll fix that). + +#+begin_quote + What is the difference between a 32-bit (symbolic) integer and a + bitvector? +#+end_quote + +A bitvector is a list of Booleans. We instead use the internal CompCert +integer type, which has two fields: a number and a bitwidth (32). + +*** Responses to Reviewer D + :PROPERTIES: + :CUSTOM_ID: responses-to-reviewer-d + :END: + +#+begin_quote + How is I/O modelled? How does it feature in the final correctness + statement? +#+end_quote + +As we do not support modules in the Verilog semantics yet (one reason we +inline all functions), we translate arguments to the main function into +inputs to the module, and in the semantics we model the transition from +Callstate to State (cf. Figure 4) as taking the input values the module +was "called with" and assigning them to registers. There are calling +conventions that need to be respected here, such as not changing the +inputs or raising the reset signal during execution. As long as these +are respected, the correctness theorem will hold. + +#+begin_quote + Would there be the possibility of attempting to switch off certain or + all optimisations in the existing compiler you compare to for getting + an additional set of data points? +#+end_quote + +Thanks for the suggestion! We have done exactly this at the top of our +rebuttal. + +#+begin_quote + Why do you need to avoid off-chip memory access? Can your compiler not + deal with these or would they obscure results? Might external RAM not + alleviate some of the hardware size pressure? +#+end_quote + +Off-chip memory currently cannot be used because it requires general +interaction with other Verilog modules from C. However, the stack, and +in future any other memories, will be stored in on-chip RAM blocks, if +inferred by the synthesis tool correctly. + +#+begin_quote + Dismissing occam, Spatial and Scala as too niche sounds disingenuous + to me. Surely, the main criterion was that CompCert exists and can be + connected to? +#+end_quote + +You are quite right: the existence of CompCert (and other verification +infrastructure for C) was the big factor. We'll rephrase. + +#+begin_quote + Have you investigated whether function inlining is leading to some of + the size blow-up in your verilog output? +#+end_quote + +We believe that almost all of the blame for the area blow-up can be laid +on the synthesis tool failing to infer RAM blocks in the HLSCert +designs. After switching to a synthesis tool that infers RAM blocks in +both HLSCert and LegUp designs, the area blow-up goes away. -- cgit