summaryrefslogtreecommitdiffstats
path: root/reviews/oopsla.md
blob: 09f14d15ea45539992884229eccb921c693ae0c2 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
OOPSLA 2021 Paper #86 Reviews and Comments
===========================================================================
Paper #86 Formal Verification of High-Level Synthesis


Review #86A
===========================================================================
* Updated: 28 Jun 2021 1:28:50pm AoE

Overall merit
-------------
2. Weak reject

Review confidence
-----------------
2. Medium

Paper summary
-------------
This paper considers how to produce a certified (formally verifier) translator from C to Verilog. The authors modify the CompCert verified C compiler by introducing a new intermediate language and compiler pass to that language, and then a further pass to Verilog. This yields a verified toolchain from C to Verilog from which unverified synthesis to netlist is performed. The authors evaluate against an open source tool, finding significant performance gaps.

Strengths and Weaknesses
------------------------
Strengths:
- First verified toolchain from C to Verilog
- Lots of small lessons to be learned from others wishing to target hardware description languages from the CompCert chain

Weaknesses:
- Much of the development is seemingly straightforward, e.g. applying a seeming straightforward translation algorithm. Overcoming major performance problems seem to imply having to solve some hard problems that the current design doesn't account for. The paper would be strengthened by being clearer here.
- In this sense this work is highly interesting, but seems to have punted many of the hard problems down the line, or at least provides insufficient evidence to decide how hard those problems will be and to what degree the current deisgn and easily accommodate their solutions.
- Some limitations are buried in the paper (e.g. having to run benchmarks on smallest data set sizes due to inability to access external storage etc.)

Comments for author
-------------------
This is a pleasant paper that tackles an interesting problem.  Being the first verified compiler from C to Verilog is of course impressive. However the paper also builds on much prior work notably the Verilog semantics on Lööw and Myreen. 

Much of the development appears somewhat workmanlike also; however there are clearly general lessons to be learned from others wanting to tackle high level synthesis using CompCert. In particular the early discussion on design choices and the commentary throughout was enlightening.

My main concern with the work is that it seems to have some hard limitations at present and it is unclear whether its current design will require radical changes to overcome those limitations. For instance, to support efficient division (without massive blow-up in output size) would that require being able to re-use code, and to parallelise execution in a way that doesn't seem immediately supported by the present translation approach? Will existing limitations like relying on CompCert inlining come to bite here? If implementing scheduling will be required to do efficient division, for instance, how big a change might that be to the present development? The paper leaves these questions unanswered. 

We have seen many verified toolchains require significant rework to address limitations in the past (CompCert itself being a good example, but also Bedrock more recently.) Might we expect the same here? Therefore, might one question whether the approach taken is overly simplistic? The paper would do well to address these kinds of questions.

On the other hand, the result is clearly of significance. Therefore if the reviewers can convince that the current design can accommodate solutions to its main drawbacks without major changes, then I could be convinced in favour.

More minor comments follow:
- You mention early on issues in prior tools when straying outside the subset of C they support. Yet your approach cannot provide guarantees to C programs with UB etc. Do existing HLS tools target proper C semantics with all of its dragons of UB and so on? Clarifying that would help.
- Similarly, the disadvantages of translation validation approaches seem kind of weak when compared to the advantage that it offers of allowing much higher performance. Are there examples of known bugs introduced by unsoundness of translation validation implementations? That aspect deserves more discusion IMO to properly argue the benefit of a certified toolchain over translation validation.
- The size of your development seems about 11K lines of Coq.  A comparison to the size of the CompCert front-end on which you rely could be instructive, especially in Fig 1, to get an idea of how much was added to CompCert compared to its original size.
- The discussion of targeting a word-addressable memory would be improved by mentioning whether any existing CompCert backends also target word-addressable architectures.
- Giving an idea of how big the Verilog to netlist synthesis tools are would also help to shed light on how much of an unverified backend your approach relies on.
- One wonders whether the current size efficiency numbers are somewhat illusory. If you implement efficient division etc. then presumably size would blow-up without also solving other challenges currently unaddressed. Some discussion here would be helpful. To put it another way, is HLSCert size-efficient only because it is relatively time-inefficient in its generated output?
- The need for the extra registers to do with function call and return was very puzzling since they are only ever used once each and their values therefore must stay constant for almost all of the execution. Are these here to support function calls in future? 
- Do typical C programs written for high level synthesis run forever? i.e. diverge? If so, HLSCert can gurantee only that they do diverge. But surely one might have some programs that take external input and produce external outpus while running forever? Or would that be integrated into some larger Verilog development? In general a discussion of "external I/O" would be helpful and whether integrating that using CompCert's I/O traces would make sense.
- To put that another way: how do I use HLSCert to build a useful embedded device?

Questions for authors’ response
---------------------------------
- Would addressing the major performance limitations require large changes to the current approach? 
- How would one use HLSCert in a practical development of an embedded device? e.g. in which an FPGA is interacting with some external environment?

Reviewer's Comments after Authors' Response
-------------------------------------------
Thank you for the clarifications regarding how the approach could be extended to overcome its limitations. While I would have appreciated a little more detail here I do hope that any future version of the paper will clearly explain both the current limitations and why they can be overcome without changing the overall approach presented here. Especially important is to convince how you would not only determine which instructions can be executed in parallel but how you would then verify that result and why doing so is not likely to be a major additional effort on top of the present result.



Review #86B
===========================================================================
* Updated: 19 Jun 2021 9:21:32am AoE

Overall merit
-------------
4. Accept

Review confidence
-----------------
3. High

Paper summary
-------------
This paper is about a new backend for CompCert that generates Verilog, i.e., hardware descriptions. The effort involves adapting an existing semantics for Verilog and introducing a new intermediate form akin to the "FSM plus datapath" representation often used in high-level synthesis (HLS) compilers. There are few optimizations, but there is a mapping to an FSM and insertion of a synthesizable RAM to hold C stack values.

Strengths and Weaknesses
------------------------
The main strength: this is the first work I've seen to successfully bring CompCert-style verified compilation to hardware generation from high-level (software) languages. Getting this kind of verification right is an extremely important and timely problem because these compilers are notoriously incorrect but are nonetheless growing in popularity. A secondary strength is that the series of representations and lemmas is clearly described, and the marriage with an existing Verilog semantics is an expedient way to make things work.

Starting simple is good, and the primary reason to accept this paper would be that future work can build on it to produce more "serious" optimizing HLS compilers. However, the main weakness is that the actual hardware generation is *so* simple that it seems to leave the conceptually challenging problems in HLS verification mostly unsolved. In particular, there is no parallelism in the generated design, every instruction takes a single cycle, and all resources are duplicated for every instruction---so the semantics of the generated hardware are very close to the original software semantics. I worry that the main challenges that makes HLS compilers so ridiculously error-prone is because of the gap between software-like semantics and hardware-like semantics—where the latter entails, principally, cycle-level timing, parallelism, and resource allocation. So while the "proof engineering" in this paper is great, by sidestepping these core hardware concerns in the generated designs, I worry that the main scientific challenges are left unaddressed.

Comments for author
-------------------
Thanks for the well-written submission to OOPSLA! There is a lot of clear exposition in this paper, and it's addressing a super important problem that the community absolutely needs to pay more attention to. I also appreciate the clean breakdown between existing CompCert responsibilities and the new responsibilities for the HLSCert backend, which makes the entire system conceptually simpler.

To expand on the discussion above, my chief concern here is that the work may be a little too early for prime time because the HLS compilation is, as yet, still so simple. As I mentioned above, it's appropriate to start simple and build more realistic HLS in follow-on work. But my worry is that it may yet be *too* early for this work to have addressed the biggest challenges in verified generation of hardware, which have to do with bridging the semantic gap between sequential software languages and fine-grained parallel hardware designs. The issue is that the current hardware generation greatly resembles standard CPU execution, whereas full-blown commercial HLS tools can generate more diverse hardware. Specifically, consider these limitations:

* There is no parallelism. We execute the entire program one instruction at a time via a single, global FSM. Harnessing fine-grained parallelism is one of the main reasons to generate a hardware accelerator, and this basic approach does not address it.
* Every 3AC instruction executes in a single cycle (AFAICT). As a result, the execution works like a proper CPU with a custom instruction set that implements the 3AC "ISA," which admits straightforward reasoning about equivalence. But splitting and merging operations across cycles is one of the key techniques in generating high-performance hardware. (This limitation, for what it's worth, is the key limiter that makes the evaluation complex and requires caveats about dividers—mapping instructions to cycles is a naive assumption that leads to underutilized and long clock cycles. With this one limitation addressed, the evaluation would not need so many qualifiers to distinguish division-having from division-free benchmarks, for instance.)
* There is no resource sharing. Every multiply instruction generates a separate multiplier circuit, for instance. Having a single hardware structure that implements multiple logical computations is also a key advantage of hardware over software. However, the current intermediate representations would have trouble representing this optimization because they do not seem to admit a "level of indirection" between instructions and the hardware resources that implement them.

Taken together, these limitations mean that HLSCert can avoid many of the fundamental sources of complexity that come with hardware-like semantics. This is the source of my concern that, as a starting point, the design may still be a little too basic. However, my score leans toward acceptance currently because I don't want to diminish the significance of the "proof engineering" accomplishment in this paper to prove equivalence to Verilog in the first place, even if it's extremely simple Verilog—there is a good chance that this approach can be the foundation of great work that extends it in fundamental ways.

---

One strategic comment: I worry a little bit that too much of the focus of this verification effort was on arithmetic equivalence. Line 825 says:

> The largest proof is the correctness proof for the HTL generation, which required equivalence proofs between all integer operations supported by CompCert and those supported in hardware.

Exhaustively proving equivalence between C's available arithmetic operators and Verilog's available arithmetic operators does indeed seem like a daunting task! However, these individual mathematical equivalencies do not seem as broadly important as the shift in execution model from sequential software semantics to granular hardware semantics. It seems like further limitations in this area would have been acceptable in exchange for more substantial engagement with the semantics of hardware.

---

Some minor limitations that, just for clarity, I do not think are a very big deal and are fine to leave to future work:

* No function calls (everything's inlined). (With the caveat that I think this is a marginally bigger deal than how the paper's text makes it sound... function inlining will become intractable for large HLS design, meaning that some sharing will be necessary.)
* Repurposing the x86_32 backend for CompCert's earlier stages rather than building a more customized target for flexible hardware generation (lines 164-166). (In particular, types with widths other than 32 seem important to support eventually.)
* All memories have a 32-bit access size. (Fine to get started, but certainly future versions will need to support other sizes to be practical.)

---

A few evaluation comments:

* The setup is good and clearly described. LegUp is a perfectly reasonable baseline, and Vivado for FPGA synthesis is also reasonable (although this tool may fare better in an ASIC comparison).
* Line 914 says “The top graphs of Figure 8 and Figure 9 compare the cycle counts of the 27 programs executed by HLSCert and the different optimisation levels of LegUp.” But the text and figure captions say this is execution time, not cycle counts. (I think this is just a writing mistake?)
* The evaluation probably needs a note that the LegUp implementations were *pragma-free*, i.e., they did not contain compiler directives that unlock parallelism and pipelining, etc. HLS programs without pragmas tend to fall back to sequential implementations that resemble the naive hardware that HLSCert targets; however, tools like LegUp are capable of generating much higher-performance hardware, given the right code annotations. I still think it's OK to compare to unoptimized LegUp code, but the text needs a disclaimer to note that the peak performance achievable with LegUp (and other tools) will undoubtedly be greater.

Other minor comments:

* IMO, the signedness discussion around line 275 wasn't terribly germane to the paper's contributions and could have been left to an implementation section.
* I had a lot of trouble understanding the RAM enable signal discussion (`en != u_en`, around line 368). I can't quite tell if this would be a universally reasonable thing to do in an HLS tool or whether it's related to a quirk of the formal verification.
* Section 2.2.3's discussion of `Oshrximm` struck me as odd: it seems clear that this instruction was introduced specifically to avoid proper division when the divisor is a power of two, so it's not exactly shocking that the eventual implementation also requires left shift. I wasn't quite sure what the insight being conveyed here was.

---

# Post-Rebuttal Addendum

I appreciate the additional clarity about the challenges and the implementation of scheduling! I still think that tacking on resource sharing will be a big job, but I am more convinced now that the work in HLSCert is a good foundation for that follow-on work. I am therefore more solidly in favor of acceptance.

Questions for authors’ response
---------------------------------
* Can you please discuss how more advanced hardware generation would fit into the HLSCert framework (or what changes to the basic approach would be necessary), in particular to verify hardware with substantial parallelism?
* Can you help crystallize what the fundamental challenges in verification were for this new Verilog backend vs. other backends in CompCert?



Review #86C
===========================================================================
* Updated: 28 Jun 2021 10:11:31pm AoE

Overall merit
-------------
3. Weak accept

Review confidence
-----------------
2. Medium

Paper summary
-------------
The paper describes a verified tool for high level synthesis (HLS), which
automatically compiles software into hardware. It addresses an existing
problem that HLS tools can't always guarantee that the hardware they generate
is correct (as well as other problems). The tool extends CompCert with a
Verilog back end. There is some evaluation, mostly covering performance.
which at the moment shows there is some overhead at run time, although this
is perhaps largely due to HLS not yet implementing several optimisations that
other tools do.

Strengths and Weaknesses
------------------------
There is a mechanical verification of HLS, increasing confidence that the
hardware generated is correct, unlike existing work, and the correctness
is w.r.t. an existing semantics for Verilog.
There is good evaluation with a suitable benchmark suite. Although the
results show the hardware is a lot slower at the moment, there is reason
to believe the performance problems can be addressed over time.
I didn't really get a sense of how the C is translated into Verilog,
however, so for example if I was designing a high level language specifically
with the intention of using it for verified hardware generation, I'm not
sure this paper would be enough to allow me to do that on its own.

Comments for author
-------------------
I liked this work overall, and although I don't have much familiarity with
hardware synthesis so my confidence is a bit lower, I'd like to see it accepted
as a convincing example of how advanced type systems and state of the art
software verification methods can have practical applications, without the user
of the tool having to be familiar with, e.g. Coq.

Perhaps there is limited space, and readers more familiar with HLS will
not need it, but before getting into the formal semantics of Verilog I would
have found a brief informal introduction to be useful. Is there some kind
of "hello world" for Verilog that would give an idea of the challenges
involved when translating from C?

As noted above, I'd like to know more about how the translation works, and
I'm more interested in this than the details of the proof in Coq (for which
I'd be happy just to see the statements of the lemmas and be referred to the
Coq code).  Perhaps you could give an outline of how constructs in C map to
Verilog?  Maybe that would also help my understanding of how Verilog works.
Similarly, a small but realistic C program that HLSCert can run would be
useful to see to get a sense of what the tool can do (Like in Fig 2, but
without the Verilog output, and a bit larger).

428 Do the semantics of Verilog affect how you write C programs, and the
kind of C programs you accept? I assume they would.

857 To what extent can the Verilog semantics be trusted? It's certainly nice
that it's only 431 lines, which means there is at least some possibility of
believing correctness just from looking at the code!

867 I guess it's implicit in the work, but I'm surprised not to see a
question of correctness here.

1099 "though it is only 2x slower if inefficient divisions are removed" -
that is quite an improvement on 27x slower! Is there some reason why you
wouldn't just do this?

Reviewer's Comments after Authors' Response
-------------------------------------------
Thanks for the detailed response - I think the changes you propose will certainly make the paper more accessible to the OOPSLA audience. It's an exciting topic and a practical application of formal verification, so I'll be happy to see it accepted.



Review #86D
===========================================================================
* Updated: 25 Jun 2021 9:38:19pm AoE

Overall merit
-------------
5. Strong accept

Review confidence
-----------------
2. Medium

Paper summary
-------------
This paper presents a compiler from a C subset to
Verilog. The compiler builds on top of CompCert and is
formally verified in Coq using the semantics of CompCert's C
and of Lööw and Myreen's Verilog.

Strengths and Weaknesses
------------------------
# Strengths

* To the best of my knowledge, no such compiler has been
proven correct before

* The correctness in this case is extremely important
because HLS tools were shown to be buggy in practice, but
also (for commercial tools) to sometimes yield better design
in a limited time budget than writing directly in Verilog.

* The development is realistic and useful in that it
interfaces with existing infrastructure: (1) CompCert
infrastructure as the source and (2) Verilog semantics for
HOL-verified synthesizer as the back end (though this
semantics is somewhat modified)

* The paper is well written, including the discussion of the
design rationale and the key strategy for HLS including
state mapping.

* Formal development is included with the submission.

* In addition to formal proof itself, the paper presents an
evaluation of the performance of the generate Verilog code,
showing that it is good enough to be interesting.

# Minor weaknesses

* HLSCert is simpler than industrial HLS compilers, avoiding
most of the complex optimizations.

* Analysis of memory to partition RAM is not performed (C is
not even a good starting language if we wish to do that
well)

* The synthesis uses uniform clock, so operations like
division are still forced to be single clock and drive up
the delay of the entire design. There are no complex
scheduling analyses and transformations.

* In practice, translation validation may be more practical
for users at the current state of development (though even
there we are lacking practical tools with small trusted
proof base that certify a given individual compilation
input-output pair).

Questions for authors’ response
---------------------------------
* What are main difficulties in handling static global
variables; can't you use RAM and generate initialization
code?

* A C compiler can have very concrete information about
addresses of globals; could you use this to generate
finer-grained arrays and would this help performance?

* I presume you cannot use the formally verified Verilog
compiler by Lööw and Myreen 2019 on whose semantics you
based your system. Is this only due to the array support you
added, or are there other reasons as well?

Reviewer's Comments after Authors' Response
-------------------------------------------
Thank you for your insightful response. I continue to appreciate this pioneering work in formally linking the worlds of C and hardware. I hope that the authors take the requested revisions seriously; spending more effort on situating the scope and the future potential of the contribution will surely help the readers appreciate the paper.



R2 Response by Yann Herklotz <yann.herklotz15@imperial.ac.uk>
---------------------------------------------------------------------------
## Overview

Thanks very much for such kind and detailed reviews!

We would like to begin by addressing two main points that multiple reviewers were concerned about.

### Q: What were the main challenges of developing HLSCert?

Two of the main challenges that we encountered when developing our HLS backend were:

1. **Memory interaction:** There are various issues with memory interaction, because the memory model in our Verilog semantics is finite and concrete, whereas the memory model in CompCert is abstract and infinite.  Moreover, our memory is word-addressed for efficiency reasons, whereas CompCert's memory is byte-addressed.  Therefore, proving the translation of the addresses correct, as well as proving the equivalence of the two different memory representations, was challenging.
2. **Integrating the Verilog semantics and HTL state-machine semantics into CompCert:**  The Verilog semantics operates quite differently to the usual intermediate languages in the backend.  All the 3AC languages, for example, use a map from CFG nodes to instructions.  An instruction can therefore be selected using an abstract program pointer. On the other hand, in the Verilog semantics the whole design is executed at every clock cycle, because hardware is inherently parallel. The program pointer is part of the design as well, not just part of an abstract state.  This makes the semantics of Verilog simpler, but comparing it to the semantics of 3AC becomes more challenging, as one has to map the abstract notion of the state to concrete values in registers.

### Q: Will the design of HLSCert have to change fundamentally in order to enable further performance improvements? 

We believe not. The main optimisation that we currently lack is scheduling (i.e. mapping operations to clock cycles, placing operations in parallel where possible). This is a challenging optimisation to prove correct, but we've made a start already. Scheduling has already been implemented successfully in CompCert at a lower level [[Tristan et al. 2008](https://doi.org/10.1145/1328438.1328444), [Six et al. 2020](https://doi.org/10.1145/3428197)], but it has not been implemented at the 3AC level. To implement scheduling in HLSCert, we would add a new intermediate language, straight after 3AC, where a program is not a "map of instructions", but a "map of lists of instructions" (where each list contains instructions that can be executed in parallel). Scheduling would be performed (using translation validation) at this new intermediate language. After this, translation to the HTL language should be straightforward because HTL is already general enough to support several instructions executing in parallel in a single clock cycle. The new translation to HTL would be able to reuse the current proofs of translation of instructions as they are, and would only need minor modifications at the top-level of the proof to translate the groups of instructions into one state together.

Reviewers also expressed some concerns about how pipelined operators and resource sharing could be represented and implemented, as it may seem like this could warrant a rewrite of the backend.  However, instructions with levels of indirection into other blocks, using wires to communicate, have already been implemented for the loads and stores, which access a separate RAM block.  Pipelined operators and resource sharing could be implemented in the same way, by providing a separate block like a divider in the same way as the RAM block and then inserting start and wait signals to interact with that divider.  Indeed, proving that a pipelined divider actually performs division correctly might be difficult, but that is orthogonal to inserting the right signals during the translation that can make use of the pipelined divider, which should already be accommodated.  Scheduling (as described above) already provides the location of where these signals should be inserted.

## Changes

If the paper is accepted, we intend to use some of our additional pages as follows:

- In response to Reviewer C's request, we will add a more intuitive high-level explanation of how the translation from C to Verilog works. The reviewer is right that readers familiar with HLS would probably skip this, but we want our paper to be accessible to readers not familiar with HLS too.
- In response to Reviewer A's request, we will add an explanation of how HLSCert could be extended to support scheduling and resource sharing, to emphasise that we do not expect having to rewrite the backend to incorporate these optimisations.
- In response to Reviewer A, we will also compare the size of the HLSCert development to that of the CompCert front end, adding that information to Figure 1.

## Response to Reviewer A

> How would one use HLSCert in a practical development of an embedded device?

The main design flow using HLSCert would be to describe a standalone hardware module using C, which can then be introduced into larger designs with a top-level that can interact with any IO of the board. HLSCert will produce control signals so that the output can be retrieved from the module.  This is also how LegUp and many other HLS tools work by default.

HLSCert does not currently have good support for external inputs. This limitation is one reason why we currently see HLSCert more as an academic prototype than a production-ready tool. The other reasons (lack of support for global variables, lack of parallelisation) have already been explained in our paper; when revising, we will be more explicit about this limitation on external inputs. 

Technically, HLSCert will happily compile a program whose main function takes extra parameters, e.g.:

```c
int main(int a, int b) { return a + b; }
```

This method can be used to supply external inputs, and is similar to how most HLS tools handle I/O by default as well. The problem is that such a program is technically malformed, and so the CompCert correctness theorem does not guarantee anything about its behaviour. 

One more principled way for HLSCert to support external inputs would be to augment its generated designs with a built-in Verilog module for reading data from an AXI stream. The builtin functions with the I/O traces would be appropriate for this, as this means that the actual implementation of the AXI stream could be unverified and very flexible, but the traces would still allow for reasoning about its execution. This is very much on our TODO list.

> You mention early on issues in prior tools when straying outside the subset of C they support. Yet your approach cannot provide guarantees to C programs with UB etc. Do existing HLS tools target proper C semantics with all of its dragons of UB and so on? Clarifying that would help.

Current HLS tools often have their own versions of C with non-standard semantics.  One example is that Vivado HLS will have unexpected behaviour when [updating a pointer multiple times](https://forums.xilinx.com/t5/High-Level-Synthesis-HLS/Pointer-synthesis-in-Vivado-HLS-v20-1/m-p/1117009), which is described in their user manual on [page 86](https://www.xilinx.com/support/documentation/sw_manuals/xilinx2019_2/ug902-vivado-high-level-synthesis.pdf#_OPENTOPIC_TOC_PROCESSING_d97e15496) and so is not seen as being a bug.  The benefit of HLSCert is that it reuses formal C semantics of a software compiler, and proves that it is handled correctly.  A verifier like [VST](https://vst.cs.princeton.edu/) can then be used to prove properties about the C code, without having to worry that the semantics of the HLS tool might be different.

> Similarly, the disadvantages of translation validation approaches seem kind of weak when compared to the advantage that it offers of allowing much higher performance. Are there examples of known bugs introduced by unsoundness of translation validation implementations? That aspect deserves more discusion IMO to properly argue the benefit of a certified toolchain over translation validation.

The main problem with TV-based approaches is their runtime performance; for instance, TV-based approaches often fail to find solutions when given large programs. That said, we are not fundamentally against TV, as individual optimisations are often easier to validate, and we are currently looking into adding TV-based optimisations.

> The discussion of targeting a word-addressable memory would be improved by mentioning whether any existing CompCert backends also target word-addressable architectures.

As far as we know, no existing CompCert backend uses word-addressable memory.  Existing backends have support for loading a word from memory, but the underlying representation of the memory is still always byte-addressable, and therefore not directly equivalent to the memory used in Verilog.

> Giving an idea of how big the Verilog to netlist synthesis tools are would also help to shed light on how much of an unverified backend your approach relies on.

It's hard to measure the size of netlist synthesis tools, as these are usually closed-source. There does exist very recent work by [Lööw [2021]](https://dblp.org/rec/conf/cpp/Loow21.html) on a verified netlist synthesis tool called Lutsig, which uses the Verilog semantics by [Lööw and Myreen [2019]](https://dblp.org/rec/conf/icse/LoowM19.html) like we do, so it would be compatible with the output of HLSCert (once array support is added to Lutsig). This could further decrease our reliance on unverified tools. 

> One wonders whether the current size efficiency numbers are somewhat illusory. If you implement efficient division etc. then presumably size would blow-up without also solving other challenges currently unaddressed. Some discussion here would be helpful. To put it another way, is HLSCert size-efficient only because it is relatively time-inefficient in its generated output?

Efficient division would indeed take up more space, however, it is normally paired with resource sharing, meaning only one divider is used for the whole design.  This is supported by the scheduling step (see the first part of our response), which could also handle the pipelined design of the divider circuit, thereby allowing multiple divisions to happen simultaneously.

> The need for the extra registers to do with function call and return was very puzzling since they are only ever used once each and their values therefore must stay constant for almost all of the execution. Are these here to support function calls in future?

These signals are needed for the external interface of the main function, even when there is no support for function calls.  As hardware executes indefinitely, a signal is needed to indicate whether the current computation has finished or not.  This is also used in the proof to prove that the return value will be set correctly whenever the "finished" flag is set.

> Do typical C programs written for high level synthesis run forever? i.e. diverge? If so, HLSCert can gurantee only that they do diverge. But surely one might have some programs that take external input and produce external outpus while running forever? Or would that be integrated into some larger Verilog development? In general a discussion of "external I/O" would be helpful and whether integrating that using CompCert's I/O traces would make sense. To put that another way: how do I use HLSCert to build a useful embedded device?

Just to add to what we wrote in the first section of our response: typical HLS programs do not run forever, as they are not normally the very top-level of the hardware design, which contains the signals interacting with the IO of the specific target board.  Instead, they will be executed an indefinite number of times by the top-level design.  Therefore, the existing correctness theorem in CompCert should suffice.

However, when more interaction with hardware from C is integrated, it might be interesting to see if extending CompCert's traces to reason better about I/O interaction with that hardware would be viable.

## Response to Reviewer B

> Can you please discuss how more advanced hardware generation would fit into the HLSCert framework (or what changes to the basic approach would be necessary), in particular to verify hardware with substantial parallelism?

Please see the first section of our response.

> Can you help crystallize what the fundamental challenges in verification were for this new Verilog backend vs. other backends in CompCert?

Please see the first section of our response.

> I had a lot of trouble understanding the RAM enable signal discussion (`en != u_en`, around line 368). I can't quite tell if this would be a universally reasonable thing to do in an HLS tool or whether it's related to a quirk of the formal verification.

The `en != u_en` trick is not something you'd really need to worry about in an unverified tool, because the most straightforward thing to do would be just to disable the RAM in the 'next' state. The problem in our verified setting is that we'd have to prove which state is the 'next' one, and this is a little fiddly. So we use our toggling trick to avoid it.


## Response to Reviewer C

> As noted above, I'd like to know more about how the translation works, and I'm more interested in this than the details of the proof in Coq (for which I'd be happy just to see the statements of the lemmas and be referred to the Coq code). Perhaps you could give an outline of how constructs in C map to Verilog? Maybe that would also help my understanding of how Verilog works. Similarly, a small but realistic C program that HLSCert can run would be useful to see to get a sense of what the tool can do (Like in Fig 2, but without the Verilog output, and a bit larger).

One way to get a sense of what HLSCert is capable of, beyond the small example we provide in our paper, would be to browse the [PolyBench/C benchmark](https://sourceforge.net/projects/polybench/files/), which contains several non-trivial C programs, all of which HLSCert can compile (except `correlation`, `gramschmidt`  and `deriche`), with slight changes that are mentioned in our Results section (replacing divisions by our iterative implementation and replacing unsigned equality by signed equality).

> 428 Do the semantics of Verilog affect how you write C programs, and the kind of C programs you accept? I assume they would.

HLSCert accepts the same standard C language as CompCert (although there are a few features we haven't implemented yet). So in one sense, no: the semantics of Verilog doesn't affect which C programs HLSCert can accept. But perhaps you are asking more about performance? In which case, yes: expert HLS users often write their C programs in such a way as to make the HLS tool generate efficient Verilog designs. They also make liberal use of directives (pragmas) in their source code to give special instructions to the HLS tool, e.g. to tell it which parts are safe to parallelise. We have concentrated on correctness in our work, but as the HLSCert project evolves, our focus will shift increasingly towards performance (while maintaining correctness of course).

> 857 To what extent can the Verilog semantics be trusted? It's certainly nice that it's only 431 lines, which means there is at least some possibility of believing correctness just from looking at the code!

We think the Verilog semantics is pretty solid. Verilog is a well-established language that has been formalised  a few times; the formalisation that we inherit in this paper was published [by Lööw and Myreen](https://dblp.org/rec/conf/icse/LoowM19.html) and has already been road-tested in a verified HOL4-to-Verilog compiler presented at [PLDI 2019](https://dblp.org/rec/conf/pldi/LoowKTMNAF19.html). All of these formalisations focus on the core Verilog language, which is very stable.

> 867 I guess it's implicit in the work, but I'm surprised not to see a question of correctness here.

Yes, it would have been interesting to evaluate HLSCert using the Csmith-generated testcases used by [Herklotz et al. [2021]](https://dblp.org/rec/conf/fccm/HerklotzDRW21.html) to test the correctness of several HLS tools.  The only feature that is preventing such an evaluation is that Csmith generates many global variables, which are currently not supported.

> 1099 "though it is only 2x slower if inefficient divisions are removed" - that is quite an improvement on 27x slower! Is there some reason why you wouldn't just do this?

It's certainly doable. Of course, we'd have to prove the correctness of the transformation that replaces all the divisions in a program with iterated shifts and subtractions. To be honest, our focus as we continue to develop HLSCert is not really on division _per se_, but on efficient scheduling of operations; once we have got this working we expect to obtain an efficient pipelined division operator as a corollary.

## Response to Reviewer D

> What are main difficulties in handling static global variables; can't you use RAM and generate initialization code?

Yes, that's exactly how they would be handled. We just need a slight generalisation of what we have at the moment, so that we can instantiate multiple RAMs (one per global variable).

> A C compiler can have very concrete information about addresses of globals; could you use this to generate finer-grained arrays and would this help performance?

Yes, definitely. If the arrays are globals, they can be accessed in parallel, because the addresses can be proven to be independent. This would allow for much more efficient code that uses global arrays instead of arrays on the stack.  In CompCert, each pointer belongs to a different block if they index different globals, meaning one can relatively easily identify pointers into different memory blocks.

> I presume you cannot use the formally verified Verilog compiler by Lööw and Myreen 2019 on whose semantics you based your system. Is this only due to the array support you added, or are there other reasons as well?

The compiler by [Lööw and Myreen [2019]](https://dblp.org/rec/conf/icse/LoowM19.html) takes a circuit description in HOL and translates it to Verilog. In other words, it translates between two low-level descriptions of a hardware design. Our compiler, on the other hand, translates from high-level software down to a low-level hardware. 

In terms of interfacing with other tools, we think there's more promise in connecting to Lööw's _other_ compiler, which is called [Lutsig](https://dblp.org/rec/conf/cpp/Loow21.html). Preliminary investigations in collaboration with Lööw suggest that Lutsig has the potential to take a Verilog design from HLSCert and synthesise a netlist (i.e., an even lower-level circuit description) from it, all in a fully-verified manner.



Comment @A1 by Reviewer B
---------------------------------------------------------------------------
## Required Revisions

Please incorporate the extra details and qualifications from the author response into the paper, especially in the introduction. We recommend being as forthright as possible about the current limitations and challenges involved in adding more advanced HLS optimizations right at the beginning of the paper. The reviewers are, on the whole, excited about the foundation laid in this paper, but we also think it's important to be as clear as possible about what the paper does not yet do in order to motivate future work.