\section{Overview of the testing system} \NR{We should also discuss where this section fits.} Three major commercial HLS tools, Vivado HLS, LegUp HLS, and Intel HLS, are being heavily tested with different versions. The testing flow we introduced and implemented is shown in \ref{?}(Fig) and can be categorized into 6 steps, including random program generation, pre-processing, directives/labels adding, HLS tools processing, result extraction and comparison, and reduction. Those helper functions are implemented using C language. The arrow lines inside the figure represent the connection between each step. Those connections are implemented either by the bash or batch scripts based on HLS tools. Besides making connections, scripts are also responsible for generating TCL/Makefiles if needed. Vivado HLS version 2019.2, 2019.1, and 2018.3, as well as LegUp HLS version 4.0, are instructed under the Linux setting, so bash shell scripts are implemented to direct the testing flow. Since we failed to run \JW{Phrase this more positively, e.g.: `Since LegUp version 7.5 does not have command-line support...'} LegUp HLS version 7.5 through the command line, version 7.5 is installed on Windows and launched as GUI. Only test cases, which trigger discrepancies in results detected through version 4.0, have been run on version 7.5. Thus, the whole testing flow does not apply to LegUp version 7.5. Intel HLS is running under the Windows setting, so a batch script is written. Starting with random program generation, valid, random C/C++ programs are essential for ensuring the quality of test cases that will be feed into HLS tools. Csmith, implemented by a group of people from the University of Utah, is being chosen and used to generate random C/C++ programs. Csmith utilizes complex hashing functions to provide one single result that reflects every change in the variables’ value, which is extremely useful for later result comparison stage. Besides that, Csmith provides both built-in commands as well as a probability file for tuning the properties and structures of generated C/C++ programs. So the wide variety of test cases can be guaranteed. Note that Csmith can create programs that fail to terminate or don’t produce results. So, it is useful to pre-check whether if the C program can provide a valid result before feeding into HLS tools. Once C/C++ program is generated, it will then undergo a pre-processing step. As each HLS tools has different supported synthesizable syntax, pre-processing steps for each HLS tools differ from one another. For instance, Intel HLS doesn’t work correctly with Csmith’s hashing functions. The program generated by Csmith will be processed, aiming to replace its original hashing with a simple XOR hashing. This replacement eases the synthesis and simulation flow for Intel HLS to a great extent, but the downside is that some bugs can go undetected. Vivado HLS and LegUp HLS managed to cope with Csmith’s hashing functions; thus, there is no need for replacement. The pre-processing step not only twists the syntax while maximumly preserve \YH{maximally preserving} the original functionality, but also extracts some necessary information regarding the generated programs. For instance, the total number and name of functions, number of for loops, and array variable’s names are useful information for applying suitable directives/pragmas automatically in the directive-selecting step. Types and quantities of directives/pragmas applied are selected randomly but ensured to be valid by checking with extracted information.\YH{Maybe this doesn't have to be repeated again as it was already mentioned at the end of the previous paragraph.} For Vivado HLS and LegUp HLS, the selection process is done by scripts, whereas, for Intel HLS, the process is done using the C program. The reason of the distinction is that Intel HLS requires the pragmas to be directly added to the program. But for the other two tools, selected pragmas is written to either TCL scripts or Makefile, which will be used and applied when the HLS tool runs. \JW{To discuss tomorrow: since Intel HLS requires pragmas in the program, why not do this for all three tools? Is there a benefit in pulling the directives out into a separate TCL file?} After pragmas are selected, the C/C++ program will be processed again for adding labels at specific correct places \YH{to add labels to specific positions}. For example, if a program contains 5 for-loops and the loop-pipeline optimization is being chosen to apply on the second loop, a name/label needs to be added to the place where the second loop starts and leaving other loops unchanged. After the labels are being added, the program can be synthesized and compiled to get the golden C result for comparing with the RTL result later. GCC version 9.3.0 is used to compile and execute the C result for Vivado HLS, whereas version 4.8.2 is used for LegUp HLS\YH{Maybe the same version of GCC can be used for all tools, for consisency reasons} \JW{Let's discuss this tomorrow.}. Intel HLS uses i++ for C++ programs. Once the program successfully produced a result, it can finally be feed into HLS tools for synthesizing, translating, and simulating. During this process, three types of results can exist, including matched C/RTL result, unmatched C/RTL result, and crashes. Theoretically, HLS tools can translate every C/C++ program into RTL if the syntax is supported. Also, the RTL result should be equivalent to the C/C++ result. However, we found this is not always the case and is the reason for implementing this testing method. The extraction and comparison stages involve extracting the RTL result from the command line, log file, or transcript, comparing it with the golden C result, and saving both the numerical returned result and comparison result to a complete result file. The comparison result is used to determine whether if the reduction process should start. The reduction process will only proceed when a test case fails, which does not include crashes\YH{How come crashes are not included to be reduced?} \JW{Let's discuss this tomorrow.}. The process will iteratively comment out one functional line, then send the modified version to HLS tools. In this way, the functionality of the program is maximumly preserved while reducing it to the minimum program that still triggers the bug. Although this reduction process can reduce the program down to some extent, manual work is yet required since full-automated reduction requires more effort.\YH{The reviews will probably mention C-Reduce, we should therefore probably try to at least get it working in the tool flow.} A checker will be executed at the end when all the test cases are finished. It automatically analyzes and displays the summative result regarding the total number of tests performed, the amount of which did not produce a result, and the amount of which produced the wrong result. Depending on the HLS tools, other information is also displayed. For example, the number of assertion errors triggered is summarized for LegUp HLS.