\section{Method} % \NR{I think this section is very detailed. I think we can start with a figure of our tool-flow. Then, we can break down each item in the figure and discuss the necessary details (Fig.~\ref{fig:method:toolflow}).} This section describes how we conducted our testing campaign, the overall flow of which is shown in Figure~\ref{fig:method:toolflow}. \input{tool-figure} \begin{itemize} \item In~\S\ref{sec:method:csmith}, we describe how we configure CSmith to generate HLS-friendly random programs for our testing campaign. \item In~\S\ref{sec:method:annotate}, we discuss how we augment those random programs with directives and the necessary configuration files for HLS compilation. \item In~\S\ref{sec:method:testing}, we discuss how we set up compilation and co-simulation checking for the three HLS tools under test. \item Finally, in~\S\ref{sec:method:reduce}, we discuss how we reduce problematic programs in order to obtain minimal examples of bugs. \end{itemize} % How we configure Csmith so that it only generates HLS-friendly programs. % \item How we process the programs generated by Csmith to add in labels on loops etc. % \item How we generate a TCL script for each program. % \item How we run each HLS tool, using timeouts as appropriate. % \end{itemize} \subsection{Generating programs via Csmith} \label{sec:method:csmith} % \NR{ % Questions: % \begin{itemize} % \item Do we use Csmith's hashing or our own hashing now? Is that what \texttt{transparent\_crc} is for? % \end{itemize} % } % \YH{I believe that we are now maybe using our own hashing, even though one of the bugs in Vivado was found in the Csmith hashing algorithm.. So I don't know how best to mention that.} For our testing campaign, we require a random program generator that produces C programs that are both semantically valid and feature-diverse; Csmith~\cite{yang11_findin_under_bugs_c_compil} meets both these criteria. %Csmith is randomised code generator of C programs for compiler testing, that has found more than 400 bugs in software compilers. %Csmith provides several properties to ensure generation of valid C programs. Csmith is designed to ensure that all the programs it generates are syntactically valid (i.e. there are no syntax errors), semantically valid (for instance: all variable are defined before use), and free from undefined behaviour (undefined behaviour indicates a programmer error, which means that the compiler is free to produce any output it likes). Csmith programs are also deterministic, which means that their output is fixed at compile-time; this property is valuable for compiler testing because it means that if two different compilers produce programs that produce different results, we can deduce that one of the compilers must be wrong. %Validity is critical for us since these random programs are treated as our ground truth in our testing setup, as shown in Figure~\ref{fig:method:toolflow}. Additionally, Csmith allows users control over how it generates programs. For instance, the probabilities of choosing various C constructs can be tuned. %By default, Csmith assigns a probability value for each of these features. %Csmith also allows users to customise these values to tune program generation for their own purposes. This is vital for our work since we want to generate programs that are HLS-friendly. %Configuring Csmith to generate HLS-friendly programs is mostly an exercise of reducing and restricting feature probabilities. \begin{table} \centering \begin{tabular}{ll} \toprule \textbf{Properties/Parameters} & \textbf{Change} \\ \midrule \code{statement\_ifelse\_prob} & Increased \\ \code{statement\_for\_prob} & Reduced \\ \code{statement\_arrayop\_prob} & Reduced \\ \code{statement\_break/goto/continue\_prob} & Reduced \\ \code{float\_as\_ltype\_prob} & Disabled \\ \code{pointer\_as\_ltype\_prob} & Disabled \\ \code{void\_prob} & Disabled \\ \code{union\_as\_ltype\_prob} & Disabled \\ \code{more\_struct\_union\_type\_prob} & Disabled \\ \code{safe\_ops\_signed\_prob} & Disabled \\ \code{binary\_bit\_and/or\_prob} & Disabled \\ \code{-{}-no-packed-struct} & Enabled \\ \code{-{}-no-embedded-assigns} & Enabled\\ \code{-{}-no-argc} & Enabled\\ \code{-{}-max-funcs} & 5 \\ \code{-{}-max-block-depth} & 2 \\ \code{-{}-max-array-dim} & 3 \\ \code{-{}-max-expr-complexity} & 2 \\ \bottomrule \end{tabular} \caption{Summary of important changes to Csmith's feature probabilities and parameters to generate HLS-friendly programs for our testing campaign. % \JW{Shouldn't `no-safe-math' be disabled, rather than enabled? I mean, we want safe-math enabled, right? So we should disable no-safe-math, no?}\YH{Yes, that shouldn't have been there.} % \NR{Discussion point: Do we enable unions or not? If not, why do have a union bug in Fig. 4? Reviewers may question this.} % \YH{I've now removed the union bug, it's too similar to another bug I think.} }\label{tab:properties} \end{table} %\paragraph{Significant probability changes} Table~\ref{tab:properties} lists the main changes that we put in place to ensure that HLS tools are able to synthesise all of our generated programs. Our overarching aim is to make the programs tricky for the tools to handle correctly (in order to maximise our chances of exposing bugs), while keeping the synthesis and simulation times low (in order to maximise the rate at which tests can be run). To this end, we increase the probability of generating \code{if} statements in order to increase the number of control paths, but we reduce the probability of generating \code{for} loops and array operations since they generally increase run times but not hardware complexity. Relatedly, we reduce the probability of generating \code{break}, \code{goto}, \code{continue} and \code{return} statements, because with fewer \code{for} loops being generated, these statements tend to lead to uninteresting programs that simply exit prematurely. %\paragraph{Important restrictions} More importantly, we disable the generation of several language features to enable HLS testing. % \NR{Few sentences on different features whose probabilities were changed significantly.} % Table~\ref{tab:properties} lists the important restrictions that we put in place to ensure that HLS tools are able to synthesise programs generated by Csmith. First, we ensure that all mathematical expressions are safe and unsigned, to ensure no undefined behaviour. We also disallow assignments being embedded within expressions, since HLS generally does not support them. We eliminate any floating-point numbers since they typically involve external libraries or use of hard IPs on FPGAs, which in turn make it hard to reduce bugs to their minimal form. We also disable the generation of pointers for HLS testing, since pointer support in HLS tools is either absent or immature~\cite{xilinx20_vivad_high_synth}. We also disable void functions, since we are not supporting pointers. We disable the generation of unions as these were not well supported by some of the tools such as LegUp 4.0. \JW{Obvious reader question at this point: if a feature is badly-supported by some HLS tool(s), how do we decide between disabling it in Csmith vs. keeping it in and filing lots of bug reports? For instance, we say that we disable pointers because lots of HLS tools don't cope well with them, but we keep in `volatile' which also leads to problems. Why the different treatments?} We enforce that the main function of each generated program must not have any input arguments to allow for HLS synthesis. We disable structure packing within Csmith since the ``\code{\#pragma pack(1)}'' directive involved causes conflicts in HLS tools because it is interpreted as an unsupported pragma. We also disable bitwise AND and OR operations because when applied to constant operands, some versions of Vivado HLS errored out with `Wrong pragma usage.' %\NR{Did we report this problem to the developers? If so, can we claim that here?}\YH{We haven't I think, we only reported one bug.} %\paragraph{Parameter settings} Finally, we tweak several integer parameters that influence program generation. We limit the maximum number of functions (five) and array dimensions (three) in our random C programs, in order to reduce the design complexity and size. %\JW{Is this bigger or smaller than the default? Either way: why did we make this change?} We also limit the depth of statements and expressions, to reduce the synthesis and simulation times. % \JW{Why? Presumably this is to do with keeping synthesis/simulation times as low as possible?} %\NR{I remove Zewei's writing but please feel free to re-instate them} \subsection{Augmenting programs for HLS testing} \label{sec:method:annotate} We augment the programs generated by Csmith to prepare them for HLS testing. We do this in two ways: program instrumentation and directive injection. This involves either modifying the C program or accompanying the C program with a configuration file, typically a \code{tcl} file. Finally, we must also generate a tool-specific build script per program, which instructs the HLS tool to create a design project and perform the necessary steps to build and simulate the design. \paragraph{Instrumenting the original C program} %To ensure that the C programs can be successfully synthesised and simulated, we must adhere to a few rules that are common to HLS tools. We generate a synthesisable testbench that executes the main function of the original C program. %Csmith does not generate meaningful testbenches for HLS synthesis. %So we invoke the original C program's main function from another top-level function. This top-level testbench contains a custom XOR-based hash function that takes hashes of the program state at several points during execution, combines all these hashes together, and then returns this value. By making the program's output sensitive to the program state in this way, we maximise the likelihood of detecting bugs when they occur. Csmith-generated programs do already include their own hashing function, but we replaced this with a simple XOR-based hashing function because we found that the Csmith one led to infeasibly long synthesis times. %We inject hashing on program states at several stages of the C program. %By doing so, we keep track of program state, increasing the likelihood of encountering a bug. %The final hash value is returned as a part of the main function to assist in determining the presence of a bug. \paragraph{Injecting HLS directives} Directives are used to instruct HLS tools to optimise the resultant hardware to meet specific performance, power and area targets. Typically, a HLS tool identifies these directives and subjects the C code to customised optimisation passes. In order to test the robustness of these parts of the HLS tools, we randomly generated directives for each C program generated by Csmith. Some directives can be applied via a separate configuration file, others require us to add labels in the C program (e.g. to identify loops), and a few directives require placing pragmas at particular locations in a C program. We generate three classes of directives: those for loops, those for functions, and those for variables. For loops, we randomly generate directives including loop pipelining (with rewinding and flushing), loop unrolling, loop flattening, loop merging, loop tripcount, loop inlining, and expression balancing. %\NR{Not so clear about nested loops. Discuss.}\YH{I believe they are just handled like simple loops?} For functions, we randomly generate directives including function pipelining, function-level loop merging, function inlining, and expression balancing. For variables, we randomly generate directives including array mapping, array partitioning and array reshaping. %\NR{Would a table of directives help? }\YH{I think there might be too many if we go over all the tools.} %\paragraph{Generating build script} %In addition to the annotated C program and configuration file with directives, we also generate a tool-specific build script. %This script instructs the HLS tool to create a design project and perform the necessary steps to build and simulate the design. %\NR{Have I missed any LegUp directives? What is these lines referring to. Supported Makefile directives include partial loop unrolling with a threshold, disable inline, and disable all optimizations. Available Config TCL directives include partial loop pipeline, all loop pipeline, disable loop pipeline, resource-sharing loop pipeline, and accelerating functions. }\YH{I think that covers most of it.} \subsection{Testing various HLS tools} \label{sec:method:testing} % \NR{Some notes on key points from this section: % \begin{itemize} % \item We needed time outs that several stages of the HLS flow to ensure scalable testing. % \item LegUp 7.0 was not user-friendly enough to perform scripting. % \item Intel HLS needs more interventions and time outs since its slow execution times are part of our critical path during testing and reduction. % \item We need to think of a good way to highlight the tooling difference and how it affects our changes. This is good information for anyone who is attempting to port any work across various tools. % \end{itemize} % } % \NR{So Vivado HLS does co-simulation automatically, whereas we had to invoke them for other tools?} Having generated HLS-friendly programs and automatically augmented them with directives and meaningful testbenches, we are now ready to provide them to HLS tools for testing. %Figure~\ref{fig:method:toolflow} shows the three stages of testing, depicted as the testing environment in the dashed area. For each HLS tool in turn, we compile the C program to RTL and then simulate the RTL. Independently, we also compile the C program using GCC and execute it. To ensure that our testing is scalable for a large number of large, random programs, we also enforce several time-outs: we set a 5-minute time-out for C execution and a 2-hour time-out for RTL simulation. We do not count time-outs as bugs, but we record them. %% JW: This paragraph is not really needed because we repeat the sentiment in the next subsection anyway. %There two types of bugs that we can encounter in this testing setup: programs that cause the HLS tool to crash during compilation (e.g. an unhandled assertion violation or a segmentation fault), and programs where the software execution and the RTL simulation do not return the same value. Programs that cause either type of error are given to the reduction stage, which aims to minimise the programs and (hopefully) identify the root cause(s). %\paragraph{Tool-specific trivia} %Vivado HLS implements C and RTL simulation by default. %For all tools, we set a 5-minute and 2-hour for C simulation and RTL simulation respectively. %In cases where simulation takes longer, we do not consider them as failures or crashes and we record them. % When testing Intel HLS, we place three time-outs since its execution is generally slower than other HLS tools: a \ref{XX}-minute time-out for C compilation, a \ref{XX}-hour time out for HLS compilation and a \ref{XX}-hour time-out for co-simulation. % And the number of timeouts placed has increased to 4. The first timeout sets when compiling the C++ program to CPU and returning an executable once finished. The second timeout is placed when running the executable to get the C++ result. The third timeout, which been given the most extended period, is at synthesizing the design and generating the co-simulation executable. Finally, running the co-simulation executable requires the fourth timeout. The test case can be dumped at any timeout period if the task is not finished within the limited time. \subsection{Reducing buggy programs} \label{sec:method:reduce} Once we discover a program that crashes the HLS tool or whose C/RTL simulations do not match, we further scrutinise the program to identify the root cause(s) of the undesirable behaviour. As the programs generated by Csmith can be fairly large, we must systematically reduce these programs to identify the source of a bug. Reduction is performed by iteratively removing some part of the original program and then providing the reduced program to the HLS tool for re-synthesis and co-simulation. The goal is to find the smallest program that still triggers the bug. We apply two consecutive methods of reduction in this work. We first perform a custom reduction in which we iteratively remove lines in strict order from the top-level function. This method of reduction first reduces the pragmas and labels that were added before synthesis of the C program, after which is proceeds by iteratively commenting out lines in the C program one after another until any further reduction would eliminate the bug. % \NR{We can add one or two more sentences summarising how we reduce the programs. Zewei is probably the best person to add these sentences.}\YH{Added some more lines, we can ask Zewei if she is OK with that.} Although, our custom reduction gives us the freedom and control of how to reduce buggy programs, it is arduous and requires a lot of manual effort. We therefore also integrated Creduce~\cite{creduce} into the workflow to automatically reduce the C programs that fail synthesis or where the outputs mismatch. This greatly speeds up the reduction of failing test cases as Creduce can reduce the input in a semantically correct why while avoiding undefined behaviour. In addition to that, it can run in parallel and has various different reduction strategies that help it converge to a smaller test case faster, such as delta debugging passes or even inlining of function bodies, which our custom reduction pass does not do. This means that the final product that Creduce obtains is often small enough to understand and step through. However, the downside of using Creduce with HLS tools is that we are not in control of which lines and features are prioritised for removal. As a consequence, we can easily end up with Creduce producing smaller programs that are not HLS-friendly and are unsynthesisable, despite being valid C. This is because the semantics of C for HLS tools might be a bit different to the semantics for C of a standard C compiler that Creduce was built for. Even though it does not normally introduce undefined behaviour, it can introduce undefined behaviour for C programs designed for HLS tools, because there are more restrictions for that C language than there normally is. An example of this could be the reduction of a function call, where the reducer realises that a mismatch is still observed when the arguments to the function call are removed, and that the function pointer is assigned to the integer instead. This is however often undefined behaviour in HLS tools, as a function pointer does not have a concrete meaning in hardware, were functions are not associated with a memory location as there are no instructions. Once undefined behaviour is introduced at any point in the reduction, the test cases will often reduce to that undefined behaviour as it does create a mismatch between the HLS tool and the C compiler, but does not actually represent a bug. To prevent this, the script which guides Creduce towards reducing it to the correct bug must try and avoid the introduction of these undefined behaviours as much as possible. The following measures were taken to prevent the insertion of undefined behaviour: \begin{itemize} \item Add \texttt{-fsanitize=undefined} to GCC options to discover and error out on undefined behaviour at runtime of the executable. \item Turn on all warnings in GCC and error out on any warning, while ignoring common warnings that are known to be harmless. \end{itemize} %%% Local Variables: %%% mode: latex %%% TeX-master: "main" %%% End: