summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorJames Pollard <james@pollard-net.co.uk>2020-07-03 14:39:26 +0000
committeroverleaf <overleaf@localhost>2020-07-04 14:48:30 +0000
commit276279c5a676c4a46361c97456da76498e972fea (patch)
treee29630e9517dd1c07aed7260cc1f1cb82ca8904c
parent97fececa9f4a9b84f26e057d87615d2914455812 (diff)
downloadoopsla21_fvhls-276279c5a676c4a46361c97456da76498e972fea.tar.gz
oopsla21_fvhls-276279c5a676c4a46361c97456da76498e972fea.zip
Update on Overleaf.
-rw-r--r--algorithm.tex2
-rw-r--r--evaluation.tex35
-rw-r--r--references.bib27
-rw-r--r--verilog.tex2
4 files changed, 63 insertions, 3 deletions
diff --git a/algorithm.tex b/algorithm.tex
index 90e7644..727e067 100644
--- a/algorithm.tex
+++ b/algorithm.tex
@@ -80,7 +80,7 @@ RTL is the existing intermediate language in CompCert that was chosen to transfo
%% Describe RTL
RTL function definitions are a sequence of instructions encoded in a control-flow graph, with each instruction linking to the next instruction that should be executed. RTL consists of ten separate instructions in total, which makes this a good candidate for high-level synthesis translation, as these instructions translate quite directly to a hardware equivalent. This structure makes it ideal for performing optimisations on, as performing static analysis and transformations on this language is much simpler than doing the same optimisation for the C abstract syntax tree instead. However, even though RTL is quite architecture independent and good for software optimisations, it does not model hardware well and is therefore not suitable for hardware optimisations. We therefore need a new intermediate language that translates directly into hardware and which can also be used as an intermediate language for more hardware specific optimisations.
-However, before this translation can be performed, all functions need to be inlined. This means that recursive function calls cannot be supported, however, these are difficult to translate to hardware as they require dynamic allocation of the stack. Dynamic allocation cannot be performed in hardware, because the memory size has to be set statically and then translated to a physical memory in the hardware. Inlining is already an optimisation that is performed at the RTL stage in CompCert, however, to get RTL that is ready to be translated to hardware, this pass was modified so that is forced all functions to be inlined. This eliminates all the function calls from the RTL, as long as there are no recursive function calls, which means that do not have to be handled by the translation to hardware.
+However, before this translation can be performed, all functions need to be inlined. This means that recursive function calls cannot be supported, however, these are difficult to translate to hardware as they require dynamic allocation of the stack. Dynamic allocation cannot be performed in hardware, because the memory size has to be set statically and then translated to a physical memory in the hardware. \JP{I think we should explain this a little bit more; HW allocation is possible in principle.} Inlining is already an optimisation that is performed at the RTL stage in CompCert, however, to get RTL that is ready to be translated to hardware, this pass was modified so that is forced all functions to be inlined. This eliminates all the function calls from the RTL, as long as there are no recursive function calls, which means that do not have to be handled by the translation to hardware.
\subsection{HTL}
diff --git a/evaluation.tex b/evaluation.tex
index b5206f6..39262e0 100644
--- a/evaluation.tex
+++ b/evaluation.tex
@@ -1,6 +1,39 @@
\section{Evaluation}
-\NR{Do we want to collect C-to-Verilog compile time?} \JW{Yes that would be great.}
+Our evaluation is designed to answer the following three research questions.
+\begin{description}
+\item[RQ1] How fast is the hardware generated by CoqUp, and how does this compare to existing HLS tools?
+\item[RQ2] How area-efficient is the hardware generated by CoqUp, and how does this compare to existing HLS tools?
+\item[RQ3] How long does CoqUp take to produce hardware?
+\end{description}
+
+\subsection{Experimental Setup}
+
+\paragraph{Choice of HLS tool for comparison.} We compare CoqUp against LegUp 4.0 because it is open-source and hence easily accessible, but still produces hardware ``of comparable quality to a commercial high-level synthesis tool''~\cite{canis+11}.
+
+\paragraph{Choice of benchmarks.} We evaluate CoqUp using the PolyBench/C benchmark suite\footnote{\url{http://web.cs.ucla.edu/~pouchet/software/polybench/}}. PolyBench/C is a modern benchmark suite that has been previously used to evaluate HLS tools~\cite{choi+18}. For completeness, we use the full set of 24 benchmarks. We set the benchmark parameters so that all datatypes are integers (since CoqUp only supports integers) and all datasets are `small' (to fit into the small on-chip memories). A current limitation of CoqUp, as discussed in Section~\ref{?}, is that it does not support addition and subtraction operations involving integer literals not divisible by 4. To work around this, we lightly modified each benchmark program so that literals other than multiples of 4 are stored into variables before being added or subtracted. \JW{Any other notable changes to the benchmarks?}
+
+\paragraph{Experimental setup.} In order to generate a hardware implementation, the Verilog produced by the HLS tool-under-test must be synthesised to a netlist using a tool such as Yosys~\cite{yosys} or Intel Quartus~\cite{quartus}. The resultant netlist can then be placed-and-routed for a particular FPGA device. In the ideal experimental setup, we would use the same netlist synthesis tool for both CoqUp and LegUp.
+However, we found that neither Yosys nor Quartus worked well with \emph{both} HLS tools. Quartus could synthesise efficient hardware from LegUp-generated Verilog in part because it detects opportunities to replace large numbers of registers with small RAM blocks, yet on CoqUp-generated Verilog, this RAM inference failed, leading to a design too large to fit onto the FPGA. Yosys had the same problem, but with the HLS tools reversed. So, in an effort to avoid disadvantaging either HLS tool, we use LegUp with Quartus and CoqUp with Yosys. In both cases, we use Quartus to place-and-route the synthesised netlists for a \ref{?} FPGA.
+
+\subsection{RQ1: How fast is CoqUp-generated hardware?}
+
+\begin{itemize}
+ \item Draw a scatter graph and talk about it. Note: advantage of scatter graph is that it summarises a large number of benchmarks quite succinctly. However, barcharts is more traditional and would allow data for individual benchmarks to be more easily identified.
+\end{itemize}
+
+\subsection{RQ2: How area-efficient is CoqUp-generated hardware?}
+
+\begin{itemize}
+ \item Draw a scatter graph and talk about it.
+\end{itemize}
+
+\subsection{RQ3: How long does CoqUp take to produce hardware?}
+
+\begin{itemize}
+ \item Draw a scatter graph and talk about it.
+\end{itemize}
+
\begin{table}
\begin{tabular}{lcccccc}
\toprule
diff --git a/references.bib b/references.bib
index 81fe2c4..ec8304d 100644
--- a/references.bib
+++ b/references.bib
@@ -369,3 +369,30 @@
publisher = {ACM},
series = {PLDI 2019},
}
+
+@inproceedings{canis+11,
+ author = {Andrew Canis and
+ Jongsok Choi and
+ Mark Aldham and
+ Victor Zhang and
+ Ahmed Kammoona and
+ Jason Helge Anderson and
+ Stephen Dean Brown and
+ Tomasz S. Czajkowski},
+ title = {{LegUp}: high-level synthesis for {FPGA}-based processor/accelerator systems},
+ booktitle = {{FPGA}},
+ pages = {33--36},
+ publisher = {{ACM}},
+ year = {2011}
+}
+
+@inproceedings{choi+18,
+ author = {Young{-}kyu Choi and
+ Jason Cong},
+ title = {{HLS}-based optimization and design space exploration for applications
+ with variable loop bounds},
+ booktitle = {{ICCAD}},
+ publisher = {{ACM}},
+ year = {2018}
+}
+
diff --git a/verilog.tex b/verilog.tex
index 0d3a071..7b951c7 100644
--- a/verilog.tex
+++ b/verilog.tex
@@ -66,7 +66,7 @@ The two main evaluation functions are then \textit{erun}, which takes in the cur
% \inferrule[Nonblocking Array]{\yhkeyword{name}\ \textit{lhs} = \yhkeyword{OK}\ n \\ \textit{erun}\ \Gamma\ \textit{rhs}\ v_{\textit{rhs}}}{\textit{srun}\ (\Gamma_{r}, \Gamma_{a}, \Delta_{r}, \Delta_{a})\ (\textit{lhs} \Leftarrow \textit{rhs})\ (\Gamma_{r}, \Gamma_{a}, \Delta_{r} // \{n \rightarrow v_{\textit{rhs}}\}, \Delta_{a})}
\end{gather*}
-\YH{TODO: Add rules for blocking and nonblocking assignment to arrays.} \JW{In CondTrue and CondFalse, the relationship between $\Gamma_0$ and $\sigma_0$ needs clarifying.}\YH{Clarified now in the previous paragraph.} \JW{Hm, why not just make `erun' take the entire $\sigma$ rather than just two of its fields? That would avoid this confusion altogether.}\YH{I guess I just wanted to make it explicit that erun never looks at the nonblocking assignments, but yes, it could also just take the whole $\sigma$} \JW{In CaseNoMatch, it feels weird to me that you keep evaluating $e$ for each case of the switch, rather than just once at the start of the switch statement. I guess it's ok because a failed match doesn't change the state. Just a bit quirky, I guess.}\YH{Yes that is a bit annoying actually, however, I couldn't really figure out the best way to only evaluate it once, as there isn't really a start to the case statement, we just describe that you could start anywhere in the case statement and evaluate it. One solution would be to define a separate inductive rule that finds a matching case based on an evaluated value, which may be cleaner actually.} \JW{In CaseDefault, I was slightly surprised to see `Some' -- is that necessary?}\YH{That's true, currently it's a \texttt{Some} because the default case is optional, however, we actually always use it, so we could change it to be mandatory.} \JW{In BlockingReg, erun is taking more parameters than it does elsewhere.}\YH{Yes thank you, fixed that now.} \JW{When using subscripted variables like $\Gamma_r$, I prefer $\Gamma_{\rm r}$ because $r$ is a fixed name (short for `register'), not a variable.}\YH{That is much nicer actually!}
+\YH{TODO: Add rules for blocking and nonblocking assignment to arrays.} \JW{In CondTrue and CondFalse, the relationship between $\Gamma_0$ and $\sigma_0$ needs clarifying.}\YH{Clarified now in the previous paragraph.} \JW{Hm, why not just make `erun' take the entire $\sigma$ rather than just two of its fields? That would avoid this confusion altogether.}\YH{I guess I just wanted to make it explicit that erun never looks at the nonblocking assignments, but yes, it could also just take the whole $\sigma$} \JW{In CaseNoMatch, it feels weird to me that you keep evaluating $e$ for each case of the switch, rather than just once at the start of the switch statement. I guess it's ok because a failed match doesn't change the state. Just a bit quirky, I guess.}\YH{Yes that is a bit annoying actually, however, I couldn't really figure out the best way to only evaluate it once, as there isn't really a start to the case statement, we just describe that you could start anywhere in the case statement and evaluate it. One solution would be to define a separate inductive rule that finds a matching case based on an evaluated value, which may be cleaner actually.} \JW{In CaseDefault, I was slightly surprised to see `Some' -- is that necessary?}\YH{That's true, currently it's a \texttt{Some} because the default case is optional, however, we actually always use it, so we could change it to be mandatory.} \JW{When using subscripted variables like $\Gamma_r$, I prefer $\Gamma_{\rm r}$ because $r$ is a fixed name (short for `register'), not a variable.}\YH{That is much nicer actually!}
Taking the \textsc{CondTrue} rule as an example, this rule will only apply if the Boolean result of running the expression results in a \texttt{true} value. It then also states that the statement in the true branch of the conditional statement \textit{stt} runs from state $\sigma_{0}$ to state $\sigma_{1}$. If both of these conditions hold, we then get that the conditional statement will also run from state $\sigma_{0}$ to state $\sigma_{1}$. The \textsc{Blocking} and \textsc{Nonblocking} rules are a bit more interesting, as these modify the blocking and nonblocking association maps respectively.