aboutsummaryrefslogtreecommitdiffstats
path: root/papers/cfrontend_new/paper.tex
diff options
context:
space:
mode:
Diffstat (limited to 'papers/cfrontend_new/paper.tex')
-rwxr-xr-xpapers/cfrontend_new/paper.tex382
1 files changed, 382 insertions, 0 deletions
diff --git a/papers/cfrontend_new/paper.tex b/papers/cfrontend_new/paper.tex
new file mode 100755
index 00000000..4293554f
--- /dev/null
+++ b/papers/cfrontend_new/paper.tex
@@ -0,0 +1,382 @@
+\documentclass[namedreferences]{kluwer}
+
+\usepackage{latexsym}
+%\usepackage{amsmath} pb latex
+\usepackage{amssymb}
+\usepackage{amsfonts}
+\usepackage{mymacros}
+\usepackage{infrules}
+\usepackage{url}
+\usepackage{mymacros}
+\usepackage[usenames]{color}
+
+\input{macros}
+
+\begin{document}
+\begin{article}
+
+\begin{opening}
+\title{Formal verification of C front-end}
+\subtitle{Yet another formal verification in Coq of a C front-end}
+\runningtitle{C front-end}
+
+\author{Sandrine \surname{Blazy} \email{{S}andrine.Blazy@ensiie.fr}}
+\institute{ENSIIE and INRIA}
+
+\author{Xavier \surname{Leroy} \email{{X}avier.Leroy@inria.fr}}
+\runningauthor{S.Blazy, X.Leroy}
+\institute{INRIA}
+
+\maketitle
+
+\begin{abstract}
+This paper presents
+\end{abstract}
+
+\keywords{formal semantics, formal proof, C language, C compiler}
+\end{opening}
+
+\section{Introduction}
+
+\section{A formal semantics for \Clight{}}
+\subsection{CIL}
+
+\subsection{Abstract syntax}
+
+The abstract syntax of \Clight{} is given in figures~\ref{fig:syntax}
+and~\ref{fig:syntax2}.
+In the Coq formalization, this abstract syntax is presented as
+inductive data types, therefore achieving a deep embedding of \Clight{}
+into Coq.
+
+At the level of types, \Clight{} features all the integral types of C,
+along with array, pointer (including function pointers), function
+types, {\tt struct} and {\tt union} types. Bit-fields in {\tt struct}
+or {\tt union} are ignored.
+The integral types fully specify the bit size of integers and floats,
+unlike the half-specified C types {\tt int}, {\tt long}, etc.
+
+\sbcm{TO DO: representation of struct and union}
+
+Within expressions, all C operators are supported.
+Side-effects stem from function calls.
+All expressions and their sub-expressions are
+annotated by their static types. We write $a^\tau$ for the expression
+$a$ carrying type $\tau$. These types are necessary to determine the
+semantics of type-dependent operators such as the overloaded
+arithmetic operators. The following expressions may be in left-value
+position: $\id, *a, a_1[a_2]$ and $a \dot \id$.
+
+\input syntax.tex
+
+At the level of statements, all structured control statements of C
+(conditional, loops, simple {\tt switch}, {\tt break}, {\tt continue} and
+{\tt return}) are supported, but not unstructured statements
+(unstructured {\tt switch} and {\tt goto}).
+Contrary to C, the default case is mandatory in a \Clight{} {\tt switch} statement
+and it is the last branch of the {\tt switch}.
+An assignment is treated as a statement (instead of an expression).
+This transformation from expression into statements is performed by CIL.
+Two kinds of variables are allowed: global variables and local {\tt auto}
+variables declared at the beginning of a function. Block-scoped local
+variables and {\tt static} variables are omitted because they are emulated by
+pulling their declarations to function scope or global scope,
+respectively. Consequently, there is no block statement in \Clight{}.
+
+A function $f$ defined in \Clight{} consists of a header (\textit{i.e.}
+{\tt fn\_params(f)} the parameters of $f$ and {\tt fn\_return(f)}
+the return type of $f$), a list of local variables ({\tt fn\_vars(f)})
+and a body ({\tt fn\_body(f)}) that is a statement. The functions composing
+the program are either internal functions, defined within \Clight{}, or
+external functions (a.k.a. system calls). An external function consists
+of the internal identifier and the signature of the function.
+
+A program $p$ consists of a list of global variable declarations
+({\tt prog\_vars(p)}), a list of function definitions ({\tt prog\_funct(p)}),
+and an identifier naming the entry point of the program ({\tt prog\_main(p)}).
+In a global variable declaration, the list of initial values
+represents all values that need to be stored (\textit{e.g.} the
+values of the cells of an array).
+
+\input syntax2.tex
+
+\subsection{Static semantics}
+ The type system is very coarse: we check only the typing properties
+ that matter for the translation to be correct. Essentially,
+ we need to check that the types attached to variable references
+ match the declaration types for those variables.
+
+\subsection{Dynamic semantics}
+
+The dynamic semantics of \Clight{} is specified using natural semantics,
+also known as big-step operational semantics.
+ These natural semantics capture the final result of program execution, as
+well as traces of calls to external functions that emit an event when applied.
+The big-step semantics described in~\cite{blazy06:fm} did not observe traces.
+This addition of traces leads to a significantly
+stronger observational equivalence between source and machine code.
+
+Execution traces represent the interactions between the program and
+the rest of the world (\textit{i.e.} the input-output activities of
+the program); they are defined in figure~\ref{fig:trace}.
+A trace is a finite list of input-output events.
+Each call to an external function produces an event.
+An event consists of an identifier and the actual values of the parameters of
+the called function. Usually, the identifier represents the called function,
+but more generally it could also represent any observable identifier.
+Starting from an initial empty event (called $\E0$), events are propagated and
+concatenated during program execution.
+
+\input trace.tex
+
+While the semantics of
+C is not deterministic (the evaluation order for expressions is not
+completely specified and compilers are free to choose between several
+orders), the semantics of \Clight{} is completely deterministic and
+imposes a left-to-right evaluation order, consistent with the order
+implemented by our compiler. This choice simplifies greatly the
+semantics compared with, for instance, Norrish's semantics for C
+\cite{Norrish:phd}, which captures (at great expense) the
+non-determinism allowed by the ISO C specification. Our semantics can
+therefore be viewed as a refinement of (a subset of) the ISO C
+semantics, or that of Norrish.
+
+The semantics is defined by 7 judgements (relations):
+$$\begin{array}{rclr}
+G, E & |- & a, M \evallvalue \loc, \tr, M' & \mbox{(expressions in l-value position)} \\
+G, E & |- & a , M \evalexpr v, \tr, M' & \mbox{(expressions in r-value position)} \\
+G, E & |- & \seq a, M \evalexpr \seq v, \tr, M' & \mbox{(list of expressions)}\\
+G, E & |- & s, M \evalstmt \out, \tr, M' & \mbox{(statements)} \\
+G, E & |- & \ls, M \evalstmt \out, \tr, M' & \mbox{(labeled statements of a switch)} \\
+G & |- & f(\seq v), M \evalfunc v, \tr, M' & \mbox{(function invocations)} \\
+ & \vdash & p \evalprog v, \tr & \mbox{(programs)}
+\end{array}$$
+Each judgement relates a syntactic element (expression, statement,
+etc.) and an initial memory state to the result of executing this
+syntactic element, as well as the final memory state at the end of
+execution and also the trace of input-output events (system calls).
+ The various kinds of results, as well as the evaluation
+environments, are defined in figure~\ref{fig:values}.
+
+\input values.tex
+
+For instance, executing an expression $a$ in l-value position results in a
+memory location $\loc$ (a memory block reference and an offset within that
+block), while executing an expression $a$ in r-value position results in
+the value $v$ of the expression. Values range over 32-bit integers,
+64-bit floats, memory locations (pointers), and an undefined value
+that represents for instance the value of uninitialized variables.
+The result associated with the execution of a statement $s$ is an
+``outcome'' indicating how the execution terminated: either normally
+by running to completion or prematurely via a {\tt break}, {\tt continue} or
+{\tt return} statement.
+The invocation of a function $f$ yields a value $v$, and so does the execution
+of a program.
+
+Two evaluation environments, defined in figure~\ref{fig:values},
+appear as parameters to the judgements. The local environment $E$ maps
+local variables to references of memory blocks containing the values
+of these variables. These blocks are allocated at function entry and
+freed at function return. The global environment $G$ maps global
+variables and function names to memory references. It also maps
+some references (those corresponding to function pointers) to function
+definitions.
+
+The memory model of the semantics is described in~\cite{icfem:sb}.
+A memory $M$ maps block references to block contents.
+Each block represents the result of a C \texttt{malloc},
+or a stack frame, a global static variable, or a function
+code-pointer.
+A block content consists of the dimensions of the block (low and high bounds)
+plus a mapping from byte offsets to byte-sized memory cells (\textit{i.e.}
+values).
+
+\sbcm{TO DO: a detailler (faire une section a part?): il faut expliquer
+loadval et storeval qui sont utilisees dans la semantique.}
+
+In the Coq specification, the 7 judgements of the dynamic semantics
+are encoded as mutually-inductive predicates. Each defining case of
+each predicate corresponds exactly to an inference rule in the
+conventional, on-paper presentation of natural semantics. We have one
+inference rule for each kind of expression and statement described in
+figure~\ref{fig:syntax}.
+ We show most of the inference rules in figures~\ref{fig:dynsem1}
+to~\ref{fig:dynsem4}.
+
+\input dynsem1.tex
+
+The first five rules of figure~\ref{fig:dynsem1} illustrate the
+evaluation of an expression in l-value position.
+A variable $\id$ evaluates to the location $(E(\id), 0)$.
+If an expression $a$ evaluates to a pointer value $\loc$, then
+the location of the dereferencing expression $(\hbox{{\tt *}}a)^\tau$ is $\loc$.
+
+Rule~\ref{rule:3} shows the evaluation of an array cell.
+The location of an array cell $\unannot{a_1}{\tau_1} [\unannot{a_2}{\tau_2}]$
+is computed from the location of the array $\unannot{a_1}{\tau_1}$
+({\it i.e.} the location of the first cell) and the value of the index
+$\unannot{a_2}{\tau_2}$ thanks to the function {\tt add}.
+{\tt add} is a partial function:
+it can be undefined if the types and the shapes of argument values are
+incompatible (e.g. an integer addition of two pointer values).
+In the Coq specification, it is a total
+function returning optional values: either {\tt None} in case of failure, or
+${\tt Some}(v)$, abbreviated as $\some v$, in case of success.
+%More precisely, the computation succeeds only if $v_1$ and
+%$v_2$ are either 2 integers, or 2 floats or a pointer value and an integer.
+%The corresponding types $\tau_1$ and $\tau_2$ must also be
+%compatible with these values. For instance, if $\tau_1$ is a {\tt Tarray},
+%then $\tau_2$ must be a {\tt Tint}.
+The trace resulting from the evaluation of
+$\unannot{a_1}{\tau_1} [\unannot{a_2}{\tau_2}]$ is the concatenation of the
+traces resulting from the evaluation of $\unannot{a_1}{\tau_1}$ and
+$\unannot{a_2}{\tau_2}$.
+
+
+In a similar way, the location of an access to a {\tt struct} field
+$\unannot{a}{\tau} \dot f$ is computed from the location of the {\tt struct}
+$\unannot{a}{\tau}$ and an offset with that {\tt struct} (see
+rule~\ref{rule:4}).
+\sbcm{un peu rapide, cf. repr ?}
+The location of an access to a {\tt union} field is the location of
+the {\tt union} (see rule~\ref{rule:5}).
+
+The last four rule of figure~\ref{fig:dynsem1} illustrate the
+evaluation of an expression in r-value position.
+Rule~\ref{rule:6} rule shows the evaluation of a l-value expression in an
+r-value context. The expression is evaluated to its location $l$,
+with final memory state $M'$. The value at location $l$ in $M'$ is
+fetched using the {\tt loadval} function (see
+\sbcm{(faire une section a part?)}
+%section~\ref{sec:mem}
+) and returned.
+
+If the location of an expression $a$ is $\loc$, then
+the value of the expression $(\hbox{{\tt \&}}a)^\tau$ is also $\loc$
+(rule~\ref{rule:7}).
+Rule~\ref{rule:8} evaluates an application of a binary operator $\op$ to
+expressions $\unannot{a_1}{\tau_1}$ and $\unannot{a_2}{\tau_2}$.
+Both sub-expressions are evaluated in
+sequence, and their values are combined with the
+{\tt eval{\char95}binary{\char95}operation} function, which takes as additional
+arguments the types $\tau_1$ and $\tau_2$ of the arguments, in order to resolve
+overloaded and type-dependent operators.
+
+Rule~\ref{rule:9} evaluates a cast expression $(\tau)\unannot{a}{\tau_a}$.
+The expression $\unannot{a}{\tau_a}$ is evaluated, and its
+value is cast from its natural type $\tau_a$ to the expected type
+$\tau$ using the partial function {\tt cast}. This function
+performs appropriate conversions, truncations and sign-extensions over
+integers and floats, and may fail for undefined casts. The result $v$
+of the cast is then the final result.
+
+The rules in figure~\ref{fig:dynsem2} are
+examples of statements execution.
+The execution of a {\tt skip} statement yields the {\tt Out{\char95}normal}
+outcome and the empty trace.
+Similarly, the execution of a {\tt break} (resp. {\tt continue}) statement
+yields the {\tt Out{\char95}break} (resp. {\tt Out{\char95}continue})
+outcome and the empty trace.
+
+Rule~\ref{rule:12} evaluates an assignment statement.
+An assignment statement $a_1^\tau \hbox{{\tt \ =\ }} a_2^{\tau'}$ evaluates
+the l-value $a_1$ to a location $\loc$, then the r-value $a_2$ to a value $v$.
+As implicit casts are inserted by CIL, there is no need to cast this $v$.
+$v$ is stored in memory at location $\loc$, resulting in the final memory
+ state $M_3$.
+
+The execution of a sequence of two statements starts with the
+execution of the first statement, thus yielding an outcome that
+determines if the second statement must be executed or not
+(rules~\ref{rule:15}~and~\ref{rule:16}).
+Rules~\ref{rule:17}--\ref{rule:19} describe the execution of a
+{\tt while} loop. Once the condition of a while loop is evaluated to a
+value $v$, the execution of the loop terminates normally if $v$ is
+false. If $v$ is true, then the loop body is executed, thus yielding
+an outcome $\out$. If $\out$ is {\tt Out{\char95}break}, the loop terminates
+normally. If $\out$ is {\tt Out{\char95}normal} or {\tt Out{\char95}continue},
+the whole loop is reexecuted in the memory state modified by the execution of the
+body.
+Finally, rules~\ref{rule:20}--\ref{rule:21} describe the execution of a
+{\tt return} statement.
+
+\input dynsem2.tex
+
+The rules of figure~\ref{fig:dynsem3} define the execution of a call statement.
+When the expression ${a_{fun}}^{\tau}$ is evaluated to a pointer value that is the
+location (in the global environment $G$) of a function $f$, and when $f$ has the
+same signature as ${a_{fun}}^{\tau}$, the control flow is passed to the called
+function (rule~\ref{rule:25}).
+
+The execution of a \Clight{} function allocates the memory required
+for storing the formal parameters and the local variables of the function
+(rule~\ref{rule:26}). The allocation initializes local variables to the
+${\tt Vundef}$ undefined value. Formal parameters are initialized to their
+corresponding actual parameters. The body of $f$ is then executed, thus yielding
+an outcome. The return value of the function is computed from this outcome and
+from the return type of $f$. The memory allocated for executing $f$ is freed
+before returning to the caller.
+
+A call $<\id \, \sigma>$ to an external function triggers a new event
+that is the trace resulting from that call.
+
+\input dynsem3.tex
+
+The rules of figure~\ref{fig:dynsem4} define the execution of a
+${\tt switch}(a) \ls$ statement. If the expression of the {\tt switch} evaluates
+to $n$, then some statements of the {\tt switch} are selected and executed
+(rule~\ref{rule:28}), thus yielding an outcome $\out$. If $\out$ is
+{\tt Out{\char95}break}, then the {\tt switch} terminates normally
+(\textit{i.e.} the {\tt outcome{\char95}switch} function updates $\out$ into
+{\tt Out{\char95}normal} and leave other outcomes unchanged).
+
+The first selected statements that are executed are those of either the branch
+labeled by $n$ or the default branch (rules~\ref{rule:30} and \ref{rule:29}),
+thus yielding an outcome that determines if the following statements must be
+executed or not (rules~\ref{rule:30} and \ref{rule:31}). If needed, the execution falls through to the next statements $\ls$.
+
+\input dynsem4.tex
+
+\section{Translation from \Clight{} to Csharpminor}
+\subsection{Overview of Csharpminor}
+The Csharpminor language is the first intermediate language of the compiler.
+It is a low-level imperative language, structured like our subset of C into
+expressions, statements and functions.
+
+As in \Clight{}, Csharpminor local variables reside in memory, and their address
+can be taken. Csharpminor is entirely processor-neutral. In particular,
+Csharpminor uses a standard set of operations.
+
+\subsection{Overview of the translation}
+
+\section{Proof of correctness}
+
+The theorem guarantees that if the source program terminates and does
+not go wrong, the compiled code terminates and does not go wrong,
+performs exactly the same system calls, and returns the same exit code
+as the source program.
+Currently, it says nothing about source programs that do not terminate (work
+in progress).
+
+\section{Translation from \Clight{} to Csharpminor}
+\subsection{Overview of Csharpminor}
+The Cminor language is the second intermediate language of the compiler
+and the target language of our front-end compiler.
+ We now give a short overview of Cminor; see \cite{xl:popl} for a more detailed
+description, and \cite{xl:compcert-backend} for a complete formal
+specification.
+
+Unlike in Csharpminor, Cminor local variables do not reside in memory,
+and their address can not be taken.
+
+\subsection{Overview of the translation}
+
+\section{Quantitative results}
+
+\section{Conclusion}
+
+\bibliographystyle{unsrt}
+\bibliography{blazy}
+
+\end{article}
+\end{document} \ No newline at end of file