diff options
Diffstat (limited to 'papers/cfrontend_new/paper.tex')
-rwxr-xr-x | papers/cfrontend_new/paper.tex | 382 |
1 files changed, 382 insertions, 0 deletions
diff --git a/papers/cfrontend_new/paper.tex b/papers/cfrontend_new/paper.tex new file mode 100755 index 00000000..4293554f --- /dev/null +++ b/papers/cfrontend_new/paper.tex @@ -0,0 +1,382 @@ +\documentclass[namedreferences]{kluwer} + +\usepackage{latexsym} +%\usepackage{amsmath} pb latex +\usepackage{amssymb} +\usepackage{amsfonts} +\usepackage{mymacros} +\usepackage{infrules} +\usepackage{url} +\usepackage{mymacros} +\usepackage[usenames]{color} + +\input{macros} + +\begin{document} +\begin{article} + +\begin{opening} +\title{Formal verification of C front-end} +\subtitle{Yet another formal verification in Coq of a C front-end} +\runningtitle{C front-end} + +\author{Sandrine \surname{Blazy} \email{{S}andrine.Blazy@ensiie.fr}} +\institute{ENSIIE and INRIA} + +\author{Xavier \surname{Leroy} \email{{X}avier.Leroy@inria.fr}} +\runningauthor{S.Blazy, X.Leroy} +\institute{INRIA} + +\maketitle + +\begin{abstract} +This paper presents +\end{abstract} + +\keywords{formal semantics, formal proof, C language, C compiler} +\end{opening} + +\section{Introduction} + +\section{A formal semantics for \Clight{}} +\subsection{CIL} + +\subsection{Abstract syntax} + +The abstract syntax of \Clight{} is given in figures~\ref{fig:syntax} +and~\ref{fig:syntax2}. +In the Coq formalization, this abstract syntax is presented as +inductive data types, therefore achieving a deep embedding of \Clight{} +into Coq. + +At the level of types, \Clight{} features all the integral types of C, +along with array, pointer (including function pointers), function +types, {\tt struct} and {\tt union} types. Bit-fields in {\tt struct} +or {\tt union} are ignored. +The integral types fully specify the bit size of integers and floats, +unlike the half-specified C types {\tt int}, {\tt long}, etc. + +\sbcm{TO DO: representation of struct and union} + +Within expressions, all C operators are supported. +Side-effects stem from function calls. +All expressions and their sub-expressions are +annotated by their static types. We write $a^\tau$ for the expression +$a$ carrying type $\tau$. These types are necessary to determine the +semantics of type-dependent operators such as the overloaded +arithmetic operators. The following expressions may be in left-value +position: $\id, *a, a_1[a_2]$ and $a \dot \id$. + +\input syntax.tex + +At the level of statements, all structured control statements of C +(conditional, loops, simple {\tt switch}, {\tt break}, {\tt continue} and +{\tt return}) are supported, but not unstructured statements +(unstructured {\tt switch} and {\tt goto}). +Contrary to C, the default case is mandatory in a \Clight{} {\tt switch} statement +and it is the last branch of the {\tt switch}. +An assignment is treated as a statement (instead of an expression). +This transformation from expression into statements is performed by CIL. +Two kinds of variables are allowed: global variables and local {\tt auto} +variables declared at the beginning of a function. Block-scoped local +variables and {\tt static} variables are omitted because they are emulated by +pulling their declarations to function scope or global scope, +respectively. Consequently, there is no block statement in \Clight{}. + +A function $f$ defined in \Clight{} consists of a header (\textit{i.e.} +{\tt fn\_params(f)} the parameters of $f$ and {\tt fn\_return(f)} +the return type of $f$), a list of local variables ({\tt fn\_vars(f)}) +and a body ({\tt fn\_body(f)}) that is a statement. The functions composing +the program are either internal functions, defined within \Clight{}, or +external functions (a.k.a. system calls). An external function consists +of the internal identifier and the signature of the function. + +A program $p$ consists of a list of global variable declarations +({\tt prog\_vars(p)}), a list of function definitions ({\tt prog\_funct(p)}), +and an identifier naming the entry point of the program ({\tt prog\_main(p)}). +In a global variable declaration, the list of initial values +represents all values that need to be stored (\textit{e.g.} the +values of the cells of an array). + +\input syntax2.tex + +\subsection{Static semantics} + The type system is very coarse: we check only the typing properties + that matter for the translation to be correct. Essentially, + we need to check that the types attached to variable references + match the declaration types for those variables. + +\subsection{Dynamic semantics} + +The dynamic semantics of \Clight{} is specified using natural semantics, +also known as big-step operational semantics. + These natural semantics capture the final result of program execution, as +well as traces of calls to external functions that emit an event when applied. +The big-step semantics described in~\cite{blazy06:fm} did not observe traces. +This addition of traces leads to a significantly +stronger observational equivalence between source and machine code. + +Execution traces represent the interactions between the program and +the rest of the world (\textit{i.e.} the input-output activities of +the program); they are defined in figure~\ref{fig:trace}. +A trace is a finite list of input-output events. +Each call to an external function produces an event. +An event consists of an identifier and the actual values of the parameters of +the called function. Usually, the identifier represents the called function, +but more generally it could also represent any observable identifier. +Starting from an initial empty event (called $\E0$), events are propagated and +concatenated during program execution. + +\input trace.tex + +While the semantics of +C is not deterministic (the evaluation order for expressions is not +completely specified and compilers are free to choose between several +orders), the semantics of \Clight{} is completely deterministic and +imposes a left-to-right evaluation order, consistent with the order +implemented by our compiler. This choice simplifies greatly the +semantics compared with, for instance, Norrish's semantics for C +\cite{Norrish:phd}, which captures (at great expense) the +non-determinism allowed by the ISO C specification. Our semantics can +therefore be viewed as a refinement of (a subset of) the ISO C +semantics, or that of Norrish. + +The semantics is defined by 7 judgements (relations): +$$\begin{array}{rclr} +G, E & |- & a, M \evallvalue \loc, \tr, M' & \mbox{(expressions in l-value position)} \\ +G, E & |- & a , M \evalexpr v, \tr, M' & \mbox{(expressions in r-value position)} \\ +G, E & |- & \seq a, M \evalexpr \seq v, \tr, M' & \mbox{(list of expressions)}\\ +G, E & |- & s, M \evalstmt \out, \tr, M' & \mbox{(statements)} \\ +G, E & |- & \ls, M \evalstmt \out, \tr, M' & \mbox{(labeled statements of a switch)} \\ +G & |- & f(\seq v), M \evalfunc v, \tr, M' & \mbox{(function invocations)} \\ + & \vdash & p \evalprog v, \tr & \mbox{(programs)} +\end{array}$$ +Each judgement relates a syntactic element (expression, statement, +etc.) and an initial memory state to the result of executing this +syntactic element, as well as the final memory state at the end of +execution and also the trace of input-output events (system calls). + The various kinds of results, as well as the evaluation +environments, are defined in figure~\ref{fig:values}. + +\input values.tex + +For instance, executing an expression $a$ in l-value position results in a +memory location $\loc$ (a memory block reference and an offset within that +block), while executing an expression $a$ in r-value position results in +the value $v$ of the expression. Values range over 32-bit integers, +64-bit floats, memory locations (pointers), and an undefined value +that represents for instance the value of uninitialized variables. +The result associated with the execution of a statement $s$ is an +``outcome'' indicating how the execution terminated: either normally +by running to completion or prematurely via a {\tt break}, {\tt continue} or +{\tt return} statement. +The invocation of a function $f$ yields a value $v$, and so does the execution +of a program. + +Two evaluation environments, defined in figure~\ref{fig:values}, +appear as parameters to the judgements. The local environment $E$ maps +local variables to references of memory blocks containing the values +of these variables. These blocks are allocated at function entry and +freed at function return. The global environment $G$ maps global +variables and function names to memory references. It also maps +some references (those corresponding to function pointers) to function +definitions. + +The memory model of the semantics is described in~\cite{icfem:sb}. +A memory $M$ maps block references to block contents. +Each block represents the result of a C \texttt{malloc}, +or a stack frame, a global static variable, or a function +code-pointer. +A block content consists of the dimensions of the block (low and high bounds) +plus a mapping from byte offsets to byte-sized memory cells (\textit{i.e.} +values). + +\sbcm{TO DO: a detailler (faire une section a part?): il faut expliquer +loadval et storeval qui sont utilisees dans la semantique.} + +In the Coq specification, the 7 judgements of the dynamic semantics +are encoded as mutually-inductive predicates. Each defining case of +each predicate corresponds exactly to an inference rule in the +conventional, on-paper presentation of natural semantics. We have one +inference rule for each kind of expression and statement described in +figure~\ref{fig:syntax}. + We show most of the inference rules in figures~\ref{fig:dynsem1} +to~\ref{fig:dynsem4}. + +\input dynsem1.tex + +The first five rules of figure~\ref{fig:dynsem1} illustrate the +evaluation of an expression in l-value position. +A variable $\id$ evaluates to the location $(E(\id), 0)$. +If an expression $a$ evaluates to a pointer value $\loc$, then +the location of the dereferencing expression $(\hbox{{\tt *}}a)^\tau$ is $\loc$. + +Rule~\ref{rule:3} shows the evaluation of an array cell. +The location of an array cell $\unannot{a_1}{\tau_1} [\unannot{a_2}{\tau_2}]$ +is computed from the location of the array $\unannot{a_1}{\tau_1}$ +({\it i.e.} the location of the first cell) and the value of the index +$\unannot{a_2}{\tau_2}$ thanks to the function {\tt add}. +{\tt add} is a partial function: +it can be undefined if the types and the shapes of argument values are +incompatible (e.g. an integer addition of two pointer values). +In the Coq specification, it is a total +function returning optional values: either {\tt None} in case of failure, or +${\tt Some}(v)$, abbreviated as $\some v$, in case of success. +%More precisely, the computation succeeds only if $v_1$ and +%$v_2$ are either 2 integers, or 2 floats or a pointer value and an integer. +%The corresponding types $\tau_1$ and $\tau_2$ must also be +%compatible with these values. For instance, if $\tau_1$ is a {\tt Tarray}, +%then $\tau_2$ must be a {\tt Tint}. +The trace resulting from the evaluation of +$\unannot{a_1}{\tau_1} [\unannot{a_2}{\tau_2}]$ is the concatenation of the +traces resulting from the evaluation of $\unannot{a_1}{\tau_1}$ and +$\unannot{a_2}{\tau_2}$. + + +In a similar way, the location of an access to a {\tt struct} field +$\unannot{a}{\tau} \dot f$ is computed from the location of the {\tt struct} +$\unannot{a}{\tau}$ and an offset with that {\tt struct} (see +rule~\ref{rule:4}). +\sbcm{un peu rapide, cf. repr ?} +The location of an access to a {\tt union} field is the location of +the {\tt union} (see rule~\ref{rule:5}). + +The last four rule of figure~\ref{fig:dynsem1} illustrate the +evaluation of an expression in r-value position. +Rule~\ref{rule:6} rule shows the evaluation of a l-value expression in an +r-value context. The expression is evaluated to its location $l$, +with final memory state $M'$. The value at location $l$ in $M'$ is +fetched using the {\tt loadval} function (see +\sbcm{(faire une section a part?)} +%section~\ref{sec:mem} +) and returned. + +If the location of an expression $a$ is $\loc$, then +the value of the expression $(\hbox{{\tt \&}}a)^\tau$ is also $\loc$ +(rule~\ref{rule:7}). +Rule~\ref{rule:8} evaluates an application of a binary operator $\op$ to +expressions $\unannot{a_1}{\tau_1}$ and $\unannot{a_2}{\tau_2}$. +Both sub-expressions are evaluated in +sequence, and their values are combined with the +{\tt eval{\char95}binary{\char95}operation} function, which takes as additional +arguments the types $\tau_1$ and $\tau_2$ of the arguments, in order to resolve +overloaded and type-dependent operators. + +Rule~\ref{rule:9} evaluates a cast expression $(\tau)\unannot{a}{\tau_a}$. +The expression $\unannot{a}{\tau_a}$ is evaluated, and its +value is cast from its natural type $\tau_a$ to the expected type +$\tau$ using the partial function {\tt cast}. This function +performs appropriate conversions, truncations and sign-extensions over +integers and floats, and may fail for undefined casts. The result $v$ +of the cast is then the final result. + +The rules in figure~\ref{fig:dynsem2} are +examples of statements execution. +The execution of a {\tt skip} statement yields the {\tt Out{\char95}normal} +outcome and the empty trace. +Similarly, the execution of a {\tt break} (resp. {\tt continue}) statement +yields the {\tt Out{\char95}break} (resp. {\tt Out{\char95}continue}) +outcome and the empty trace. + +Rule~\ref{rule:12} evaluates an assignment statement. +An assignment statement $a_1^\tau \hbox{{\tt \ =\ }} a_2^{\tau'}$ evaluates +the l-value $a_1$ to a location $\loc$, then the r-value $a_2$ to a value $v$. +As implicit casts are inserted by CIL, there is no need to cast this $v$. +$v$ is stored in memory at location $\loc$, resulting in the final memory + state $M_3$. + +The execution of a sequence of two statements starts with the +execution of the first statement, thus yielding an outcome that +determines if the second statement must be executed or not +(rules~\ref{rule:15}~and~\ref{rule:16}). +Rules~\ref{rule:17}--\ref{rule:19} describe the execution of a +{\tt while} loop. Once the condition of a while loop is evaluated to a +value $v$, the execution of the loop terminates normally if $v$ is +false. If $v$ is true, then the loop body is executed, thus yielding +an outcome $\out$. If $\out$ is {\tt Out{\char95}break}, the loop terminates +normally. If $\out$ is {\tt Out{\char95}normal} or {\tt Out{\char95}continue}, +the whole loop is reexecuted in the memory state modified by the execution of the +body. +Finally, rules~\ref{rule:20}--\ref{rule:21} describe the execution of a +{\tt return} statement. + +\input dynsem2.tex + +The rules of figure~\ref{fig:dynsem3} define the execution of a call statement. +When the expression ${a_{fun}}^{\tau}$ is evaluated to a pointer value that is the +location (in the global environment $G$) of a function $f$, and when $f$ has the +same signature as ${a_{fun}}^{\tau}$, the control flow is passed to the called +function (rule~\ref{rule:25}). + +The execution of a \Clight{} function allocates the memory required +for storing the formal parameters and the local variables of the function +(rule~\ref{rule:26}). The allocation initializes local variables to the +${\tt Vundef}$ undefined value. Formal parameters are initialized to their +corresponding actual parameters. The body of $f$ is then executed, thus yielding +an outcome. The return value of the function is computed from this outcome and +from the return type of $f$. The memory allocated for executing $f$ is freed +before returning to the caller. + +A call $<\id \, \sigma>$ to an external function triggers a new event +that is the trace resulting from that call. + +\input dynsem3.tex + +The rules of figure~\ref{fig:dynsem4} define the execution of a +${\tt switch}(a) \ls$ statement. If the expression of the {\tt switch} evaluates +to $n$, then some statements of the {\tt switch} are selected and executed +(rule~\ref{rule:28}), thus yielding an outcome $\out$. If $\out$ is +{\tt Out{\char95}break}, then the {\tt switch} terminates normally +(\textit{i.e.} the {\tt outcome{\char95}switch} function updates $\out$ into +{\tt Out{\char95}normal} and leave other outcomes unchanged). + +The first selected statements that are executed are those of either the branch +labeled by $n$ or the default branch (rules~\ref{rule:30} and \ref{rule:29}), +thus yielding an outcome that determines if the following statements must be +executed or not (rules~\ref{rule:30} and \ref{rule:31}). If needed, the execution falls through to the next statements $\ls$. + +\input dynsem4.tex + +\section{Translation from \Clight{} to Csharpminor} +\subsection{Overview of Csharpminor} +The Csharpminor language is the first intermediate language of the compiler. +It is a low-level imperative language, structured like our subset of C into +expressions, statements and functions. + +As in \Clight{}, Csharpminor local variables reside in memory, and their address +can be taken. Csharpminor is entirely processor-neutral. In particular, +Csharpminor uses a standard set of operations. + +\subsection{Overview of the translation} + +\section{Proof of correctness} + +The theorem guarantees that if the source program terminates and does +not go wrong, the compiled code terminates and does not go wrong, +performs exactly the same system calls, and returns the same exit code +as the source program. +Currently, it says nothing about source programs that do not terminate (work +in progress). + +\section{Translation from \Clight{} to Csharpminor} +\subsection{Overview of Csharpminor} +The Cminor language is the second intermediate language of the compiler +and the target language of our front-end compiler. + We now give a short overview of Cminor; see \cite{xl:popl} for a more detailed +description, and \cite{xl:compcert-backend} for a complete formal +specification. + +Unlike in Csharpminor, Cminor local variables do not reside in memory, +and their address can not be taken. + +\subsection{Overview of the translation} + +\section{Quantitative results} + +\section{Conclusion} + +\bibliographystyle{unsrt} +\bibliography{blazy} + +\end{article} +\end{document}
\ No newline at end of file |