8 Library of CIL Modules
We are developing a suite of modules that use CIL for program analyses and
transformations that we have found useful. You can use these modules directly
on your code, or generally as inspiration for writing similar modules. A
particularly big and complex application written on top of CIL is CCured
(../ccured/index.html).
8.1 Control-Flow Graphs
The Cil.stmt datatype includes fields for intraprocedural
control-flow information: the predecessor and successor statements of
the current statement. This information is not computed by default.
If you want to use the control-flow graph, or any of the extensions in
this section that require it, you have to explicitly ask CIL to
compute the CFG.
8.1.1 The CFG module (new in CIL 1.3.5)
The best way to compute the CFG is with the CFG module. Just invoke
Cfg.computeFileCFG on your file. The Cfg API
describes the rest of actions you can take with this module, including
computing the CFG for one function at a time, or printing the CFG in
dot form.
8.1.2 Simplified control flow
CIL can reduce high-level C control-flow constructs like switch and
continue to lower-level gotos. This completely eliminates some
possible classes of statements from the program and may make the result
easier to analyze (e.g., it simplifies data-flow analysis).
You can invoke this transformation on the command line with
--domakeCFG or programatically with Cil.prepareCFG.
After calling Cil.prepareCFG, you can use Cil.computeCFGInfo
to compute the CFG information and find the successor and predecessor
of each statement.
For a concrete example, you can see how cilly --domakeCFG
transforms the following code (note the fall-through in case 1):
int foo (int predicate) {
int x = 0;
switch (predicate) {
case 0: return 111;
case 1: x = x + 1;
case 2: return (x+3);
case 3: break;
default: return 222;
}
return 333;
}
See the CIL output for this
code fragment
8.2 Data flow analysis framework
The Dataflow module (click for the ocamldoc) contains a
parameterized framework for forward and backward data flow
analyses. You provide the transfer functions and this module does the
analysis. You must compute control-flow information (Section 8.1)
before invoking the Dataflow module.
8.3 Dominators
The module Dominators contains the computation of immediate
dominators. It uses the Dataflow module.
8.4 Points-to Analysis
The module ptranal.ml contains two interprocedural points-to
analyses for CIL: Olf and Golf. Olf is the default.
(Switching from olf.ml to golf.ml requires a change in
Ptranal and a recompiling cilly.)
The analyses have the following characteristics:
-
Not based on C types (inferred pointer relationships are sound
despite most kinds of C casts)
- One level of subtyping
- One level of context sensitivity (Golf only)
- Monomorphic type structures
- Field insensitive (fields of structs are conflated)
- Demand-driven (points-to queries are solved on demand)
- Handle function pointers
The analysis itself is factored into two components: Ptranal,
which walks over the CIL file and generates constraints, and Olf
or Golf, which solve the constraints. The analysis is invoked
with the function Ptranal.analyze_file: Cil.file ->
unit. This function builds the points-to graph for the CIL file
and stores it internally. There is currently no facility for clearing
internal state, so Ptranal.analyze_file should only be called
once.
The constructed points-to graph supports several kinds of queries,
including alias queries (may two expressions be aliased?) and
points-to queries (to what set of locations may an expression point?).
The main interface with the alias analysis is as follows:
-
Ptranal.may_alias: Cil.exp -> Cil.exp -> bool. If
true, the two expressions may have the same value.
- Ptranal.resolve_lval: Cil.lval -> (Cil.varinfo
list). Returns the list of variables to which the given
left-hand value may point.
- Ptranal.resolve_exp: Cil.exp -> (Cil.varinfo list).
Returns the list of variables to which the given expression may
point.
- Ptranal.resolve_funptr: Cil.exp -> (Cil.fundec
list). Returns the list of functions to which the given
expression may point.
The precision of the analysis can be customized by changing the values
of several flags:
-
Ptranal.no_sub: bool ref.
If true, subtyping is disabled. Associated commandline option:
--ptr_unify.
- Ptranal.analyze_mono: bool ref.
(Golf only) If true, context sensitivity is disabled and the
analysis is effectively monomorphic. Commandline option:
--ptr_mono.
- Ptranal.smart_aliases: bool ref.
(Golf only) If true, “smart” disambiguation of aliases is
enabled. Otherwise, aliases are computed by intersecting points-to
sets. This is an experimental feature.
- Ptranal.model_strings: bool ref.
Make the alias analysis model string constants by treating them as
pointers to chars. Commandline option: --ptr_model_strings
- Ptranal.conservative_undefineds: bool ref.
Make the most pessimistic assumptions about globals if an undefined
function is present. Such a function can write to every global
variable. Commandline option: --ptr_conservative
In practice, the best precision/efficiency tradeoff is achieved by
setting Ptranal.no_sub to false, Ptranal.analyze_mono to
true, and Ptranal.smart_aliases to false. These are the
default values of the flags.
There are also a few flags that can be used to inspect or serialize
the results of the analysis.
-
Ptranal.debug_may_aliases.
Print the may-alias relationship of each pair of expressions in the
program. Commandline option: --ptr_may_aliases.
- Ptranal.print_constraints: bool ref.
If true, the analysis will print each constraint as it is
generated.
- Ptranal.print_types: bool ref.
If true, the analysis will print the inferred type of each
variable in the program.
If Ptranal.analyze_mono and Ptranal.no_sub are both
true, this output is sufficient to reconstruct the points-to
graph. One nice feature is that there is a pretty printer for
recursive types, so the print routine does not loop.
- Ptranal.compute_results: bool ref.
If true, the analysis will print out the points-to set of each
variable in the program. This will essentially serialize the
points-to graph.
8.5 StackGuard
The module heapify.ml contains a transformation similar to the one
described in “StackGuard: Automatic Adaptive Detection and Prevention of
Buffer-Overflow Attacks”, Proceedings of the 7th USENIX Security
Conference. In essence it modifies the program to maintain a separate
stack for return addresses. Even if a buffer overrun attack occurs the
actual correct return address will be taken from the special stack.
Although it does work, this CIL module is provided mainly as an example of
how to perform a simple source-to-source program analysis and
transformation. As an optimization only functions that contain a dangerous
local array make use of the special return address stack.
For a concrete example, you can see how cilly --dostackGuard
transforms the following dangerous code:
int dangerous() {
char array[10];
scanf("%s",array); // possible buffer overrun!
}
int main () {
return dangerous();
}
See the CIL output for this
code fragment
8.6 Heapify
The module heapify.ml also contains a transformation that moves all
dangerous local arrays to the heap. This also prevents a number of buffer
overruns.
For a concrete example, you can see how cilly --doheapify
transforms the following dangerous code:
int dangerous() {
char array[10];
scanf("%s",array); // possible buffer overrun!
}
int main () {
return dangerous();
}
See the CIL output for this
code fragment
8.7 One Return
The module oneret.ml contains a transformation the ensures that all
function bodies have at most one return statement. This simplifies a number
of analyses by providing a canonical exit-point.
For a concrete example, you can see how cilly --dooneRet
transforms the following code:
int foo (int predicate) {
if (predicate <= 0) {
return 1;
} else {
if (predicate > 5)
return 2;
return 3;
}
}
See the CIL output for this
code fragment
8.8 Partial Evaluation and Constant Folding
The partial.ml module provides a simple interprocedural partial
evaluation and constant folding data-flow analysis and transformation. This
transformation requires the --domakeCFG option.
For a concrete example, you can see how cilly --domakeCFG --dopartial
transforms the following code (note the eliminated if branch and the
partial optimization of foo):
int foo(int x, int y) {
int unknown;
if (unknown)
return y+2;
return x+3;
}
int main () {
int a,b,c;
a = foo(5,7) + foo(6,7);
b = 4;
c = b * b;
if (b > c)
return b-c;
else
return b+c;
}
See the CIL output for this
code fragment
8.9 Reaching Definitions
The reachingdefs.ml module uses the dataflow framework and CFG
information to calculate the definitions that reach each
statement. After computing the CFG (Section 8.1) and calling
computeRDs on a
function declaration, ReachingDef.stmtStartData will contain a
mapping from statement IDs to data about which definitions reach each
statement. In particular, it is a mapping from statement IDs to a
triple the first two members of which are used internally. The third
member is a mapping from variable IDs to Sets of integer options. If
the set contains Some(i), then the definition of that variable
with ID i reaches that statement. If the set contains None,
then there is a path to that statement on which there is no definition
of that variable. Also, if the variable ID is unmapped at a
statement, then no definition of that variable reaches that statement.
To summarize, reachingdefs.ml has the following interface:
-
computeRDs – Computes reaching definitions. Requires that
CFG information has already been computed for each statement.
- ReachingDef.stmtStartData – contains reaching
definition data after computeRDs is called.
- ReachingDef.defIdStmtHash – Contains a mapping
from definition IDs to the ID of the statement in which
the definition occurs.
- getRDs – Takes a statement ID and returns
reaching definition data for that statement.
- instrRDs – Takes a list of instructions and the
definitions that reach the first instruction, and for
each instruction calculates the definitions that reach
either into or out of that instruction.
- rdVisitorClass – A subclass of nopCilVisitor that
can be extended such that the current reaching definition
data is available when expressions are visited through
the get_cur_iosh method of the class.
8.10 Available Expressions
The availexps.ml module uses the dataflow framework and CFG
information to calculate something similar to a traditional available
expressions analysis. After computeAEs is called following a CFG
calculation (Section 8.1), AvailableExps.stmtStartData will
contain a mapping
from statement IDs to data about what expressions are available at
that statement. The data for each statement is a mapping for each
variable ID to the whole expression available at that point(in the
traditional sense) which the variable was last defined to be. So,
this differs from a traditional available expressions analysis in that
only whole expressions from a variable definition are considered rather
than all expressions.
The interface is as follows:
-
computeAEs – Computes available expressions. Requires
that CFG information has already been comptued for each statement.
- AvailableExps.stmtStartData – Contains available
expressions data for each statement after computeAEs has been
called.
- getAEs – Takes a statement ID and returns
available expression data for that statement.
- instrAEs – Takes a list of instructions and
the availalbe expressions at the first instruction, and
for each instruction calculates the expressions available
on entering or exiting each instruction.
- aeVisitorClass – A subclass of nopCilVisitor that
can be extended such that the current available expressions
data is available when expressions are visited through the
get_cur_eh method of the class.
8.11 Liveness Analysis
The liveness.ml module uses the dataflow framework and
CFG information to calculate which variables are live at
each program point. After computeLiveness is called
following a CFG calculation (Section 8.1), LiveFlow.stmtStartData will
contain a mapping for each statement ID to a set of varinfos
for varialbes live at that program point.
The interface is as follows:
-
computeLiveness – Computes live variables. Requires
that CFG information has already been computed for each statement.
- LiveFlow.stmtStartData – Contains live variable data
for each statement after computeLiveness has been called.
Also included in this module is a command line interface that
will cause liveness data to be printed to standard out for
a particular function or label.
-
–doliveness – Instructs cilly to comptue liveness
information and to print on standard out the variables live
at the points specified by –live_func and live_label.
If both are ommitted, then nothing is printed.
- –live_func – The name of the function whose
liveness data is of interest. If –live_label is ommitted,
then data for each statement is printed.
- –live_label – The name of the label at which
the liveness data will be printed.
8.12 Dead Code Elimination
The module deadcodeelim.ml uses the reaching definitions
analysis to eliminate assignment instructions whose results
are not used. The interface is as follows:
-
elim_dead_code – Performs dead code elimination
on a function. Requires that CFG information has already
been computed (Section 8.1).
- dce – Performs dead code elimination on an
entire file. Requires that CFG information has already
been computed.
8.13 Simple Memory Operations
The simplemem.ml module allows CIL lvalues that contain memory
accesses to be even futher simplified via the introduction of
well-typed temporaries. After this transformation all lvalues involve
at most one memory reference.
For a concrete example, you can see how cilly --dosimpleMem
transforms the following code:
int main () {
int ***three;
int **two;
***three = **two;
}
See the CIL output for this
code fragment
8.14 Simple Three-Address Code
The simplify.ml module further reduces the complexity of program
expressions and gives you a form of three-address code. After this
transformation all expressions will adhere to the following grammar:
basic::=
Const _
Addrof(Var v, NoOffset)
StartOf(Var v, NoOffset)
Lval(Var v, off), where v is a variable whose address is not taken
and off contains only "basic"
exp::=
basic
Lval(Mem basic, NoOffset)
BinOp(bop, basic, basic)
UnOp(uop, basic)
CastE(t, basic)
lval ::=
Mem basic, NoOffset
Var v, off, where v is a variable whose address is not taken and off
contains only "basic"
In addition, all sizeof and alignof forms are turned into
constants. Accesses to arrays and variables whose address is taken are
turned into "Mem" accesses. All field and index computations are turned
into address arithmetic.
For a concrete example, you can see how cilly --dosimplify
transforms the following code:
int main() {
struct mystruct {
int a;
int b;
} m;
int local;
int arr[3];
int *ptr;
ptr = &local;
m.a = local + sizeof(m) + arr[2];
return m.a;
}
See the CIL output for this
code fragment
8.15 Converting C to C++
The module canonicalize.ml performs several transformations to correct
differences between C and C++, so that the output is (hopefully) valid
C++ code. This may be incomplete — certain fixes which are necessary
for some programs are not yet implemented.
Using the --doCanonicalize option with CIL will perform the
following changes to your program:
-
Any variables that use C++ keywords as identifiers are renamed.
- C allows global variables to have multiple declarations and
multiple (equivalent) definitions. This transformation removes
all but one declaration and all but one definition.
- __inline is #defined to inline, and __restrict
is #defined to nothing.
- C allows function pointers with no specified arguments to be used on
any argument list. To make C++ accept this code, we insert a cast
from the function pointer to a type that matches the arguments. Of
course, this does nothing to guarantee that the pointer actually has
that type.
- Makes casts from int to enum types explicit. (CIL changes enum
constants to int constants, but doesn't use a cast.)