diff options
author | David Thomas <m8pple@github.com> | 2017-01-23 10:56:54 +0000 |
---|---|---|
committer | David Thomas <m8pple@github.com> | 2017-01-26 10:57:25 +0000 |
commit | b850a6b64115940a0e3df50b2aa923dd3e99b13e (patch) | |
tree | 2b394c18824ed89cc305d719d2c771f155615ba8 /2-parser.md | |
download | Compiler-b850a6b64115940a0e3df50b2aa923dd3e99b13e.tar.gz Compiler-b850a6b64115940a0e3df50b2aa923dd3e99b13e.zip |
Initial transfer
Diffstat (limited to '2-parser.md')
-rw-r--r-- | 2-parser.md | 268 |
1 files changed, 268 insertions, 0 deletions
diff --git a/2-parser.md b/2-parser.md new file mode 100644 index 0000000..5010e2f --- /dev/null +++ b/2-parser.md @@ -0,0 +1,268 @@ +Create a parser for the C language +================================== + +Your program should accept C source code on +`stdin` and write a heirarchical representation on `stdout`. + +Input Format +------------ + +The input format and constraints are the same as for the tokeniser, with +the following restrictions: + +- the only variable/parameter data-type that will occur is `int` (so no arrays, structs, floats) + +- the only control-flow statements are `if`, `while`, `for`, and sequencing. + +- all functions are definitions (i.e. there are no declarations) + +These restrictions do _not_ apply to the eventual compiler, they are +simply to make this stage more manageable as a milestone. You are +encouraged to support other things as well, but the intent here is +to get a core set of functionality fully working. + +The input format is still C90, and it is worth noting +a difference between the C90 ("classic) and C99 ("modern") +languages, which is that declarations can only appear +at the top of a scope. So all variables declarations will +appear before any statements. See the later section for more discussion. + +Output Format +------------- + +The output format will be a heirarchical format written in +[XML](https://en.wikipedia.org/wiki/XML), which identifies the functions, scopes, and variables. + +Your output should consist of the following types of XML element: + +- `<Program>` : The top-level element, containing all others. + +- `<Function id="id">` : This appears within a Program element, and + indicates the existence of a function called id. The ordering of + Function elements relative to other globals should + follow the original source. + +- `<Parameter id="id">` : This appears as a child of a Function element, + and states that the enclosing function has a parameter with this name. + The relative ordering of Parameter elements is important, and should follow the + order given in the function signature. + +- `<Scope>` : Appears within either a Function or another Scope, and + identifies a scope within which variables have a lifetime. Empty + scopes are not considered meaningful and may be omitted or inserted + as convenient. Extra levels of scope nesting are also allowed, as long + as the lifetime of all variables is correct (equivalent to `((x))` == `(x)`). + +- `<Variable>` : Appears within a Program or a Scope, and indicates a + variable that is being introduced. + +The output should not contain any other elements except for these +five. Each element can contain arbitrary amounts of white-space. + +Examples +-------- + +As a simple example, the following function: + + int f( int x) + { + int y=x; + return y; + } + +should result in: + + + <?xml version="1.0"?> + <Program> + <Function id="f"> + <Parameter id="x" /> + <Scope> + <Variable id="y" /> + </Scope> + </Function> + </Program> + +or also correct would be: + + <?xml version="1.0"?> + <Program> <Function id="f"> + <Parameter id="x" /> + <Scope /> + <Scope><Scope /></Scope> + <Scope> + <Scope> + <Variable id="y" /> + </Scope></Scope> + </Function> </Program> + +and infinite variations thereof. + +This code: + + int x() + {} + + int g; + + int zz(int a, int b, int c) + { + if(a==b) + int a; + return a; + else{ + int fsdfsdfs; + return c; + } + } + +would result in: + + + <?xml version="1.0"?> + <Program> + <Function id="x" /> + <Variable id="g" /> + <Function id="zz"> + <Parameter id="a" /><Parameter id="b" /> + <Parameter id="c" /> + <Scope> + <Scope> + <Variable id="a" /> + </Scope> + <Scope> + <Variable id="fsdfsdfs" /> + </Scope> + </Scope> + </Function> + </Program> + + +Given that additional scope nesting levels and extra empty scopes are +allowed, you may wonder how it is possible to tell whether the +scoping is correct. This comes down to the lifetime of variables +within the original C program. For example, given this: + + int x; + if(...){ + int y; + } + +It would be possible to say: + + <Scope> + <Variable id="x" /> + <Scope> + <Variable id="y" /> + </Scope> + </Scope> + +as this indicates that: + +- a variable x is defined. + +- a variable y is defined, and x is still available within the same scope + +- the variable y goes out of scope before x. + +It would not be possible to say: + + <Scope> + <Variable id="x" /> + <Variable id="y" /> + </Scope> + +as while it matches the first two properties above, it indicates that +x and y go out of scope at the same time, rather than y leaving scope +before x does. + +Program build +------------- + +From your top-level directory, doing: + + make bin/c_parser + +should create a programme called... `bin/c_parser`. + + + +C90 Declarations +---------------- + +In C99 (like C++) you are allowed to interleave declarations and statements: + + int f(int i) + { + int x=i+5; + x=x*x; + int y=x+4; + return y; + } + +In C90, you must have all declarations before any statements, +so that would need to be re-written as: + + int f(int i) + { + int x=i+5; + int y; + x=x*x; + y=x+4; + return y; + } + +Note that declarations can happen at the start of any scope, so things like: + + int f(int i) + { + int x=i+5; + x=x*x; + if(1){ + int y=x+4; + return y; + } + } + +Thinking about it with your knowledge of parsers and compilers, +you might be able to guess why the older language is more restricted, +while the newer one relaxes it. You also might guess why I +deliberately chose the older syntax for you to implement. + +It's worth noting that beyond compiler implementation, there +are many arguments for and against both styles. An argument +for requiring C90 style declarations is that you have +immediate visibility of the total number of variables in a scope. +This gives you a sense of the complexity of a function, +as you can see how many moving parts there are. For example, +Linus Torvalds [likes that aspect](https://lkml.org/lkml/2012/4/12/18), +as the kernel is an extremely large and complex code-base. +There are also rules in the [MISRA-C](https://en.wikipedia.org/wiki/MISRA_C) +guidelines for automotive and embedded software that state +that variables should appear at function scope. + +One argument for the newer style is convenience, but also +the idea that that it is better to declare a variable at +the point that it is initialised. A common error in C is +to have a variable which is not initialised on all code +paths: + + { + int i; + + // Much code + + if(getc()){ + i=1; + } + + // So codey + + putc(i+1); + } + +Modern compilers will put out a warning about `i` potentially +being used before being initialised, but people may just +ignore that. The move towards more functional and declarative +styles has also encouraged this process - for example, the +`auto` keyword in C++11. |