aboutsummaryrefslogtreecommitdiffstats
path: root/2-parser.md
diff options
context:
space:
mode:
authorDavid Thomas <m8pple@github.com>2017-01-23 10:56:54 +0000
committerDavid Thomas <m8pple@github.com>2017-01-26 10:57:25 +0000
commitb850a6b64115940a0e3df50b2aa923dd3e99b13e (patch)
tree2b394c18824ed89cc305d719d2c771f155615ba8 /2-parser.md
downloadCompiler-b850a6b64115940a0e3df50b2aa923dd3e99b13e.tar.gz
Compiler-b850a6b64115940a0e3df50b2aa923dd3e99b13e.zip
Initial transfer
Diffstat (limited to '2-parser.md')
-rw-r--r--2-parser.md268
1 files changed, 268 insertions, 0 deletions
diff --git a/2-parser.md b/2-parser.md
new file mode 100644
index 0000000..5010e2f
--- /dev/null
+++ b/2-parser.md
@@ -0,0 +1,268 @@
+Create a parser for the C language
+==================================
+
+Your program should accept C source code on
+`stdin` and write a heirarchical representation on `stdout`.
+
+Input Format
+------------
+
+The input format and constraints are the same as for the tokeniser, with
+the following restrictions:
+
+- the only variable/parameter data-type that will occur is `int` (so no arrays, structs, floats)
+
+- the only control-flow statements are `if`, `while`, `for`, and sequencing.
+
+- all functions are definitions (i.e. there are no declarations)
+
+These restrictions do _not_ apply to the eventual compiler, they are
+simply to make this stage more manageable as a milestone. You are
+encouraged to support other things as well, but the intent here is
+to get a core set of functionality fully working.
+
+The input format is still C90, and it is worth noting
+a difference between the C90 ("classic) and C99 ("modern")
+languages, which is that declarations can only appear
+at the top of a scope. So all variables declarations will
+appear before any statements. See the later section for more discussion.
+
+Output Format
+-------------
+
+The output format will be a heirarchical format written in
+[XML](https://en.wikipedia.org/wiki/XML), which identifies the functions, scopes, and variables.
+
+Your output should consist of the following types of XML element:
+
+- `<Program>` : The top-level element, containing all others.
+
+- `<Function id="id">` : This appears within a Program element, and
+ indicates the existence of a function called id. The ordering of
+ Function elements relative to other globals should
+ follow the original source.
+
+- `<Parameter id="id">` : This appears as a child of a Function element,
+ and states that the enclosing function has a parameter with this name.
+ The relative ordering of Parameter elements is important, and should follow the
+ order given in the function signature.
+
+- `<Scope>` : Appears within either a Function or another Scope, and
+ identifies a scope within which variables have a lifetime. Empty
+ scopes are not considered meaningful and may be omitted or inserted
+ as convenient. Extra levels of scope nesting are also allowed, as long
+ as the lifetime of all variables is correct (equivalent to `((x))` == `(x)`).
+
+- `<Variable>` : Appears within a Program or a Scope, and indicates a
+ variable that is being introduced.
+
+The output should not contain any other elements except for these
+five. Each element can contain arbitrary amounts of white-space.
+
+Examples
+--------
+
+As a simple example, the following function:
+
+ int f( int x)
+ {
+ int y=x;
+ return y;
+ }
+
+should result in:
+
+
+ <?xml version="1.0"?>
+ <Program>
+ <Function id="f">
+ <Parameter id="x" />
+ <Scope>
+ <Variable id="y" />
+ </Scope>
+ </Function>
+ </Program>
+
+or also correct would be:
+
+ <?xml version="1.0"?>
+ <Program> <Function id="f">
+ <Parameter id="x" />
+ <Scope />
+ <Scope><Scope /></Scope>
+ <Scope>
+ <Scope>
+ <Variable id="y" />
+ </Scope></Scope>
+ </Function> </Program>
+
+and infinite variations thereof.
+
+This code:
+
+ int x()
+ {}
+
+ int g;
+
+ int zz(int a, int b, int c)
+ {
+ if(a==b)
+ int a;
+ return a;
+ else{
+ int fsdfsdfs;
+ return c;
+ }
+ }
+
+would result in:
+
+
+ <?xml version="1.0"?>
+ <Program>
+ <Function id="x" />
+ <Variable id="g" />
+ <Function id="zz">
+ <Parameter id="a" /><Parameter id="b" />
+ <Parameter id="c" />
+ <Scope>
+ <Scope>
+ <Variable id="a" />
+ </Scope>
+ <Scope>
+ <Variable id="fsdfsdfs" />
+ </Scope>
+ </Scope>
+ </Function>
+ </Program>
+
+
+Given that additional scope nesting levels and extra empty scopes are
+allowed, you may wonder how it is possible to tell whether the
+scoping is correct. This comes down to the lifetime of variables
+within the original C program. For example, given this:
+
+ int x;
+ if(...){
+ int y;
+ }
+
+It would be possible to say:
+
+ <Scope>
+ <Variable id="x" />
+ <Scope>
+ <Variable id="y" />
+ </Scope>
+ </Scope>
+
+as this indicates that:
+
+- a variable x is defined.
+
+- a variable y is defined, and x is still available within the same scope
+
+- the variable y goes out of scope before x.
+
+It would not be possible to say:
+
+ <Scope>
+ <Variable id="x" />
+ <Variable id="y" />
+ </Scope>
+
+as while it matches the first two properties above, it indicates that
+x and y go out of scope at the same time, rather than y leaving scope
+before x does.
+
+Program build
+-------------
+
+From your top-level directory, doing:
+
+ make bin/c_parser
+
+should create a programme called... `bin/c_parser`.
+
+
+
+C90 Declarations
+----------------
+
+In C99 (like C++) you are allowed to interleave declarations and statements:
+
+ int f(int i)
+ {
+ int x=i+5;
+ x=x*x;
+ int y=x+4;
+ return y;
+ }
+
+In C90, you must have all declarations before any statements,
+so that would need to be re-written as:
+
+ int f(int i)
+ {
+ int x=i+5;
+ int y;
+ x=x*x;
+ y=x+4;
+ return y;
+ }
+
+Note that declarations can happen at the start of any scope, so things like:
+
+ int f(int i)
+ {
+ int x=i+5;
+ x=x*x;
+ if(1){
+ int y=x+4;
+ return y;
+ }
+ }
+
+Thinking about it with your knowledge of parsers and compilers,
+you might be able to guess why the older language is more restricted,
+while the newer one relaxes it. You also might guess why I
+deliberately chose the older syntax for you to implement.
+
+It's worth noting that beyond compiler implementation, there
+are many arguments for and against both styles. An argument
+for requiring C90 style declarations is that you have
+immediate visibility of the total number of variables in a scope.
+This gives you a sense of the complexity of a function,
+as you can see how many moving parts there are. For example,
+Linus Torvalds [likes that aspect](https://lkml.org/lkml/2012/4/12/18),
+as the kernel is an extremely large and complex code-base.
+There are also rules in the [MISRA-C](https://en.wikipedia.org/wiki/MISRA_C)
+guidelines for automotive and embedded software that state
+that variables should appear at function scope.
+
+One argument for the newer style is convenience, but also
+the idea that that it is better to declare a variable at
+the point that it is initialised. A common error in C is
+to have a variable which is not initialised on all code
+paths:
+
+ {
+ int i;
+
+ // Much code
+
+ if(getc()){
+ i=1;
+ }
+
+ // So codey
+
+ putc(i+1);
+ }
+
+Modern compilers will put out a warning about `i` potentially
+being used before being initialised, but people may just
+ignore that. The move towards more functional and declarative
+styles has also encouraged this process - for example, the
+`auto` keyword in C++11.