diff options
author | David Thomas <m8pple@github.com> | 2017-01-23 10:56:54 +0000 |
---|---|---|
committer | David Thomas <m8pple@github.com> | 2017-01-26 10:57:25 +0000 |
commit | b850a6b64115940a0e3df50b2aa923dd3e99b13e (patch) | |
tree | 2b394c18824ed89cc305d719d2c771f155615ba8 | |
download | Compiler-b850a6b64115940a0e3df50b2aa923dd3e99b13e.tar.gz Compiler-b850a6b64115940a0e3df50b2aa923dd3e99b13e.zip |
Initial transfer
-rw-r--r-- | 1-lexer.md | 228 | ||||
-rw-r--r-- | 2-parser.md | 268 | ||||
-rw-r--r-- | 3-compiler.md | 45 | ||||
-rw-r--r-- | Vagrantfile | 55 | ||||
-rw-r--r-- | bin/.gitignore | 0 | ||||
-rw-r--r-- | doc/.gitignore | 0 | ||||
-rw-r--r-- | makefile | 13 | ||||
-rw-r--r-- | readme.md | 50 | ||||
-rw-r--r-- | src/.gitignore | 0 |
9 files changed, 659 insertions, 0 deletions
diff --git a/1-lexer.md b/1-lexer.md new file mode 100644 index 0000000..9395704 --- /dev/null +++ b/1-lexer.md @@ -0,0 +1,228 @@ +Create a tokeniser for the C language. +====================================== + +Your program should accept C source code on +`stdin` and write the tokens out to `stdout` +in [JSON](http://www.json.org). + +Input format +------------ + +The input file will be pre-processed [ANSI C](https://en.wikipedia.org/wiki/ANSI_C), +also called C90 or C89. It's what's generally thought of as "classic" or "normal" C, +but not the _really_ old one without function prototypes (you may never have come +across that). C90 is still often used in embedded systems, and pretty much the +entire linux kernel is in C90. + +We've mainly taught you C++, but you're probably aware of C as a subset of C++ without +classes, which is a good mental model. Your programs (lexer, parser and compiler) will +never be given code that has different parsing or execution semantics under C and C++ (so for +example I won't give you code that uses "class" as an identifier). + +The source code will not contain any compiler-specific or platform-specific +extensions. If you pre-process a typical program (see later), you'll see many +things such as `__attribute__` or `__declspec` coming from the system headers. You will +not need to deal with any of these[^1]. + +The test inputs will be a set of files of increasing complexity and +variety. The initial files will contain a small number of basic tokens, +then the lexicon will be increased, and finally Pre-Processor directives +(see next section) will be included. All files will be [well-formed](https://en.wikipedia.org/wiki/Well-formedness). + +Output Format +------------- + +You will be outputting the tokens as [JSON](http://www.json.org/), +which is a lightweight way of representing structured data. A JSON +value can be either: + +- An array of JSON values + +- A dictionary that maps names to JSON values + +In your output format we will use a dictionary to represent each +token, and the top-level output will be an array of those dictionaries. + +Each dictionary should have the following properties: + +- "Text" : The original text of the token, not including any + surrounding white-space. + +- "Class" : Describes the class of token, with one of the following classes: + + - "Keyword" : `while`, `if`, ... + + - "Identifier" : `x`, `y`, `x6`, ... + + - "Operator" : `(`, `[`, `+` + + - "Constant" : `42`, `1.0`, ... + + - "StringLiteral" : `"Is it too late now to say sorry?"`, `"Cause I'm missing more than just"`, ... + + - "Invalid" : An unknown token type. + + *Note*: This does not make any distinction between "operators" and + "punctuators" which you will see in some C grammars. We will ignore + the distinction and only have an operator class. + +- "StreamLine" : The line number the token appears on, in terms of the input + stream. The first line number is 1. This number increases regardless of any + remapping indicated by pre-processor directives. + +- "SourceFile" : The source file the token came from. See [Pre-Processor section](#Pre-Processor) + for details on file-names - if you do not support file-names, then do not + include this dictionary property. + +- "SourceLine" : The source line the token came from. See [comments on the pre-processor](#Pre-Processor). + +- "SourceCol" : The source column the token started at. If a token starts at the + first character on a line, it would have `"SourceCol":1`. Column offsets + should follow the [GNU error message guidelines](https://www.gnu.org/prep/standards/standards.html#Errors). + +Any entry of the token dictionary not supported by your lexer +can be omitted, and the results for the rest will still be assessed. + +You may optionally include an empty final '{}' entry in the array +of tokens, which makes it a little easier to print out. + +Program build +------------- + +From your top-level directory, doing: + + make bin/c_lexer + +should create a programe called... `bin/c_lexer`. +How the makefile makes it is up to you. + +You are free to use C or C++ (or something else if it +is installed in the target environment), and may make use +of other tools such as flex as part of your +build process. The target environment is the lab +Ubuntu distribution (Ubuntu 16.04), or equivalently an +Ubuntu 16.04 VM as configured in the attached Vagrantfile. + +If you particularly want to use a tool/language, then +submit a pull request againt the Vagrantfile - as long +as they are reasonable, they can be added in (though they +will not then magically appear in the lab...) + +Example +------- + +Assuming you are in a console in the current directory, +this would be an example session: +```` +$ make bin/c_lexer +$ echo "int x;" > wibble.c +$ cat wibble.c +int x; +$ cat wibble.c | bin/c_lexer +[ +{"Text":"int","Class":"Keyword"} +, +{"Text":"x","Class":"Identifier"} +, +{"Text":";","Class":"Operator"} +] +```` + +Another session for a different lexer could be: +```` +$ make bin/c_lexer +$ echo "float x=" > wibble.c +$ echo " 4.5 ;" >> wibble.c +$ cat wibble.c +float x= + 4.5 ; +$ +$ cat wibble.c | bin/c_lexer +[ { "Class" : "Keyword", "Text": "float", "StreamLine" : 1, "SourceCol" : 1 }, + { "Class" : "Identifier", "Text": "x", "StreamLine" : 1, "SourceCol" : 7 }, + { "Class" : "Operator", "Text": "=", "StreamLine" : 1, "SourceCol" : 8 }, + { "Class" : "Constant", "Text": "4.5", "StreamLine" : 2, "SourceCol" : 3 }, + { "Class" : "Operator", "Text": ";", "StreamLine" : 2, "SourceCol" : 7 }, + {} +] +```` + +Note that both examples do not have the full complement of fields, and +the first example does not use a final empty token, while the second does. + +The outputs also look different because we are using JSON as +a container, so the exact spacing of things doesn't matter +too much. Also within the dictionary (the parts within `{}`) +the ordering of `key:value` pairs does not matter. However, +the ordering of tokens within the toplevel array `[]` _does_ +matter. + + +Pre-Processor +------------- + +TLDR : if you don't want to support pre-processor information, then +if you encounter a `#` character that is _not_ in a string literal, +consume all characters to the end of the line. + +If needed, the input will already have been pre-processed, which means +that all instances `#define` and `#include` will already have been processed. However, +you may still see some input lines from the pre-processor. Create a simple +file called `tmp.c`: + + echo -e "#include <stdio.h>\nint main(){\n printf(\"Wibble\"); \n}\n" > tmp.c + less tmp.c + +We can now pre-process it with `cpp`, the [GNU C Pre-Processor](https://gcc.gnu.org/onlinedocs/cpp/): + + cpp tmp.c + +You should should see a load of stuff from the standard includes, then +your tiny program at the bottom. You can get a better look with `less`: + + cpp tmp.c | less + +At the top of the file you'll see something like `# 1 tmp.c`, then +references to various files like `/usr/include/stdio.h` and `/usr/include/sys/config.h`. +If you scroll down, eventually you'll see the first non-empty line that +_doesn't_ start with `#` - in my case it is `typedef signed char __int8_t;`, +but it could be something else. This represents the first true token, +and will be the first line that should cause a token to be output. + +The format of the `#` lines is explained in the [Pre-Processor Manual](https://gcc.gnu.org/onlinedocs/cpp/Preprocessor-Output.html), +though you can get a good sense of what is going on by comparing +the original headers and the output of `cpp`. `less` has a mode +the displays line numbers (the `-N` option), so do: + + cpp tmp.c | less -N + +in one terminal, and in another terminal do: + + less -N /usr/include/stdio.h + +It gives quite a good insight into how the transformation +happens, as you can match up the lines in the original +versus the lines in the `#` output, and see where comments +and `#define`s have been removed. + +Note: there is no requirement that the input has been pre-processed, +so there may be no `#` lines at all. Use the default filename +"<stdin>" as the SourceFile if there is no information provided +by a pre-processor. + +Footnotes +========= + +[^1] - Some thought indicates _why_ you won't see extensions. Things + like `__attribute__` are a side-channel between the implementer + of the C run-time and the compiler, or the OS interface and the + compiler. It is necessary that both sides agree on what those + extensions mean, so for example the writers of `libc` have to talk + to the writers of `gcc` about what they need and why. The C + standard recognises that this is necessary, and so explicitly reserves + identifiers beginning with two under-scores as "implementation defined", + where "implementation" means the compiler. + + So in this instance _you_ are the implementation. Given you have + not defined any extensions, logically there are no possible extensions + that could occur. diff --git a/2-parser.md b/2-parser.md new file mode 100644 index 0000000..5010e2f --- /dev/null +++ b/2-parser.md @@ -0,0 +1,268 @@ +Create a parser for the C language +================================== + +Your program should accept C source code on +`stdin` and write a heirarchical representation on `stdout`. + +Input Format +------------ + +The input format and constraints are the same as for the tokeniser, with +the following restrictions: + +- the only variable/parameter data-type that will occur is `int` (so no arrays, structs, floats) + +- the only control-flow statements are `if`, `while`, `for`, and sequencing. + +- all functions are definitions (i.e. there are no declarations) + +These restrictions do _not_ apply to the eventual compiler, they are +simply to make this stage more manageable as a milestone. You are +encouraged to support other things as well, but the intent here is +to get a core set of functionality fully working. + +The input format is still C90, and it is worth noting +a difference between the C90 ("classic) and C99 ("modern") +languages, which is that declarations can only appear +at the top of a scope. So all variables declarations will +appear before any statements. See the later section for more discussion. + +Output Format +------------- + +The output format will be a heirarchical format written in +[XML](https://en.wikipedia.org/wiki/XML), which identifies the functions, scopes, and variables. + +Your output should consist of the following types of XML element: + +- `<Program>` : The top-level element, containing all others. + +- `<Function id="id">` : This appears within a Program element, and + indicates the existence of a function called id. The ordering of + Function elements relative to other globals should + follow the original source. + +- `<Parameter id="id">` : This appears as a child of a Function element, + and states that the enclosing function has a parameter with this name. + The relative ordering of Parameter elements is important, and should follow the + order given in the function signature. + +- `<Scope>` : Appears within either a Function or another Scope, and + identifies a scope within which variables have a lifetime. Empty + scopes are not considered meaningful and may be omitted or inserted + as convenient. Extra levels of scope nesting are also allowed, as long + as the lifetime of all variables is correct (equivalent to `((x))` == `(x)`). + +- `<Variable>` : Appears within a Program or a Scope, and indicates a + variable that is being introduced. + +The output should not contain any other elements except for these +five. Each element can contain arbitrary amounts of white-space. + +Examples +-------- + +As a simple example, the following function: + + int f( int x) + { + int y=x; + return y; + } + +should result in: + + + <?xml version="1.0"?> + <Program> + <Function id="f"> + <Parameter id="x" /> + <Scope> + <Variable id="y" /> + </Scope> + </Function> + </Program> + +or also correct would be: + + <?xml version="1.0"?> + <Program> <Function id="f"> + <Parameter id="x" /> + <Scope /> + <Scope><Scope /></Scope> + <Scope> + <Scope> + <Variable id="y" /> + </Scope></Scope> + </Function> </Program> + +and infinite variations thereof. + +This code: + + int x() + {} + + int g; + + int zz(int a, int b, int c) + { + if(a==b) + int a; + return a; + else{ + int fsdfsdfs; + return c; + } + } + +would result in: + + + <?xml version="1.0"?> + <Program> + <Function id="x" /> + <Variable id="g" /> + <Function id="zz"> + <Parameter id="a" /><Parameter id="b" /> + <Parameter id="c" /> + <Scope> + <Scope> + <Variable id="a" /> + </Scope> + <Scope> + <Variable id="fsdfsdfs" /> + </Scope> + </Scope> + </Function> + </Program> + + +Given that additional scope nesting levels and extra empty scopes are +allowed, you may wonder how it is possible to tell whether the +scoping is correct. This comes down to the lifetime of variables +within the original C program. For example, given this: + + int x; + if(...){ + int y; + } + +It would be possible to say: + + <Scope> + <Variable id="x" /> + <Scope> + <Variable id="y" /> + </Scope> + </Scope> + +as this indicates that: + +- a variable x is defined. + +- a variable y is defined, and x is still available within the same scope + +- the variable y goes out of scope before x. + +It would not be possible to say: + + <Scope> + <Variable id="x" /> + <Variable id="y" /> + </Scope> + +as while it matches the first two properties above, it indicates that +x and y go out of scope at the same time, rather than y leaving scope +before x does. + +Program build +------------- + +From your top-level directory, doing: + + make bin/c_parser + +should create a programme called... `bin/c_parser`. + + + +C90 Declarations +---------------- + +In C99 (like C++) you are allowed to interleave declarations and statements: + + int f(int i) + { + int x=i+5; + x=x*x; + int y=x+4; + return y; + } + +In C90, you must have all declarations before any statements, +so that would need to be re-written as: + + int f(int i) + { + int x=i+5; + int y; + x=x*x; + y=x+4; + return y; + } + +Note that declarations can happen at the start of any scope, so things like: + + int f(int i) + { + int x=i+5; + x=x*x; + if(1){ + int y=x+4; + return y; + } + } + +Thinking about it with your knowledge of parsers and compilers, +you might be able to guess why the older language is more restricted, +while the newer one relaxes it. You also might guess why I +deliberately chose the older syntax for you to implement. + +It's worth noting that beyond compiler implementation, there +are many arguments for and against both styles. An argument +for requiring C90 style declarations is that you have +immediate visibility of the total number of variables in a scope. +This gives you a sense of the complexity of a function, +as you can see how many moving parts there are. For example, +Linus Torvalds [likes that aspect](https://lkml.org/lkml/2012/4/12/18), +as the kernel is an extremely large and complex code-base. +There are also rules in the [MISRA-C](https://en.wikipedia.org/wiki/MISRA_C) +guidelines for automotive and embedded software that state +that variables should appear at function scope. + +One argument for the newer style is convenience, but also +the idea that that it is better to declare a variable at +the point that it is initialised. A common error in C is +to have a variable which is not initialised on all code +paths: + + { + int i; + + // Much code + + if(getc()){ + i=1; + } + + // So codey + + putc(i+1); + } + +Modern compilers will put out a warning about `i` potentially +being used before being initialised, but people may just +ignore that. The move towards more functional and declarative +styles has also encouraged this process - for example, the +`auto` keyword in C++11. diff --git a/3-compiler.md b/3-compiler.md new file mode 100644 index 0000000..7ac2fe7 --- /dev/null +++ b/3-compiler.md @@ -0,0 +1,45 @@ +Create a compiler for the C language +==================================== + +Your program should accept C source code on +`stdin` and write a heirarchical representation on `stdout`. + +Input Format +------------ + +The input format is C90. + +Output Format +------------- + +The output format should be MIPS1 assembly code. + +It should be possible to assemble and link this code +against a C run-time, and have it execute correctly +on a MIPS processor. + +Compilation +----------- + +Your compiler should be built using: +```` +make bin/c_compiler +```` +and the resulting program should be called `bin/c_compiler`. + +The target environment remains Ubuntu 16.04. + +Deliverables +------------ + +There are actually three deliverables here: + +1 - The compiler itself + +2 - A test framework + +3 - Documentation + +There are certain requirements on test and a format for the +documentation, but we'll elaborate on those when they are +encountered in the course. diff --git a/Vagrantfile b/Vagrantfile new file mode 100644 index 0000000..1f53d5d --- /dev/null +++ b/Vagrantfile @@ -0,0 +1,55 @@ +# -*- mode: ruby -*- +# vi: set ft=ruby : + +# All Vagrant configuration is done below. The "2" in Vagrant.configure +# configures the configuration version (we support older styles for +# backwards compatibility). Please don't change it unless you know what +# you're doing. +Vagrant.configure(2) do |config| + # The most common configuration options are documented and commented below. + # For a complete reference, please see the online documentation at + # https://docs.vagrantup.com. + + # Every Vagrant development environment requires a box. You can search for + # boxes at https://atlas.hashicorp.com/search. + config.vm.box = "ubuntu/xenial64" + + # Share an additional folder to the guest VM. The first argument is + # the path on the host to the actual folder. The second argument is + # the path on the guest to mount the folder. And the optional third + # argument is a set of non-required options. + # config.vm.synced_folder "../data", "/vagrant_data" + + # Provider-specific configuration so you can fine-tune various + # backing providers for Vagrant. These expose provider-specific options. + # Example for VirtualBox: + # + # config.vm.provider "virtualbox" do |vb| + # # Display the VirtualBox GUI when booting the machine + # vb.gui = true + # + # # Customize the amount of memory on the VM: + # vb.memory = "1024" + # end + + + # Enable provisioning with a shell script. Additional provisioners such as + # Puppet, Chef, Ansible, Salt, and Docker are also available. Please see the + # documentation for more information about their specific syntax and use. + config.vm.provision "shell", inline: <<-SHELL + sudo apt-get update + + # Standard build tools + sudo apt-get -y install g++ gdb make dos2unix git + + # Compiler build tools + sudo apt-get -y install bison flex + + # MIPS cross-compiler stuff + sudo apt-get -y install g++-mips-linux-gnu gdb-multiarch + + # QEMU run-time emulator + sudo apt-get -y install qemu + + SHELL +end diff --git a/bin/.gitignore b/bin/.gitignore new file mode 100644 index 0000000..e69de29 --- /dev/null +++ b/bin/.gitignore diff --git a/doc/.gitignore b/doc/.gitignore new file mode 100644 index 0000000..e69de29 --- /dev/null +++ b/doc/.gitignore diff --git a/makefile b/makefile new file mode 100644 index 0000000..5f938f3 --- /dev/null +++ b/makefile @@ -0,0 +1,13 @@ +
+
+bin/c_lexer :
+ echo "No current build commands for C lexer."
+ exit 1
+
+bin/c_parser :
+ echo "No current build commands for C parser."
+ exit 1
+
+bin/c_compiler :
+ echo "No current build commands for C compile."
+ exit 1
diff --git a/readme.md b/readme.md new file mode 100644 index 0000000..f1ed994 --- /dev/null +++ b/readme.md @@ -0,0 +1,50 @@ +The module code deliverables come in three parts:
+
+ Deliverable | Due | Weight | Notes
+-------------------------------|-------------|--------|------------------------------------
+[1-lexer.md](1-lexer.md) | Tue 7th Feb | 10% | A C lexer that produces JSON output
+[2-parser.md](2-lexer.md) | Tue 7th Mar | 10% | A (restricted) C parser that produces XML output
+[3-compiler.md](3-compiler.md) | Tue 7th Feb | 50% | The final C compiler
+
+You will all get a bare private repository. It is up to you
+if you want to clone the master specification, or to start from
+scratch. Both approaches just require the git operations already used
+in lab 1, and possibly looking at the documentation for the commands
+you have already used. Remember that you can always do `git ${CMD} --help`
+for any CMD such as `pull`, `push`, `commit`, `remote`.
+
+Starting from scratch
+---------------------
+
+1 - Create a local git repository using `git init` in a directory of your choice.
+
+2 - Commit files to your local repository (as you would normally commit)
+
+3 - Add the URL of your private repository as the remote `origin` (similar
+ to manually adding the `spec` remote in the lab).
+
+4 - You should now be able to push and pull commits to origin, which is
+ your private repo.
+
+Cloning the spec
+----------------
+
+1 - Clone the master specification from the public URL. This will produce
+ a directory called `langproc-2016-cw` by default.
+
+2 - Rename it to something more meaningful, e.g. `langproc-2016-cw-${LOGIN}`.
+
+3 - Use `git remote` to rename `origin` (which points at the public spec) to
+ `spec`.
+
+4 - Use `git remote` to add a new remote called `origin` pointing at
+ your private repository.
+
+5 - You should then be able to push and pull to `origin`, and pull from `spec`,
+ as in the lab.
+
+Submissions
+-----------
+
+Submission will be via github (code) + blackboard (commit hash),
+as in the lab.
diff --git a/src/.gitignore b/src/.gitignore new file mode 100644 index 0000000..e69de29 --- /dev/null +++ b/src/.gitignore |