aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorDavid Thomas <m8pple@github.com>2017-01-23 10:56:54 +0000
committerDavid Thomas <m8pple@github.com>2017-01-26 10:57:25 +0000
commitb850a6b64115940a0e3df50b2aa923dd3e99b13e (patch)
tree2b394c18824ed89cc305d719d2c771f155615ba8
downloadCompiler-b850a6b64115940a0e3df50b2aa923dd3e99b13e.tar.gz
Compiler-b850a6b64115940a0e3df50b2aa923dd3e99b13e.zip
Initial transfer
-rw-r--r--1-lexer.md228
-rw-r--r--2-parser.md268
-rw-r--r--3-compiler.md45
-rw-r--r--Vagrantfile55
-rw-r--r--bin/.gitignore0
-rw-r--r--doc/.gitignore0
-rw-r--r--makefile13
-rw-r--r--readme.md50
-rw-r--r--src/.gitignore0
9 files changed, 659 insertions, 0 deletions
diff --git a/1-lexer.md b/1-lexer.md
new file mode 100644
index 0000000..9395704
--- /dev/null
+++ b/1-lexer.md
@@ -0,0 +1,228 @@
+Create a tokeniser for the C language.
+======================================
+
+Your program should accept C source code on
+`stdin` and write the tokens out to `stdout`
+in [JSON](http://www.json.org).
+
+Input format
+------------
+
+The input file will be pre-processed [ANSI C](https://en.wikipedia.org/wiki/ANSI_C),
+also called C90 or C89. It's what's generally thought of as "classic" or "normal" C,
+but not the _really_ old one without function prototypes (you may never have come
+across that). C90 is still often used in embedded systems, and pretty much the
+entire linux kernel is in C90.
+
+We've mainly taught you C++, but you're probably aware of C as a subset of C++ without
+classes, which is a good mental model. Your programs (lexer, parser and compiler) will
+never be given code that has different parsing or execution semantics under C and C++ (so for
+example I won't give you code that uses "class" as an identifier).
+
+The source code will not contain any compiler-specific or platform-specific
+extensions. If you pre-process a typical program (see later), you'll see many
+things such as `__attribute__` or `__declspec` coming from the system headers. You will
+not need to deal with any of these[^1].
+
+The test inputs will be a set of files of increasing complexity and
+variety. The initial files will contain a small number of basic tokens,
+then the lexicon will be increased, and finally Pre-Processor directives
+(see next section) will be included. All files will be [well-formed](https://en.wikipedia.org/wiki/Well-formedness).
+
+Output Format
+-------------
+
+You will be outputting the tokens as [JSON](http://www.json.org/),
+which is a lightweight way of representing structured data. A JSON
+value can be either:
+
+- An array of JSON values
+
+- A dictionary that maps names to JSON values
+
+In your output format we will use a dictionary to represent each
+token, and the top-level output will be an array of those dictionaries.
+
+Each dictionary should have the following properties:
+
+- "Text" : The original text of the token, not including any
+ surrounding white-space.
+
+- "Class" : Describes the class of token, with one of the following classes:
+
+ - "Keyword" : `while`, `if`, ...
+
+ - "Identifier" : `x`, `y`, `x6`, ...
+
+ - "Operator" : `(`, `[`, `+`
+
+ - "Constant" : `42`, `1.0`, ...
+
+ - "StringLiteral" : `"Is it too late now to say sorry?"`, `"Cause I'm missing more than just"`, ...
+
+ - "Invalid" : An unknown token type.
+
+ *Note*: This does not make any distinction between "operators" and
+ "punctuators" which you will see in some C grammars. We will ignore
+ the distinction and only have an operator class.
+
+- "StreamLine" : The line number the token appears on, in terms of the input
+ stream. The first line number is 1. This number increases regardless of any
+ remapping indicated by pre-processor directives.
+
+- "SourceFile" : The source file the token came from. See [Pre-Processor section](#Pre-Processor)
+ for details on file-names - if you do not support file-names, then do not
+ include this dictionary property.
+
+- "SourceLine" : The source line the token came from. See [comments on the pre-processor](#Pre-Processor).
+
+- "SourceCol" : The source column the token started at. If a token starts at the
+ first character on a line, it would have `"SourceCol":1`. Column offsets
+ should follow the [GNU error message guidelines](https://www.gnu.org/prep/standards/standards.html#Errors).
+
+Any entry of the token dictionary not supported by your lexer
+can be omitted, and the results for the rest will still be assessed.
+
+You may optionally include an empty final '{}' entry in the array
+of tokens, which makes it a little easier to print out.
+
+Program build
+-------------
+
+From your top-level directory, doing:
+
+ make bin/c_lexer
+
+should create a programe called... `bin/c_lexer`.
+How the makefile makes it is up to you.
+
+You are free to use C or C++ (or something else if it
+is installed in the target environment), and may make use
+of other tools such as flex as part of your
+build process. The target environment is the lab
+Ubuntu distribution (Ubuntu 16.04), or equivalently an
+Ubuntu 16.04 VM as configured in the attached Vagrantfile.
+
+If you particularly want to use a tool/language, then
+submit a pull request againt the Vagrantfile - as long
+as they are reasonable, they can be added in (though they
+will not then magically appear in the lab...)
+
+Example
+-------
+
+Assuming you are in a console in the current directory,
+this would be an example session:
+````
+$ make bin/c_lexer
+$ echo "int x;" > wibble.c
+$ cat wibble.c
+int x;
+$ cat wibble.c | bin/c_lexer
+[
+{"Text":"int","Class":"Keyword"}
+,
+{"Text":"x","Class":"Identifier"}
+,
+{"Text":";","Class":"Operator"}
+]
+````
+
+Another session for a different lexer could be:
+````
+$ make bin/c_lexer
+$ echo "float x=" > wibble.c
+$ echo " 4.5 ;" >> wibble.c
+$ cat wibble.c
+float x=
+ 4.5 ;
+$
+$ cat wibble.c | bin/c_lexer
+[ { "Class" : "Keyword", "Text": "float", "StreamLine" : 1, "SourceCol" : 1 },
+ { "Class" : "Identifier", "Text": "x", "StreamLine" : 1, "SourceCol" : 7 },
+ { "Class" : "Operator", "Text": "=", "StreamLine" : 1, "SourceCol" : 8 },
+ { "Class" : "Constant", "Text": "4.5", "StreamLine" : 2, "SourceCol" : 3 },
+ { "Class" : "Operator", "Text": ";", "StreamLine" : 2, "SourceCol" : 7 },
+ {}
+]
+````
+
+Note that both examples do not have the full complement of fields, and
+the first example does not use a final empty token, while the second does.
+
+The outputs also look different because we are using JSON as
+a container, so the exact spacing of things doesn't matter
+too much. Also within the dictionary (the parts within `{}`)
+the ordering of `key:value` pairs does not matter. However,
+the ordering of tokens within the toplevel array `[]` _does_
+matter.
+
+
+Pre-Processor
+-------------
+
+TLDR : if you don't want to support pre-processor information, then
+if you encounter a `#` character that is _not_ in a string literal,
+consume all characters to the end of the line.
+
+If needed, the input will already have been pre-processed, which means
+that all instances `#define` and `#include` will already have been processed. However,
+you may still see some input lines from the pre-processor. Create a simple
+file called `tmp.c`:
+
+ echo -e "#include <stdio.h>\nint main(){\n printf(\"Wibble\"); \n}\n" > tmp.c
+ less tmp.c
+
+We can now pre-process it with `cpp`, the [GNU C Pre-Processor](https://gcc.gnu.org/onlinedocs/cpp/):
+
+ cpp tmp.c
+
+You should should see a load of stuff from the standard includes, then
+your tiny program at the bottom. You can get a better look with `less`:
+
+ cpp tmp.c | less
+
+At the top of the file you'll see something like `# 1 tmp.c`, then
+references to various files like `/usr/include/stdio.h` and `/usr/include/sys/config.h`.
+If you scroll down, eventually you'll see the first non-empty line that
+_doesn't_ start with `#` - in my case it is `typedef signed char __int8_t;`,
+but it could be something else. This represents the first true token,
+and will be the first line that should cause a token to be output.
+
+The format of the `#` lines is explained in the [Pre-Processor Manual](https://gcc.gnu.org/onlinedocs/cpp/Preprocessor-Output.html),
+though you can get a good sense of what is going on by comparing
+the original headers and the output of `cpp`. `less` has a mode
+the displays line numbers (the `-N` option), so do:
+
+ cpp tmp.c | less -N
+
+in one terminal, and in another terminal do:
+
+ less -N /usr/include/stdio.h
+
+It gives quite a good insight into how the transformation
+happens, as you can match up the lines in the original
+versus the lines in the `#` output, and see where comments
+and `#define`s have been removed.
+
+Note: there is no requirement that the input has been pre-processed,
+so there may be no `#` lines at all. Use the default filename
+"<stdin>" as the SourceFile if there is no information provided
+by a pre-processor.
+
+Footnotes
+=========
+
+[^1] - Some thought indicates _why_ you won't see extensions. Things
+ like `__attribute__` are a side-channel between the implementer
+ of the C run-time and the compiler, or the OS interface and the
+ compiler. It is necessary that both sides agree on what those
+ extensions mean, so for example the writers of `libc` have to talk
+ to the writers of `gcc` about what they need and why. The C
+ standard recognises that this is necessary, and so explicitly reserves
+ identifiers beginning with two under-scores as "implementation defined",
+ where "implementation" means the compiler.
+
+ So in this instance _you_ are the implementation. Given you have
+ not defined any extensions, logically there are no possible extensions
+ that could occur.
diff --git a/2-parser.md b/2-parser.md
new file mode 100644
index 0000000..5010e2f
--- /dev/null
+++ b/2-parser.md
@@ -0,0 +1,268 @@
+Create a parser for the C language
+==================================
+
+Your program should accept C source code on
+`stdin` and write a heirarchical representation on `stdout`.
+
+Input Format
+------------
+
+The input format and constraints are the same as for the tokeniser, with
+the following restrictions:
+
+- the only variable/parameter data-type that will occur is `int` (so no arrays, structs, floats)
+
+- the only control-flow statements are `if`, `while`, `for`, and sequencing.
+
+- all functions are definitions (i.e. there are no declarations)
+
+These restrictions do _not_ apply to the eventual compiler, they are
+simply to make this stage more manageable as a milestone. You are
+encouraged to support other things as well, but the intent here is
+to get a core set of functionality fully working.
+
+The input format is still C90, and it is worth noting
+a difference between the C90 ("classic) and C99 ("modern")
+languages, which is that declarations can only appear
+at the top of a scope. So all variables declarations will
+appear before any statements. See the later section for more discussion.
+
+Output Format
+-------------
+
+The output format will be a heirarchical format written in
+[XML](https://en.wikipedia.org/wiki/XML), which identifies the functions, scopes, and variables.
+
+Your output should consist of the following types of XML element:
+
+- `<Program>` : The top-level element, containing all others.
+
+- `<Function id="id">` : This appears within a Program element, and
+ indicates the existence of a function called id. The ordering of
+ Function elements relative to other globals should
+ follow the original source.
+
+- `<Parameter id="id">` : This appears as a child of a Function element,
+ and states that the enclosing function has a parameter with this name.
+ The relative ordering of Parameter elements is important, and should follow the
+ order given in the function signature.
+
+- `<Scope>` : Appears within either a Function or another Scope, and
+ identifies a scope within which variables have a lifetime. Empty
+ scopes are not considered meaningful and may be omitted or inserted
+ as convenient. Extra levels of scope nesting are also allowed, as long
+ as the lifetime of all variables is correct (equivalent to `((x))` == `(x)`).
+
+- `<Variable>` : Appears within a Program or a Scope, and indicates a
+ variable that is being introduced.
+
+The output should not contain any other elements except for these
+five. Each element can contain arbitrary amounts of white-space.
+
+Examples
+--------
+
+As a simple example, the following function:
+
+ int f( int x)
+ {
+ int y=x;
+ return y;
+ }
+
+should result in:
+
+
+ <?xml version="1.0"?>
+ <Program>
+ <Function id="f">
+ <Parameter id="x" />
+ <Scope>
+ <Variable id="y" />
+ </Scope>
+ </Function>
+ </Program>
+
+or also correct would be:
+
+ <?xml version="1.0"?>
+ <Program> <Function id="f">
+ <Parameter id="x" />
+ <Scope />
+ <Scope><Scope /></Scope>
+ <Scope>
+ <Scope>
+ <Variable id="y" />
+ </Scope></Scope>
+ </Function> </Program>
+
+and infinite variations thereof.
+
+This code:
+
+ int x()
+ {}
+
+ int g;
+
+ int zz(int a, int b, int c)
+ {
+ if(a==b)
+ int a;
+ return a;
+ else{
+ int fsdfsdfs;
+ return c;
+ }
+ }
+
+would result in:
+
+
+ <?xml version="1.0"?>
+ <Program>
+ <Function id="x" />
+ <Variable id="g" />
+ <Function id="zz">
+ <Parameter id="a" /><Parameter id="b" />
+ <Parameter id="c" />
+ <Scope>
+ <Scope>
+ <Variable id="a" />
+ </Scope>
+ <Scope>
+ <Variable id="fsdfsdfs" />
+ </Scope>
+ </Scope>
+ </Function>
+ </Program>
+
+
+Given that additional scope nesting levels and extra empty scopes are
+allowed, you may wonder how it is possible to tell whether the
+scoping is correct. This comes down to the lifetime of variables
+within the original C program. For example, given this:
+
+ int x;
+ if(...){
+ int y;
+ }
+
+It would be possible to say:
+
+ <Scope>
+ <Variable id="x" />
+ <Scope>
+ <Variable id="y" />
+ </Scope>
+ </Scope>
+
+as this indicates that:
+
+- a variable x is defined.
+
+- a variable y is defined, and x is still available within the same scope
+
+- the variable y goes out of scope before x.
+
+It would not be possible to say:
+
+ <Scope>
+ <Variable id="x" />
+ <Variable id="y" />
+ </Scope>
+
+as while it matches the first two properties above, it indicates that
+x and y go out of scope at the same time, rather than y leaving scope
+before x does.
+
+Program build
+-------------
+
+From your top-level directory, doing:
+
+ make bin/c_parser
+
+should create a programme called... `bin/c_parser`.
+
+
+
+C90 Declarations
+----------------
+
+In C99 (like C++) you are allowed to interleave declarations and statements:
+
+ int f(int i)
+ {
+ int x=i+5;
+ x=x*x;
+ int y=x+4;
+ return y;
+ }
+
+In C90, you must have all declarations before any statements,
+so that would need to be re-written as:
+
+ int f(int i)
+ {
+ int x=i+5;
+ int y;
+ x=x*x;
+ y=x+4;
+ return y;
+ }
+
+Note that declarations can happen at the start of any scope, so things like:
+
+ int f(int i)
+ {
+ int x=i+5;
+ x=x*x;
+ if(1){
+ int y=x+4;
+ return y;
+ }
+ }
+
+Thinking about it with your knowledge of parsers and compilers,
+you might be able to guess why the older language is more restricted,
+while the newer one relaxes it. You also might guess why I
+deliberately chose the older syntax for you to implement.
+
+It's worth noting that beyond compiler implementation, there
+are many arguments for and against both styles. An argument
+for requiring C90 style declarations is that you have
+immediate visibility of the total number of variables in a scope.
+This gives you a sense of the complexity of a function,
+as you can see how many moving parts there are. For example,
+Linus Torvalds [likes that aspect](https://lkml.org/lkml/2012/4/12/18),
+as the kernel is an extremely large and complex code-base.
+There are also rules in the [MISRA-C](https://en.wikipedia.org/wiki/MISRA_C)
+guidelines for automotive and embedded software that state
+that variables should appear at function scope.
+
+One argument for the newer style is convenience, but also
+the idea that that it is better to declare a variable at
+the point that it is initialised. A common error in C is
+to have a variable which is not initialised on all code
+paths:
+
+ {
+ int i;
+
+ // Much code
+
+ if(getc()){
+ i=1;
+ }
+
+ // So codey
+
+ putc(i+1);
+ }
+
+Modern compilers will put out a warning about `i` potentially
+being used before being initialised, but people may just
+ignore that. The move towards more functional and declarative
+styles has also encouraged this process - for example, the
+`auto` keyword in C++11.
diff --git a/3-compiler.md b/3-compiler.md
new file mode 100644
index 0000000..7ac2fe7
--- /dev/null
+++ b/3-compiler.md
@@ -0,0 +1,45 @@
+Create a compiler for the C language
+====================================
+
+Your program should accept C source code on
+`stdin` and write a heirarchical representation on `stdout`.
+
+Input Format
+------------
+
+The input format is C90.
+
+Output Format
+-------------
+
+The output format should be MIPS1 assembly code.
+
+It should be possible to assemble and link this code
+against a C run-time, and have it execute correctly
+on a MIPS processor.
+
+Compilation
+-----------
+
+Your compiler should be built using:
+````
+make bin/c_compiler
+````
+and the resulting program should be called `bin/c_compiler`.
+
+The target environment remains Ubuntu 16.04.
+
+Deliverables
+------------
+
+There are actually three deliverables here:
+
+1 - The compiler itself
+
+2 - A test framework
+
+3 - Documentation
+
+There are certain requirements on test and a format for the
+documentation, but we'll elaborate on those when they are
+encountered in the course.
diff --git a/Vagrantfile b/Vagrantfile
new file mode 100644
index 0000000..1f53d5d
--- /dev/null
+++ b/Vagrantfile
@@ -0,0 +1,55 @@
+# -*- mode: ruby -*-
+# vi: set ft=ruby :
+
+# All Vagrant configuration is done below. The "2" in Vagrant.configure
+# configures the configuration version (we support older styles for
+# backwards compatibility). Please don't change it unless you know what
+# you're doing.
+Vagrant.configure(2) do |config|
+ # The most common configuration options are documented and commented below.
+ # For a complete reference, please see the online documentation at
+ # https://docs.vagrantup.com.
+
+ # Every Vagrant development environment requires a box. You can search for
+ # boxes at https://atlas.hashicorp.com/search.
+ config.vm.box = "ubuntu/xenial64"
+
+ # Share an additional folder to the guest VM. The first argument is
+ # the path on the host to the actual folder. The second argument is
+ # the path on the guest to mount the folder. And the optional third
+ # argument is a set of non-required options.
+ # config.vm.synced_folder "../data", "/vagrant_data"
+
+ # Provider-specific configuration so you can fine-tune various
+ # backing providers for Vagrant. These expose provider-specific options.
+ # Example for VirtualBox:
+ #
+ # config.vm.provider "virtualbox" do |vb|
+ # # Display the VirtualBox GUI when booting the machine
+ # vb.gui = true
+ #
+ # # Customize the amount of memory on the VM:
+ # vb.memory = "1024"
+ # end
+
+
+ # Enable provisioning with a shell script. Additional provisioners such as
+ # Puppet, Chef, Ansible, Salt, and Docker are also available. Please see the
+ # documentation for more information about their specific syntax and use.
+ config.vm.provision "shell", inline: <<-SHELL
+ sudo apt-get update
+
+ # Standard build tools
+ sudo apt-get -y install g++ gdb make dos2unix git
+
+ # Compiler build tools
+ sudo apt-get -y install bison flex
+
+ # MIPS cross-compiler stuff
+ sudo apt-get -y install g++-mips-linux-gnu gdb-multiarch
+
+ # QEMU run-time emulator
+ sudo apt-get -y install qemu
+
+ SHELL
+end
diff --git a/bin/.gitignore b/bin/.gitignore
new file mode 100644
index 0000000..e69de29
--- /dev/null
+++ b/bin/.gitignore
diff --git a/doc/.gitignore b/doc/.gitignore
new file mode 100644
index 0000000..e69de29
--- /dev/null
+++ b/doc/.gitignore
diff --git a/makefile b/makefile
new file mode 100644
index 0000000..5f938f3
--- /dev/null
+++ b/makefile
@@ -0,0 +1,13 @@
+
+
+bin/c_lexer :
+ echo "No current build commands for C lexer."
+ exit 1
+
+bin/c_parser :
+ echo "No current build commands for C parser."
+ exit 1
+
+bin/c_compiler :
+ echo "No current build commands for C compile."
+ exit 1
diff --git a/readme.md b/readme.md
new file mode 100644
index 0000000..f1ed994
--- /dev/null
+++ b/readme.md
@@ -0,0 +1,50 @@
+The module code deliverables come in three parts:
+
+ Deliverable | Due | Weight | Notes
+-------------------------------|-------------|--------|------------------------------------
+[1-lexer.md](1-lexer.md) | Tue 7th Feb | 10% | A C lexer that produces JSON output
+[2-parser.md](2-lexer.md) | Tue 7th Mar | 10% | A (restricted) C parser that produces XML output
+[3-compiler.md](3-compiler.md) | Tue 7th Feb | 50% | The final C compiler
+
+You will all get a bare private repository. It is up to you
+if you want to clone the master specification, or to start from
+scratch. Both approaches just require the git operations already used
+in lab 1, and possibly looking at the documentation for the commands
+you have already used. Remember that you can always do `git ${CMD} --help`
+for any CMD such as `pull`, `push`, `commit`, `remote`.
+
+Starting from scratch
+---------------------
+
+1 - Create a local git repository using `git init` in a directory of your choice.
+
+2 - Commit files to your local repository (as you would normally commit)
+
+3 - Add the URL of your private repository as the remote `origin` (similar
+ to manually adding the `spec` remote in the lab).
+
+4 - You should now be able to push and pull commits to origin, which is
+ your private repo.
+
+Cloning the spec
+----------------
+
+1 - Clone the master specification from the public URL. This will produce
+ a directory called `langproc-2016-cw` by default.
+
+2 - Rename it to something more meaningful, e.g. `langproc-2016-cw-${LOGIN}`.
+
+3 - Use `git remote` to rename `origin` (which points at the public spec) to
+ `spec`.
+
+4 - Use `git remote` to add a new remote called `origin` pointing at
+ your private repository.
+
+5 - You should then be able to push and pull to `origin`, and pull from `spec`,
+ as in the lab.
+
+Submissions
+-----------
+
+Submission will be via github (code) + blackboard (commit hash),
+as in the lab.
diff --git a/src/.gitignore b/src/.gitignore
new file mode 100644
index 0000000..e69de29
--- /dev/null
+++ b/src/.gitignore