Initial transfer

author: David Thomas <m8pple@github.com> 2017-01-23 10:56:54 +0000
committer: David Thomas <m8pple@github.com> 2017-01-26 10:57:25 +0000
commit: b850a6b64115940a0e3df50b2aa923dd3e99b13e (patch)
tree: 2b394c18824ed89cc305d719d2c771f155615ba8
download: Compiler-b850a6b64115940a0e3df50b2aa923dd3e99b13e.tar.gz
Compiler-b850a6b64115940a0e3df50b2aa923dd3e99b13e.zip
9 files changed, 659 insertions, 0 deletions
diff --git a/1-lexer.md b/1-lexer.md
new file mode 100644
index 0000000..9395704
--- /dev/null
+++ b/1-lexer.md
@@ -0,0 +1,228 @@
+Create a tokeniser for the C language.
+======================================
+
+Your program should accept C source code on
+`stdin` and write the tokens out to `stdout`
+in [JSON](http://www.json.org).
+
+Input format
+------------
+
+The input file will be pre-processed [ANSI C](https://en.wikipedia.org/wiki/ANSI_C),
+also called C90 or C89. It's what's generally thought of as "classic" or "normal" C,
+but not the _really_ old one without function prototypes (you may never have come
+across that). C90 is still often used in embedded systems, and pretty much the
+entire linux kernel is in C90.
+
+We've mainly taught you C++, but you're probably aware of C as a subset of C++ without
+classes, which is a good mental model. Your programs (lexer, parser and compiler) will
+never be given code that has different parsing or execution semantics under C and C++ (so for
+example I won't give you code that uses "class" as an identifier).
+
+The source code will not contain any compiler-specific or platform-specific
+extensions. If you pre-process a typical program (see later), you'll see many
+things such as `__attribute__` or `__declspec` coming from the system headers. You will
+not need to deal with any of these[^1].
+
+The test inputs will be a set of files of increasing complexity and
+variety. The initial files will contain a small number of basic tokens,
+then the lexicon will be increased, and finally Pre-Processor directives
+(see next section) will be included. All files will be [well-formed](https://en.wikipedia.org/wiki/Well-formedness).
+
+Output Format
+-------------
+
+You will be outputting the tokens as [JSON](http://www.json.org/),
+which is a lightweight way of representing structured data. A JSON
+value can be either:
+
+- An array of JSON values
+
+- A dictionary that maps names to JSON values
+
+In your output format we will use a dictionary to represent each
+token, and the top-level output will be an array of those dictionaries.
+
+Each dictionary should have the following properties:
+
+- "Text" : The original text of the token, not including any
+  surrounding white-space.
+
+- "Class" : Describes the class of token, with one of the following classes:
+
+  - "Keyword" : `while`, `if`, ...
+
+  - "Identifier" : `x`, `y`, `x6`, ...
+
+  - "Operator" : `(`, `[`, `+`
+
+  - "Constant" : `42`, `1.0`, ...
+
+  - "StringLiteral" : `"Is it too late now to say sorry?"`, `"Cause I'm missing more than just"`, ...
+
+  - "Invalid" : An unknown token type.
+
+  *Note*: This does not make any distinction between "operators" and
+  "punctuators" which you will see in some C grammars. We will ignore
+  the distinction and only have an operator class.
+
+- "StreamLine" : The line number the token appears on, in terms of the input
+  stream. The first line number is 1. This number increases regardless of any
+  remapping indicated by pre-processor directives.
+
+- "SourceFile" : The source file the token came from. See [Pre-Processor section](#Pre-Processor)
+  for details on file-names - if you do not support file-names, then do not
+  include this dictionary property.
+
+- "SourceLine" : The source line the token came from. See [comments on the pre-processor](#Pre-Processor).
+
+- "SourceCol" : The source column the token started at. If a token starts at the
+  first character on a line, it would have `"SourceCol":1`. Column offsets
+  should follow the [GNU error message guidelines](https://www.gnu.org/prep/standards/standards.html#Errors).
+
+Any entry of the token dictionary not supported by your lexer
+can be omitted, and the results for the rest will still be assessed.
+
+You may optionally include an empty final '{}' entry in the array
+of tokens, which makes it a little easier to print out.
+
+Program build
+-------------
+
+From your top-level directory, doing:
+
+    make bin/c_lexer
+
+should create a programe called... `bin/c_lexer`.
+How the makefile makes it is up to you.
+
+You are free to use C or C++ (or something else if it
+is installed in the target environment), and may make use
+of other tools such as flex as part of your
+build process. The target environment is the lab
+Ubuntu distribution (Ubuntu 16.04), or equivalently an
+Ubuntu 16.04 VM as configured in the attached Vagrantfile.
+
+If you particularly want to use a tool/language, then
+submit a pull request againt the Vagrantfile - as long
+as they are reasonable, they can be added in (though they
+will not then magically appear in the lab...)
+
+Example
+-------
+
+Assuming you are in a console in the current directory,
+this would be an example session:
+````
+$ make bin/c_lexer
+$ echo "int x;" > wibble.c
+$ cat wibble.c
+int x;
+$ cat wibble.c | bin/c_lexer
+[
+{"Text":"int","Class":"Keyword"}
+,
+{"Text":"x","Class":"Identifier"}
+,
+{"Text":";","Class":"Operator"}
+]
+````
+
+Another session for a different lexer could be:
+````
+$ make bin/c_lexer
+$ echo "float x=" > wibble.c
+$ echo "  4.5 ;" >> wibble.c
+$ cat wibble.c
+float x=
+  4.5 ;
+$ 
+$ cat wibble.c | bin/c_lexer
+[ { "Class" : "Keyword",    "Text": "float", "StreamLine" : 1, "SourceCol" : 1 },
+  { "Class" : "Identifier", "Text": "x",     "StreamLine" : 1, "SourceCol" : 7 },
+  { "Class" : "Operator",   "Text": "=",     "StreamLine" : 1, "SourceCol" : 8 },
+  { "Class" : "Constant",   "Text": "4.5",   "StreamLine" : 2, "SourceCol" : 3 },
+  { "Class" : "Operator",   "Text": ";",     "StreamLine" : 2, "SourceCol" : 7 },
+  {}
+]
+````
+
+Note that both examples do not have the full complement of fields, and
+the first example does not use a final empty token, while the second does.
+
+The outputs also look different because we are using JSON as
+a container, so the exact spacing of things doesn't matter
+too much. Also within the dictionary (the parts within `{}`)
+the ordering of `key:value` pairs does not matter. However,
+the ordering of tokens within the toplevel array `[]` _does_
+matter.
+
+
+Pre-Processor
+-------------
+
+TLDR : if you don't want to support pre-processor information, then
+if you encounter a `#` character that is _not_ in a string literal,
+consume all characters to the end of the line.
+
+If needed, the input will already have been pre-processed, which means
+that all instances `#define` and `#include` will already have been processed. However,
+you may still see some input lines from the pre-processor. Create a simple
+file called `tmp.c`:
+
+    echo -e "#include <stdio.h>\nint main(){\n  printf(\"Wibble\"); \n}\n" > tmp.c
+    less tmp.c
+
+We can now pre-process it with `cpp`, the [GNU C Pre-Processor](https://gcc.gnu.org/onlinedocs/cpp/):
+
+    cpp tmp.c
+
+You should should see a load of stuff from the standard includes, then
+your tiny program at the bottom. You can get a better look with `less`:
+
+    cpp tmp.c | less
+
+At the top of the file you'll see something like `# 1 tmp.c`, then
+references to various files like `/usr/include/stdio.h` and `/usr/include/sys/config.h`.
+If you scroll down, eventually you'll see the first non-empty line that
+_doesn't_ start with `#` - in my case it is `typedef signed char __int8_t;`,
+but it could be something else. This represents the first true token,
+and will be the first line that should cause a token to be output.
+
+The format of the `#` lines is explained in the [Pre-Processor Manual](https://gcc.gnu.org/onlinedocs/cpp/Preprocessor-Output.html),
+though you can get a good sense of what is going on by comparing
+the original headers and the output of `cpp`. `less` has a mode
+the displays line numbers (the `-N` option), so do:
+
+    cpp tmp.c | less -N
+
+in one terminal, and in another terminal do:
+
+    less -N /usr/include/stdio.h
+
+It gives quite a good insight into how the transformation
+happens, as you can match up the lines in the original
+versus the lines in the `#` output, and see where comments
+and `#define`s have been removed.
+
+Note: there is no requirement that the input has been pre-processed,
+so there may be no `#` lines at all. Use the default filename
+"<stdin>" as the SourceFile if there is no information provided
+by a pre-processor.
+
+Footnotes
+=========
+
+[^1] - Some thought indicates _why_ you won't see extensions. Things
+       like `__attribute__` are a side-channel between the implementer
+       of the C run-time and the compiler, or the OS interface and the
+       compiler. It is necessary that both sides agree on what those
+       extensions mean, so for example the writers of `libc` have to talk
+       to the writers of `gcc` about what they need and why. The C
+       standard recognises that this is necessary, and so explicitly reserves
+       identifiers beginning with two under-scores as "implementation defined",
+       where "implementation" means the compiler.
+
+       So in this instance _you_ are the implementation. Given you have
+       not defined any extensions, logically there are no possible extensions
+       that could occur.
diff --git a/2-parser.md b/2-parser.md
new file mode 100644
index 0000000..5010e2f
--- /dev/null
+++ b/2-parser.md
@@ -0,0 +1,268 @@
+Create a parser for the C language
+==================================
+
+Your program should accept C source code on
+`stdin` and write a heirarchical representation on `stdout`.
+
+Input Format
+------------
+
+The input format and constraints are the same as for the tokeniser, with
+the following restrictions:
+
+- the only variable/parameter data-type that will occur is `int` (so no arrays, structs, floats)
+
+- the only control-flow statements are `if`, `while`, `for`, and sequencing.
+
+- all functions are definitions (i.e. there are no declarations)
+
+These restrictions do _not_ apply to the eventual compiler, they are
+simply to make this stage more manageable as a milestone. You are
+encouraged to support other things as well, but the intent here is
+to get a core set of functionality fully working.
+
+The input format is still C90, and it is worth noting
+a difference between the C90 ("classic) and C99 ("modern")
+languages, which is that declarations can only appear
+at the top of a scope. So all variables declarations will
+appear before any statements. See the later section for more discussion.
+
+Output Format
+-------------
+
+The output format will be a heirarchical format written in 
+[XML](https://en.wikipedia.org/wiki/XML), which identifies the functions, scopes, and variables.
+
+Your output should consist of the following types of XML element:
+
+- `<Program>` : The top-level element, containing all others.
+
+- `<Function id="id">` : This appears within a Program element, and
+    indicates the existence of a function called id. The ordering of
+    Function elements relative to other globals should
+    follow the original source.
+    
+- `<Parameter id="id">` : This appears as a child of a Function element,
+    and states that the enclosing function has a parameter with this name.
+    The relative ordering of Parameter elements is important, and should follow the
+    order given in the function signature.
+    
+- `<Scope>` : Appears within either a Function or another Scope, and
+    identifies a scope within which variables have a lifetime. Empty
+    scopes are not considered meaningful and may be omitted or inserted
+    as convenient. Extra levels of scope nesting are also allowed, as long
+    as the lifetime of all variables is correct (equivalent to `((x))` == `(x)`).
+    
+- `<Variable>` : Appears within a Program or a Scope, and indicates a
+    variable that is being introduced.
+    
+The output should not contain any other elements except for these
+five. Each element can contain arbitrary amounts of white-space.
+
+Examples
+--------
+
+As a simple example, the following function:
+
+    int f( int x)
+    {
+        int y=x;
+        return y;
+    }
+
+should result in:
+
+
+    <?xml version="1.0"?>
+    <Program>
+        <Function id="f">
+            <Parameter id="x" />
+            <Scope>
+                <Variable id="y" />
+            </Scope>
+        </Function>
+    </Program>
+    
+or also correct would be:
+
+    <?xml version="1.0"?>
+    <Program> <Function id="f">
+            <Parameter id="x" />
+            <Scope />
+       <Scope><Scope /></Scope>
+            <Scope>
+                <Scope>
+                <Variable id="y" />
+            </Scope></Scope>
+    </Function> </Program>
+
+and infinite variations thereof.
+
+This code:
+
+    int x()
+    {}
+
+    int g;
+
+    int zz(int a, int b, int c)
+    {
+      if(a==b)
+        int a;
+        return a;
+      else{
+        int fsdfsdfs;
+        return c;
+      }
+    }
+
+would result in:
+
+    
+    <?xml version="1.0"?>
+    <Program>
+        <Function id="x" />
+        <Variable id="g" />
+        <Function id="zz">
+            <Parameter id="a" /><Parameter id="b" />
+                    <Parameter id="c" />
+            <Scope>
+                <Scope>
+                    <Variable id="a" />
+                </Scope>
+                <Scope>
+                    <Variable id="fsdfsdfs" />
+                </Scope>
+            </Scope>
+        </Function>
+    </Program>
+
+
+Given that additional scope nesting levels and extra empty scopes are
+allowed, you may wonder how it is possible to tell whether the
+scoping is correct. This comes down to the lifetime of variables
+within the original C program. For example, given this:
+
+    int x;
+    if(...){
+        int y;
+    }
+    
+It would be possible to say:
+
+    <Scope>
+        <Variable id="x" />
+        <Scope>
+            <Variable id="y" />
+        </Scope>
+    </Scope>
+
+as this indicates that:
+
+- a variable x is defined.
+
+- a variable y is defined, and x is still available within the same scope
+
+- the variable y goes out of scope before x.
+
+It would not be possible to say:
+
+    <Scope>
+        <Variable id="x" />
+        <Variable id="y" />
+    </Scope>
+    
+as while it matches the first two properties above, it indicates that
+x and y go out of scope at the same time, rather than y leaving scope
+before x does.
+
+Program build
+-------------
+
+From your top-level directory, doing:
+
+    make bin/c_parser
+
+should create a programme called... `bin/c_parser`.
+
+
+
+C90 Declarations
+----------------
+
+In C99 (like C++) you are allowed to interleave declarations and statements:
+
+    int f(int i)
+    {
+        int x=i+5;
+        x=x*x;
+        int y=x+4;
+        return y;
+    }
+
+In C90, you must have all declarations before any statements,
+so that would need to be re-written as:
+
+    int f(int i)
+    {
+        int x=i+5;
+        int y;
+        x=x*x;
+        y=x+4;
+        return y;
+    }
+
+Note that declarations can happen at the start of any scope, so things like:
+
+    int f(int i)
+    {
+        int x=i+5;
+        x=x*x;
+        if(1){
+            int y=x+4;
+            return y;
+        }
+    }
+
+Thinking about it with your knowledge of parsers and compilers,
+you might be able to guess why the older language is more restricted,
+while the newer one relaxes it. You also might guess why I
+deliberately chose the older syntax for you to implement.
+
+It's worth noting that beyond compiler implementation, there
+are many arguments for and against both styles. An argument
+for requiring C90 style declarations is that you have
+immediate visibility of the total number of variables in a scope.
+This gives you a sense of the complexity of a function,
+as you can see how many moving parts there are. For example,
+Linus Torvalds [likes that aspect](https://lkml.org/lkml/2012/4/12/18),
+as the kernel is an extremely large and complex code-base.
+There are also rules in the [MISRA-C](https://en.wikipedia.org/wiki/MISRA_C)
+guidelines for automotive and embedded software that state
+that variables should appear at function scope.
+
+One argument for the newer style is convenience, but also
+the idea that that it is better to declare a variable at
+the point that it is initialised. A common error in C is
+to have a variable which is not initialised on all code
+paths:
+
+    {
+       int i;
+
+       // Much code
+
+       if(getc()){
+           i=1;
+       }
+
+       // So codey
+
+       putc(i+1);
+    }
+
+Modern compilers will put out a warning about `i` potentially
+being used before being initialised, but people may just
+ignore that. The move towards more functional and declarative
+styles has also encouraged this process - for example, the
+`auto` keyword in C++11.
diff --git a/3-compiler.md b/3-compiler.md
new file mode 100644
index 0000000..7ac2fe7
--- /dev/null
+++ b/3-compiler.md
@@ -0,0 +1,45 @@
+Create a compiler for the C language
+====================================
+
+Your program should accept C source code on
+`stdin` and write a heirarchical representation on `stdout`.
+
+Input Format
+------------
+
+The input format is C90.
+
+Output Format
+-------------
+
+The output format should be MIPS1 assembly code.
+
+It should be possible to assemble and link this code
+against a C run-time, and have it execute correctly
+on a MIPS processor.
+
+Compilation
+-----------
+
+Your compiler should be built using:
+````
+make bin/c_compiler
+````
+and the resulting program should be called `bin/c_compiler`.
+
+The target environment remains Ubuntu 16.04.
+
+Deliverables
+------------
+
+There are actually three deliverables here:
+
+1 - The compiler itself
+
+2 - A test framework
+
+3 - Documentation
+
+There are certain requirements on test and a format for the
+documentation, but we'll elaborate on those when they are
+encountered in the course.
diff --git a/Vagrantfile b/Vagrantfile
new file mode 100644
index 0000000..1f53d5d
--- /dev/null
+++ b/Vagrantfile
@@ -0,0 +1,55 @@
+# -*- mode: ruby -*-
+# vi: set ft=ruby :
+
+# All Vagrant configuration is done below. The "2" in Vagrant.configure
+# configures the configuration version (we support older styles for
+# backwards compatibility). Please don't change it unless you know what
+# you're doing.
+Vagrant.configure(2) do |config|
+  # The most common configuration options are documented and commented below.
+  # For a complete reference, please see the online documentation at
+  # https://docs.vagrantup.com.
+
+  # Every Vagrant development environment requires a box. You can search for
+  # boxes at https://atlas.hashicorp.com/search.
+  config.vm.box = "ubuntu/xenial64"
+
+  # Share an additional folder to the guest VM. The first argument is
+  # the path on the host to the actual folder. The second argument is
+  # the path on the guest to mount the folder. And the optional third
+  # argument is a set of non-required options.
+  # config.vm.synced_folder "../data", "/vagrant_data"
+
+  # Provider-specific configuration so you can fine-tune various
+  # backing providers for Vagrant. These expose provider-specific options.
+  # Example for VirtualBox:
+  #
+  # config.vm.provider "virtualbox" do |vb|
+  #   # Display the VirtualBox GUI when booting the machine
+  #   vb.gui = true
+  #
+  #   # Customize the amount of memory on the VM:
+  #   vb.memory = "1024"
+  # end
+
+
+  # Enable provisioning with a shell script. Additional provisioners such as
+  # Puppet, Chef, Ansible, Salt, and Docker are also available. Please see the
+  # documentation for more information about their specific syntax and use.
+   config.vm.provision "shell", inline: <<-SHELL
+      sudo apt-get update
+
+      # Standard build tools
+      sudo apt-get -y install g++ gdb make dos2unix git
+
+      # Compiler build tools
+      sudo apt-get -y install bison flex
+
+      # MIPS cross-compiler stuff
+      sudo apt-get -y install g++-mips-linux-gnu gdb-multiarch
+
+      # QEMU run-time emulator
+      sudo apt-get -y install qemu
+
+   SHELL
+end
diff --git a/bin/.gitignore b/bin/.gitignore
new file mode 100644
index 0000000..e69de29
--- /dev/null
+++ b/bin/.gitignore
diff --git a/doc/.gitignore b/doc/.gitignore
new file mode 100644
index 0000000..e69de29
--- /dev/null
+++ b/doc/.gitignore
diff --git a/makefile b/makefile
new file mode 100644
index 0000000..5f938f3
--- /dev/null
+++ b/makefile
@@ -0,0 +1,13 @@
+
+
+bin/c_lexer :
+    echo "No current build commands for C lexer."
+    exit 1
+    
+bin/c_parser :
+    echo "No current build commands for C parser."
+    exit 1
+
+bin/c_compiler :
+    echo "No current build commands for C compile."
+    exit 1
diff --git a/readme.md b/readme.md
new file mode 100644
index 0000000..f1ed994
--- /dev/null
+++ b/readme.md
@@ -0,0 +1,50 @@
+The module code deliverables come in three parts:
+
+  Deliverable                  | Due         | Weight | Notes
+-------------------------------|-------------|--------|------------------------------------
+[1-lexer.md](1-lexer.md)       | Tue 7th Feb | 10%    | A C lexer that produces JSON output 
+[2-parser.md](2-lexer.md)      | Tue 7th Mar | 10%    | A (restricted) C parser that produces XML output
+[3-compiler.md](3-compiler.md) | Tue 7th Feb | 50%    | The final C compiler
+
+You will all get a bare private repository. It is up to you
+if you want to clone the master specification, or to start from
+scratch. Both approaches just require the git operations already used
+in lab 1, and possibly looking at the documentation for the commands
+you have already used. Remember that you can always do `git ${CMD} --help`
+for any CMD such as `pull`, `push`, `commit`, `remote`.
+
+Starting from scratch
+---------------------
+
+1 - Create a local git repository using `git init` in a directory of your choice.
+
+2 - Commit files to your local repository (as you would normally commit)
+
+3 - Add the URL of your private repository as the remote `origin` (similar
+    to manually adding the `spec` remote in the lab).
+
+4 - You should now be able to push and pull commits to origin, which is
+    your private repo.
+    
+Cloning the spec
+----------------
+
+1 - Clone the master specification from the public URL. This will produce
+    a directory called `langproc-2016-cw` by default.
+    
+2 - Rename it to something more meaningful, e.g. `langproc-2016-cw-${LOGIN}`.
+
+3 - Use `git remote` to rename `origin` (which points at the public spec) to
+    `spec`.
+    
+4 - Use `git remote` to add a new remote called `origin` pointing at
+    your private repository.
+
+5 - You should then be able to push and pull to `origin`, and pull from `spec`,
+    as in the lab.
+
+Submissions
+-----------
+
+Submission will be via github (code) + blackboard (commit hash),
+as in the lab.
diff --git a/src/.gitignore b/src/.gitignore
new file mode 100644
index 0000000..e69de29
--- /dev/null
+++ b/src/.gitignore
author	David Thomas <m8pple@github.com>	2017-01-23 10:56:54 +0000
committer	David Thomas <m8pple@github.com>	2017-01-26 10:57:25 +0000
commit	b850a6b64115940a0e3df50b2aa923dd3e99b13e (patch)
tree	2b394c18824ed89cc305d719d2c771f155615ba8
download	Compiler-b850a6b64115940a0e3df50b2aa923dd3e99b13e.tar.gz Compiler-b850a6b64115940a0e3df50b2aa923dd3e99b13e.zip