FMark/src/Common/Lexer/README.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285

# Project Contribution

The preprocessor and lexer are part of FMark, which a markdown parser in F#. This sub project contains the 
lexer and the preprocessor for the markdown parser. The preprocessor is a completely separate parser
which preprocesses the markdown before passing it to the lexer and finally the parser.

# Preprocessor

This project contains the Preprocessor for FMark. The preprocessor adds templating
capabilities to FMark, which was inspired by [Liquid](https://shopify.github.io/liquid/).

## Specification

### Supported Constructs

These are the supported constructs in the preprocessor.

|Supported|Syntax|Description|Tested|
|---|---|---|---|
|Simple Macro|`{% macro name value %}`| Sets the Macro `name` equal to the string `value`|Unit Test|
|Function Macro|`{% macro name(arg1; arg2) value %}`|Sets the Macro `name` equal to the string `value` with two parameters.|Unit Test|
|Simple Evaluation|`{{ macro_name }}`|Evaluates the macro `macro_name` and replaces the evaluation with the evaluated body of the macro.|Unit Test|
|Function Evaluation|`{{ macro_name(arg 1; arg 2) }}`|Evaluates the macro `macro_name` with the arguments `arg 1` and `arg 2` and replaces the evaluation with the evaluated body of the macro.|Unit Test|

### Supported Features

These are the features that are currently supported by the preprocessor.

|Feature|Example|Description|Tested|
|---|---|---|---|
| Simple whitespace control|`{% macro x y %}` evaluates to `y` and not ` ` `y` ` `.|Removes whitespace and newlines in macros where one wouldn't expect them to be added to the macro body.|Unit Test|
|Shadowing of macros through|`{% macro x x %} {% macro y(x) {{ x }} %}` with `{{ y(z) }}` will evaluate to `z` but `{{ x }}` outside of the macro will always evaluate to `x`.|Macros can be shadowed by arguments of other macros.|Unit Test|
|Nested macros|`{% macro x {% macro y %} %}`|Macro y is only defined inside macro x and cannot be seen outside of the scope of x.|Unit Test|
|Shadowing of macros through|`{% macro x x %} {% macro y {% macro x z %} {{x}} %} y: {{ y }}, x: {{ x }}` will evaluate to `y: z, x: x`|Macros can be shadowed by other macros which will be used instead for evaluation.|Unit Test|
|Evaluation of large strings|`{{ x(This is the first argument; This is the second argument) }}`|One can pass large strings as arguments to the macros.|Unit Test|
|Escaping of characters inside argument|`{{ x(arg 1 with a \); arg 2 with a \;) }}`|One can esape all the special characters inside macros and substitutions|Unit Test|
|Escaping macros|`\{% macro x y %}`|This will escape the whole macro and not evaluate it|Unit Test|
|Escaping Subsitutions|`\{{ x }}`| will not evaluate the substitution but instead output it literally|Unit Test|
|Outputting unmatched subsituttion|`{{ x }}` -> `{{ x }}` if not in scope|If the subsitution is not matched, it will output it as it got it|Unit Test|

### Usage

To use the preprocessor and the lexer, a string or a list of strings can be used, depending on if there are multiple
lines or not. For a single string, the following can be used.

For string, the `preprocess` and `lex` functions.

``` f#
[<EntryPoint>]
let main =
    let inputString = (* Read the string *)

    inputString
    |> preprocess
    |> lex
    ...
```

For a list of strings, one can use the `preprocessList` and `lexList` functions.

``` f#
[<EntryPoint>]
let main =
    let inputStringList = (* Read the string list *)
    
    inputStringList
    |> preprocessList
    |> lexList
    ...
```

### Example

In markdown using the preprocessor, one can then write the following:

```
Text before macro
{% macro Hello(arg1; arg2)
This is text inside the macro, with semicolons;
{% macro local(arg1; arg2)
This is the second macro
%}
Now back in the first macro.
{{ local(arg1; arg2) }}
%}
Outside both macros
Should be printed as not in scope: {{ local(arg1; arg2) }}

{{ Hello(arg1; arg2) }}
```

which then evaluates to

```
Text before macro
Outside both macros
Should be printed as not in scope: {{ local(arg1; arg2) }}


This is text inside the macro, with semicolons;
Now back in the first macro.

This is the second macro 


```

### Future improvements

There are many features that will be introduced into the preprocessor in the future. Some of the future
constructs can be seen below.

|Construct|Description|
|---|---|
|for loop|A for loop that will repeat whatever is put into the body|
|ifdef|Check if a macro is defined|
|Expressions|Introduce arithmetic expressions|
|if|Check if a condition is true, which will need the introduction of Expressions|

There are also some features that could be added.

|Feature|Description|
|---|---|
|`{%- -%}`|New delimiter that will completely remove the whitespace of the macro at that point|

# Lexer

## Interface to the Parser

The interface to the parser was done using the following `Token` type, which the parser takes in 
and can parse.

``` f#
type Token =
    | CODEBLOCK of string * Language
    | LITERAL of string
    | WHITESPACE of size: int
    | NUMBER of string
    | HASH | PIPE | EQUAL | MINUS | PLUS | ASTERISK | DOT
    | DASTERISK | TASTERISK | UNDERSCORE | DUNDERSCORE | TUNDERSCORE | TILDE | DTILDE
    | TTILDE | LSBRA | RSBRA | LBRA | RBRA | BSLASH | SLASH | LABRA | RABRA | LCBRA
    | RCBRA | BACKTICK | TBACKTICK | EXCLAMATION | ENDLINE | COLON | CARET | PERCENT
```

## Features

Supports escaping of all the special characters defined in [Types](/FMark/Types.fs). This is done by adding
a `\` in front of the character that should be escaped.

Tokens that match multiple characters can also be escaped by just putting a `\` before it. For example, 
`***` can be escaped by writing `\***`.

## Extensibility

It can easily be extended by adding the type of the token to `Token` above. Then the string
has to be linked to the token by adding it as a tuple of type `string * Token` to a list called
`charList` in the [Lexer](/FMark/Lexer.fs).

## Missing Feature

Currently, it does not return code blocks into `CODEBLOCK`, as that will be moved into the parser and removed from the 
Token list.

# Test Plan

The lexer and the preprocessor were built using a test-driven manner, by writing tests first and then making them pass with
the code. This means that the goal of the code is well defined beforehand and can more easily be written. It is then
much easier to test the whole code by just running all the unit tests, instead of manually testing it everytime, as that 
could mean that pevious functionality might not work anymore.

Unit tests were used to make small tests that were going to have to pass. After the code was written,
property based tests made sure that the main functions were working as they were supposed to.

More tests were added once all the functionality was there to thoroughly test the preprocessor and lexer. These tests
were chosen for relatively large functions that were used directly in the workflow and not for small functions that are
used by these larger functions, as they would make them fail if they didn't work. This could then be detected by running
tests on the larger functions that used these tests.

As many edge cases as possible were identified for the preprocessor and tokenizer and tested using unit tests as well,
which identified a few bugs, such as issues with whitespace in macros.

Finally, a property based test was added for the preprocessor that tests that the preprocessed output, when preprocessed
again is the same. This is the only property test that seemed to work. Trying to create a `lex` property based
test that would compare the input to the output, a lot of differences were found that I had previously not thought about.
This type of test did not work, because the `Token` type does not restrict the type enough. `FsCheck` would generate
a `LITERAL ""` which would never be generated by the lexer, however is still a valid type. The same goes for `NUMBER ""` and
`WHITESPACE 0`. The `ENDLINE` token will also always be at the end of the list for a single string for `lex`, however,
`FsCheck` would put it anywhere.

## Summary

1. Test while writing the different functions and when implementing new features.

2. Add unit tests to test edge cases.

3. Add property based tests.


## Property based tests

### Preprocessor

This property based test runs a random string that is generated by `FsCheck` and runs it through
the preprocessor. It then checks if the output string is the same as the string if it passed through the
preprocessor a second time.

## Unit tests

### Preprocessor

#### Next Token

|Name|Status|
|---|---|
|Openeval|Pass|
|Closeeval|Pass|
|Opendef|Pass|
|Semicolon|Pass|
|Long random text|Pass|

#### Tokenize

|Name|Status|
|---|---|
|All Tokens|Pass|
|Macro|Pass|
|Subsitution|Pass|
|Normal markdown|Pass|
|Escaped character in sentence|Pass|

#### Parse

|Name|Status|
|---|---|
|Macro with multiple arguments and inline body|Pass|
|Substitution|Pass|
|Substitution with argument|Pass|
|Substitution with multiple arguments|Pass|
|Substitution with argument and spaces|Pass|

#### Preprocess

|Name|Status|
|---|---|
|Simple text does not change|Pass|
|Simple text does not change with special chars|Pass|
|Simple macro with no arguments|Pass|
|Simple macro with empty brackets|Pass|
|Simple macro evaluation|Pass|
|Print out the input when substitution not in scope|Pass|
|Escaping macro bracket should make the original input appear|Pass|
|Shadowed macros and arguments|Pass|
|Shadowed macros|Pass|
|Macro with different arguments|Pass|
|Macro with long name|Pass|

#### Preprocess List

|Name|Status|
|---|---|
|Multiline macro evaluation with newline|Pass|
|Multiline macro without newline|Pass|
|Multiline macro with arguments|Pass|

### Lexer

#### Lex

|Name|Status|
|---|---|
|All Tokens|Pass|
|Literal|Pass|
|Number|Pass|
|WhiteSpace|Pass|
|Very simple markdown|Pass|
|With special characters|Pass|
|Escaping characters|Pass|

#### lexList

|Name|Status|
|---|---|
|Very simple multiline markdown|Pass|
|With special characters|Pass|
|Escaping characters|Pass|