Create Your First Programming Language That Runs on .NET
Write a mini programming language that outputs valid CIL in a notebook in under an hour
Note: as of this writing, C# 11 is the latest version of C# to be released. C# 12 includes a feature that would make a lot of this code more concise - collection expressions, however, I don't want to figure out how to get preview versions of .NET to work in a notebook. As such, I plan to update this when C# 12 releases.
Compilers can be a tricky topic to get into. There are plenty of resources available to learn compilers, but nothing I've found as small as a notebook.
This purpose of this notebook is to demonstrate an example of a tiny compiler.
Goals:
Runs on .NET runtime as its own executable!
variables, assignment, integer expression evaluation
runs without the need of parser generators
Plan
This compiler follows 4 main phases of a canonical compiler:
Lexer
Parser
Semantic Analysis
Codegen
Note: This language is extremely underwhelming. It supports no functions, only a single class of the most basic operators, has a single type of
int
, gives zero helpful error messages (it either works or it doesn't), and despite running on .NET, you can't explicitly import or call any of the .NET methods.
Lexer
The first step in a compiler is a lexer. The idea is to take string text and convert it into a sequence of tokens, kind of like how you can take an English sentence and split it into words with string.Split(" ")
.
We can't use string.Split(" ")
for tokens, though, because sometimes tokens are not separated by spaces. For example, x = 3
and x=3
are both valid Python programs. Instead, we'll use something called regular expressions (Regex
).
Regex
A regex is a pattern with which we can test whether another string conforms to that pattern. For example, the string name
matches the pattern name
. In a regex, the unescaped .
means "match any single character". So therefore, name
also matches the pattern n.me
, as well as the strings nome
and neme
.
using System.Text.RegularExpressions;
string[] words = ["name", "neme", "nome", "neem"];
string pattern = "n.me";
words.Select(w => new { String = w, IsMatch = Regex.IsMatch(w, pattern) }).DisplayTable()
String | IsMatch |
name | True |
neme | True |
nome | True |
neem | False |
Another reason we might want to use Regex
to break strings into tokens ("sentence into words") is because it helps us classify our tokens into types or categories. In English sentences, words are classified into "parts of speech", like "verb" and "noun". In a naive example where words fall into one part of speech, we can look up words in a sentence in a dictionary to find their parts of speech. For example, in I ran to the store
, I
is a noun, ran
is a past-tense verb, etc. With regex, we actually have it a bit easier, as we can classify tokens based on their patterns instead of looking them up in a dictionary. For example, the pattern -?[0-9]+
matches integers like 1
and 626
, and the pattern ^[a-zA-Z]+[0-9a-zA-Z]*
matches IDs like index
and age
.
First, we'll define our token types using an enum
, and a record struct
to store our token type, along with an int Start
and ReadOnlyMemory<char> Memory
to reference where it lives in memory.
(we could use this information in the future for error reporting, though we do not in this example)
dotnet build -c Release
MSBuild version 17.8.3+195e7f5a3 for .NET
Determining projects to restore...
All projects are up-to-date for restore.
learn -> C:\Users\retru\source\repos\johnW-ret\Retruate.Languages\learn\bin\Release\net8.0\learn.dll
Build succeeded.
0 Warning(s)
0 Error(s)
Time Elapsed 00:00:01.45
#r "bin/Release/net8.0/learn.dll"
We also want to associate token types with their corresponding Regex
rules.
We could use a Dictionary
(hashmap) here, but as we'll see, we'll be iterating through our collection of rules, so it makes more sense to just define them as a list of tuples.
(TokenType Type, Regex Pattern)[] rules = [
(TokenType.Semicolon, new(@"^;", RegexOptions.Compiled)),
(TokenType.Assignment, new(@"^=", RegexOptions.Compiled)),
(TokenType.LeftParenthesis, new(@"^\(", RegexOptions.Compiled)),
(TokenType.RightParenthesis, new(@"^\)", RegexOptions.Compiled)),
(TokenType.Add, new(@"^\+", RegexOptions.Compiled)),
(TokenType.Subtract, new(@"^\-", RegexOptions.Compiled)),
(TokenType.Multiply, new(@"^\*", RegexOptions.Compiled)),
(TokenType.Divide, new(@"^/", RegexOptions.Compiled)),
(TokenType.Id, new(@"^[a-zA-Z]+[0-9a-zA-Z]*", RegexOptions.Compiled)),
(TokenType.Integer, new(@"^-?[0-9]+", RegexOptions.Compiled)),
(TokenType.Whitespace, new(@"^[ \n\t\r]", RegexOptions.Compiled))
];
In our case, we can modify our use of Regex
to "pull tokens off" when we have at least one match.
With this knowledge, we can imagine a simple algorithm to break off tokens:
tokens = []
while input string s not empty
for each (type, regex pattern) in rules
search for first match m in regex pattern
if m exists
add (m, type) to tokens
if m not exists
error
return tokens
which is exactly what we do below
First we define our input string:
ReadOnlyMemory<char> input =
"""
a = 2;
b = 3;
c = (3 + a) * b;
c
""".AsMemory();
You'll note that instead of using a string
directly, we call AsMemory
and store it as a ReadOnlyMemory<char>
.
ReadOnlyMemory<char>
and ReadOnlySpan<char>
allow us to create wrappers over our original input string and slice the front off each time we pull off a token without passing an int
pointer or and without copying any of the input string.
#nullable enable
public List<Token>? Lex(ReadOnlyMemory<char> feed)
{
List<Token> tokens = new();
while (feed.Length > 0)
{
// get the first match
foreach (var rule in rules)
{
var match = rule.Pattern.EnumerateMatches(feed.Span);
if (match.MoveNext())
{
tokens.Add(new Token(input.Length - feed.Length, feed[0..match.Current.Length], rule.Type));
feed = feed[match.Current.Length..];
goto next;
}
}
// tokens[^1] switch { _ => ... }
Console.WriteLine("Error at: " + ((Func<Token, string>)(t => $"{t.Start} {t.Type} {t.Memory}"))(tokens[^1]));
return null;
next:;
}
// remove whitespace
return tokens.Where(t => t.Type != TokenType.Whitespace).ToList();
}
var tokens = Lex(input);
tokens
index | value | ||||||
---|---|---|---|---|---|---|---|
0 |
|
Start | 0 |
Memory | a |
Type | Id |
Token { Start = 2, Memory = =, Type = Assignment }
Start | 2 |
Memory | = |
Type | Assignment |
Token { Start = 4, Memory = 2, Type = Integer }
Start | 4 |
Memory | 2 |
Type | Integer |
Token { Start = 5, Memory = ;, Type = Semicolon }
Start | 5 |
Memory | ; |
Type | Semicolon |
Token { Start = 7, Memory = b, Type = Id }
Start | 7 |
Memory | b |
Type | Id |
Token { Start = 9, Memory = =, Type = Assignment }
Start | 9 |
Memory | = |
Type | Assignment |
Token { Start = 11, Memory = 3, Type = Integer }
Start | 11 |
Memory | 3 |
Type | Integer |
Token { Start = 12, Memory = ;, Type = Semicolon }
Start | 12 |
Memory | ; |
Type | Semicolon |
Token { Start = 14, Memory = c, Type = Id }
Start | 14 |
Memory | c |
Type | Id |
Token { Start = 16, Memory = =, Type = Assignment }
Start | 16 |
Memory | = |
Type | Assignment |
Token { Start = 18, Memory = (, Type = LeftParenthesis }
Start | 18 |
Memory | ( |
Type | LeftParenthesis |
Token { Start = 19, Memory = 3, Type = Integer }
Start | 19 |
Memory | 3 |
Type | Integer |
Token { Start = 21, Memory = +, Type = Add }
Start | 21 |
Memory | + |
Type | Add |
Token { Start = 23, Memory = a, Type = Id }
Start | 23 |
Memory | a |
Type | Id |
Token { Start = 24, Memory = ), Type = RightParenthesis }
Start | 24 |
Memory | ) |
Type | RightParenthesis |
Token { Start = 26, Memory = *, Type = Multiply }
Start | 26 |
Memory | * |
Type | Multiply |
Token { Start = 28, Memory = b, Type = Id }
Start | 28 |
Memory | b |
Type | Id |
Token { Start = 29, Memory = ;, Type = Semicolon }
Start | 29 |
Memory | ; |
Type | Semicolon |
Token { Start = 32, Memory = c, Type = Id }
Start | 32 |
Memory | c |
Type | Id |
Great! Our input was an input string
and now we have a List<Token> tokens
!
Parser
Once we have our string of tokens, the next step is the parser.
The input of the parser is our string of tokens, and the output is something called an Abstract Syntax Tree (AST). Our parser is going to be a method that looks something like this:
record DummyProgram(); // dummy AST
static DummyProgram Parse(ReadOnlySpan<Token> tokens) =>
// replace with parser logic
new();
In this case, DummyProgram
is our AST type. Then we can get our AST like this:
DummyProgram dummyProgram = Parse(Lex("abcd".AsMemory()).ToArray().AsSpan());
The AST represents the source file as a tree. It is useful, because we can use it as input to another method to generate code:
static string Emit(DummyProgram program) =>
// replace with emit logic
string.Empty;
static void DummyWriteEmittedCodeToExecutable(string code) { }
DummyWriteEmittedCodeToExecutable(Emit(dummyProgram));
To understand how the AST is generated, we have to understand grammar.
Grammar
Imagine we have a string and we'd like to check whether it exists in a language:
string sentence = "i run to the store";
We can imagine a method that checks whether that given string exists in a language:
bool Validate(string sentence) =>
// replace with validation rules
true;
Validate(sentence);
A grammar simply defines the rules of a language, and we can use those rules to validate whether a given string is valid.
Every language has a grammar, for example, take a possible subset of English grammar:
Prepositional Phrase -> to the Noun
Noun -> store | post office
The capital letters are called non-terminals, which means that they have one or more production rules, and can be expanded into the right hand side of one of their production rules. The lowercase letters are called terminals which is basically just the same thing as tokens honestly in that they cannot be expanded with production rules and therefore make up the final sentence of the grammar.
We can take the Noun
rule and encode it as a method, which can directly be used to validate a sentence:
bool IsNoun(string subsentence) => subsentence is "store" or "post office";
IsNoun("store")
True
That was easy enough. We can do another one:
bool IsSpace(string subsentence) => subsentence is " ";
bool IsPrepositionalPhrase(string subsentence) => true || subsentence[0..6] is "to the" && IsSpace(subsentence[7..8]) && IsNoun(subsentence[7..]);
IsPrepositionalPhrase("to the store")
True
You may note that subsentence
is a string
, and using the range operator on a string
in C# allocates another string
(like string.Split
),
"to the "[0..3].GetType()
so from now on, I'm going to use ReadOnlySpan<char>
and ReadOnlyMemory<char>
.
static bool IsPrepositionalPhrase(ReadOnlySpan<char> subsentence) => subsentence[0..7] is "to the " && IsNoun(subsentence[7..]);
static bool IsNoun(ReadOnlySpan<char> subsentence) => subsentence is "store" or "post office";
We can create more rules and encode them as methods:
Sentence -> Subject Verb Object
Subject -> i | he | they
Verb -> run | drive
static bool IsSentence(ReadOnlySpan<char> subsentence) => IsSubject(subsentence) /* ... ??? */;
static bool IsSubject(ReadOnlySpan<char> subsentence) => subsentence is "i" or "he" or "they";
Our general "algorithm" so far is "try the first of each rule until one works", but you might notice that we run into a bit of a problem. Our rules are unable to "peel off" matches from our input. Specifically, in the above code, if IsSubject
succeeds, IsSentence
doesn't know where to slice to pass the rest of the unmatched input.
There are several ways you could solve this problem. For simplicity, we're just going to return ReadOnlySpan<char>
instead of bool
for all productions other than Sentence, and simply return "~"
for cases that fail, that will in turn, cause all following matches to fail (for our grammar, because ~
is not valid in our grammar). This isn't the greatest for error handling, but it's the simplest way I could think of that doesn't introduce global state.
static bool IsSentence(this ReadOnlySpan<char> subsentence) => subsentence.PeelSubject().PeelSpace().PeelVerb().PeelSpace() is "store";
static ReadOnlySpan<char> PeelSpace(this ReadOnlySpan<char> subsentence) => subsentence.StartsWith(" ") ? subsentence[1..] : "~";
static ReadOnlySpan<char> PeelSubject(this ReadOnlySpan<char> subsentence) => subsentence switch
{
_ when subsentence.StartsWith("i") => subsentence[1..],
_ when subsentence.StartsWith("he") => subsentence[2..],
_ when subsentence.StartsWith("they") => subsentence[4..],
_ => "~"
};
static ReadOnlySpan<char> PeelVerb(this ReadOnlySpan<char> subsentence) => subsentence switch
{
_ when subsentence.StartsWith("run") => subsentence[3..],
_ when subsentence.StartsWith("drive") => subsentence[5..],
_ => "~"
};
new[] { new { Input = "i run store", Result = IsSentence("i run store") }, new { Input = "i switch ran", Result = IsSentence("i switch ran") } }.DisplayTable()
Input | Result |
i run store | True |
i switch ran | False |
Finally, adding a rule for Object:
Object -> Prepositional Phrase | Noun
and completing all our rules:
static ReadOnlySpan<char> PeelNoun(this ReadOnlySpan<char> subsentence) => subsentence switch
{
_ when subsentence.StartsWith("store") => subsentence[5..],
_ when subsentence.StartsWith("post office") => subsentence[11..],
_ => "~"
};
static ReadOnlySpan<char> PeelPrepositionalPhrase(this ReadOnlySpan<char> subsentence) => subsentence switch
{
_ when subsentence.StartsWith("to the") => subsentence[6..],
_ => "~"
};
static ReadOnlySpan<char> PeelObject(this ReadOnlySpan<char> subsentence) => subsentence switch
{
_ when subsentence.PeelPrepositionalPhrase() is not "~" and { } rest
&& rest.PeelSpace().PeelNoun() is not "~" and { } _rest => _rest,
_ when subsentence.PeelNoun() is not "~" and { } rest => rest,
_ => "~"
};
static bool IsSentence(this ReadOnlySpan<char> subsentence) => subsentence.PeelSubject().PeelSpace().PeelVerb().PeelSpace().PeelObject() is { Length: 0 };
new[] { new { Input = sentence, Result = IsSentence(sentence) }, new { Input = "i switch ran", Result = IsSentence("i switch ran") } }.DisplayTable()
Input | Result |
i run to the store | True |
i switch ran | False |
What we did here is actually create a mini parser for a subset of the English language. The parser is very bad - it doesn't give any good error reporting, but it validates whether a sentence conforms to our language.
Parse Tree
If you step through a debugger, you will see the path our string i run to the store
took to get through our parser is as follows:
in `IsSentence`
in `PeelSubject`
out
in `PeelVerb`
out
in `PeelObject`
in `PeelPrepositionalPhrase`
out
in `PeelNoun`
out
out
out
This path can also be represented with a tree, as illustrated below:
(does not render on GitHub preview as of 2023-11-27 - click outside of the preview and press .
to view in vscode.dev, or copy-paste below cell into mermaid.dev)
This tree is called a parse tree, and it represents the derivation from the start non-terminal, through productions, to a final sentence.
The parser we wrote is called a recrusive-descent parser. In a recrusive-descent parser, the path taken (without walkbacks) to validate a string is the parse tree for that string.
I say without walkbacks, because in our algorithm, we always try the first rule for each production, and the string i run to the store
is what you get by always expanding the first production. If our string was i run store
, our code path would look like this
in `IsSentence`
in `PeelSubject`
out
in `PeelVerb`
out
in `PeelObject`
// we try this rule of `Object`
in `PeelPrepositionalPhrase`
out
in `PeelNoun`
out
// but it fails so we try the next
in `PeelNoun`
out
out
out
but our parse tree would look like this:
Here is a draft grammar for our language:
Program -> Statements Expression
Statements -> Statements Statement
Statement -> AssignmentStatement
AssignmentStatement -> id = Expression;
Expression ->
(Expression)
| Expression * Expression
| Expression / Expression
| Expression + Expression
| Expression - Expression
| id
| int
A general description is a list of statements followed by a single expression. All statements end with a ;
, and the program is ended with an expression (which we will evaluate and print to the console, though this isn't related to the grammar).
If you recall our input string:
input
a = 2; b = 3; c = (3 + a) * b; c
We can see it conforms to our grammar:
Left Recursion
We have a big problem though... actually, two of them!
Let's take a look at Statements
Statements -> Statements Statement
and what it could look like in code:
static ReadOnlySpan<char> PeelStatements(this ReadOnlySpan<char> input) => input.PeelStatements().PeelStatement();
static ReadOnlySpan<char> PeelStatement(this ReadOnlySpan<char> input) =>
// replace with statement validation logic
input[1..];
This is infinite recursion!
We can fix this by modifying our grammar:
Statements -> Statement Statements
This changes the order of derivation and therefore our parse tree, but that's not a problem here.
In general, you never want recursion to be in the same direction as your derivation. If you have left derivation (you try the left-most rules first), you don't want left recursion.
Ambiguity
We still have one more problem though, and it's with the Expression
rule:
Expression ->
(Expression)
| Expression * Expression
| Expression / Expression
| Expression + Expression
| Expression - Expression
| id
| int
This grammar is also left recursive, but more than that, it's ambiguous, in that for a single sentence, there are multiple derivations / parse trees you could construct.
Both parse trees above are legal productions of the grammar, but it's easy to see that their differences are non-negligible ((1 + 2) * 3 = 9
!=
1 + (2 * 3) = 7
).
To illustrate this problem, I'd translate the grammar to code again, but we'd run into the immediate issue of infinite left recursion. There are several ways to solve left recursion, but the underlying problem here is that of operator precedence, which is why we have a problem with ambiguity as well. Sometimes we can solve both problems at once by introducing new non-terminals, which is what we'll do:
Expression ->
Term + Expression
Term - Expression
Term
Term ->
Factor * Term
Factor / Term
Factor
Factor ->
(Expression) | id | int
When we split the grammar into multiple rules like so, we try to match for operators of lower precedence first, which means matching farther away from the leaves and closer to the root. The above Mermaid diagram hopefully helps demonstrate how the closer a subexpression is to the leaves, the higher precedence it has.
Note: I think it's important to note here that I am skipping over entire classes of parsers and grammars.
The grammar I am describing here is called "LL", which means it reads tokens from left-to-right and constructs a "left-most derivation" (just tries doing all the rules and stuff from the left first), and the parser I am writing is called "manual recursive descent". It is also "top-down", meaning it starts from the first non-terminal and tries to produce a matching sentence.
A point of inefficiency for the parser I'm writing is that it tries all the rules until it finds a match, meaning that if it does not find a match, it tries every single navigation of productions before giving up. This isn't terrible, as production compilers often use a mixture of parsing techniques with recursive descent often being the primary one because of the flexibility it provides, but it is worth mentioning that there are other parsing techniques that eliminate some of these inefficiencies by requiring you to be stricter with how you define your grammar.
Abstract Syntax Trees
The next step would be to translate our grammar into code as we did our English example, but you'll note that we're missing a key goal that we desired from the outset of our parser, and that is the AST - the representation of our code in tree form.
For your research, the preferred method of building an AST is with "semantic actions", but in our case, we can just build an AST as we parse our input by returning each subtree with each parsing method.
If this sounds complicated, it's my fault for being unable to explain it clearly, because it's really not, but if we directly try to implement it with code, we'll run into another issue.
Take our definition of PeelNoun from earlier:
static ReadOnlySpan<char> PeelNoun(this ReadOnlySpan<char> subsentence) => subsentence switch
{
_ when subsentence.StartsWith("store") => subsentence[5..],
_ when subsentence.StartsWith("post office") => subsentence[11..],
_ => "~"
};
If we return some type of Noun
data structure instead of a ReadOnlySpan<char>
, the caller method won't know where in the char
stream to resume reading input from.
A logical step would be to return both values in a tuple, but tuples under the hood are of type System.ValueTuple<T1, T2, ...>
and ref struct
types, of which ReadOnlySpan<T>
is one, can't be generic type parameters because that could potentially allow them to leak onto the heap. The scope of this issue is beyond that of this notebook, but the gist is that if we want to return multiple parameters along with ref structs
, we have to include them in another ref struct
and return that or use out
parameters... and out
parameters are ugly.
But before I can show our example, what of our AST data structure? For this, I'm currently using record
s. They're kind of like classes in that they're reference types, but they can be concisely defined in a single statement and have structural equality. The member definitions passed in the ()
s become readonly
auto-properties, meaning they are immutable.
I was trying to get some interop with F# to work in this notebook, and unfortunately, sharing custom data types across .NET Interactive doesn't work too well, so I've decided to define the types in the Program.cs in the same directory as this notebook.
Here's a simple example for parsing a literal value:
#nullable enable
ref struct ParseResult
{
public Node? Result { get; set; }
public ReadOnlySpan<Token> Rest { get; set; }
}
static ParseResult ParseLiteral(ReadOnlySpan<Token> tokens)
=> tokens switch
{
[{ Type: TokenType.Integer } token, ..var _rest] when int.TryParse(token.Memory.Span, out int value)
=> new() { Rest = _rest, Result = new IntLiteralExpression(value) },
_ => new() { Rest = tokens, Result = null }
};
ParseLiteral(Lex("3".AsMemory()).ToArray().AsSpan()).Result
IntLiteralExpression { Value = 3 }
Value | 3 |
The process is very similar to our English subset example, except that we, of course, need to return an AST, but we also use our ReadOnlySpan<Token> tokens
instead of a direct ReadOnlySpan<char>
character stream.
We can build on our above work to parse an Expression
:
#nullable enable
static ParseResult ParseExpression(ReadOnlySpan<Token> tokens)
=> ParseTerm(tokens) switch
{
{ Result: Expression left, Rest: var rest } => rest switch
{
[{ Type: TokenType.Add or TokenType.Subtract } op, ..var _rest] => ParseTerm(_rest) switch
{
{ Result: Expression right, Rest: var __rest } =>
new() { Result = new BinaryExpression(left, op, right), Rest = __rest },
_ => new() { Result = null, Rest = _rest }
},
_ => new() { Rest = rest, Result = left }
},
_ => new() { Rest = tokens, Result = null }
};
static ParseResult ParseTerm(ReadOnlySpan<Token> tokens)
=> ParseFactor(tokens) switch
{
{ Result: Expression left, Rest: var rest } => rest switch
{
[{ Type: TokenType.Multiply or TokenType.Divide } op, ..var _rest] => ParseTerm(_rest) switch
{
{ Result: Expression right, Rest: var __rest } =>
new() { Result = new BinaryExpression(left, op, right), Rest = __rest },
_ => new() { Result = null, Rest = _rest }
},
_ => new() { Result = left, Rest = rest }
},
_ => new() { Result = null, Rest = tokens }
};
static ParseResult ParseFactor(ReadOnlySpan<Token> tokens)
=> tokens switch
{
{ } when ParseWrappedExpression(tokens) is { Result: { } expression, Rest: var rest } =>
new() { Result = expression, Rest = rest },
{ } when ParseLiteral(tokens) is { Result: { } expression, Rest: var rest } =>
new() { Result = expression, Rest = rest },
[{ Type: TokenType.Id }, ..var _rest] =>
new() { Result = new NameExpression(tokens[0]), Rest = _rest },
};
static ParseResult ParseWrappedExpression(ReadOnlySpan<Token> tokens)
=> tokens switch
{
[{ Type: TokenType.LeftParenthesis } left, ..var rest] =>
ParseExpression(rest) switch
{
{ Result: Expression expression, Rest: var _rest } => _rest switch
{
[{ Type: TokenType.RightParenthesis } right, ..var __rest]
=> new() { Rest = __rest, Result = expression },
_ => new() { Rest = _rest, Result = null }
},
{ Rest: var _rest, Result: null } => new() { Rest = _rest, Result = null }
},
_ => new() { Rest = tokens, Result = null }
};
ParseExpression(Lex("(3 + 4)".AsMemory()).ToArray().AsSpan()).Result
BinaryExpression { Left = IntLiteralExpression { Value = 3 }, Operator = Token { Start = 29, Memory = +, Type = Add }, Right = IntLiteralExpression { Value = 4 } }
Left |
|
Value | 3 |
Token { Start = 29, Memory = +, Type = Add }
Start | 29 |
Memory | + |
Type | Add |
IntLiteralExpression { Value = 4 }
Value | 4 |
And statements:
using System.Collections.Immutable;
#nullable enable
static ParseResult ParseStatements(ReadOnlySpan<Token> tokens)
=> ParseStatement(tokens) switch
{
{ Result: null } => new () { Result = new Statements(ImmutableList<Statement>.Empty), Rest = tokens },
{ Result: Statement statement, Rest: var rest } =>
ParseStatements(rest) is { Result: Statements statements, Rest: var _rest } ?
new() { Result = new Statements(statements.StatementList.Insert(0, statement)), Rest = _rest }
: new() { Result = null, Rest = rest }
};
static ParseResult ParseStatement(ReadOnlySpan<Token> tokens) => ParseAssignment(tokens);
static ParseResult ParseAssignment(ReadOnlySpan<Token> tokens)
=> tokens switch
{
[{ Type: TokenType.Id }, { Type: TokenType.Assignment }, ..var rest] => ParseExpression(rest) switch
{
{ Result: Expression expression, Rest: var _rest } => _rest switch
{
[{ Type: TokenType.Semicolon }, ..var __rest] => new() { Result = new AssignmentStatement(tokens[0], expression), Rest = __rest },
_ => new() { Result = null, Rest = _rest }
},
_ => new() { Result = null, Rest = rest }
},
_ => new() { Result = null, Rest = tokens }
};
ParseStatement(Lex("c = (3 + a) * b;".AsMemory()).ToArray().AsSpan()).Result
AssignmentStatement { Identifier = Token { Start = 17, Memory = c, Type = Id }, Expression = BinaryExpression { Left = BinaryExpression { Left = IntLiteralExpression { Value = 3 }, Operator = Token { Start = 24, Memory = +, Type = Add }, Right = NameExpression { Identifier = Token { Start = 26, Memo...
Identifier |
|
Start | 17 |
Memory | c |
Type | Id |
BinaryExpression { Left = BinaryExpression { Left = IntLiteralExpression { Value = 3 }, Operator = Token { Start = 24, Memory = +, Type = Add }, Right = NameExpression { Identifier = Token { Start = 26, Memory = a, Type = Id } } }, Operator = Token { Start = 29, Memory = *, Type = Multiply }, Right ...
Left |
|
Left |
|
Value | 3 |
Token { Start = 24, Memory = +, Type = Add }
Start | 24 |
Memory | + |
Type | Add |
NameExpression { Identifier = Token { Start = 26, Memory = a, Type = Id } }
Identifier |
|
Start | 26 |
Memory | a |
Type | Id |
Token { Start = 29, Memory = *, Type = Multiply }
Start | 29 |
Memory | * |
Type | Multiply |
NameExpression { Identifier = Token { Start = 31, Memory = b, Type = Id } }
Identifier |
|
Start | 31 |
Memory | b |
Type | Id |
And finally our entire program:
#nullable enable
static ParseResult Parse(ReadOnlySpan<Token> tokens) => ParseStatements(tokens) switch
{
{ Result: Statements statements, Rest: var rest } => ParseExpression(rest) switch
{
{ Result: Expression expression, Rest: var _rest } => new() { Result = new Program(statements, expression), Rest = _rest },
_ => new() { Result = null, Rest = rest }
},
_ => new() { Result = null, Rest = tokens }
};
public (Program? program, string? error) res => Parse(tokens.ToArray().AsSpan()) switch
{
{ Result: Program program, Rest: { IsEmpty: true } } => (program, null),
{ Result: Program program, Rest: { IsEmpty: false } rest } => (program, $"Error at {rest[0]}")
};
if (res is (_, string error))
{
Console.WriteLine(error);
throw new Exception();
}
if (res is not (Program program, _))
throw new Exception();
program
Program { Statements = Statements { StatementList = System.Collections.Immutable.ImmutableList`1[Statement] }, Eval = NameExpression { Identifier = Token { Start = 32, Memory = c, Type = Id } } }
Statements |
|
StatementList |
Expression |
|
Value | 2 |
AssignmentStatement { Identifier = Token { Start = 7, Memory = b, Type = Id }, Expression = IntLiteralExpression { Value = 3 } }
Identifier |
|
Start | 7 |
Memory | b |
Type | Id |
IntLiteralExpression { Value = 3 }
Value | 3 |
AssignmentStatement { Identifier = Token { Start = 14, Memory = c, Type = Id }, Expression = BinaryExpression { Left = BinaryExpression { Left = IntLiteralExpression { Value = 3 }, Operator = Token { Start = 21, Memory = +, Type = Add }, Right = NameExpression { Identifier = Token { Start = 23, Memo...
Identifier |
|
Start | 14 |
Memory | c |
Type | Id |
BinaryExpression { Left = BinaryExpression { Left = IntLiteralExpression { Value = 3 }, Operator = Token { Start = 21, Memory = +, Type = Add }, Right = NameExpression { Identifier = Token { Start = 23, Memory = a, Type = Id } } }, Operator = Token { Start = 26, Memory = *, Type = Multiply }, Right ...
Left |
|
Left | IntLiteralExpression { Value = 3 } |
Operator | Token { Start = 21, Memory = +, Type = Add } |
Right | NameExpression { Identifier = Token { Start = 23, Memory = a, Type = Id } } |
Token { Start = 26, Memory = *, Type = Multiply }
Start | 26 |
Memory | * |
Type | Multiply |
NameExpression { Identifier = Token { Start = 28, Memory = b, Type = Id } }
Identifier | Token { Start = 28, Memory = b, Type = Id } |
NameExpression { Identifier = Token { Start = 32, Memory = c, Type = Id } }
Identifier |
|
Start | 32 |
Memory | c |
Type | Id |
Semantic Analysis
Now that we have our AST, the next phase of the compiler is semantic analysis.
The role of semantic analysis can change depending on how you define your language. Your language, like F#, might provide rich type inference and strict type checking, and here is where you'll do most of that checking. You can also do some high-level optimization during this phase, like constant folding, for example.
For our purposes, we're just going to do the important yet humble act of making sure we don't use a variable before it is defined and don't define one multiple times.
#nullable enable
static List<string>? GetSymbols(Program program)
{
var symbols = new List<string>();
foreach (Statement statement in program.Statements.StatementList)
{
if (statement is AssignmentStatement assignment)
{
// multiply defined
if (symbols.Contains(assignment.Identifier.Memory.ToString()))
return null;
if (IsValidExpression(symbols, assignment.Expression) is false)
return null;
// check expression
symbols.Add(assignment.Identifier.Memory.ToString());
}
}
return symbols;
static bool IsValidExpression(List<string> symbols, Expression expression)
=> expression switch
{
NameExpression name => symbols.Contains(name.Identifier.Memory.ToString()),
BinaryExpression binary => IsValidExpression(symbols, binary.Left)
&& IsValidExpression(symbols, binary.Right),
IntLiteralExpression _ => true,
_ => false
};
}
var symbols = GetSymbols(program);
symbols
[ a, b, c ]
Codegen
For codegen, we're going to be using AsmResolver to emit CIL.
#r "nuget: AsmResolver.DotNet, 5.4.0"
using AsmResolver.DotNet;
using AsmResolver.DotNet.Code.Cil;
using AsmResolver.DotNet.Signatures;
using AsmResolver.DotNet.Signatures.Types;
using AsmResolver.PE.DotNet.Cil;
using AsmResolver.PE.DotNet.Metadata.Tables.Rows;
- AsmResolver.DotNet, 5.4.0
We're going to need to write some boilerplate to define an assembly and module, and define our entry point.
var assembly = new AssemblyDefinition(
name: "MyAssembly",
version: new(1, 0, 0, 0));
var runtime = DotNetRuntimeInfo.Parse(".NETCoreApp,Version=v7.0");
var module = new ModuleDefinition("MyAssembly.exe", runtime.GetDefaultCorLib());
var factory = module.CorLibTypeFactory;
assembly.Modules.Add(module);
var programType = new TypeDefinition(
ns: "",
name: "Program",
TypeAttributes.Class,
baseType: factory.Object.Type);
module.TopLevelTypes.Add(programType);
var mainMethod = new MethodDefinition(
name: "Main",
MethodAttributes.Public | MethodAttributes.Static,
signature: MethodSignature.CreateStatic(returnType: module.CorLibTypeFactory.Void));
var methodBody = new CilMethodBody(mainMethod);
mainMethod.CilMethodBody = methodBody;
var write =
module.DefaultImporter.ImportType(typeof(Console))
.CreateMemberReference("Write", MethodSignature.CreateStatic(
factory.Void, factory.Int32))
.ImportWith(module.DefaultImporter);
Symbol Table
CIL uses a stack evaluation model, meaning that we have a stack to store values on and an "unlimited" (256) number of local variables. The .NET runtime then converts this IL to assembly code at runtime, though it is possible to compile it to native code ahead of time (AOT).
Inside our method body, we need to initialize our symbols as stack local variables, and then setup a Dictionary<string, int>
to reference them by when generating code.
What I am creating with this Dictionary
is something called a symbol table, but a very simplified watered down version. In real programming languages, you can have any number of scopes, and variables can be defined with the same name as another as long as they are uniquely defined in their own scope. A data structure for this model looks a lot more like a Dictionary
where each value is a list.
Note: In most languges, module-like scopes can be named, meaning that identifiers inside modules must be qualified with the name of their containing scope. Some languages, like F#, Java and TypeScript, have a notion of "opening" a module which means adding all the items inside it to the lexical environment it's opened in.
var symbolTuples = symbols.Select((s, i) => (Var: new CilLocalVariable(factory.Int32), Id: s, Index: i));
foreach (var tpl in symbolTuples)
{
methodBody.LocalVariables.Add(tpl.Var);
}
var symbolTable = symbolTuples.ToDictionary(k => k.Id, v => v.Index);
Now we're ready to start generating code.
You might recall that our language always ends with an expression. This type of pattern is common when using languages in a scripting environment, such as Python in an interactive environment and C# Script. For our purposes, I didn't want to go through the work / add the complexity of implementing calling .NET methods, so in every program,
Console.Write
is called on this final expression. This gives us an easy way to test our programs, but is obviously lacking, as we cannot callConsole.Write
in various points of our code.
We have n
statements and one expression at the end, and it would make sense that - since we don't have functions - we generate code for our language in order.
So our general algorithm would look something like
for each statement s
codegen(s)
codegen(final expression)
codegen(call `Console.Write`)
The way a stack evaluation model works, is that the "current" value is always at the top of the stack. Looking at the last two instructions above, we see we never "pass" the final expression to Console.Write
, but instead push the argument onto the stack and then call Console.Write
. This is also how we can handle assignment statements - by pushing the expression onto the stack and then popping it into the local variable being assigned to.
First, we define a method for evaluating expressions and pushing them onto the stack:
List<CilInstruction> Codegen(Expression expression)
=> expression switch
{
NameExpression s => new List<CilInstruction>()
{
new CilInstruction(CilOpCodes.Ldloc_S, methodBody.LocalVariables[symbolTable[s.Identifier.Memory.ToString()]]),
},
BinaryExpression s => ((Func<List<CilInstruction>>)(() =>
{
var l = new List<CilInstruction>();
l.AddRange(Codegen(s.Left));
l.AddRange(Codegen(s.Right));
l.Add(new CilInstruction(s.Operator.Type switch
{
TokenType.Add => CilOpCodes.Add,
TokenType.Subtract => CilOpCodes.Sub,
TokenType.Multiply => CilOpCodes.Mul,
TokenType.Divide => CilOpCodes.Div,
_ => throw new NotImplementedException()
}));
return l;
}))(),
IntLiteralExpression s => new List<CilInstruction>()
{
new CilInstruction(CilOpCodes.Ldc_I4, s.Value),
},
_ => throw new NotImplementedException()
};
Then to generate code for each statement, where every statement is an assignment statement, we generate code to push the expression onto the stack, then code to pop the stack into a local variable.
var il = methodBody.Instructions;
il.AddRange(
program.Statements.StatementList
.SelectMany(s => s switch
{
AssignmentStatement { Identifier: { Memory: var memory }, Expression: Expression expression } => ((Func<List<CilInstruction>>)(() =>
{
var l = new List<CilInstruction>();
l.AddRange(Codegen(expression));
l.Add(new CilInstruction(CilOpCodes.Stloc_S, methodBody.LocalVariables[symbolTable[memory.ToString()]]));
return l;
}))(),
_ => throw new NotImplementedException()
}
)
);
And finally, the evaluate the last expression and call Console.Write
:
il.AddRange(Codegen(program.Eval));
il.Add(CilOpCodes.Call, write);
il.Add(new CilInstruction(CilOpCodes.Ret));
Now we can set our mainMethod
as our entry point
programType.Methods.Add(mainMethod);
module.ManagedEntryPointMethod = mainMethod;
And write our assembly (it is a framework-dependent assembly, so we also need to write a *.runtimeconfig.json
)
assembly.Write("MyAssembly.exe");
System.IO.File.WriteAllText("MyAssembly.runtimeconfig.json",
"""
{
"runtimeOptions": {
"tfm": "net8.0",
"framework": {
"name": "Microsoft.NETCore.App",
"version": "8.0.0"
}
}
}
"""
);
Now, if we run our assembly with PowerShell:
dotnet MyAssembly.exe
15
For seeing our compiled code, I prefer using ILSpy (can install on Windows with winget install ILSpy
), but if you don't want to install that tool, we can pretty print our instructions with AsmResolver like so (taken from here):
var formatter = new CilInstructionFormatter();
foreach (CilInstruction instruction in methodBody.Instructions)
Console.WriteLine(formatter.FormatInstruction(instruction));
IL_0000: ldc.i4 2
IL_0005: stloc.s V_0
IL_0007: ldc.i4 3
IL_000C: stloc.s V_1
IL_000E: ldc.i4 3
IL_0013: ldloc.s V_0
IL_0015: add
IL_0016: ldloc.s V_1
IL_0018: mul
IL_0019: stloc.s V_2
IL_001B: ldloc.s V_2
IL_001D: call System.Void System.Console::Write(System.Int32)
IL_0022: ret