Blog

Create Your First Programming Language That Runs on .NET

Write a mini programming language that outputs valid CIL in a notebook in under an hour

Open in VS Code

Open in GitHub Codespaces


Note: as of this writing, C# 11 is the latest version of C# to be released. C# 12 includes a feature that would make a lot of this code more concise - collection expressions, however, I don't want to figure out how to get preview versions of .NET to work in a notebook. As such, I plan to update this when C# 12 releases.

Compilers can be a tricky topic to get into. There are plenty of resources available to learn compilers, but nothing I've found as small as a notebook.

This purpose of this notebook is to demonstrate an example of a tiny compiler.

Goals:


  • Runs on .NET runtime as its own executable!

  • variables, assignment, integer expression evaluation

  • runs without the need of parser generators

Plan


This compiler follows 4 main phases of a canonical compiler:

  1. Lexer

  2. Parser

  3. Semantic Analysis

  4. Codegen

Note: This language is extremely underwhelming. It supports no functions, only a single class of the most basic operators, has a single type of int, gives zero helpful error messages (it either works or it doesn't), and despite running on .NET, you can't explicitly import or call any of the .NET methods.

Lexer

The first step in a compiler is a lexer. The idea is to take string text and convert it into a sequence of tokens, kind of like how you can take an English sentence and split it into words with string.Split(" ").

We can't use string.Split(" ") for tokens, though, because sometimes tokens are not separated by spaces. For example, x = 3 and x=3 are both valid Python programs. Instead, we'll use something called regular expressions (Regex).

Regex

A regex is a pattern with which we can test whether another string conforms to that pattern. For example, the string name matches the pattern name. In a regex, the unescaped . means "match any single character". So therefore, name also matches the pattern n.me, as well as the strings nome and neme.

using System.Text.RegularExpressions;
string[] words = ["name", "neme", "nome", "neem"];
string pattern = "n.me";
words.Select(w => new { String = w, IsMatch = Regex.IsMatch(w, pattern) }).DisplayTable()
StringIsMatch
name
True
neme
True
nome
True
neem
False

Another reason we might want to use Regex to break strings into tokens ("sentence into words") is because it helps us classify our tokens into types or categories. In English sentences, words are classified into "parts of speech", like "verb" and "noun". In a naive example where words fall into one part of speech, we can look up words in a sentence in a dictionary to find their parts of speech. For example, in I ran to the store, I is a noun, ran is a past-tense verb, etc. With regex, we actually have it a bit easier, as we can classify tokens based on their patterns instead of looking them up in a dictionary. For example, the pattern -?[0-9]+ matches integers like 1 and 626, and the pattern ^[a-zA-Z]+[0-9a-zA-Z]* matches IDs like index and age.

First, we'll define our token types using an enum, and a record struct to store our token type, along with an int Start and ReadOnlyMemory<char> Memory to reference where it lives in memory.

(we could use this information in the future for error reporting, though we do not in this example)

dotnet build -c Release
MSBuild version 17.8.3+195e7f5a3 for .NET
  Determining projects to restore...
  All projects are up-to-date for restore.
  learn -> C:\Users\retru\source\repos\johnW-ret\Retruate.Languages\learn\bin\Release\net8.0\learn.dll
Build succeeded.
    0 Warning(s)
    0 Error(s)
Time Elapsed 00:00:01.45

#r "bin/Release/net8.0/learn.dll"

We also want to associate token types with their corresponding Regex rules.

We could use a Dictionary (hashmap) here, but as we'll see, we'll be iterating through our collection of rules, so it makes more sense to just define them as a list of tuples.

(TokenType Type, Regex Pattern)[] rules = [
    (TokenType.Semicolon,           new(@"^;", RegexOptions.Compiled)),
    (TokenType.Assignment,          new(@"^=", RegexOptions.Compiled)),
    (TokenType.LeftParenthesis,     new(@"^\(", RegexOptions.Compiled)),
    (TokenType.RightParenthesis,    new(@"^\)", RegexOptions.Compiled)),
    (TokenType.Add,                 new(@"^\+", RegexOptions.Compiled)),
    (TokenType.Subtract,            new(@"^\-", RegexOptions.Compiled)),
    (TokenType.Multiply,            new(@"^\*", RegexOptions.Compiled)),
    (TokenType.Divide,              new(@"^/", RegexOptions.Compiled)),
    (TokenType.Id,                  new(@"^[a-zA-Z]+[0-9a-zA-Z]*", RegexOptions.Compiled)),
    (TokenType.Integer,             new(@"^-?[0-9]+", RegexOptions.Compiled)),
    (TokenType.Whitespace,          new(@"^[ \n\t\r]", RegexOptions.Compiled))
];

In our case, we can modify our use of Regex to "pull tokens off" when we have at least one match.

With this knowledge, we can imagine a simple algorithm to break off tokens:


tokens = []

while input string s not empty

    for each (type, regex pattern) in rules

        search for first match m in regex pattern

        if m exists

            add (m, type) to tokens

        if m not exists

            error

return tokens

which is exactly what we do below

First we define our input string:

ReadOnlyMemory<char> input =
    """
    a = 2;
    b = 3;
    c = (3 + a) * b;
    c
    """.AsMemory();

You'll note that instead of using a string directly, we call AsMemory and store it as a ReadOnlyMemory<char>.

ReadOnlyMemory<char> and ReadOnlySpan<char> allow us to create wrappers over our original input string and slice the front off each time we pull off a token without passing an int pointer or and without copying any of the input string.

#nullable enable
public List<Token>? Lex(ReadOnlyMemory<char> feed) 
{
    List<Token> tokens = new();
    while (feed.Length > 0)
    {
        // get the first match
        foreach (var rule in rules)
        {
            var match = rule.Pattern.EnumerateMatches(feed.Span);
            if (match.MoveNext())
            {
                tokens.Add(new Token(input.Length - feed.Length, feed[0..match.Current.Length], rule.Type));
                feed = feed[match.Current.Length..];
                goto next;
            }
        }
        // tokens[^1] switch { _ => ... }
        Console.WriteLine("Error at: " + ((Func<Token, string>)(t => $"{t.Start} {t.Type} {t.Memory}"))(tokens[^1]));
        return null;
        next:;
    }
    // remove whitespace
    return tokens.Where(t => t.Type != TokenType.Whitespace).ToList();
}
var tokens = Lex(input);
tokens
indexvalue
0
Token { Start = 0, Memory = a, Type = Id }
Start
0
Memorya
TypeId
1
Token { Start = 2, Memory = =, Type = Assignment }
Start
2
Memory=
TypeAssignment
2
Token { Start = 4, Memory = 2, Type = Integer }
Start
4
Memory2
TypeInteger
3
Token { Start = 5, Memory = ;, Type = Semicolon }
Start
5
Memory;
TypeSemicolon
4
Token { Start = 7, Memory = b, Type = Id }
Start
7
Memoryb
TypeId
5
Token { Start = 9, Memory = =, Type = Assignment }
Start
9
Memory=
TypeAssignment
6
Token { Start = 11, Memory = 3, Type = Integer }
Start
11
Memory3
TypeInteger
7
Token { Start = 12, Memory = ;, Type = Semicolon }
Start
12
Memory;
TypeSemicolon
8
Token { Start = 14, Memory = c, Type = Id }
Start
14
Memoryc
TypeId
9
Token { Start = 16, Memory = =, Type = Assignment }
Start
16
Memory=
TypeAssignment
10
Token { Start = 18, Memory = (, Type = LeftParenthesis }
Start
18
Memory(
TypeLeftParenthesis
11
Token { Start = 19, Memory = 3, Type = Integer }
Start
19
Memory3
TypeInteger
12
Token { Start = 21, Memory = +, Type = Add }
Start
21
Memory+
TypeAdd
13
Token { Start = 23, Memory = a, Type = Id }
Start
23
Memorya
TypeId
14
Token { Start = 24, Memory = ), Type = RightParenthesis }
Start
24
Memory)
TypeRightParenthesis
15
Token { Start = 26, Memory = *, Type = Multiply }
Start
26
Memory*
TypeMultiply
16
Token { Start = 28, Memory = b, Type = Id }
Start
28
Memoryb
TypeId
17
Token { Start = 29, Memory = ;, Type = Semicolon }
Start
29
Memory;
TypeSemicolon
18
Token { Start = 32, Memory = c, Type = Id }
Start
32
Memoryc
TypeId

Great! Our input was an input string and now we have a List<Token> tokens!

Parser

Once we have our string of tokens, the next step is the parser.

The input of the parser is our string of tokens, and the output is something called an Abstract Syntax Tree (AST). Our parser is going to be a method that looks something like this:

record DummyProgram(); // dummy AST
static DummyProgram Parse(ReadOnlySpan<Token> tokens) => 
    // replace with parser logic
    new();

In this case, DummyProgram is our AST type. Then we can get our AST like this:

DummyProgram dummyProgram = Parse(Lex("abcd".AsMemory()).ToArray().AsSpan());

The AST represents the source file as a tree. It is useful, because we can use it as input to another method to generate code:

static string Emit(DummyProgram program) => 
    // replace with emit logic
    string.Empty;
static void DummyWriteEmittedCodeToExecutable(string code) { }
DummyWriteEmittedCodeToExecutable(Emit(dummyProgram));

To understand how the AST is generated, we have to understand grammar.

Grammar

Imagine we have a string and we'd like to check whether it exists in a language:

string sentence = "i run to the store";

We can imagine a method that checks whether that given string exists in a language:

bool Validate(string sentence) =>
    // replace with validation rules
    true;
Validate(sentence);

A grammar simply defines the rules of a language, and we can use those rules to validate whether a given string is valid.

Every language has a grammar, for example, take a possible subset of English grammar:


Prepositional Phrase -> to the Noun

Noun        -> store | post office

The capital letters are called non-terminals, which means that they have one or more production rules, and can be expanded into the right hand side of one of their production rules. The lowercase letters are called terminals which is basically just the same thing as tokens honestly in that they cannot be expanded with production rules and therefore make up the final sentence of the grammar.

We can take the Noun rule and encode it as a method, which can directly be used to validate a sentence:

bool IsNoun(string subsentence) => subsentence is "store" or "post office";
IsNoun("store")
True

That was easy enough. We can do another one:

bool IsSpace(string subsentence) => subsentence is " ";
bool IsPrepositionalPhrase(string subsentence) => true || subsentence[0..6] is "to the" && IsSpace(subsentence[7..8]) && IsNoun(subsentence[7..]);
IsPrepositionalPhrase("to the store")
True

You may note that subsentence is a string, and using the range operator on a string in C# allocates another string (like string.Split),

"to the "[0..3].GetType()

System.String

so from now on, I'm going to use ReadOnlySpan<char> and ReadOnlyMemory<char>.

static bool IsPrepositionalPhrase(ReadOnlySpan<char> subsentence) => subsentence[0..7] is "to the " && IsNoun(subsentence[7..]);
static bool IsNoun(ReadOnlySpan<char> subsentence) => subsentence is "store" or "post office";

We can create more rules and encode them as methods:


Sentence    -> Subject Verb Object

Subject     -> i | he | they

Verb        -> run | drive

static bool IsSentence(ReadOnlySpan<char> subsentence) => IsSubject(subsentence) /* ... ??? */;
static bool IsSubject(ReadOnlySpan<char> subsentence) => subsentence is "i" or "he" or "they";

Our general "algorithm" so far is "try the first of each rule until one works", but you might notice that we run into a bit of a problem. Our rules are unable to "peel off" matches from our input. Specifically, in the above code, if IsSubject succeeds, IsSentence doesn't know where to slice to pass the rest of the unmatched input.

There are several ways you could solve this problem. For simplicity, we're just going to return ReadOnlySpan<char> instead of bool for all productions other than Sentence, and simply return "~" for cases that fail, that will in turn, cause all following matches to fail (for our grammar, because ~ is not valid in our grammar). This isn't the greatest for error handling, but it's the simplest way I could think of that doesn't introduce global state.

static bool IsSentence(this ReadOnlySpan<char> subsentence) => subsentence.PeelSubject().PeelSpace().PeelVerb().PeelSpace() is "store";
static ReadOnlySpan<char> PeelSpace(this ReadOnlySpan<char> subsentence) => subsentence.StartsWith(" ") ? subsentence[1..] : "~";
static ReadOnlySpan<char> PeelSubject(this ReadOnlySpan<char> subsentence) => subsentence switch
{
    _ when subsentence.StartsWith("i") => subsentence[1..],
    _ when subsentence.StartsWith("he") => subsentence[2..],
    _ when subsentence.StartsWith("they") => subsentence[4..],
    _  => "~"
};
static ReadOnlySpan<char> PeelVerb(this ReadOnlySpan<char> subsentence) => subsentence switch
{
    _ when subsentence.StartsWith("run") => subsentence[3..],
    _ when subsentence.StartsWith("drive") => subsentence[5..],
    _  => "~"
};
new[] { new { Input = "i run store", Result = IsSentence("i run store") }, new { Input = "i switch ran", Result = IsSentence("i switch ran") } }.DisplayTable()
InputResult
i run store
True
i switch ran
False

Finally, adding a rule for Object:


Object      -> Prepositional Phrase | Noun

and completing all our rules:

static ReadOnlySpan<char> PeelNoun(this ReadOnlySpan<char> subsentence) => subsentence switch
{
    _ when subsentence.StartsWith("store") => subsentence[5..],
    _ when subsentence.StartsWith("post office") => subsentence[11..],
    _  => "~"
};
static ReadOnlySpan<char> PeelPrepositionalPhrase(this ReadOnlySpan<char> subsentence) => subsentence switch
{
    _ when subsentence.StartsWith("to the") => subsentence[6..],
    _  => "~"
};
static ReadOnlySpan<char> PeelObject(this ReadOnlySpan<char> subsentence) => subsentence switch
{
    _ when subsentence.PeelPrepositionalPhrase() is not "~" and { } rest
        && rest.PeelSpace().PeelNoun() is not "~" and { } _rest => _rest,
    _ when subsentence.PeelNoun() is not "~" and { } rest => rest,
    _  => "~"
};
static bool IsSentence(this ReadOnlySpan<char> subsentence) => subsentence.PeelSubject().PeelSpace().PeelVerb().PeelSpace().PeelObject() is { Length: 0 };
new[] { new { Input = sentence, Result = IsSentence(sentence) }, new { Input = "i switch ran", Result = IsSentence("i switch ran") } }.DisplayTable()
InputResult
i run to the store
True
i switch ran
False

What we did here is actually create a mini parser for a subset of the English language. The parser is very bad - it doesn't give any good error reporting, but it validates whether a sentence conforms to our language.

Parse Tree

If you step through a debugger, you will see the path our string i run to the store took to get through our parser is as follows:


in `IsSentence`

  in `PeelSubject`

  out

  in `PeelVerb`

  out

  in `PeelObject`

      in `PeelPrepositionalPhrase`

      out

      in `PeelNoun`

      out

  out

out

This path can also be represented with a tree, as illustrated below:

(does not render on GitHub preview as of 2023-11-27 - click outside of the preview and press . to view in vscode.dev, or copy-paste below cell into mermaid.dev)

graph TD A[Sentence] --> B[Subject] A --> C[Verb] A --> D[Object] B --> E[i] C --> F[run] D --> G[Prepositional Phrase] G --> H[to the] G --> I[Noun] I --> J[store]

This tree is called a parse tree, and it represents the derivation from the start non-terminal, through productions, to a final sentence.

The parser we wrote is called a recrusive-descent parser. In a recrusive-descent parser, the path taken (without walkbacks) to validate a string is the parse tree for that string.

I say without walkbacks, because in our algorithm, we always try the first rule for each production, and the string i run to the store is what you get by always expanding the first production. If our string was i run store, our code path would look like this


in `IsSentence`

  in `PeelSubject`

  out

  in `PeelVerb`

  out

  in `PeelObject`

      // we try this rule of `Object`

      in `PeelPrepositionalPhrase`

      out

      in `PeelNoun`

      out

      // but it fails so we try the next

      in `PeelNoun`

      out

  out

out

but our parse tree would look like this:

graph TD A[Sentence] --> B[Subject] A --> C[Verb] A --> D[Object] B --> E[i] C --> F[run] D --> G[Noun] G --> H[store]

Here is a draft grammar for our language:


Program                 -> Statements Expression

Statements              -> Statements Statement

Statement               -> AssignmentStatement

AssignmentStatement    -> id = Expression;

Expression   ->

    (Expression)

  | Expression * Expression

  | Expression / Expression

  | Expression + Expression

  | Expression - Expression

  | id

  | int

A general description is a list of statements followed by a single expression. All statements end with a ;, and the program is ended with an expression (which we will evaluate and print to the console, though this isn't related to the grammar).

If you recall our input string:

input

a = 2; b = 3; c = (3 + a) * b; c

We can see it conforms to our grammar:

graph TD A[Program] --> B[Statements] A --> C[Expression] B --> D[Statements] B --> E[Statement] D --> F[...] E --> G[AssignmentStatement] G --> H[c] G --> I[=] G --> J[Expression] J --> K[...] G --> L[;] C --> Z[c]

Left Recursion

We have a big problem though... actually, two of them!

Let's take a look at Statements


Statements              -> Statements Statement

and what it could look like in code:

static ReadOnlySpan<char> PeelStatements(this ReadOnlySpan<char> input) => input.PeelStatements().PeelStatement();
static ReadOnlySpan<char> PeelStatement(this ReadOnlySpan<char> input) => 
    // replace with statement validation logic
    input[1..];

This is infinite recursion!

We can fix this by modifying our grammar:


Statements              -> Statement Statements

This changes the order of derivation and therefore our parse tree, but that's not a problem here.

In general, you never want recursion to be in the same direction as your derivation. If you have left derivation (you try the left-most rules first), you don't want left recursion.

Ambiguity

We still have one more problem though, and it's with the Expression rule:


Expression   ->

    (Expression)

  | Expression * Expression

  | Expression / Expression

  | Expression + Expression

  | Expression - Expression

  | id

  | int

This grammar is also left recursive, but more than that, it's ambiguous, in that for a single sentence, there are multiple derivations / parse trees you could construct.

graph TD A[Expression] --> B[Expression] A --> C[*] A --> D[Expression] E --> F[1] B --> E[Expression] B --> G[+] B --> H[Expression] H --> J[2] D --> I[3] _A[Expression] --> _B[Expression] _A --> _C[+] _A --> _D[Expression] _B --> _I[1] _D --> _E[Expression] _E --> _F[2] _D --> _G[*] _D --> _H[Expression] _H --> _J[3]

Both parse trees above are legal productions of the grammar, but it's easy to see that their differences are non-negligible ((1 + 2) * 3 = 9 != 1 + (2 * 3) = 7).

To illustrate this problem, I'd translate the grammar to code again, but we'd run into the immediate issue of infinite left recursion. There are several ways to solve left recursion, but the underlying problem here is that of operator precedence, which is why we have a problem with ambiguity as well. Sometimes we can solve both problems at once by introducing new non-terminals, which is what we'll do:


Expression   ->

    Term + Expression

    Term - Expression

    Term

Term   ->

    Factor * Term

    Factor / Term

    Factor

Factor  ->

    (Expression) | id | int

When we split the grammar into multiple rules like so, we try to match for operators of lower precedence first, which means matching farther away from the leaves and closer to the root. The above Mermaid diagram hopefully helps demonstrate how the closer a subexpression is to the leaves, the higher precedence it has.

Note: I think it's important to note here that I am skipping over entire classes of parsers and grammars.

The grammar I am describing here is called "LL", which means it reads tokens from left-to-right and constructs a "left-most derivation" (just tries doing all the rules and stuff from the left first), and the parser I am writing is called "manual recursive descent". It is also "top-down", meaning it starts from the first non-terminal and tries to produce a matching sentence.

A point of inefficiency for the parser I'm writing is that it tries all the rules until it finds a match, meaning that if it does not find a match, it tries every single navigation of productions before giving up. This isn't terrible, as production compilers often use a mixture of parsing techniques with recursive descent often being the primary one because of the flexibility it provides, but it is worth mentioning that there are other parsing techniques that eliminate some of these inefficiencies by requiring you to be stricter with how you define your grammar.

Abstract Syntax Trees

The next step would be to translate our grammar into code as we did our English example, but you'll note that we're missing a key goal that we desired from the outset of our parser, and that is the AST - the representation of our code in tree form.

For your research, the preferred method of building an AST is with "semantic actions", but in our case, we can just build an AST as we parse our input by returning each subtree with each parsing method.

If this sounds complicated, it's my fault for being unable to explain it clearly, because it's really not, but if we directly try to implement it with code, we'll run into another issue.

Take our definition of PeelNoun from earlier:

static ReadOnlySpan<char> PeelNoun(this ReadOnlySpan<char> subsentence) => subsentence switch
{
    _ when subsentence.StartsWith("store") => subsentence[5..],
    _ when subsentence.StartsWith("post office") => subsentence[11..],
    _  => "~"
};

If we return some type of Noun data structure instead of a ReadOnlySpan<char>, the caller method won't know where in the char stream to resume reading input from.

A logical step would be to return both values in a tuple, but tuples under the hood are of type System.ValueTuple<T1, T2, ...> and ref struct types, of which ReadOnlySpan<T> is one, can't be generic type parameters because that could potentially allow them to leak onto the heap. The scope of this issue is beyond that of this notebook, but the gist is that if we want to return multiple parameters along with ref structs, we have to include them in another ref struct and return that or use out parameters... and out parameters are ugly.

But before I can show our example, what of our AST data structure? For this, I'm currently using records. They're kind of like classes in that they're reference types, but they can be concisely defined in a single statement and have structural equality. The member definitions passed in the ()s become readonly auto-properties, meaning they are immutable.

I was trying to get some interop with F# to work in this notebook, and unfortunately, sharing custom data types across .NET Interactive doesn't work too well, so I've decided to define the types in the Program.cs in the same directory as this notebook.

Here's a simple example for parsing a literal value:

#nullable enable
ref struct ParseResult
{
    public Node? Result { get; set; }
    public ReadOnlySpan<Token> Rest { get; set; }
}
static ParseResult ParseLiteral(ReadOnlySpan<Token> tokens)
    => tokens switch
    {
        [{ Type: TokenType.Integer } token, ..var _rest] when int.TryParse(token.Memory.Span, out int value)
            => new() { Rest = _rest, Result = new IntLiteralExpression(value) },
        _ => new() { Rest = tokens, Result = null }
    };
ParseLiteral(Lex("3".AsMemory()).ToArray().AsSpan()).Result
IntLiteralExpression { Value = 3 }
Value
3

The process is very similar to our English subset example, except that we, of course, need to return an AST, but we also use our ReadOnlySpan<Token> tokens instead of a direct ReadOnlySpan<char> character stream.

We can build on our above work to parse an Expression:

#nullable enable
static ParseResult ParseExpression(ReadOnlySpan<Token> tokens)
    => ParseTerm(tokens) switch
    {
        { Result: Expression left, Rest: var rest } => rest switch
            {
                [{ Type: TokenType.Add or TokenType.Subtract } op, ..var _rest] => ParseTerm(_rest) switch
                {
                    { Result: Expression right, Rest: var __rest } =>
                        new() { Result = new BinaryExpression(left, op, right), Rest = __rest },
                    _ => new() { Result = null, Rest = _rest }
                },
                _ => new() { Rest = rest, Result = left }
            },
        _ => new() { Rest = tokens, Result = null }
    };
static ParseResult ParseTerm(ReadOnlySpan<Token> tokens)
    => ParseFactor(tokens) switch
    {
        { Result: Expression left, Rest: var rest } => rest switch
            {
                [{ Type: TokenType.Multiply or TokenType.Divide } op, ..var _rest] => ParseTerm(_rest) switch
                    {
                        { Result: Expression right, Rest: var __rest } => 
                            new() { Result = new BinaryExpression(left, op, right), Rest = __rest },
                        _ => new() { Result = null, Rest = _rest }
                    },
                _ => new() { Result = left, Rest = rest }
            },
        _ => new() { Result = null, Rest = tokens }
    };
static ParseResult ParseFactor(ReadOnlySpan<Token> tokens)
    => tokens switch
    {
        { } when ParseWrappedExpression(tokens) is { Result: { } expression, Rest: var rest } =>
            new() { Result = expression, Rest = rest },
        { } when ParseLiteral(tokens) is { Result: { } expression, Rest: var rest } =>
            new() { Result = expression, Rest = rest },
        [{ Type: TokenType.Id }, ..var _rest] =>
            new() { Result = new NameExpression(tokens[0]), Rest = _rest },
    };
static ParseResult ParseWrappedExpression(ReadOnlySpan<Token> tokens)
    => tokens switch
    {
        [{ Type: TokenType.LeftParenthesis } left, ..var rest] => 
            ParseExpression(rest) switch
            {
                { Result: Expression expression, Rest: var _rest } => _rest switch
                    {
                        [{ Type: TokenType.RightParenthesis } right, ..var __rest]
                            => new() { Rest = __rest, Result = expression },
                        _ => new() { Rest = _rest, Result = null }
                    },
                { Rest: var _rest, Result: null } => new() { Rest = _rest, Result = null }
            },
        _ => new() { Rest = tokens, Result = null }
    };
ParseExpression(Lex("(3 + 4)".AsMemory()).ToArray().AsSpan()).Result
BinaryExpression { Left = IntLiteralExpression { Value = 3 }, Operator = Token { Start = 29, Memory = +, Type = Add }, Right = IntLiteralExpression { Value = 4 } }
Left
IntLiteralExpression { Value = 3 }
Value
3
Operator
Token { Start = 29, Memory = +, Type = Add }
Start
29
Memory+
TypeAdd
Right
IntLiteralExpression { Value = 4 }
Value
4

And statements:

using System.Collections.Immutable;
#nullable enable
static ParseResult ParseStatements(ReadOnlySpan<Token> tokens)
    => ParseStatement(tokens) switch
        {
            { Result: null } => new () { Result = new Statements(ImmutableList<Statement>.Empty), Rest = tokens },
            { Result: Statement statement, Rest: var rest } => 
                ParseStatements(rest) is { Result: Statements statements, Rest: var _rest } ?
                new() { Result = new Statements(statements.StatementList.Insert(0, statement)), Rest = _rest }
                : new() { Result = null, Rest = rest }
        };
static ParseResult ParseStatement(ReadOnlySpan<Token> tokens) => ParseAssignment(tokens);
static ParseResult ParseAssignment(ReadOnlySpan<Token> tokens)
    =>  tokens switch
    {
        [{ Type: TokenType.Id }, { Type: TokenType.Assignment }, ..var rest] => ParseExpression(rest) switch
        {
            { Result: Expression expression, Rest: var _rest } => _rest switch
            {
                [{ Type: TokenType.Semicolon }, ..var __rest] => new() { Result = new AssignmentStatement(tokens[0], expression), Rest = __rest },
                _ => new() { Result = null, Rest = _rest }
            },
            _ => new() { Result = null, Rest = rest }
        },
        _ => new() { Result = null, Rest = tokens }
    };
ParseStatement(Lex("c = (3 + a) * b;".AsMemory()).ToArray().AsSpan()).Result
AssignmentStatement { Identifier = Token { Start = 17, Memory = c, Type = Id }, Expression = BinaryExpression { Left = BinaryExpression { Left = IntLiteralExpression { Value = 3 }, Operator = Token { Start = 24, Memory = +, Type = Add }, Right = NameExpression { Identifier = Token { Start = 26, Memo...
Identifier
Token { Start = 17, Memory = c, Type = Id }
Start
17
Memoryc
TypeId
Expression
BinaryExpression { Left = BinaryExpression { Left = IntLiteralExpression { Value = 3 }, Operator = Token { Start = 24, Memory = +, Type = Add }, Right = NameExpression { Identifier = Token { Start = 26, Memory = a, Type = Id } } }, Operator = Token { Start = 29, Memory = *, Type = Multiply }, Right ...
Left
BinaryExpression { Left = IntLiteralExpression { Value = 3 }, Operator = Token { Start = 24, Memory = +, Type = Add }, Right = NameExpression { Identifier = Token { Start = 26, Memory = a, Type = Id } } }
Left
IntLiteralExpression { Value = 3 }
Value
3
Operator
Token { Start = 24, Memory = +, Type = Add }
Start
24
Memory+
TypeAdd
Right
NameExpression { Identifier = Token { Start = 26, Memory = a, Type = Id } }
Identifier
Token { Start = 26, Memory = a, Type = Id }
Start
26
Memorya
TypeId
Operator
Token { Start = 29, Memory = *, Type = Multiply }
Start
29
Memory*
TypeMultiply
Right
NameExpression { Identifier = Token { Start = 31, Memory = b, Type = Id } }
Identifier
Token { Start = 31, Memory = b, Type = Id }
Start
31
Memoryb
TypeId

And finally our entire program:

#nullable enable
static ParseResult Parse(ReadOnlySpan<Token> tokens) => ParseStatements(tokens) switch
    {
        { Result: Statements statements, Rest: var rest } => ParseExpression(rest) switch
        {
            { Result: Expression expression, Rest: var _rest } => new() { Result = new Program(statements, expression), Rest = _rest },
            _ => new() { Result = null, Rest = rest }
        },
        _ => new() { Result = null, Rest = tokens }
    };
public (Program? program, string? error) res => Parse(tokens.ToArray().AsSpan()) switch
{
    { Result: Program program, Rest: { IsEmpty: true } } => (program, null),
    { Result: Program program, Rest: { IsEmpty: false } rest } => (program, $"Error at {rest[0]}")
};
if (res is (_, string error))
{
    Console.WriteLine(error);
    throw new Exception();
}
if (res is not (Program program, _))
    throw new Exception();
program
Program { Statements = Statements { StatementList = System.Collections.Immutable.ImmutableList`1[Statement] }, Eval = NameExpression { Identifier = Token { Start = 32, Memory = c, Type = Id } } }
Statements
Statements { StatementList = System.Collections.Immutable.ImmutableList`1[Statement] }
StatementList
indexvalue
0
AssignmentStatement { Identifier = Token { Start = 0, Memory = a, Type = Id }, Expression = IntLiteralExpression { Value = 2 } }
Identifier
Token { Start = 0, Memory = a, Type = Id }
Start
0
Memorya
TypeId
Expression
IntLiteralExpression { Value = 2 }
Value
2
1
AssignmentStatement { Identifier = Token { Start = 7, Memory = b, Type = Id }, Expression = IntLiteralExpression { Value = 3 } }
Identifier
Token { Start = 7, Memory = b, Type = Id }
Start
7
Memoryb
TypeId
Expression
IntLiteralExpression { Value = 3 }
Value
3
2
AssignmentStatement { Identifier = Token { Start = 14, Memory = c, Type = Id }, Expression = BinaryExpression { Left = BinaryExpression { Left = IntLiteralExpression { Value = 3 }, Operator = Token { Start = 21, Memory = +, Type = Add }, Right = NameExpression { Identifier = Token { Start = 23, Memo...
Identifier
Token { Start = 14, Memory = c, Type = Id }
Start
14
Memoryc
TypeId
Expression
BinaryExpression { Left = BinaryExpression { Left = IntLiteralExpression { Value = 3 }, Operator = Token { Start = 21, Memory = +, Type = Add }, Right = NameExpression { Identifier = Token { Start = 23, Memory = a, Type = Id } } }, Operator = Token { Start = 26, Memory = *, Type = Multiply }, Right ...
Left
BinaryExpression { Left = IntLiteralExpression { Value = 3 }, Operator = Token { Start = 21, Memory = +, Type = Add }, Right = NameExpression { Identifier = Token { Start = 23, Memory = a, Type = Id } } }
LeftIntLiteralExpression { Value = 3 }
OperatorToken { Start = 21, Memory = +, Type = Add }
RightNameExpression { Identifier = Token { Start = 23, Memory = a, Type = Id } }
Operator
Token { Start = 26, Memory = *, Type = Multiply }
Start26
Memory*
TypeMultiply
Right
NameExpression { Identifier = Token { Start = 28, Memory = b, Type = Id } }
IdentifierToken { Start = 28, Memory = b, Type = Id }
Eval
NameExpression { Identifier = Token { Start = 32, Memory = c, Type = Id } }
Identifier
Token { Start = 32, Memory = c, Type = Id }
Start
32
Memoryc
TypeId

Semantic Analysis

Now that we have our AST, the next phase of the compiler is semantic analysis.

The role of semantic analysis can change depending on how you define your language. Your language, like F#, might provide rich type inference and strict type checking, and here is where you'll do most of that checking. You can also do some high-level optimization during this phase, like constant folding, for example.

For our purposes, we're just going to do the important yet humble act of making sure we don't use a variable before it is defined and don't define one multiple times.

#nullable enable
static List<string>? GetSymbols(Program program)
{
    var symbols = new List<string>();
    foreach (Statement statement in program.Statements.StatementList)
    {
        if (statement is AssignmentStatement assignment)
        {
            // multiply defined
            if (symbols.Contains(assignment.Identifier.Memory.ToString()))
                return null;
            if (IsValidExpression(symbols, assignment.Expression) is false)
                return null;
            // check expression
            symbols.Add(assignment.Identifier.Memory.ToString());
        }
    }
    return symbols;
    static bool IsValidExpression(List<string> symbols, Expression expression)
        => expression switch
        {
            NameExpression name => symbols.Contains(name.Identifier.Memory.ToString()),
            BinaryExpression binary => IsValidExpression(symbols, binary.Left) 
                && IsValidExpression(symbols, binary.Right),
            IntLiteralExpression _ => true,
            _ => false
        };
}
var symbols = GetSymbols(program);
symbols
[ a, b, c ]

Codegen

For codegen, we're going to be using AsmResolver to emit CIL.

#r "nuget: AsmResolver.DotNet, 5.4.0"
using AsmResolver.DotNet;
using AsmResolver.DotNet.Code.Cil;
using AsmResolver.DotNet.Signatures;
using AsmResolver.DotNet.Signatures.Types;
using AsmResolver.PE.DotNet.Cil;
using AsmResolver.PE.DotNet.Metadata.Tables.Rows;
Installed Packages
  • AsmResolver.DotNet, 5.4.0

We're going to need to write some boilerplate to define an assembly and module, and define our entry point.

var assembly = new AssemblyDefinition(
    name: "MyAssembly",
    version: new(1, 0, 0, 0));
var runtime = DotNetRuntimeInfo.Parse(".NETCoreApp,Version=v7.0");
var module = new ModuleDefinition("MyAssembly.exe", runtime.GetDefaultCorLib());
var factory = module.CorLibTypeFactory;
assembly.Modules.Add(module);
var programType = new TypeDefinition(
    ns: "", 
    name: "Program", 
    TypeAttributes.Class,
    baseType: factory.Object.Type);
module.TopLevelTypes.Add(programType);
var mainMethod = new MethodDefinition(
    name: "Main",
    MethodAttributes.Public | MethodAttributes.Static,
    signature: MethodSignature.CreateStatic(returnType: module.CorLibTypeFactory.Void));
var methodBody = new CilMethodBody(mainMethod);
mainMethod.CilMethodBody = methodBody;
var write = 
    module.DefaultImporter.ImportType(typeof(Console))
    .CreateMemberReference("Write", MethodSignature.CreateStatic(
        factory.Void, factory.Int32))
    .ImportWith(module.DefaultImporter);

Symbol Table

CIL uses a stack evaluation model, meaning that we have a stack to store values on and an "unlimited" (256) number of local variables. The .NET runtime then converts this IL to assembly code at runtime, though it is possible to compile it to native code ahead of time (AOT).

Inside our method body, we need to initialize our symbols as stack local variables, and then setup a Dictionary<string, int> to reference them by when generating code.

What I am creating with this Dictionary is something called a symbol table, but a very simplified watered down version. In real programming languages, you can have any number of scopes, and variables can be defined with the same name as another as long as they are uniquely defined in their own scope. A data structure for this model looks a lot more like a Dictionary where each value is a list.

Note: In most languges, module-like scopes can be named, meaning that identifiers inside modules must be qualified with the name of their containing scope. Some languages, like F#, Java and TypeScript, have a notion of "opening" a module which means adding all the items inside it to the lexical environment it's opened in.

var symbolTuples = symbols.Select((s, i) => (Var: new CilLocalVariable(factory.Int32), Id: s, Index: i));
foreach (var tpl in symbolTuples)
{
    methodBody.LocalVariables.Add(tpl.Var);
}
var symbolTable = symbolTuples.ToDictionary(k => k.Id, v => v.Index);

Now we're ready to start generating code.

You might recall that our language always ends with an expression. This type of pattern is common when using languages in a scripting environment, such as Python in an interactive environment and C# Script. For our purposes, I didn't want to go through the work / add the complexity of implementing calling .NET methods, so in every program, Console.Write is called on this final expression. This gives us an easy way to test our programs, but is obviously lacking, as we cannot call Console.Write in various points of our code.

We have n statements and one expression at the end, and it would make sense that - since we don't have functions - we generate code for our language in order.

So our general algorithm would look something like


for each statement s

    codegen(s)

codegen(final expression)

codegen(call `Console.Write`)

The way a stack evaluation model works, is that the "current" value is always at the top of the stack. Looking at the last two instructions above, we see we never "pass" the final expression to Console.Write, but instead push the argument onto the stack and then call Console.Write. This is also how we can handle assignment statements - by pushing the expression onto the stack and then popping it into the local variable being assigned to.

First, we define a method for evaluating expressions and pushing them onto the stack:

List<CilInstruction> Codegen(Expression expression)
    => expression switch
    {
        NameExpression s => new List<CilInstruction>()
        {
            new CilInstruction(CilOpCodes.Ldloc_S, methodBody.LocalVariables[symbolTable[s.Identifier.Memory.ToString()]]),
        },
        BinaryExpression s => ((Func<List<CilInstruction>>)(() =>
        {    
            var l = new List<CilInstruction>();
            l.AddRange(Codegen(s.Left));
            l.AddRange(Codegen(s.Right));
            l.Add(new CilInstruction(s.Operator.Type switch
            {
                TokenType.Add => CilOpCodes.Add,
                TokenType.Subtract => CilOpCodes.Sub,
                TokenType.Multiply => CilOpCodes.Mul,
                TokenType.Divide => CilOpCodes.Div,
                _ => throw new NotImplementedException()
            }));
            return l;
        }))(),
        IntLiteralExpression s => new List<CilInstruction>()
        {
            new CilInstruction(CilOpCodes.Ldc_I4, s.Value),
        },
        _ => throw new NotImplementedException()
    };

Then to generate code for each statement, where every statement is an assignment statement, we generate code to push the expression onto the stack, then code to pop the stack into a local variable.

var il = methodBody.Instructions;
il.AddRange(
    program.Statements.StatementList
        .SelectMany(s => s switch
            {
                AssignmentStatement { Identifier: { Memory: var memory }, Expression: Expression expression } => ((Func<List<CilInstruction>>)(() =>
                {    
                    var l = new List<CilInstruction>();
                    l.AddRange(Codegen(expression));
                    l.Add(new CilInstruction(CilOpCodes.Stloc_S, methodBody.LocalVariables[symbolTable[memory.ToString()]]));
                    return l;
                }))(),
                _ => throw new NotImplementedException()
            }    
        )
);

And finally, the evaluate the last expression and call Console.Write:

il.AddRange(Codegen(program.Eval));
il.Add(CilOpCodes.Call, write);
il.Add(new CilInstruction(CilOpCodes.Ret));

Now we can set our mainMethod as our entry point

programType.Methods.Add(mainMethod);
module.ManagedEntryPointMethod = mainMethod;

And write our assembly (it is a framework-dependent assembly, so we also need to write a *.runtimeconfig.json)

assembly.Write("MyAssembly.exe");
System.IO.File.WriteAllText("MyAssembly.runtimeconfig.json",
"""
{
   "runtimeOptions": {
     "tfm": "net8.0",
     "framework": {
       "name": "Microsoft.NETCore.App",
       "version": "8.0.0"
     }
   }
}
"""
);

Now, if we run our assembly with PowerShell:

dotnet MyAssembly.exe
15

For seeing our compiled code, I prefer using ILSpy (can install on Windows with winget install ILSpy), but if you don't want to install that tool, we can pretty print our instructions with AsmResolver like so (taken from here):

var formatter = new CilInstructionFormatter();
foreach (CilInstruction instruction in methodBody.Instructions)
    Console.WriteLine(formatter.FormatInstruction(instruction));
IL_0000: ldc.i4 2
IL_0005: stloc.s V_0
IL_0007: ldc.i4 3
IL_000C: stloc.s V_1
IL_000E: ldc.i4 3
IL_0013: ldloc.s V_0
IL_0015: add
IL_0016: ldloc.s V_1
IL_0018: mul
IL_0019: stloc.s V_2
IL_001B: ldloc.s V_2
IL_001D: call System.Void System.Console::Write(System.Int32)
IL_0022: ret

An unhandled error has occurred. Reload 🗙