Bootstrapping the Lys compiler
Self hosting a compiler may be the ultimate goal of every new language.
The process of writing a the next version of a compiler and compiling itself is called bootstrap the compiler.
I decided not to continue developing Lys' new features until it is self hosted, everything from now on should be either a bug fix or adding some helpers to the standard library.
The plan
I came up with a staged plan to do it:
- Write a parser library^parser
- Write Lys' grammar in it^new-lys-grammar
- Parse the whole standard library and tests correctly
- Iteratively implement every phase of the compiler until it is capable of run some tests
Compiler phases
- Parsing phase:
-
Semantic phase:
- Syntax sugar expansion
- Code generation
-
Name initialization phase:
-
Detects implicit imports (like in fully qualified names
support::env::printf(..)
) - Creates scopes for every relevant node and then register the name declarations
- Injects standard library core imports
-
Detects implicit imports (like in fully qualified names
-
Scope phase:
- Finds and resolves references
- Validates loop statements (loop, continue, break)
- Validates assignments' mutability
-
Type initialization phase:
- This phase initializes types and traits directives.
-
Type checker phase:
This phase runs iteratively until the result of the phase equals the result of the previous run.- Infers the type of every node.
- If changes in a module are detected, the module is invalidated and added to a stack of pending modules.
-
Pre compilation phase:
- Detects and annotates features like tail-call functions
-
Compilation phase:
- Resolves local numbers inside functions
- Emits WAST code
- Uses binaryen to parse and optimize the output
- Emits the final code
The parser
Building parsers is a topic of study itself, the first Lys parser was a PEG^PEG parser using a library^node-ebnf made for the specific use case.
The previous parser had serious ambiguity and backtracking problems with keywords and name identifiers, take a look at this rule:
NameIdentifier ::= !KEYWORD [a-zA-Z][a-zA-Z0-9_$]*
KEYWORD ::= 'fun' | 'private' | ...
The problem may not be clear enough, but to read a NameIdentifier it first looks trough a list of other things that are also a name identifier. It produces a costly look-ahead and a lot of backtracking.
That problem can be fixed if we use a tokenizer or lexer^lexer
as a first stage of parsing. In that case, instead of using parsing expressions
([a-zA-Z][a-zA-Z0-9_$]*
) to generate our grammar, we use named tokens. Our
end grammar should look like this:
NameIdentifier ::= <IdentifierToken>
KEYWORD ::= <KeywordToken>
Which is obviously simpler.
That led me to the first parser design decision for Lys: a tokenizer is required.
Creating the tokenizer
Depending on the tokens, creating tokenizers is a very simple task, this is the basic algorithm:
- Read a char
- Find a token that may start with that char
- Read the rest of the token, char by char
- Return the token, go to step 1 until we reach the end of the file
When you read a character, you also move the reading cursor forward and then stop when we reach the end of the string.
This is a reduced version of a tokenizer that only reads tokens and whitespaces:
enum TokenType {
Word
Whitespace
EndOfFile
}
struct Token(tokenType: TokenType, start: u32, end: u32)
struct Lexer(source: string, pos: u32)
impl Lexer {
#[method]
fun eat(self: Lexer): Token = {
if (self.pos >= self.source.length)
Token(EndOfFile, self.source.length, self.source.length)
else {
// Match the [self.pos] character of self.source
val eatenToken = match self.source[self.pos] {
case 0x20 -> eatWhitespace(self) // 0x20 == whitespace
else -> eatToken(self)
}
// we got a valid token, move self.pos forward
self.pos = eatenToken.end
// return the eaten token
eatenToken
}
}
private fun eatWhitespace(self: Lexer): Token = {
var start = self.pos
var end = end
val len = self.source.length
loop {
if (end >= len)
break
// eat until we reach a non-whitespace character
if (self.source[end] != 0x20)
break
end = end + 1 as u32
continue
}
Token(Whitespace, start, end)
}
private fun eatToken(self: Lexer): Token = {
var start = self.pos
var end = end
val len = self.source.length
loop {
if (end >= len)
break
// eat until we reach a whitespace character
if (self.source[end] == 0x20)
break
end = end + 1 as u32
continue
}
Token(Word, start, end)
}
}
Creating a parsing library
Back in my days at DataWeave^dw, we used a parser library called parboiled2^parboiled, it is a very good scala library to create PEG^PEG
parsers using parsing combinators^parser-combinators. It allowed us to create our grammars using scala functions, in contrast to
ebnf
which is another animal that needs to be parsed in runtime to generate a
non optimized, blackbox parser like we do now.
Using parboiled2 as inspiration, I decided to create a parsing library in Lys that allows me to create an expressive and rich grammar, or at least not limited by the limitations of an old BNF.
The parser must consume tokens and produce an AST^ast. To do so, it should provide expressive tools to emit nodes of the AST itself.
Abstract Syntax Tree (AST)
The AST will be entirely composed by:
enum AstNode {
// Some rules will push empty values
Rule0
// Terminal values, represented by a token
Leaf(token: Token, value: string)
// A node of the AST, it is allowed to have a child
Node(name: string, child: AstNode, start: u32, end: u32)
// When representing lists, an AstCons is used
AstCons(head: AstNode, tail: AstNode)
// When we find a mismatch we produce an UnexpectedToken and continue parsing
UnexpectedToken(token: Token, value: string)
}
To explain how the AST looks like with an example, imagine we have a parser for equations capable of reading this:
;(1 * 2) / 4
It can be represented as:
AstCons(
AstCons(
Leaf(Number, "1"),
AstCons(
Leaf(Operator, "+"),
Leaf(Number, "2")
)
),
AstCons(
Leaf(Operator, "/"),
Leaf(Number, "4")
)
)
Parsing rules
To produce the AST we need a set of parsing rules, the rules are trees and the parser uses the rules to match against the input tokens.
The responsibility of the rules are
- Match the tokens in a tree structure
- Emit the nodes of the AST
enum ParserRule {
/*
* Matches a tokenType from the lexer.
* execution of this node returns Leaf | Nil
*/
Terminal(tokenType: TokenType)
/*
* Matches a tokenType from the lexer and also its value.
* execution of this node returns Leaf | Nil
*/
StrictTerminal(tokenType: TokenType, value: string)
/*
* Matches a grammatical construction by name
* execution of this node returns AstNode | Nil
*/
NonTerminal(name: string)
/*
* Matches the first rule,
* if it returns Nil, matches the second rule and returns its value
* execution of this node returns AstNode | Nil
*/
Or(lhs: ParserRule, rhs: ParserRule)
/*
* Matches the rule, one or more times.
* Returns an AstCons with the list of results.
* execution of this node returns AstCons | AstNode | Nil
*/
OneOrMore(rule: ParserRule)
/*
* Matches the rule, zero or more times.
* Returns an AstCons with the list of results.
* execution of this node returns AstCons | AstNode | Rule0
*/
ZeroOrMore(rule: ParserRule)
/*
* Matches the head rule, then it tries to match the tail rule.
* Returns an AstCons with the list of results.
* execution of this node returns AstCons | Nil
*/
Cons(head: ParserRule, tail: ParserRule)
/*
* Matches the head rule, then it tries to match the tail rule.
* If the tail rule fails, it will consume UnexpectedTokens until it matches the
* tail rule.
* Returns an AstCons with the list of results.
* execution of this node returns AstCons | Nil
*/
Cut(head: ParserRule, tail: ParserRule)
/*
* Matches the rule, if it doesn't match, returns Rule0.
* execution of this node returns AstCons | Nil
*/
Optional(rule: ParserRule)
/*
* Execution of this rule eats a token and returns a SyntaxError
* execution of this node returns SyntaxError
*/
Fail(message: string)
/*
* Matches the rule, returns Rule0 if it succeeds, Nil if it fails.
* After matching the rule, it rewinds the cursor of the lexer.
* execution of this node returns Rule0 | Nil
*/
LookAhead(rule: ParserRule)
/*
* Matches the rule, returns Nil if it succeeds, Rule0 if it fails.
* After matching the rule, it rewinds the cursor of the lexer.
* execution of this node returns Rule0 | Nil
*/
NegativeLookAhead(rule: ParserRule)
/*
* Matches the rule, returns Rule0 if it succeeds, Nil if it fails.
* execution of this node returns Rule0 | Nil
*/
Discard(rule: ParserRule)
/*
* Matches the rule, if it succeeds, pushes a named Node to the AST
* execution of this node returns Node | Nil
*/
Push(name: string, rule: ParserRule)
/*
* Matches the rule, if it succeeds and it has many children,
* pushes a named Node to the AST
* execution of this node returns Node | Nil
*/
PushIfManyChildren(name: string, rule: ParserRule)
}
And to organize the rules we need a grammatic construction:
enum Grammar {
// Named grammatical constructions, these are referenced from NonTerminal(name)
// rules
Nominal(name: string, rule: ParserRule)
// We need to mock a list, to do so we have a GrammarConj
GrammarConj(tail: Grammar, head: Nominal)
}
impl Grammar {
// Implement a simple DSL of Nominal(..) ++ Nominal(..) to produce GrammarConj
fun ++(tail: Grammar, head: Nominal): Grammar = GrammarConj(tail, head)
}
impl Nominal {
// Implement a simple DSL of Nominal(..) ++ Nominal(..) to produce GrammarConj
fun ++(tail: Grammar, head: Nominal): Grammar = GrammarConj(tail, head)
}
With those tools at hand we can come up with a simple parser to explain how it works:
// WS? rule WS?
fun spacedRule(rule: ParserRule): ParserRule =
Cons(Optional(NonTerminal("WS")), Cons(rule, Optional(NonTerminal("WS"))))
fun parseJson(str: string): AstNode = {
val grammar =
Nominal("WS", OneOrMore(Or(Terminal(Whitespace), Terminal(NewLine)))) ++
Nominal("value",
Or(NonTerminal("true"),
Or(NonTerminal("false"),
Or(NonTerminal("null"),
Or(NonTerminal("string"),
Or(NonTerminal("number"),
Or(NonTerminal("object"), NonTerminal("array")))))))
) ++
Nominal("false", StrictTerminal(Keyword, "false")) ++
Nominal("true", StrictTerminal(Keyword, "true")) ++
Nominal("null", StrictTerminal(Identifier, "null")) ++
Nominal("string", Terminal(StringLiteral)) ++
Nominal("number", Terminal(NumberLiteral)) ++
Nominal("member",
Cons(
NonTerminal("string"),
Discard(NonTerminal("COLON")),
NonTerminal("value")
)
) ++
Nominal("object",
Cons(
NonTerminal("OBJ_OPEN"),
Optional(
Cons(
NonTerminal("member"),
ZeroOrMore(Cons(NonTerminal("COMMA"), NonTerminal("member")))
)
),
NonTerminal("OBJ_CLOSE")
)
) ++
Nominal("array",
Cons(
NonTerminal("ARR_OPEN"),
Optional(
Cons(
NonTerminal("value"),
ZeroOrMore(Cons(NonTerminal("COMMA"), NonTerminal("value")))
)
),
NonTerminal("ARR_CLOSE")
)
) ++
Nominal("OBJ_OPEN", Discard(spacedRule(Terminal(CurlyBracesOpen)))) ++
Nominal("OBJ_CLOSE", Discard(spacedRule(Terminal(CurlyBracesClose)))) ++
Nominal("ARR_OPEN", Discard(spacedRule(Terminal(VectorOpen)))) ++
Nominal("ARR_CLOSE", Discard(spacedRule(Terminal(VectorClose)))) ++
Nominal("COLON", Discard(spacedRule(StrictTerminal(Operator, ":")))) ++
Nominal("COMMA", Discard(spacedRule(Terminal(Comma))))
val lexer = Lexer(source)
val parser = Parser(lexer, grammar)
parse("value", parser, 0)
}
The full code of the parser may be very long for this post, you can find the full version it here^parser-code
Once the parser produces the AST, it can be feeded to the next stage and we can continue with our compiler.
To be continued
This post will be updated with new chapters of the bootstrapping Lys tale