Bootstrapping the Lys compiler

Recipe for yogurt: Add yogurt to milk.

Self hosting a compiler may be the ultimate goal of every new language.

The process of writing a the next version of a compiler and compiling itself is called bootstrap the compiler.

I decided not to continue developing Lys' new features until it is self hosted, everything from now on should be either a bug fix or adding some helpers to the standard library.

The plan

I came up with a staged plan to do it:

Write a parser library^parser
Write Lys' grammar in it^new-lys-grammar
Parse the whole standard library and tests correctly
Iteratively implement every phase of the compiler until it is capable of run some tests

Compiler phases

Parsing phase:
- Read source code and emit canonical representation of the source PR
- Detect and print syntax errors PR
Semantic phase:
- Syntax sugar expansion
- Code generation
Name initialization phase:
- Detects implicit imports (like in fully qualified names support::env::printf(..))
- Creates scopes for every relevant node and then register the name declarations
- Injects standard library core imports
Scope phase:
- Finds and resolves references
- Validates loop statements (loop, continue, break)
- Validates assignments' mutability
Type initialization phase:
- This phase initializes types and traits directives.
Type checker phase:
This phase runs iteratively until the result of the phase equals the result of the previous run.
- Infers the type of every node.
- If changes in a module are detected, the module is invalidated and added to a stack of pending modules.
Pre compilation phase:
- Detects and annotates features like tail-call functions
Compilation phase:
- Resolves local numbers inside functions
- Emits WAST code
- Uses binaryen to parse and optimize the output
- Emits the final code

The parser

Building parsers is a topic of study itself, the first Lys parser was a PEG^PEG parser using a library^node-ebnf made for the specific use case.

The previous parser had serious ambiguity and backtracking problems with keywords and name identifiers, take a look at this rule:

NameIdentifier ::= !KEYWORD [a-zA-Z][a-zA-Z0-9_$]*
KEYWORD        ::= 'fun' | 'private' | ...

The problem may not be clear enough, but to read a NameIdentifier it first looks trough a list of other things that are also a name identifier. It produces a costly look-ahead and a lot of backtracking.

That problem can be fixed if we use a tokenizer or lexer^lexer as a first stage of parsing. In that case, instead of using parsing expressions ([a-zA-Z][a-zA-Z0-9_$]*) to generate our grammar, we use named tokens. Our end grammar should look like this:

NameIdentifier ::= <IdentifierToken>
KEYWORD        ::= <KeywordToken>

Which is obviously simpler.

That led me to the first parser design decision for Lys: a tokenizer is required.

Creating the tokenizer

Depending on the tokens, creating tokenizers is a very simple task, this is the basic algorithm:

Read a char
Find a token that may start with that char
Read the rest of the token, char by char
Return the token, go to step 1 until we reach the end of the file

When you read a character, you also move the reading cursor forward and then stop when we reach the end of the string.

This is a reduced version of a tokenizer that only reads tokens and whitespaces:

enum TokenType {
  Word
  Whitespace
  EndOfFile
}

struct Token(tokenType: TokenType, start: u32, end: u32)

struct Lexer(source: string, pos: u32)

impl Lexer {
  #[method]
  fun eat(self: Lexer): Token = {
    if (self.pos >= self.source.length)
      Token(EndOfFile, self.source.length, self.source.length)
    else {
      // Match the [self.pos] character of self.source
      val eatenToken = match self.source[self.pos] {
        case 0x20 -> eatWhitespace(self) // 0x20 == whitespace
        else      -> eatToken(self)
      }
      // we got a valid token, move self.pos forward
      self.pos = eatenToken.end
      // return the eaten token
      eatenToken
    }
  }

  private fun eatWhitespace(self: Lexer): Token = {
    var start = self.pos
    var end = end
    val len = self.source.length
    loop {
      if (end >= len)
        break

      // eat until we reach a non-whitespace character
      if (self.source[end] != 0x20)
        break

      end = end + 1 as u32
      continue
    }
    Token(Whitespace, start, end)
  }

  private fun eatToken(self: Lexer): Token = {
    var start = self.pos
    var end = end
    val len = self.source.length
    loop {
      if (end >= len)
        break

      // eat until we reach a whitespace character
      if (self.source[end] == 0x20)
        break

      end = end + 1 as u32
      continue
    }
    Token(Word, start, end)
  }
}

Creating a parsing library

Back in my days at DataWeave^dw, we used a parser library called parboiled2^parboiled, it is a very good scala library to create PEG^PEG parsers using parsing combinators^parser-combinators. It allowed us to create our grammars using scala functions, in contrast to ebnf which is another animal that needs to be parsed in runtime to generate a non optimized, blackbox parser like we do now.

Using parboiled2 as inspiration, I decided to create a parsing library in Lys that allows me to create an expressive and rich grammar, or at least not limited by the limitations of an old BNF.

The parser must consume tokens and produce an AST^ast. To do so, it should provide expressive tools to emit nodes of the AST itself.

Abstract Syntax Tree (AST)

The AST will be entirely composed by:

enum AstNode {
  // Some rules will push empty values
  Rule0

  // Terminal values, represented by a token
  Leaf(token: Token, value: string)

  // A node of the AST, it is allowed to have a child
  Node(name: string, child: AstNode, start: u32, end: u32)

  // When representing lists, an AstCons is used
  AstCons(head: AstNode, tail: AstNode)

  // When we find a mismatch we produce an UnexpectedToken and continue parsing
  UnexpectedToken(token: Token, value: string)
}

To explain how the AST looks like with an example, imagine we have a parser for equations capable of reading this:

;(1 * 2) / 4

It can be represented as:

AstCons(
  AstCons(
    Leaf(Number, "1"),
    AstCons(
      Leaf(Operator, "+"),
      Leaf(Number, "2")
    )
  ),
  AstCons(
    Leaf(Operator, "/"),
    Leaf(Number, "4")
  )
)

Parsing rules

To produce the AST we need a set of parsing rules, the rules are trees and the parser uses the rules to match against the input tokens.

The responsibility of the rules are

Match the tokens in a tree structure
Emit the nodes of the AST

enum ParserRule {
  /*
   * Matches a tokenType from the lexer.
   * execution of this node returns Leaf | Nil
   */
  Terminal(tokenType: TokenType)

  /*
   * Matches a tokenType from the lexer and also its value.
   * execution of this node returns Leaf | Nil
   */
  StrictTerminal(tokenType: TokenType, value: string)

  /*
   * Matches a grammatical construction by name
   * execution of this node returns AstNode | Nil
   */
  NonTerminal(name: string)

  /*
   * Matches the first rule,
   * if it returns Nil, matches the second rule and returns its value
   * execution of this node returns AstNode | Nil
   */
  Or(lhs: ParserRule, rhs: ParserRule)

  /*
   * Matches the rule, one or more times.
   * Returns an AstCons with the list of results.
   * execution of this node returns AstCons | AstNode | Nil
   */
  OneOrMore(rule: ParserRule)

  /*
   * Matches the rule, zero or more times.
   * Returns an AstCons with the list of results.
   * execution of this node returns AstCons | AstNode | Rule0
   */
  ZeroOrMore(rule: ParserRule)

  /*
   * Matches the head rule, then it tries to match the tail rule.
   * Returns an AstCons with the list of results.
   * execution of this node returns AstCons | Nil
   */
  Cons(head: ParserRule, tail: ParserRule)

  /*
   * Matches the head rule, then it tries to match the tail rule.
   * If the tail rule fails, it will consume UnexpectedTokens until it matches the
   * tail rule.
   * Returns an AstCons with the list of results.
   * execution of this node returns AstCons | Nil
   */
  Cut(head: ParserRule, tail: ParserRule)

  /*
   * Matches the rule, if it doesn't match, returns Rule0.
   * execution of this node returns AstCons | Nil
   */
  Optional(rule: ParserRule)

  /*
   * Execution of this rule eats a token and returns a SyntaxError
   * execution of this node returns SyntaxError
   */
  Fail(message: string)

  /*
   * Matches the rule, returns Rule0 if it succeeds, Nil if it fails.
   * After matching the rule, it rewinds the cursor of the lexer.
   * execution of this node returns Rule0 | Nil
   */
  LookAhead(rule: ParserRule)

  /*
   * Matches the rule, returns Nil if it succeeds, Rule0 if it fails.
   * After matching the rule, it rewinds the cursor of the lexer.
   * execution of this node returns Rule0 | Nil
   */
  NegativeLookAhead(rule: ParserRule)

  /*
   * Matches the rule, returns Rule0 if it succeeds, Nil if it fails.
   * execution of this node returns Rule0 | Nil
   */
  Discard(rule: ParserRule)

  /*
   * Matches the rule, if it succeeds, pushes a named Node to the AST
   * execution of this node returns Node | Nil
   */
  Push(name: string, rule: ParserRule)

  /*
   * Matches the rule, if it succeeds and it has many children,
   * pushes a named Node to the AST
   * execution of this node returns Node | Nil
   */
  PushIfManyChildren(name: string, rule: ParserRule)
}

And to organize the rules we need a grammatic construction:

enum Grammar {
  // Named grammatical constructions, these are referenced from NonTerminal(name)
  // rules
  Nominal(name: string, rule: ParserRule)
  // We need to mock a list, to do so we have a GrammarConj
  GrammarConj(tail: Grammar, head: Nominal)
}

impl Grammar {
  // Implement a simple DSL of Nominal(..) ++ Nominal(..) to produce GrammarConj
  fun ++(tail: Grammar, head: Nominal): Grammar = GrammarConj(tail, head)
}

impl Nominal {
  // Implement a simple DSL of Nominal(..) ++ Nominal(..) to produce GrammarConj
  fun ++(tail: Grammar, head: Nominal): Grammar = GrammarConj(tail, head)
}

With those tools at hand we can come up with a simple parser to explain how it works:

// WS? rule WS?
fun spacedRule(rule: ParserRule): ParserRule =
  Cons(Optional(NonTerminal("WS")), Cons(rule, Optional(NonTerminal("WS"))))


fun parseJson(str: string): AstNode = {
  val grammar =
    Nominal("WS", OneOrMore(Or(Terminal(Whitespace), Terminal(NewLine)))) ++
    Nominal("value",
      Or(NonTerminal("true"),
      Or(NonTerminal("false"),
      Or(NonTerminal("null"),
      Or(NonTerminal("string"),
      Or(NonTerminal("number"),
      Or(NonTerminal("object"), NonTerminal("array")))))))
    ) ++
    Nominal("false", StrictTerminal(Keyword, "false")) ++
    Nominal("true", StrictTerminal(Keyword, "true")) ++
    Nominal("null", StrictTerminal(Identifier, "null")) ++
    Nominal("string", Terminal(StringLiteral)) ++
    Nominal("number", Terminal(NumberLiteral)) ++
    Nominal("member",
      Cons(
        NonTerminal("string"),
        Discard(NonTerminal("COLON")),
        NonTerminal("value")
      )
    ) ++
    Nominal("object",
      Cons(
        NonTerminal("OBJ_OPEN"),
        Optional(
          Cons(
            NonTerminal("member"),
            ZeroOrMore(Cons(NonTerminal("COMMA"), NonTerminal("member")))
          )
        ),
        NonTerminal("OBJ_CLOSE")
      )
    ) ++
    Nominal("array",
      Cons(
        NonTerminal("ARR_OPEN"),
        Optional(
          Cons(
            NonTerminal("value"),
            ZeroOrMore(Cons(NonTerminal("COMMA"), NonTerminal("value")))
          )
        ),
        NonTerminal("ARR_CLOSE")
      )
    ) ++
    Nominal("OBJ_OPEN", Discard(spacedRule(Terminal(CurlyBracesOpen)))) ++
    Nominal("OBJ_CLOSE", Discard(spacedRule(Terminal(CurlyBracesClose)))) ++
    Nominal("ARR_OPEN", Discard(spacedRule(Terminal(VectorOpen)))) ++
    Nominal("ARR_CLOSE", Discard(spacedRule(Terminal(VectorClose)))) ++
    Nominal("COLON", Discard(spacedRule(StrictTerminal(Operator, ":")))) ++
    Nominal("COMMA", Discard(spacedRule(Terminal(Comma))))

    val lexer = Lexer(source)
    val parser = Parser(lexer, grammar)
    parse("value", parser, 0)
}

The full code of the parser may be very long for this post, you can find the full version it here ^parser-code

Once the parser produces the AST, it can be feeded to the next stage and we can continue with our compiler.

To be continued

This post will be updated with new chapters of the bootstrapping Lys tale