How I Made It - Francis' Lisp Compiler

Project Highlights

Everything was written by me from scratch
Compiles to fast JVM bytecode, making Java-like performance possible
Tokenizer, parser and code generator written from scratch. No Lex/Yacc or other parser generators were used
Only external library is ASM, a popular Java library for generating Java Virtual Machine (JVM) bytecode

How It Works

All compilers primarily do three things; scanning, parsing and code generation. I will therefore begin by explaining the design of my compiler's scanner.

What Is a Scanner?

Scanners, also known as tokenisers, break a stream of text into objects known as tokens. Scanners are the first stage of any compiler.

A token is a piece of the input text after some additional context has been added, this additional context is required by the parser later on. I have included an example below to clarify things.

Consider the following input text, it arrives at the scanner as nothing more than a sequence of characters.

(+ 100 200 300)

The scanner consumes this input, one character at a time and produces tokens. The first token will be the opening paranthese, the fact this first token is a single character is just a coincidence, tokens are often longer than a single character.

The second token will be the + symbol, the third token will be the number 100, the fourth token will be the number 200 and so on. Whitespace is sometimes important to scanners and sometimes it isn't, that just depends on the language you're scanning. Whitespace is not significant in Lisp so it is ignored by my scanner.

The complete set of tokens for the previous input string will look similar to the following.

Token{type=LPAREN, value="("},
Token{type=SYMBOL, value="+"},
Token{type=NUMBER, value="100"},
Token{type=NUMBER, value="200"},
Token{type=NUMBER, value="300"},
Token{type=RPAREN, value=")"}

The Design of My Scanner

My scanner is not novel, its design is straightforward and follows a well known pattern. My scanner is a single Java class called, Scanner, with two methods, next() and peek().

next() consumes one token worth of the input string and returns the next token, this design is efficient because at no point do I store all of the tokens in memory, that is unnecessarily wasteful. The scanner's next() method is repeated called, as and when, the next token is required.

peek() is another important method, it's required by the parser to 'peek' ahead at what the next token will be without actually creating the token. To parse any non-regular language at least one token of lookahead is required, the peek method serves this purpose.

class Scanner {
    public Token next() {
        ...
    }

    public Token peek() {
        ...
    }
}

The outline of my Lisp's Scanner class

You can find the full code of my Scanner class on my GitHub here https://github.com/FrancisMcN/fsp/blob/main/src/main/java/ie/francis/lisp/scanner/Scanner.java

What Is a Parser?

A parser is the second stage of a compiler, it consumes the tokens produced by the scanner.

A parser consumes tokens and builds a parse tree by following the rules specified by the langauge's grammar. If a parse tree can be constructed successfully we know the original input string conforms to the grammar, if a parse tree cannot be constructed we know a syntax error has been detected.

One of the great properties of Lisp is the ease with which it can be parsed. As you might have noticed from my example code on the previous page, almost everything in Lisp is a list. Once we can parse that basic list structure we can basically parse any Lisp program.

Designing the Grammar

Every language has a grammar, both human languages and programming languages alike. There's a special meta-language for describing grammars called Extended Backus-Naur Form (EBNF), the grammar for my Lisp implementation is show below.

program = expr*;
expr = list | atom | "'" expr;
list = "(" expr* ")";
atom = ε | SYMBOL | STRING | BOOLEAN | NUMBER;

The grammar my parser recognises in EBNF notation

Some explaining is in order. Every lowercase word in the EBNF grammar is called a non-terminal, non-terminals are defined on the left-hand side of the equals sign and become the symbols on the right-hand side during parsing.

For example, in the above grammar an 'atom' can become a STRING, BOOLEAN, NUMBER, SYMBOL or the empty string. The uppercase words are called terminals, each terminal corresponds to one of the tokens produced by the scanner shown earlier.

A Parsing Example

Let's again look back to the earlier example string and watch the rules the parser will invoke to parse it.

(+ 100 200 300)

Original input string before parsing

Starting with the first token, the opening parenthese, by calling scanner.peek(). The parser begins with its first rule, the 'program' non-terminal and tries to consume the input tokens until there are no tokens remaining.

The 'program' rule, in English, says 'a program is zero or more exprs'.

'expr' is another non-terminal so the parser moves to the expr rule, the token under consideration remains the opening parenthese. The token under consideration won't change until it's been matched with a terminal symbol.

The 'expr' rule, in English, says 'an expr is either a list or an atom or a single-quote followed by an expr'.

The parser will try each rule in order, the parser will therefore begin with the 'list' non-terminal.

A 'list' rule, in English, says 'a list is an opening parenthese followed by zero or more exprs followed by a closing parenthese'.

Finally, the parser has found a terminal, better still, the terminal matches the token under consideration. The token is therefore consumed by calling scanner.next().

+ 100 200 300)

Input string after consuming the first opening parenthese

Now, the token under consideration is the '+' symbol and the rule the parser matches is 'expr' because expr* comes after "(" in the 'list' rule.

The parser goes back to the expr rule, again seeing an expr is a list or an atom or a single-quote followed by another expr. The parser again tries to use the 'list' rule however this time it finds the token under consideration doesn't match the first terminal in the 'list' rule. The '+' symbol doesn't match '('.

Therefore the parser can't continue, so it backtracks. It backs out of the expr rule, goes back up to the list rule and tries the next non-terminal which in this case the 'atom' rule.

An atom, in English, is either the empty string, a SYMBOL token, a STRING token, a BOOLEAN token or a NUMBER token.

Again, the parser tries each rule in order and eventually finds the token under consideration which is the '+' symbol matches the SYMBOL token in the atom rule.

The parser consumes the '+' symbol by calling scanner.next(), jumps out of the atom rule and then out of the expr rule.

100 200 300)

The input string after consuming the '+' symbol

Something interesting happens next, the parser is still parsing the original 'list' rule. It has parsed "(" and 'expr*', the parser doesn't move onto ")" yet though.

The star after expr means match zero or more occurrences, the parser therefore enters the expr rule again. The parser will only stop re-entering the expr rule whenever the token under consideration isn't a valid 'expr' or whenever the parser runs out of input.

The parser finds the next three tokens are all valid exprs and consumes them (because an expr can be an atom which can be a NUMBER token). The input string is now shown below.

The input string after consuming the remaining numbers

The parser now returns to the original 'list' rule and finds only one terminal needs to be matched. As luck would have it, the token under consideration is now the closing parenthese which is the remaining terminal the parser wanted to find.

The list rule has been successfully parsed, the parser backtracks again back to the original expr rule and from there backtracks again to the original 'program' rule. The program rule has been parsed successfully and the entire input has been consumed, therefore parsing was successful.

The Design of My Parser

I implemented a simple recursive descent parser by hand, it's common to use table-driven machine generated parsers but recursive descent parsers are easy to write and I didn't want to be tied to an external tool to build my code.

Recursive descent parsers are effective and easy to write, they aren't powerful enough to recognise all languages but they're more than sufficient to parse Lisp.

A recursive descent parser can be written by repeatedly applying some simple rules.

Each non-terminal is a method in the parser.
Match zero or more, expr* for example, becomes a while loop.
Match one or more becomes a do-while loop.
Match zero or one becomes an if-statement.

That's all there is to it. I've included an incomplete snippet of Java code for an example Lisp parser below.

class Parser {
    public Object parse() {
        return program();
    }
    private Object program() {
        ...
    }

    private Object expr() {
        ...
    }

    private Object list() {
        scanner.next(); // Consume first "("
        List<Object> exprs = new ArrayList<>();
        while (isExpr(scanner.peek())) {
            exprs.add(expr());
        }
        scanner.next(); // Consume last ")"
        return exprs;
    }
    private Object atom() {
        switch (scanner.peek().type()) {
            case SYMBOL:
                return new Symbol(scanner.next().value());
            ...
            default:
                throw new SyntaxErrorException("found a syntax error!");
        }
    }
}

A recursive descent parser can be written by following a few simple rules

You can find the full code of my Parser class on my GitHub here https://github.com/FrancisMcN/fsp/blob/main/src/main/java/ie/francis/lisp/parser/Parser.java

Once parsing is complete and a parse tree has been generated it's time to start generating machine code. The 'machine' my compiler targets doesn't exist in the physical world, it's a virtual machine called the Java Virtual Machine (JVM).

Why Target the Java Virtual Machine?

The JVM was originally intended to host Java itself but has since become a platform to host many different kinds of languages, such as Scala and Clojure.

Targeting the JVM provides languages with many advantages.

JVM bytecode is portable which means it will run anywhere a JVM will run. My Lisp can be compiled once, to JVM bytecode and will then run anywhere a JVM will run (which is basically everywhere).
My Lisp benefits from the JVM's optimisations for free. The JVM is a just-in-time compiler, it compiles bytecode into native code at runtime, continuously profiling and optimising to produce the most efficient native code at runtime.
My Lisp can leverage all pre-existing Java code. Once my Lisp is compiled to bytecode it looks like any other Java program to the JVM, therefore calling other Java code from my Lisp is trivial.

Code Generation

Code generation is a complicated topic but there's a library for Java which makes this easy. The library is called ASM and is already used by many Java projects to manipulate JVM class files.

Class files are the fundamental compilation unit for the JVM, their closest conterparts are object files in C and C++. In Lisp the unit of compilation is the function, therefore in my Lisp each function becomes its own Java class.

Each compiled function becomes a Java class implementing an interface called Lambda. The Lambda interface has many methods each named call(), which method is used depends on the number of arguments passed into the function.

The primary reason for having so many implementations of call() is performance, a secondary advantage of this design is it supports adding function overloading to my Lisp in future. A variadic function is one where the number of arguments is variable, in Java variadic arguments are passed into functions as arrays which creates additional overhead for each function call. My design therefore limits this overhead to function calls involving more than five arguments.

public interface Lambda {
    Object call();
    Object call(Object arg);
    Object call(Object arg, Object arg2);
    Object call(Object arg, Object arg2, Object arg3);
    Object call(Object arg, Object arg2, Object arg3, Object arg4);
    Object call(Object arg, Object arg2, Object arg3, Object arg4, Object arg5);
    Object call(Object[] args);
}

Every function in my Lisp becomes a Java class implementing the Lambda interface

How Functions Work in My Lisp

Consider the '+' function. In my Lisp this is implemented in the Plus class. The Plus class implements the Lambda interface and provides implementations for adding numbers together, an excerpt is shown below

public class Plus extends BaseLambda implements Lambda {
    ...

    @Override
    public Object call(Object arg, Object arg2) {
        return (Integer) arg + (Integer) arg2;
    }

    ...
}

The following example Lisp code

(+ 1 2 3)

An example function call in my Lisp

compiles down to this Java code

new Apply().call(Environment.get(new Symbol("+")), 1, 2, 3);

The Java code corresponding to my Lisp code after it's been compiled

which is represented by this JVM bytecode

public java.lang.Object call();
Code:
    0: new           #14  // class ie/francis/lisp/function/Apply
    3: dup
    4: invokespecial #15  // Method ie/francis/lisp/function/Apply."":()V
    7: new           #17  // class ie/francis/lisp/type/Symbol
    10: dup
    11: ldc           #19  // String +
    13: invokespecial #22  // Method ie/francis/lisp/type/Symbol."":(Ljava/lang/String;)V
    16: invokestatic  #28  // Method ie/francis/lisp/Environment.get:(Lie/francis/lisp/type/Symbol;)Ljava/lang/Object;
    19: ldc           #29  // int 1
    21: invokestatic  #35  // Method java/lang/Integer.valueOf:(I)Ljava/lang/Integer;
    24: ldc           #36  // int 2
    26: invokestatic  #35  // Method java/lang/Integer.valueOf:(I)Ljava/lang/Integer;
    29: ldc           #37  // int 3
    31: invokestatic  #35  // Method java/lang/Integer.valueOf:(I)Ljava/lang/Integer;
    34: invokevirtual #40  // Method ie/francis/lisp/function/Apply.call:(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;
    37: areturn
}

Compiled JVM bytecode corresponding to the Lisp code (+ 1 2 3)

Something of note is how each function call is actually converted by the compiler into a call to the builtin function 'apply'. Apply is a traditional Lisp function used to call other Lisp functions. Using 'apply' delays name resolution of the function until runtime which is important when implementing recursive functions.

(+ 1 2 3) actually becomes (apply + 1 2 3) inside the compiler. You can see this in the JVM bytecode, the very first instruction creates an instance of the Apply class.

Since function name resolution occurs at runtime the following code actually compiles but will not run, without apply however, recursive functions would not be possible.

> (func hello () (non-existant-function 1 2 3))
nil
> (hello)
ie.francis.lisp.exception.UndefinedSymbolException: undefined symbol: non-existant-function
    at ie.francis.lisp.Environment.get(Environment.java:26)
    at Lambda288801326.call(Unknown Source)
    at ie.francis.lisp.function.Apply.call(Apply.java:16)
    at Lambda479264912.call(Unknown Source)
    at ie.francis.lisp.function.Eval.call(Eval.java:39)
    at ie.francis.lisp.Main.main(Main.java:77)

'non-existant-function' doesn't exist, so when 'hello' is called the code fails