Click here to go back to the demo
All compilers primarily do three things; scanning, parsing and code generation. I will therefore begin by explaining the design of my compiler's scanner.
Scanners, also known as tokenisers, break a stream of text into objects known as tokens. Scanners are the first stage of any compiler.
A token is a piece of the input text after some additional context has been added, this additional context is required by the parser later on. I have included an example below to clarify things.
Consider the following input text, it arrives at the scanner as nothing more than a sequence of characters.
(+ 100 200 300)
The scanner consumes this input, one character at a time and produces tokens. The first token will be the opening paranthese, the fact this first token is a single character is just a coincidence, tokens are often longer than a single character.
The second token will be the + symbol, the third token will be the number 100, the fourth token will be the number 200 and so on. Whitespace is sometimes important to scanners and sometimes it isn't, that just depends on the language you're scanning. Whitespace is not significant in Lisp so it is ignored by my scanner.
The complete set of tokens for the previous input string will look similar to the following.
Token{type=LPAREN, value="("},
Token{type=SYMBOL, value="+"},
Token{type=NUMBER, value="100"},
Token{type=NUMBER, value="200"},
Token{type=NUMBER, value="300"},
Token{type=RPAREN, value=")"}
My scanner is not novel, its design is straightforward and follows a well known pattern. My scanner is a single Java class called, Scanner, with two methods, next() and peek().
next() consumes one token worth of the input string and returns the next token, this design is efficient because at no point do I store all of the tokens in memory, that is unnecessarily wasteful. The scanner's next() method is repeated called, as and when, the next token is required.
peek() is another important method, it's required by the parser to 'peek' ahead at what the next token will be without actually creating the token. To parse any non-regular language at least one token of lookahead is required, the peek method serves this purpose.
class Scanner {
public Token next() {
...
}
public Token peek() {
...
}
}
The outline of my Lisp's Scanner class
You can find the full code of my Scanner class on my GitHub here https://github.com/FrancisMcN/fsp/blob/main/src/main/java/ie/francis/lisp/scanner/Scanner.java
A parser is the second stage of a compiler, it consumes the tokens produced by the scanner.
A parser consumes tokens and builds a parse tree by following the rules specified by the langauge's grammar. If a parse tree can be constructed successfully we know the original input string conforms to the grammar, if a parse tree cannot be constructed we know a syntax error has been detected.
One of the great properties of Lisp is the ease with which it can be parsed. As you might have noticed from my example code on the previous page, almost everything in Lisp is a list. Once we can parse that basic list structure we can basically parse any Lisp program.
Every language has a grammar, both human languages and programming languages alike. There's a special meta-language for describing grammars called Extended Backus-Naur Form (EBNF), the grammar for my Lisp implementation is show below.
program = expr*;
expr = list | atom | "'" expr;
list = "(" expr* ")";
atom = ε | SYMBOL | STRING | BOOLEAN | NUMBER;
The grammar my parser recognises in EBNF notation
Some explaining is in order. Every lowercase word in the EBNF grammar is called a non-terminal, non-terminals are defined on the left-hand side of the equals sign and become the symbols on the right-hand side during parsing.
For example, in the above grammar an 'atom' can become a STRING, BOOLEAN, NUMBER, SYMBOL or the empty string. The uppercase words are called terminals, each terminal corresponds to one of the tokens produced by the scanner shown earlier.
Let's again look back to the earlier example string and watch the rules the parser will invoke to parse it.
(+ 100 200 300)
Original input string before parsing
Starting with the first token, the opening parenthese, by calling scanner.peek(). The parser begins with its first rule, the 'program' non-terminal and tries to consume the input tokens until there are no tokens remaining.
The 'program' rule, in English, says 'a program is zero or more exprs'.
'expr' is another non-terminal so the parser moves to the expr rule, the token under consideration remains the opening parenthese. The token under consideration won't change until it's been matched with a terminal symbol.
The 'expr' rule, in English, says 'an expr is either a list or an atom or a single-quote followed by an expr'.
The parser will try each rule in order, the parser will therefore begin with the 'list' non-terminal.
A 'list' rule, in English, says 'a list is an opening parenthese followed by zero or more exprs followed by a closing parenthese'.
Finally, the parser has found a terminal, better still, the terminal matches the token under consideration. The token is therefore consumed by calling scanner.next().
+ 100 200 300)
Input string after consuming the first opening parenthese
Now, the token under consideration is the '+' symbol and the rule the parser matches is 'expr' because expr* comes after "(" in the 'list' rule.
The parser goes back to the expr rule, again seeing an expr is a list or an atom or a single-quote followed by another expr. The parser again tries to use the 'list' rule however this time it finds the token under consideration doesn't match the first terminal in the 'list' rule. The '+' symbol doesn't match '('.
Therefore the parser can't continue, so it backtracks. It backs out of the expr rule, goes back up to the list rule and tries the next non-terminal which in this case the 'atom' rule.
An atom, in English, is either the empty string, a SYMBOL token, a STRING token, a BOOLEAN token or a NUMBER token.
Again, the parser tries each rule in order and eventually finds the token under consideration which is the '+' symbol matches the SYMBOL token in the atom rule.
The parser consumes the '+' symbol by calling scanner.next(), jumps out of the atom rule and then out of the expr rule.
100 200 300)
The input string after consuming the '+' symbol
Something interesting happens next, the parser is still parsing the original 'list' rule. It has parsed "(" and 'expr*', the parser doesn't move onto ")" yet though.
The star after expr means match zero or more occurrences, the parser therefore enters the expr rule again. The parser will only stop re-entering the expr rule whenever the token under consideration isn't a valid 'expr' or whenever the parser runs out of input.
The parser finds the next three tokens are all valid exprs and consumes them (because an expr can be an atom which can be a NUMBER token). The input string is now shown below.
)
The input string after consuming the remaining numbers
The parser now returns to the original 'list' rule and finds only one terminal needs to be matched. As luck would have it, the token under consideration is now the closing parenthese which is the remaining terminal the parser wanted to find.
The list rule has been successfully parsed, the parser backtracks again back to the original expr rule and from there backtracks again to the original 'program' rule. The program rule has been parsed successfully and the entire input has been consumed, therefore parsing was successful.
I implemented a simple recursive descent parser by hand, it's common to use table-driven machine generated parsers but recursive descent parsers are easy to write and I didn't want to be tied to an external tool to build my code.
Recursive descent parsers are effective and easy to write, they aren't powerful enough to recognise all languages but they're more than sufficient to parse Lisp.
A recursive descent parser can be written by repeatedly applying some simple rules.
That's all there is to it. I've included an incomplete snippet of Java code for an example Lisp parser below.
class Parser {
public Object parse() {
return program();
}
private Object program() {
...
}
private Object expr() {
...
}
private Object list() {
scanner.next(); // Consume first "("
List<Object> exprs = new ArrayList<>();
while (isExpr(scanner.peek())) {
exprs.add(expr());
}
scanner.next(); // Consume last ")"
return exprs;
}
private Object atom() {
switch (scanner.peek().type()) {
case SYMBOL:
return new Symbol(scanner.next().value());
...
default:
throw new SyntaxErrorException("found a syntax error!");
}
}
}
A recursive descent parser can be written by following a few simple rules
You can find the full code of my Parser class on my GitHub here https://github.com/FrancisMcN/fsp/blob/main/src/main/java/ie/francis/lisp/parser/Parser.java
Once parsing is complete and a parse tree has been generated it's time to start generating machine code. The 'machine' my compiler targets doesn't exist in the physical world, it's a virtual machine called the Java Virtual Machine (JVM).
The JVM was originally intended to host Java itself but has since become a platform to host many different kinds of languages, such as Scala and Clojure.
Targeting the JVM provides languages with many advantages.
Code generation is a complicated topic but there's a library for Java which makes this easy. The library is called ASM and is already used by many Java projects to manipulate JVM class files.
Class files are the fundamental compilation unit for the JVM, their closest conterparts are object files in C and C++. In Lisp the unit of compilation is the function, therefore in my Lisp each function becomes its own Java class.
Each compiled function becomes a Java class implementing an interface called Lambda. The Lambda interface has many methods each named call(), which method is used depends on the number of arguments passed into the function.
The primary reason for having so many implementations of call() is performance, a secondary advantage of this design is it supports adding function overloading to my Lisp in future. A variadic function is one where the number of arguments is variable, in Java variadic arguments are passed into functions as arrays which creates additional overhead for each function call. My design therefore limits this overhead to function calls involving more than five arguments.
public interface Lambda {
Object call();
Object call(Object arg);
Object call(Object arg, Object arg2);
Object call(Object arg, Object arg2, Object arg3);
Object call(Object arg, Object arg2, Object arg3, Object arg4);
Object call(Object arg, Object arg2, Object arg3, Object arg4, Object arg5);
Object call(Object[] args);
}
Every function in my Lisp becomes a Java class implementing the Lambda interface
Consider the '+' function. In my Lisp this is implemented in the Plus class. The Plus class implements the Lambda interface and provides implementations for adding numbers together, an excerpt is shown below
public class Plus extends BaseLambda implements Lambda {
...
@Override
public Object call(Object arg, Object arg2) {
return (Integer) arg + (Integer) arg2;
}
...
}
The following example Lisp code
(+ 1 2 3)
An example function call in my Lisp
compiles down to this Java code
new Apply().call(Environment.get(new Symbol("+")), 1, 2, 3);
The Java code corresponding to my Lisp code after it's been compiled
which is represented by this JVM bytecode
public java.lang.Object call();
Code:
0: new #14 // class ie/francis/lisp/function/Apply
3: dup
4: invokespecial #15 // Method ie/francis/lisp/function/Apply."":()V
7: new #17 // class ie/francis/lisp/type/Symbol
10: dup
11: ldc #19 // String +
13: invokespecial #22 // Method ie/francis/lisp/type/Symbol."":(Ljava/lang/String;)V
16: invokestatic #28 // Method ie/francis/lisp/Environment.get:(Lie/francis/lisp/type/Symbol;)Ljava/lang/Object;
19: ldc #29 // int 1
21: invokestatic #35 // Method java/lang/Integer.valueOf:(I)Ljava/lang/Integer;
24: ldc #36 // int 2
26: invokestatic #35 // Method java/lang/Integer.valueOf:(I)Ljava/lang/Integer;
29: ldc #37 // int 3
31: invokestatic #35 // Method java/lang/Integer.valueOf:(I)Ljava/lang/Integer;
34: invokevirtual #40 // Method ie/francis/lisp/function/Apply.call:(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;
37: areturn
}
Compiled JVM bytecode corresponding to the Lisp code (+ 1 2 3)
Something of note is how each function call is actually converted by the compiler into a call to the builtin function 'apply'. Apply is a traditional Lisp function used to call other Lisp functions. Using 'apply' delays name resolution of the function until runtime which is important when implementing recursive functions.
(+ 1 2 3)
actually becomes (apply + 1 2 3)
inside the compiler. You can see this in the JVM bytecode, the very first instruction creates an instance of the Apply class.
Since function name resolution occurs at runtime the following code actually compiles but will not run, without apply however, recursive functions would not be possible.
> (func hello () (non-existant-function 1 2 3))
nil
> (hello)
ie.francis.lisp.exception.UndefinedSymbolException: undefined symbol: non-existant-function
at ie.francis.lisp.Environment.get(Environment.java:26)
at Lambda288801326.call(Unknown Source)
at ie.francis.lisp.function.Apply.call(Apply.java:16)
at Lambda479264912.call(Unknown Source)
at ie.francis.lisp.function.Eval.call(Eval.java:39)
at ie.francis.lisp.Main.main(Main.java:77)
'non-existant-function' doesn't exist, so when 'hello' is called the code fails