Compilers

Published on: Sun Sep 24 2023

Have you ever wondered how do the human-readable source code is transformed into something that a computer can execute? As I mentioned in the post on assembly - computers follow linear sequences of basic instructions, but source code defines classes, functions, loops, annotations and so much more!

The process of converting some text data into some data structure that is simpler to work with is called parsing. You may be familiar with JSON or HTML parsing and source code parsing is not much different. Except the result isn’t some hashtable or a list, but a syntax tree. A tree stems from the fact that the data structure is branching - for example a function definition has name, arguments and body elements, an if statement has a condition, true path and maybe a false path.

So let’s take some Common Lisp code as an example:

(defun nth_fibonacci (n)
    "Calculate the Nth Fibonacci number."
    (if (< n 2)
        n
        (+
            (nth_fibonacci (- n 1))
            (nth_fibonacci (- n 2))
        )
    )
)

(write (nth_fibonacci (read)))

If you-re not familiar with the language - don’t worry, neither are computers before we write a parser! It is equivalent to the following Python, although we will focus on Lisp, as its syntax maps better to what we will be talking about.

def nth_fibonacci(n):
    """Calculate the Nth Fibonnaci number."""
    if n < 2:
        return n
    return nth_fibonacci(n - 1) + nth_fibbonachi(n - 2)

print(nth_fibonacci(int(input())))

Let’s start with something simpler: a single expression (+ 1 2). As you might suspect it evaluates to 3. And that’s pretty much all an expression is - a sequence of instructions that evaluates to some value. Most languages allow nesting expressions together (+ (read) (read)). The summation is now invoking another expression (reading from input) for each of its arguments. And this nesting is why source code is usually represented as a syntax tree when working with it - sometimes it’s sufficient to identify that “oh, this is an expression, but our syntax doesn’t allow an expression here”. When in doubt - expressions are whatever the language allows you to put on the right hand side of assignment.

But speaking of assignment - is it an expression? What about a function definition? Is our first set of brackets an expression? Well they don’t evaluate to anything, so maybe not. They are just a statement - a block of code that tells what the program should do. So our first statement syntax tree looks something like:

statement (function definition) { name: "nth_fibonacci", parameters: ["n"], doc: "Calculate the Nth Fibonacci number." }
    expression (if)
        expression (<)
            variable "n"
            value 2
        variable "n"
        expression (+)
            expression (nth_fibonacci)
                expression (-)
                    variable "n"
                    value 1
            expression (nth_fibonacci)
                expression (-)
                    variable "n"
                    value 2

Arguably the statement does evaluate to something - modifying the global state to define a new function that can be called. And in fact that’s exactly how Lisp works! So every set of brackets is an expression!

Alright, we have a weird looking data structure that we have no idea how to create (and I will delve into parsing some other time), what next? How does having this tree help us? Imagine we’re writing an interpreter and we had the parser that would provide us the syntax tree. How would we go about evaluating an expression?

Firstly we may notice that all expressions have a name of the function they correspond to - alright, we can have a hashmap of names to a function. What should a function take? For arithmetic operations it should probably be numbers, but in our example we have variables… So we also need a mechanism of retreiving the values of variables by name. Let’s use another hashmap! Ok, we can now evaluate subtraction in our example. But what about compound expressions? Something like

expression (+)
    expression (*)
        variable "n"
        value 2
    value -1

Maybe we could replace the * expression with the result of the multiplication? Aha! Then the + expression could receive the resulting value and be replaced with the result! So to evaluate an expression tree - we recursively evaluate the nested expressions to get our result!

And when we see a code-defined function - our interpreter could remember the corresponding syntax tree and evaluate the expression once it sees it again!

Sweet, if I haven’t lost you - you may have an idea of how to implement a generic interpreter for any language that you can find a parser for! But the title was “compilers” right? Not interpreters! A compiler can’t just evaluate an expression - it need to convert the expression into machine code. So our evaluation function will need to change from returning the resulting value to returning the corresponding machine instructions. And also we need handle variables somehow. Instead of looking up the variable value - we need to replace them with loading the value from memory and placing it into a register.

Note that expression evaluation order (depth-first) is exactly what we want - before we translate an expression, we translate its nested expressions and remember which register the results were placed in. And finally we translate the expression with machine code, using the remembered registers.

Is that it!? Well yeah, kinda. Of course this is very general and I had to gloss over many details, but this is pretty much how compilers work! We parse our code into some recursive data structure that a compiler can work with more easily. And then dive deeper into said structure working up from the leaf nodes, replacing expressions (and statements) with machine code.