Crafting intepreters: Notes
Hi. In this video I will share the notes I took from the first chapter of the book Crafting Interpreters (link here).
Steps of the compiler
The compiler performs a series of steps with the aim of converting the source code into something the computer can understand. There are several steps to achieve this. These steps are divided into three phases:
1. Scanning (also lexer or lexical analysis)
This step divides the code (a string) into many tokens.
It rejects all the tokens that don't mean anything (like blank spaces or comments).
This step takes all those tokens and builds a syntaxis tree.
These trees are called:
Parse trees / Abstract syntax trees
Language hackers call them: Syntax tree / ASTs / or simply tree
At this step, the compiler reports syntax errors.
3. Static analysis
Up to this step, the compiler doesn't analyze anything related to the actual implementation of the language.
Most languages do something called binding or resolution.
For each identifier (variable names for instance), it checks where it was declared and it joins both pieces of data (identifier and place).
In this place, the compiler understands the scope of objects or bariables (meaning: the region of the code where it's used).
If the language is statically typed, we verify types here. If there are problems with certain operations that should not be performed, we throw type errors.
Now that we have all this information, we need to store it somewhere. There are three main options:
Attributes: We store this information (scope) in the same syntax tree. We initially leave some empty nodes so that we can complete them at this step.
Symbol table: We create a separate lookup table where we store all this information. The key for this table is the identifier.
New structure: We transform the table into a whole new structure that we will explain afterward.
4. Intermediate Representation (IR)
This implementation is not tied to the source nor the target language.
This means that we could compile into this representation starting from different source languages.
Also, we can compile into many different target languages without having to write a whole different compiler for each combination of source - target languages we aim for.
At this step, we change the original program for another one that does the same more efficiently.
For instance, we implement here something called constant folding: if an operation returns always the same result, instead of making the processor execute each time we need it, in the optimization process we hardcode or put the final value directly.
pennyArea = 3.14159 * (0.75 / 2) + (0.75 / 2);<- will end up changing to:
pennyArea = 0.4417860938;<- because we already calculated this operation in the optimizer.
6. Code Generation
Here we generate the final code that the CPU will execute.
It's generally assembly-like code.
We will generate code either for a real or virtual machine.
We call the machine code ByteCode, because instructions generally occupy one byte.
ByteCode was created by Martin Richards and Niklaus Wirth.
They designed a language for a hypothetical machine.
7. Virtual machine
If we compile to bytecode, we have two options:
We create a new compiler that compiles from bytecode to the target architecture (but we'll need many, one per architecture).
We create a new Virtual Machine that emulates a hypothetical chip, and run the program (in bytecode) at execution time. This approach is slower when executing.
We need some services to run for each program created in our language. They could be:
Garbage collector: To clean the computer's memory that is not being used anymore.
Instance of: If we support this feature, we should then be able to track in memory the type of each object during execution.
All these things happen during runtime.
Each program we generate will have a copy of the runtime.
If the program is executed in a Virtual Machine, then the runtime lives there.
This is all for today. I hope you liked the topic. I'll see you in the next post. Thanks!