-
-
Notifications
You must be signed in to change notification settings - Fork 88
Components of simC
There are three primary and two helper components in simC. The blocks in white represent the core components while the ones in yellow represent the helper components. These helper components are shared by the three core components as can be seen in the figure above.
The primary components are:-
- Lexical Analyzer
- Parser
- Compiler
The helper components are:-
- Symbol Table
- Type Inference Engine
Let us first understand the core components.
Lexical analyzer also known as scanner, tokenizer, or lexer is the first step in compilation of simC code. The lexical analyzer takes in the raw source code and goes over it character by character to identify tokens.
Before I move any further I believe it is important to know what a token is. Token is the smallest part of a program, it is like the cell of a program. Tokens can be keywords, identifiers, operators, constants, anything which cannot be further broken into a smaller part.
The lexical analyzer identifies these different types of tokens and returns a list (Python list) of these tokens. These tokens then go to the second core component - parser.
Before we move on to the parser, it would be better if we take an example to understand lexical analyzer concretely. Let's consider the following snippet of simC code:-
MAIN
print("Hello World")
END_MAIN
This is a simple hello world script in simC. Let us use the simC compiler to generate the tokens, it can be done using the following command:-
user@programmer~:$ simc test.simc token
Considering that the file name is test.simc.
Token('MAIN', '', '1')
Token('newline', '', '1')
Token('print', '', '2')
Token('left_paren', '', '2')
Token('string', '1', '2')
Token('right_paren', '', '2')
Token('call_end', '', '2')
Token('newline', '', '2')
Token('END_MAIN', '', '3')
Token('newline', '', '3')
These are the tokens generated for the above program. Starting from line 1, which has the main function, we see the MAIN token is generated, this is followed by a newline token which marks the end of line 1. In the second line we have the print statement, and it is broken into 5 tokens - print, left_paren (left parantheses), string ("Hello World" in this case) which has the integer value 1 (you will have to wait till symbol table to understand the significance of this integer), right_paren (right parantheses) and call_end (you can ignore this for now), at the end there is again a newline token to end this line. Coming to the final line which marks the end to main function there are two tokens - END_MAIN which marks the end of main function and the final newline token of the program. So, this is how the simC compiler will look into the code in the next step (i.e. the parser).
The parser takes in as input the list of tokens returned by the lexical analyzer. It is responsible for two tasks:-
- Checking the syntax of a particular statement.
- Combining these tokens into logical groups called OpCode (Operation Code).
We will be considering the same code for this stage too. Let's try to understand what might be some syntax checks for this. The only line where it makes sense to check syntax is the print statement. The syntax of simC's print statement is:-
print(<expr>)
The compiler thus checks for a left parantheses followed by an expression and finally a right parantheses for the print statement. Since, we have the correct syntax, the parser won't complain. It then moves on to the second step - grouping tokens into opcodes.
At this point, you might have an idea of what the opcodes would look like, sinc MAIN is a statement on its own, it gets its own opcode. The newlines are no necessary from this point on, farewell newline tokens :). Print is a single logical statement and thus all the tokens from print till call_end will be grouped together to form the print opcode. Finally, END_MAIN opcode is generated. Let us look into the opcodes to make sure we guessed them correctly:-
user@programmer~:$ simc test.simc opcode
This generates the following opcodes:-
OpCode('MAIN', '', '')
OpCode('print', '"Hello World"', 'None')
OpCode('END_MAIN', '', '')
These opcodes are put into a list (Python list) and returned back. Now that we are done with the parser let's move on to the final core component - compiler.
The compiler takes in the opcodes list returned from the parser as input and then performs the following tasks:-
- Checks if any C standard library is to be imported based on user's code.
- Compiles the opcode into equivalent C code.
The inclusion of C standard library is fairly straightforward, it loops through the opcodes to checks for statements which will require a standard library after compiling, it also makes sure that a standard library is included only once and isn't repeated. An example of this can be the print statement again, when it gets compiled it becomes printf function which required stdio.h to be included.
The compiler again goes through the opcodes, this time to convert them into equivalent C code. Let us look at the compiled code for our test.simc file:-
user@programmer~:$ simc test.simc
C code generated at test.c!
This is the generated C code:-
#include <stdio.h>
int main() {
printf("Hello World");
return 0;
}
The stdio.h standard library gets included for printf, print statement becomes printf function and a return 0 is added explicitly to the compiled code. We can also see that MAIN compied to "int main( ) {" and END_MAIN to "}".
With this we are done with the core components, but we are not done yet. Let's move to the first helper component - Symbol Table.
Symbol table is a hash-map (implemented using Python dictionary) which holds information about identifiers and constants in source code. This provides unique id to each entry (this can be thought of as PRIMARY KEY if you are familiar with RDBMS concepts) and holds the value, type and some metadata information which is used during compilation.
This is created at the beginning and persists throughout the compilation process. The lexical analyzer and the parser are responsible for filling relevant details into it, the compiler just fetches value from this which helps it in compilation.
Let us look at what the symbol table looks like after lexical analysis:-
user@programmer~:$ simc test.simc table_after_lexing
{1: ['"Hello World"', 'string', 'constant']}
For this simple example we had only a single string constant which can be seen as the sole entry in the symbol table. The id for this entry is 1 (this is the 1 from the string token, the unique id for the string), it has a value of "Hello World", the type - string, and the metadata field in this case just indicates that this is a constant value.
For this case the table will remain the same even after parsing, but let's make sure it is:-
user@programmer~:$ simc test.simc table_after_parsing
{1: ['"Hello World"', 'string', 'constant']}
We are now just one component away from the end of this long wiki page. Let us understand the type inference engine.
The type inference engine is responsible for figuring out the type of identifiers (variables, functions, structures). The way in which types inference is done is out of scope of this wiki page (you will have to go through the codebase to understand this). This figures out the type and updates the type field of corresponding identifier to the symbol table.
If we take the example above the only type which has to be inferred is the string type of "Hello World" inside print statement. With this we are finally done, you now have enough understanding to go through the codebase.