The source language is Mx*, a java-like language. The target is risc-v assembly. There are mainly three stages: semantic, codegen and optimize.
ANTLR4 is a powerful tool to generate lexer and parser by a sample g4 file.
CST(Concrete Syntax Tree) is a representation of grammars in a tree-like form. AST(Abstract Syntax Tree) can be considered as a simplified CST but still contains useful information for semantic check. We can traverse CST using BaseVisitor(generated by antlr4) to build AST.
Build an implicit scope tree and symbol tree. Lots of details and strange cases like (new int[10])[0]. The only thing to do is data-oriented program and check.
LLVM-IR is a great IR. If we follow the rule of llvm, the codegen generation and optimization will be easier and more efficient. And we don't need too much time to design personal IR. Although the design and implemention is much easier, the homework becomes a little boring and less challengeable. In my opinion, the most challengeable work is to design an efficient IR, which will make the optimization process easy.
See llvm langdef
Translate IR to pesudo risc-v assembly. We can do some optimizations in the process, such as:
- replace
a + a
ora * 2
witha << 1
.
%ptr = getelementptr i32, i32* array, i32 1
%res = load i32, i32* %ptr
We can translate the above IR code into lw res,4(array) instead of addi ptr,4(array); lw res,0(ptr).
The hard part of instruction selection is to understand risc-v calling convention and handle registers like sp,fp,ra when calling function. Here's a good reference: Understanding RISCV Calling Convention.
See Chapter 11 in Tiger Book. Although the algorithm is difficult understand for me, there's pesudo code in tiger book.
The definition of SSA(Static Single Assignment) in SSA book:
A program is defined to be in SSA form if each variable is a target of exactly one assignment statement in the program text.
But to implement LLVM-IR with SSA directly is a little hard, so we firstly implement pesudo-LLVM then restruct it to real LLVM. In pesudo-LLVM, all defitions of variables are by store instructions and all uses of variables are by load instructions. The following is an example.
int a = 1, b = 2;
int c = a + b;
The pesudo-LLVM of the above code is
%a_address = alloca i32
store i32 1, i32* %a_address
%b_address = alloca i32
store i32 2, i32* %b_address
%c_address = alloca i32
%a = load i32, i32* %a_address
%b = load i32, i32* %b_address
%c = add i32 %a, %b
store i32 %c i32* %c_address
In other words, we achieve pesudo-SSA by sacrificing memory access. The mem2reg optimization is just to build a dominator tree and elimitate these expensive memory access. The algorithm is in chapter 2 of SSA book.
In the CFG(control flow graph), vertexes represent basic blocks and edges represent jump
instructions in the control flow. CFG Simplification includes removing unreachable blocks, simplifying constant branches, simplifying single path phi instructions, etc.
Two pointers are may alias if they may point to the same memory location. The result of alias analysis will be used in CSE and LICM. I use Andersen's Points-To Analysis, which is easy to implement.
Is alias analysis useful? In fact, in most cases, only global variables are checked not alias and other pointers are checked may alias.
If c = a + b
in block1
, d = a + b
in block2
and block1
dominates block2
, then we can replace d = a + b
with d = c
. But this case doesn't often happen. In most cases, we elimniate load
like a[i]
to reduce memory access. The following is an example.
Before CSE:
for(i = 0; i < n; ++i) {
s1 = s1 + a[i];
s2 = s2 + a[i] * a[i];
s3 = s3 + a[i] * a[i] * a[i];
}
After CSE:
for(i = 0; i < n; ++i) {
t1 = a[i];
s1 = s1 + t1;
t2 = t1 * t1;
s2 = s2 + t2;
t3 = t2 * t1
s3 = s3 + t3;
}
In the above example, there's no store
instruction. So we don't need to check alias and just need to eliminate load
instructions. But most cases are more complex. In CSE optimization, we dfs the dominator tree, find common subexpressions by using map and check alias if we eliminate load
.
A variable a
is loop invariant if :
a
is defined outside the loop.a = b op c
b, c are loop invariant.a = getelementptr ptr, x
ptr is loop invariant.a = load ptr
ptr is loop invariant and nostore
instruction toptr
orptr's alias
.
If we don't have alias anlysis, LICM will be useless. Because in most cases, we move getelementptr
and load
instructions like the following example.
Before LICM:
for(i = 0; i < n; ++i) {
for (j = 0; j < n; ++j) {
s = s + a[i][j];
}
}
After LICM:
for(i = 0; i < n; ++i) {
b = a[i];
for (j = 0; j < n; ++j) {
s = s + b[j];
}
}
A very useful optimization. For non-recursive functions, we build a calling tree rooted at main
function and start inlining from the leafs of the calling tree. For recursive-functions, we only inline several levels.
I just move global variables to main
function if they just appear in main
function. In fact, most global variables will be moved into main
after inlining. But there're still some global variables. Here's a good optimization: load values from all global varaibles when entering function and store values to all global variables when exiting function. I don't do the optimization, but the same effect can be achieved by CSE. Thus my CSE is just to eliminate global variables?
A naive dead code elimitation: If a instruction's result is not used by other instructions, remove it!
DCE optimization can be used to eliminate redundant IR instructions if the implementation of IR is a little bad.
See chapter 19.3 in Tiger Book .
Are DCE and SCCP useful in programs which are written by a not bad programer?
Replace println(toString(x))
with printlnInt(x)
. Maybe the optimization is just data-oriented optimization. I'm curious why there are many println(toString())
in some testcases. In my opinion, it's meaningless. Is there such a programmer writing these redundant codes?
Replace store a, ptr
b = load ptr
with store a, ptr
b = move a
. For every load
, we just need to check nearby store
of the load
. The opimization is useful in the following case.
Before Peephole Optimization:
for (i = 0; i < n; ++i) {
a[i] = i * i;
printlnInt(a[i]);
}
After Peephole Optimization:
for (i = 0; i < n; ++i) {
t = i * i;
a[i] = t;
printlnInt(t);
}
If the block has the only entry, we can eliminate store
and load
in different basic blocks. The optimization is useful in if
statements like the following case.
Before Peephole Optimization:
for (i = 0; i < n; ++i) {
a[i] = i * i;
if (a[i] > c)
printlnInt(a[i]);
}
After Peephole Optimization:
for (i = 0; i < n; ++i) {
t = i * i;
a[i] = t;
if (t > c)
printlnInt(t);
}
gcc O1 benchmark: 4.00
when maxInstNum=2000 in inlining
:
testcase | time(clock cycles of simulator) | score |
---|---|---|
binary_tree | 532814506 | 5.00 |
dijkstra | 166188863 | 4.22 |
humble | 1018041821 | 3.05 |
kruskal | 158220677 | 3.42 |
lca | 96375308 | 4.06 |
lunatic | 2656322338 | 3.51 |
maxflow | 23437143 | 5.00 |
pi | 32980003 | 3.49 |
segtree | 1027958443 | 3.96 |
sha_1 | 1243551947 | 4.78 |
overall | 40.55 |
when maxInstNum=1000 in inlining
:
testcase | time(clock cycles of simulator) | score |
---|---|---|
binary_tree | 595389094 | 4.82 |
dijkstra | 161618896 | 4.38 |
humble | 1018041821 | 3.05 |
kruskal | 158315403 | 3.41 |
lca | 97050821 | 4.02 |
lunatic | 2656524739 | 3.51 |
maxflow | 23437143 | 5.00 |
pi | 32980003 | 3.49 |
segtree | 1027826250 | 3.96 |
sha_1 | 1249243188 | 4.76 |
overall | 40.44 |
when maxInstNum=500 in inlining
:
testcase | time(clock cycles of simulator) | score |
---|---|---|
binary_tree | 679571883 | 4.09 |
dijkstra | 170628330 | 4.07 |
humble | 1018041821 | 3.05 |
kruskal | 158897962 | 3.39 |
lca | 96616199 | 4.05 |
lunatic | 2657771281 | 3.50 |
maxflow | 23437143 | 5.00 |
pi | 32980003 | 3.49 |
segtree | 1040642244 | 3.89 |
sha_1 | 1265151948 | 4.69 |
overall | 38.96 |
- The Definitive ANTLR 4 Reference.
- Tiger Book: Modern Compiler Implementation in Java.
- 《自制编译器》青木峰郎.
- Engineering A Compiler.
- SSA Book
- The RISC-V Instruction Set Manual.