Skip to content

LinsongGuo/Mxstar-compiler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

94 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Mx* Compiler

Overview

The source language is Mx*, a java-like language. The target is risc-v assembly. There are mainly three stages: semantic, codegen and optimize.

Semantic Stage

Lexer and Parser

ANTLR4 is a powerful tool to generate lexer and parser by a sample g4 file.

Abstract Syntax Tree

CST(Concrete Syntax Tree) is a representation of grammars in a tree-like form. AST(Abstract Syntax Tree) can be considered as a simplified CST but still contains useful information for semantic check. We can traverse CST using BaseVisitor(generated by antlr4) to build AST.

Semantic Check

Build an implicit scope tree and symbol tree. Lots of details and strange cases like (new int[10])[0]. The only thing to do is data-oriented program and check.

Codegen Stage

Intermediate Representation

LLVM-IR is a great IR. If we follow the rule of llvm, the codegen generation and optimization will be easier and more efficient. And we don't need too much time to design personal IR. Although the design and implemention is much easier, the homework becomes a little boring and less challengeable. In my opinion, the most challengeable work is to design an efficient IR, which will make the optimization process easy.

See llvm langdef

Instruction Selection

Translate IR to pesudo risc-v assembly. We can do some optimizations in the process, such as:

  • replace a + a or a * 2 with a << 1.
%ptr = getelementptr i32, i32* array, i32 1
%res = load i32, i32* %ptr

We can translate the above IR code into lw res,4(array) instead of addi ptr,4(array); lw res,0(ptr).

The hard part of instruction selection is to understand risc-v calling convention and handle registers like sp,fp,ra when calling function. Here's a good reference: Understanding RISCV Calling Convention.

Register Allocation

See Chapter 11 in Tiger Book. Although the algorithm is difficult understand for me, there's pesudo code in tiger book.

Optimize Stage

Mem2reg

The definition of SSA(Static Single Assignment) in SSA book:

A program is defined to be in SSA form if each variable is a target of exactly one assignment statement in the program text. 

But to implement LLVM-IR with SSA directly is a little hard, so we firstly implement pesudo-LLVM then restruct it to real LLVM. In pesudo-LLVM, all defitions of variables are by store instructions and all uses of variables are by load instructions. The following is an example.

int a = 1, b = 2;
int c = a + b;

The pesudo-LLVM of the above code is

%a_address = alloca i32
store i32 1, i32* %a_address
%b_address = alloca i32
store i32 2, i32* %b_address
%c_address = alloca i32
%a = load i32, i32* %a_address
%b = load i32, i32* %b_address
%c = add i32 %a, %b
store i32 %c i32* %c_address

In other words, we achieve pesudo-SSA by sacrificing memory access. The mem2reg optimization is just to build a dominator tree and elimitate these expensive memory access. The algorithm is in chapter 2 of SSA book.

CFG Simplification

In the CFG(control flow graph), vertexes represent basic blocks and edges represent jump instructions in the control flow. CFG Simplification includes removing unreachable blocks, simplifying constant branches, simplifying single path phi instructions, etc.

Alias Analysis

Two pointers are may alias if they may point to the same memory location. The result of alias analysis will be used in CSE and LICM. I use Andersen's Points-To Analysis, which is easy to implement.

Is alias analysis useful? In fact, in most cases, only global variables are checked not alias and other pointers are checked may alias.

CSE(Common SubExpression Elimination)

If c = a + b in block1, d = a + b in block2 and block1 dominates block2, then we can replace d = a + b with d = c. But this case doesn't often happen. In most cases, we elimniate load like a[i] to reduce memory access. The following is an example.

Before CSE:

for(i = 0; i < n; ++i) {
    s1 = s1 + a[i];
    s2 = s2 + a[i] * a[i];
    s3 = s3 + a[i] * a[i] * a[i];
}

After CSE:

for(i = 0; i < n; ++i) {
    t1 = a[i];
    s1 = s1 + t1;
    t2 = t1 * t1;
    s2 = s2 + t2;
    t3 = t2 * t1
    s3 = s3 + t3;
}

In the above example, there's no store instruction. So we don't need to check alias and just need to eliminate load instructions. But most cases are more complex. In CSE optimization, we dfs the dominator tree, find common subexpressions by using map and check alias if we eliminate load.

LICM(Loop Invariant Code Motion)

A variable a is loop invariant if :

  • a is defined outside the loop.
  • a = b op c b, c are loop invariant.
  • a = getelementptr ptr, x ptr is loop invariant.
  • a = load ptr ptr is loop invariant and no store instruction to ptr or ptr's alias.

If we don't have alias anlysis, LICM will be useless. Because in most cases, we move getelementptr and load instructions like the following example.

Before LICM:

for(i = 0; i < n; ++i) {
    for (j = 0; j < n; ++j) {
        s = s + a[i][j];
    }
}

After LICM:

for(i = 0; i < n; ++i) {
    b = a[i];
    for (j = 0; j < n; ++j) {
        s = s + b[j];
    }
}

Inlining

A very useful optimization. For non-recursive functions, we build a calling tree rooted at main function and start inlining from the leafs of the calling tree. For recursive-functions, we only inline several levels.

Global Variables Localization

I just move global variables to main function if they just appear in main function. In fact, most global variables will be moved into main after inlining. But there're still some global variables. Here's a good optimization: load values from all global varaibles when entering function and store values to all global variables when exiting function. I don't do the optimization, but the same effect can be achieved by CSE. Thus my CSE is just to eliminate global variables?

DCE(Dead Code Elimitation)

A naive dead code elimitation: If a instruction's result is not used by other instructions, remove it!

DCE optimization can be used to eliminate redundant IR instructions if the implementation of IR is a little bad.

SCCP(Sparse Conditional Constant Propagation)

See chapter 19.3 in Tiger Book .

Are DCE and SCCP useful in programs which are written by a not bad programer?

Print Restruction

Replace println(toString(x)) with printlnInt(x). Maybe the optimization is just data-oriented optimization. I'm curious why there are many println(toString()) in some testcases. In my opinion, it's meaningless. Is there such a programmer writing these redundant codes?

Peephole Optimization

Replace store a, ptr b = load ptr with store a, ptr b = move a. For every load, we just need to check nearby store of the load. The opimization is useful in the following case.

Before Peephole Optimization:

for (i = 0; i < n; ++i) {
    a[i] = i * i;
    printlnInt(a[i]);
}

After Peephole Optimization:

for (i = 0; i < n; ++i) {
    t = i * i;
    a[i] = t;
    printlnInt(t);
}

If the block has the only entry, we can eliminate store and load in different basic blocks. The optimization is useful in if statements like the following case.

Before Peephole Optimization:

for (i = 0; i < n; ++i) {
    a[i] = i * i;
    if (a[i] > c)
        printlnInt(a[i]);
}

After Peephole Optimization:

for (i = 0; i < n; ++i) {
    t = i * i;
    a[i] = t;
    if (t > c)
        printlnInt(t);
}

Performance

gcc O1 benchmark: 4.00

when maxInstNum=2000 in inlining:

testcase time(clock cycles of simulator) score
binary_tree 532814506 5.00
dijkstra 166188863 4.22
humble 1018041821 3.05
kruskal 158220677 3.42
lca 96375308 4.06
lunatic 2656322338 3.51
maxflow 23437143 5.00
pi 32980003 3.49
segtree 1027958443 3.96
sha_1 1243551947 4.78
overall 40.55

when maxInstNum=1000 in inlining:

testcase time(clock cycles of simulator) score
binary_tree 595389094 4.82
dijkstra 161618896 4.38
humble 1018041821 3.05
kruskal 158315403 3.41
lca 97050821 4.02
lunatic 2656524739 3.51
maxflow 23437143 5.00
pi 32980003 3.49
segtree 1027826250 3.96
sha_1 1249243188 4.76
overall 40.44

when maxInstNum=500 in inlining:

testcase time(clock cycles of simulator) score
binary_tree 679571883 4.09
dijkstra 170628330 4.07
humble 1018041821 3.05
kruskal 158897962 3.39
lca 96616199 4.05
lunatic 2657771281 3.50
maxflow 23437143 5.00
pi 32980003 3.49
segtree 1040642244 3.89
sha_1 1265151948 4.69
overall 38.96

Reference

  • The Definitive ANTLR 4 Reference.
  • Tiger Book: Modern Compiler Implementation in Java.
  • 《自制编译器》青木峰郎.
  • Engineering A Compiler.
  • SSA Book
  • The RISC-V Instruction Set Manual.

About

Mx* Compiler Project for SJTU MS208

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published