-
Notifications
You must be signed in to change notification settings - Fork 32
WhyTwobitIsSlow
= Why Twobit is Slow =
Twobit was designed to be a fairly fast compiler, but has been quite slow on some (but not all) large programs. (See TwobitIsSlow.) We have found and repaired (or limited) inefficiencies in Twobit's macro expander, common subexpression elimination, and representation inference. Yet Twobit remained slow in v0.963, especially when generating machine code for the x86, and especially when compiling very large expressions, definitions, or libraries.
In v0.963, the main problems are that both assemblers consume superlinear time when creating large code segments, R6RS libraries get compiled twice, and the Sassy assembler is fairly slow.
The superlinear times indicated an algorithmic inefficiency or a poor choice of data structures. The presence of this problem on both x86 and SPARC machines pointed toward a problem with the relatively machine-independent parts of the MacScheme machine assembler. Constant sharing and label lookups were both taking quadratic time, although constant sharing didn't matter much because most procedures have a fairly small number of pointer constants and external references. The more important problem was that some Larceny-specific code for interfacing to the Sassy assembler was taking quadratic time to perform an expensive runtime sanity check that always succeeds.
All known asymptotic inefficiencies have been fixed in the current development version. Compile times now appear to be linear in the size of the program being compiled.
What follows are Will's notes, written during his investigation and revised to eliminate misleading data or false conclusions.
Larceny v0.963 is instrumented so we can tell how much of the Twobit compiler's time
is spent in its various passes. This instrumentation is not complete: we cannot
easily measure how much time is spent in the reader, the R6RS library/macro expander,
or the routine that writes machine code to a .fasl
or .slfasl
file.
We also cannot distinguish between time spent in the relatively machine-independent
part of the assembler (the peephole optimizer etc) and the machine-dependent part.
But we can easily tell how much time is spent in the following passes:
pass0 source optimization of output produced by Andre van Tonder's expander
pass1 Twobit's legacy macro/primop expander; prepares for pass2
pass2 closure analysis, lambda lifting, and similar optimizations
pass3 inlining, constant propagation, CSE, representation inference
pass4 generation of relatively machine-independent MacScheme machine code
assembly includes relatively machine-independent parts:
local optimization of MacScheme assembly code
peephole optimization
label fixups
and the major, machine-dependent part:
generation of machine code (native Larceny) or C code (Petit Larceny)
Timings in milliseconds for compile-file
on a synthetic
input that defines a single procedure f
, with three arguments,
that makes n calls to an unknown procedure g
on a permutation
of those three arguments.
In v0.963, neither CSE nor representation inference are performed if the size of the A-normal form is greater than 40000. That threshold was determined empirically, and may be too low now that we are imposing a separate threshold on the number of available expressions.
On a 2.4 GHz Intel Core 2 Duo:
LOC (ANF) pass0 pass1 pass2 pass3 pass4 assembly cpu elapsed MB
2 1 1 0 1 0 3 12 18 13
4 0 0 1 1 0 4 11 16 13
8 0 1 1 1 0 8 15 21 13
16 1 0 0 2 1 14 21 18 13
32 0 1 1 4 1 25 35 44 13
64 0 0 0 8 2 49 64 77 13
128 1 2 1 18 3 97 125 151 13
256 1 2 1 33 8 196 245 287 13
512 0 3 1 72 17 405 507 593 13
1024 1 6 1 191 32 837 1082 1268 22
2048 0 12 2 363 66 1789 2257 2620 34
4096 ( 16388) 1 56 3 833 132 3972 5046 5805 77
8192 ( 32772) 0 49 8 1905 269 9617 11961 13534 145
16384 (>40000) 0 238 16 253 221 25945 26873 29447 100
32768 (>40000) 0 201 37 654 448 80559 82381 87619 206
65536 (>40000) 0 406 75 563 1619 296891 300642 311429 355
On a 1.5 GHz Ultrasparc:
LOC (ANF) pass0 pass1 pass2 pass3 pass4 assembly cpu elapsed MB
2 0 0 0 0 0 0 10 9 1
4 0 0 0 0 0 0 10 12 1
8 0 0 0 10 0 0 10 13 1
16 0 0 0 10 0 0 10 20 1
32 0 0 0 20 0 0 20 28 1
64 0 0 0 30 10 0 50 56 1
128 0 10 0 60 20 10 100 113 1
256 0 10 0 130 30 20 200 210 13
512 0 10 10 270 60 40 420 443 13
1024 0 20 10 620 120 80 890 927 15
2048 0 50 10 1390 240 210 2020 2073 34
4096 ( 16388) 0 100 20 2730 470 520 4040 4196 25
8192 ( 32772) 0 190 40 5740 1100 1490 8970 9231 34
16384 (>40000) 0 370 100 1100 1050 6830 10300 10382 52
32768 (>40000) 0 750 200 1460 2300 45450 51880 52036 51
65536 (>40000) 0 1640 430 2460 5140 259810 272870 278352 154
All measurements in this section are for a 2.4 GHz MacBook Pro and IAssassin.
There are several performance bugs in the assemblers. The fact that
both the IAssassin and SPARC assemblers are superlinear made me look
for asymptotic inefficiencies in the machine-independent parts of those
assemblers, mainly pass5p1.sch
. (It's hard to imagine how a
peephole optimizer could have an asymptotic inefficiency, so I haven't
checked that.)
The constant-sharing as implemented by emit-constant
takes
quadratic time, so I limited constant-sharing to the first 50 distinct
constants of a code segment. This made no measurable difference to
the synthetic benchmark or to PLT Scheme's R6RS test program, as
I should have determined before I made the change.
Label lookups also take quadratic time. For (time-twobit 65536)
,
I estimate that about 15 seconds are spent in find-label
, which
is significant but not the main problem. I haven't done anything
about that yet.
Crude profiling showed that a lot of time was being spent in
check-for-free-ids
, which amounts to a runtime assertion that
should never fail. This procedure is defined in Asm/IAssassin/pass5p2-sassy.sch
and called only from sassy-assemble
in that same file, so the
call is easily commented out. Doing so establishes that 200 of the
300 cpu seconds for (time-twobit 65536)
are spent in
check-for-free-ids
.
Subtracting the estimated 15 seconds for find-label
and the
measured 200 seconds for check-for-free-ids
would get
(time-twobit 65536)
down to about 90 seconds, which is
still superlinear but not by much.
Unfortunately, the elimination of check-for-free-ids
had no
measurable effect on compile time for PLT Scheme's R6RS test
program. That means we have some other asymptotic inefficiency
that the synthetic benchmark does not trigger.
The elimination of check-free-ids
did not measurably reduce
the excessive storage requirements. Those data are noisy, but
aren't noisy enough to hide the solution when we find it.
CPU time for synthetic benchmark
LOC (ANF) v0.963 without
check-free-ids
2 12 9
4 11 11
8 15 14
16 21 21
32 35 35
64 64 63
128 125 119
256 245 234
512 507 482
1024 1082 968
2048 2257 1944
4096 ( 16388) 5046 4104
8192 ( 32772) 11961 8542
16384 (>40000) 26873 14202
32768 (>40000) 82381 38934
65536 (>40000) 300642 105995
On a 1.5 GHz Ultrasparc, compiling PLT Scheme's R6RS test libraries:
((pass0 . 10)
(pass1 . 2770)
(pass2 . 730)
(pass3 . 42550)
(pass4 . 4130)
(assemble . 6810))
#!unspecified
Words allocated: 325061412
Words reclaimed: 0
Elapsed time...: 78728 ms (User: 75810 ms; System: 1010 ms)
Elapsed GC time: 5585 ms (CPU: 5630 in 1239 collections.)
(heap-area-info
.
#(#(0 nursery 1048576 fixed)
#(1 two-space 37879808 expandable)
#(2 static 6201344 expandable)))
That's about twice as fast as on the MacBook Pro, which is a much faster machine. Furthermore IAssassin is spending almost all of its time in the assembler, but the SPARC version is spending more than half of its time in pass3.
Conclusion: The machine-independent part of the assembler is probably fast enough, and the problem with IAssassin probably lies within Sassy or our interface to Sassy.
Another possibility, suggested by PnkFelix, is that one of the SRFIs being required by Sassy might be redefining some standard procedure in some way that has a drastic effect upon performance. At the very least, we should make sure all of those SRFIs run as fast as possible because Sassy uses them.
For (time-twobit 65536)
, the IAssassin assembler generates 2 code segments, 2 constants, 65536 labels, and 0 fixups.
For (time-twobit 65536)
, the SPARC assembler generates 1 code segment, 2 constants, 65536 labels, and 196612 fixups.
The fixups take negligible time. The presumed 65536 calls to make-asm-label
probably take no more than 25 seconds,
because 65536 calls to assq
on an alist of 65536 pairs that don't contain the key take 50 seconds.
Conclusion: Don't bother with the quadratic behavior of make-asm-label
for now, although it will
be worth looking into later after the real problem has been identified and solved.
An hour later: so much for jumping to conclusions. After hacking a more convenient interface onto
PnkFelix's profiler, it looks like (time-twobit 65536)
is spending about 93% of its time in
find-label
within make-asm-label
. That should be straightforward to fix.
After changing Asm/Shared/pass5p1.sch
to use hashtables for labels, (time-twobit 65536)
runs
in 20 seconds on the SPARC, down from 275 in v0.963. Compiling the libraries of PLT Scheme's R6RS
test program takes 80 seconds, which is about the same as in v0.963.
For IAssassin, with check-for-free-ids
disabled, (time-twobit 65536)
now runs
in 70 seconds, down from 300 in v0.963 and from 100 in v0.963 with check-for-free-ids
disabled. Compiling the libraries of PLT Scheme's R6RS
test program takes 143 cpu seconds, which is about 8 seconds faster than in v0.963.
The IAssassin, IL, and IL-LCG assemblers are calling as-labels
, usually to call
assq
on the resulting alist. This is almost certain to produce quadratic
behavior.
Fixing that (for IAssassin only) had no measurable effect on (time-twobit 65536)
or
on compile time for PLT Scheme's R6RS test program.
Profiling during compilation of the R6RS test program gave these results:
% topmost named procedure
18 find-tail
9 sassy
7 equal?
7 hash
6 sassy-symbol-ensure
5 hash-table-ref
4 %hash-table-walk
3 hash-table-set!
3 pass3
3 sassy-assemble
2 find
2 append
1 representation-analysis
1 %hash-table-maybe-resize!
1 env-lookup
1 %hash-table-hash
1 loop
1 %bignum-length
1 sassy-symbol-exists?
1 sassy-symbol-set-scope!
1 every?
1 emit-direct2
1 hash-on-equal
1 get-datum-with-source-locations
1 available-intersect!
1 list-ref
1 %hash-table-add!
1 fixnum->bignum
1 big-divide-digits
[...]
% active procedures
100 r5rs-entry-point
100 repl
100 run-with-profiling
82 assemble
82 assemble1
82 assemble-finalize!
76 sassy
48 twobit
47 compile-file
29 sassy-symbol-ensure Lib/Sassy/intern.scm
29 sassy-symbol-set-scope! Lib/Sassy/intern.scm
23 find-tail SRFI 1
15 emit-direct2 Lib/Sassy/opcodes.scm
14 hash-table-ref SRFI 69
11 compile
10 pass3
8 hash-table-set! SRFI 69
8 %hash-table-maybe-resize! SRFI 69
8 loop
7 hash SRFI 69
7 equal?
7 u/s-dword Lib/Sassy/numbers.scm
7 lgand SRFI 60
6 %hash-table-walk SRFI 69
6 find SRFI 1
6 compile-expression
4 bignum-quotient
4 big-divide-digits
4 sassy-assemble Asm/IAssassin/pass5p2-sassy.sch
4 i32 Lib/Sassy/opcodes.scm
3 fixup-mem Lib/Sassy/opcodes.scm
3 scan-sequence lib/R6RS/r6rs-expander.sch
3 operator-binding lib/R6RS/r6rs-expander.sch
3 binding lib/R6RS/r6rs-expander.sch
3 env-lookup lib/R6RS/r6rs-expander.sch, src/Compiler/copy.sch
[...]
Undoing some of the performance damage done by the reference
implementation for SRFI 1 reduced the time for
(time-twobit 65536)
from 70 to 65 seconds.
Compiling the libraries of PLT Scheme's R6RS test program
has dropped from 143 to 116 cpu seconds.
Rewriting the implementation of SRFI 69 to use R6RS hashtables made very little difference.
Compile times for (time-twobit 65536)
and the R6RS test program changed by no more
than a cpu second, but the profiling output changed dramatically:
% topmost named procedure
24 sassy-symbol-ensure
14 hash-table-ref
12 sassy
4 equal?
4 sassy-assemble
3 pass3
3 append
2 %bignum-length
2 sassy-symbol-exists?
2 env-lookup
2 loop
1 representation-analysis
1 every?
1 big-multiply-through-by
1 big-divide-digits
1 hash-on-equal
1 big-compare
1 make-r6rs-hashtable
1 sassy-symbol-set-scope!
1 list-ref
1 big-limited-normalize!
1 available-intersect!
1 emit-direct2
1 get-datum-with-source-locations
[...]
% active procedures
100 r5rs-entry-point
100 repl
100 run-with-profiling
79 assemble
79 assemble1
79 assemble-finalize!
69 sassy
48 twobit
46 compile-file
34 sassy-symbol-ensure
34 sassy-symbol-set-scope!
18 emit-direct2
16 hash-table-ref
13 compile
12 pass3
11 loop
9 u/s-dword
9 lgand
7 compile-expression
6 sassy-assemble
5 big-divide-digits
5 bignum-quotient
5 i32
4 fixup-mem
4 equal?
4 binding
4 env-lookup
4 scan-sequence
4 operator-binding
3 append
[...]
The profiler's output indicates that Twobit is spending almost half its time in two small procedures:
- 12% is spent in
sassy
, just setting up to do some assembling - 34% is spent in
sassy-symbol-ensure
, which is called bysassy-symbol-set-scope!
Adding temporary instrumentation revealed there were 36550 calls to sassy
,
about 21538865 calls to sassy-symbol-ensure
, and almost as many (20938395)
iterations of the named loop body in sassy-symbol-ensure
:
(define (sassy-symbol-ensure sassy-output symbol-name)
(or (sassy-symbol-exists-env? sassy-output symbol-name)
(let ((new (make-sassy-symbol symbol-name 'local #f #f #f '() #f)))
(let iter ((t (sassy-symbol-table sassy-output)))
(if (hash-table? (car t))
(begin (hash-table-set! (car t) symbol-name new)
new)
(iter (cdr t)))))))
Experiments at the ERR5RS top level revealed 4 calls to sassy
to compile the usual
definition of doubly recursive fib, 99 calls to sassy-symbol-ensure
, and 26 executions
of the iter
loop body above.
Apparently a sassy-output
record contains a list of hashtables, through which
sassy-symbol-ensure
must iterate. The average length of the traversed lists
is about 1, but there are an average of over 500 calls to sassy-symbol-ensure
for each call to sassy
. With the most trivial call to sassy-symbol-ensure
I could devise, 21538865 calls took about 7.7 seconds.
With the actual arguments, those 21538865 calls are taking about 39 seconds.
Apparently calls can be less trivial than the one I devised, since sassy-symbol-ensure
calls sassy-symbol-exists?
:
(define (sassy-symbol-exists? sassy-output name)
(let iter ((rst (sassy-symbol-table sassy-output)))
(cond ((hash-table? rst)
(hash-table-ref rst name (lambda () #f)))
((hash-table? (car rst))
(hash-table-ref (car rst) name (lambda () #f)))
((eq? name (sassy-symbol-name (car rst))) (car rst))
(else (iter (cdr rst))))))
This is a second loop over that list of hashtables, and it looks like that list might
contain symbols instead of or as well as hashtables. Notice also the gratuitous
inefficiency of creating a thunk instead of using hash-table-ref/default
or hand-lifting the thunk to top level.
(Compilers could optimize that away, but Twobit doesn't because many programmers
have a naive view of Scheme's location-tagging eq? semantics for procedures.)
Instrumentation revealed 22510940 executions of that loop body, which means
at least 95% of the calls to this procedure execute the loop body only once.
The name passed to these procedures can be anything that satisfies the
valid-label0
predicate:
- a symbol other than a keyword (note use of
member
invalid-label0
whenmemq
would do) - a quoted-label of the form
(quote sym)
, wheresym
is a symbol On the average, it looks as though there are two hashtable operations (with accompanying hash-on-equal), one closure creation, and one creation of a new Sassy symbol for each call tosassy-symbol-ensure
.
Inlining the call to sassy-symbol-exists?
within sassy-symbol-ensure
,
followed by local simplification, saved 2 or 3 seconds of cpu time on the R6RS
test program, bringing it down to 112 or 113 cpu seconds. At least 5.5 cpu seconds,
and conceivably up to 40 cpu seconds,
are spent in 82421463 calls to the hash
procedure of SRFI 69.
Two or three full days of hard work have reduced the compile time for the R6RS test program by about 25%, from 150 cpu seconds to 112. As described below, there's another factor of 2 to be gained.
Andre van Tonder's implementation of the R6RS library/macro system calls eval
on the expansion of each library definition that is read during evaluation or
compilation. The purpose of this call is to register the library definition.
Since eval
calls the compiler in native Larceny, each library definition
will be compiled twice:
- immediately after it has been expanded, to register the library
- when the expanded code is compiled to a
.slfasl
file, or when the expanded code iseval
'ed as though it were an ordinary expression. Will implemented a hack to defer registering libraries that are defined by the only library definition in the file being compiled; in that case, the compiled file can be loaded immediately after it has been written, and that is faster than compiling a second time by callingeval
. That hack doesn't work when more than one library is defined within the file being compiled.
That hack also doesn't work when R6RS programs are executed from source code.
In that case, we'd like to recognize the expanded version of a library so we
can avoid calling eval
on it a second time, but that's hard to do given
the way van Tonder's code is currently written.
Adding a cache to normal hashtables seems to have improved IAssassin compile times by one to two percent.
Total CPU time to compile (without executing) some benchmarks:
IAssassin Sparc
=================== ===================
LOC (ANF) v0.963 v0.963+ v0.963 v0.963+
rev5714 rev5714
2 9 10 0 0
4 10 11 10 0
8 15 14 10 10
16 22 20 10 10
32 36 33 20 20
64 63 60 50 50
128 121 114 80 90
256 247 225 190 210
512 507 446 430 430
1024 1066 923 890 910
2048 2281 1863 2010 2040
4096 ( 16388) 5030 3884 4020 3980
8192 ( 32772) 12101 8013 9090 8500
16384 (>40000) 27242 12474 9180 5120
32768 (>40000) 82599 25780 51890 10080
65536 (>40000) 320019 54802 272470 19800
PLT Scheme's
R6RS test pgm 170744 68184 81760 51620
(rev11057)
The current profiling output looks like this:
% topmost named procedure
23 sassy-symbol-ensure
13 hash-table-ref
11 sassy
6 equal?
4 sassy-assemble
3 env-lookup
3 get-datum-with-source-locations
2 loop
2 pass3
2 %bignum-length
2 append
2 big-multiply-through-by
[...]
% active procedures
100 r5rs-entry-point
100 repl
100 run-with-stats
100 run-with-profiling
85 compile-file
73 assemble
73 assemble1
73 assemble-finalize!
66 sassy
33 sassy-symbol-set-scope!
33 sassy-symbol-ensure
17 emit-direct2
15 hash-table-ref
13 loop
11 compile
10 pass3
9 u/s-dword
9 lgand
7 binding
7 scan-sequence
7 env-lookup
7 operator-binding
6 big-divide-digits
6 bignum-quotient
6 equal?
6 i32
[...]
All known sources of quadratic compile times have been fixed. Short of rewriting Sassy or moving to a faster assembler, we probably have as much to gain by improving the performance of records, hashtables, and bignums as by anything else.
The factor of two for R6RS programs executed from source can be fixed. At worst, we can compile R6RS programs to a temporary file before executing them.
PnkFelix reduced the R6RS (tests r6rs base)
library that seems to
cause significant assembly overhead to a R5RS-compatible input
(see file attached to Ticket #575 ). He then did some experimentation
by hitting Ctrl-C as the file loads and then using SRO to inspect the state
of the heap at that time. (Its important to hit Ctrl-C before the heap
gets too large; otherwise SRO dies attempting to construct objects too
large for Larceny to handle. PnkFelix found that it seems to suffice to
watch the larceny.bin
process with top
and hit Ctrl-C soon after
it reaches 100 MB.)
The vast majority of the objects are pairs (pointer tag 1) and vector-likes (pointer tag 3); a vast majority of the vector-likes are structures (header tag 5).
% ./larceny
Larceny v0.963 "Fluoridation" (Aug 4 2008 17:12:02, precise:Posix Unix:unified)
larceny.heap, built on Wed Aug 6 17:14:36 EDT 2008
> (load "boo3.sch")
ANF size: 15910
C-c C-cKeyboard interrupt.
Entering debugger; type "?" for help.
debug> n
>> (define vecl-objects (sro 3 5 -1))
>> (define type-counts (make-oldstyle-hashtable))
>> (do ((i 0 (+ i 1)))
((>= i (vector-length vecl-objects)))
(let ((r (vector-ref vecl-objects i)))
(hashtable-update! type-counts
(and (record? r)
(record-type-name (record-rtd r)))
(lambda (n) (+ 1 n)) 0)))
>> (hashtable-map list type-counts)
((#f 4)
(enum-set 1)
(enumeration 1)
("sassy-symbol" 1167763)
("sassy-reloc" 12162)
(hashtable 1244)
("text-block" 6759)
("sassy-output" 420)
("record-type-descriptor-type" 40))
>> (define pair-objects (sro 1 -1 -1))
>> (vector-length pair-objects)
3794384
Note that there are about 1.2 million sassy-symbol objects, and about 3.8 million pairs. PnkFelix thinks that each sassy-symbol has a vector-like-length of 8. So the pairs and the sassy symbols seem like they would account for the majority of the space usage (assuming that the heap is indeed about 100 MB at the time of the above interaction).
As for WHY there are so many sassy-symbol objects being generated, you can see a description for one reason why this was (accidentally) happening in the log message for changeset:5722, which corrects the problem by generating code segments that export fewer labels (and thus do not need to have corresponding sassy-symbol objects nor entries in Sassy's hashtable for such objects).
When the size of trigtest.sps
increased from about 4700 tests to about 20000,
and all 20000 tests were put into a single library instead of splitting the 4700
tests into two libraries, the compile time went from about a minute to almost 45
minutes.
The increase appears to be caused by quadratic behavior during code generation, with at least two possible causes:
- computing set unions and differences takes quadratic time because sets are represented by lists of symbols
- frame environments are very large (over 15000 entries), are represented as lists, and
cgframe-release!
usesremq
to remove entries even though the entry to be removed occurs only once in the list and usually occurs near the very beginning of the list
The more puzzling question is why frame environments are so large.
The cg-arguments
procedure allocates temporaries,
which get added to the frame environment by phantom store instructions,
but should then be released after code for all arguments has been emitted.
Changing cgframe-release!
to remove only the first entry helped a lot,
with both time and consing (because the deep recursions were overflowing the
stack cache into the heap). Compile time dropped from 45 minutes to about 9.
Nine minutes is still unacceptable. That's with double compilation, so solving that problem would get down to about four and half minutes. Should measure to see how much of that time is being spent in each pass.