Skip to content
Felix S. Klock II edited this page Jul 28, 2013 · 3 revisions

= Why Twobit is Slow =

Twobit was designed to be a fairly fast compiler, but has been quite slow on some (but not all) large programs. (See TwobitIsSlow.) We have found and repaired (or limited) inefficiencies in Twobit's macro expander, common subexpression elimination, and representation inference. Yet Twobit remained slow in v0.963, especially when generating machine code for the x86, and especially when compiling very large expressions, definitions, or libraries.

In v0.963, the main problems are that both assemblers consume superlinear time when creating large code segments, R6RS libraries get compiled twice, and the Sassy assembler is fairly slow.

The superlinear times indicated an algorithmic inefficiency or a poor choice of data structures. The presence of this problem on both x86 and SPARC machines pointed toward a problem with the relatively machine-independent parts of the MacScheme machine assembler. Constant sharing and label lookups were both taking quadratic time, although constant sharing didn't matter much because most procedures have a fairly small number of pointer constants and external references. The more important problem was that some Larceny-specific code for interfacing to the Sassy assembler was taking quadratic time to perform an expensive runtime sanity check that always succeeds.

All known asymptotic inefficiencies have been fixed in the current development version. Compile times now appear to be linear in the size of the program being compiled.

What follows are Will's notes, written during his investigation and revised to eliminate misleading data or false conclusions.

Instrumentation

Larceny v0.963 is instrumented so we can tell how much of the Twobit compiler's time is spent in its various passes. This instrumentation is not complete: we cannot easily measure how much time is spent in the reader, the R6RS library/macro expander, or the routine that writes machine code to a .fasl or .slfasl file. We also cannot distinguish between time spent in the relatively machine-independent part of the assembler (the peephole optimizer etc) and the machine-dependent part. But we can easily tell how much time is spent in the following passes:


    pass0     source optimization of output produced by Andre van Tonder's expander
    pass1     Twobit's legacy macro/primop expander; prepares for pass2
    pass2     closure analysis, lambda lifting, and similar optimizations
    pass3     inlining, constant propagation, CSE, representation inference
    pass4     generation of relatively machine-independent MacScheme machine code
    assembly  includes relatively machine-independent parts:
                  local optimization of MacScheme assembly code
                  peephole optimization
                  label fixups
              and the major, machine-dependent part:
                  generation of machine code (native Larceny) or C code (Petit Larceny)

Measurements

Timings in milliseconds for compile-file on a synthetic input that defines a single procedure f, with three arguments, that makes n calls to an unknown procedure g on a permutation of those three arguments.

In v0.963, neither CSE nor representation inference are performed if the size of the A-normal form is greater than 40000. That threshold was determined empirically, and may be too low now that we are imposing a separate threshold on the number of available expressions.

IAssassin

On a 2.4 GHz Intel Core 2 Duo:


LOC   (ANF) pass0 pass1 pass2 pass3 pass4  assembly    cpu  elapsed   MB

    2		1     1	    0 	  1	0     	 3    	12    	 18   13
    4		0     0	    1 	  1   	0     	 4    	11    	 16   13
    8		0     1	    1 	  1   	0     	 8    	15    	 21   13
   16		1     0     0 	  2   	1     	14    	21    	 18   13
   32		0     1	    1 	  4   	1     	25    	35    	 44   13
   64		0     0	    0 	  8   	2     	49    	64    	 77   13
  128		1     2	    1 	 18   	3     	97     125    	151   13
  256		1     2	    1 	 33   	8      196     245    	287   13
  512		0     3	    1 	 72    17      405     507    	593   13
 1024		1     6	    1 	191    32      837    1082     1268   22
 2048		0    12	    2 	363    66     1789    2257     2620   34
 4096 ( 16388)	1    56	    3 	833   132     3972    5046     5805   77
 8192 ( 32772)	0    49	    8  1905   269     9617   11961    13534  145
16384 (>40000)	0   238	   16 	253   221    25945   26873    29447  100
32768 (>40000)	0   201	   37 	654   448    80559   82381    87619  206
65536 (>40000)	0   406    75   563  1619   296891  300642   311429  355  

On a 1.5 GHz Ultrasparc:


LOC   (ANF) pass0 pass1 pass2 pass3 pass4  assembly    cpu  elapsed   MB

    2		0     0     0	  0	0     	 0    	10    	  9    1
    4		0     0     0     0	0   	 0    	10    	 12    1
    8		0     0     0    10	0     	 0    	10	 13    1
   16		0     0     0    10	0     	 0    	10    	 20    1
   32		0     0     0 	 20	0   	 0    	20    	 28    1
   64		0     0     0 	 30    10     	 0    	50    	 56    1
  128		0    10     0 	 60    20     	10     100    	113    1
  256		0    10	    0 	130    30     	20     200    	210   13
  512		0    10	   10 	270    60     	40     420    	443   13
 1024		0    20	   10 	620   120     	80     890    	927   15
 2048		0    50	   10  1390   240      210    2020     2073   34
 4096 ( 16388)	0   100	   20  2730   470      520    4040     4196   25
 8192 ( 32772)	0   190	   40  5740  1100     1490    8970     9231   34
16384 (>40000)	0   370	  100  1100  1050     6830   10300    10382   52
32768 (>40000)	0   750	  200  1460  2300    45450   51880    52036   51
65536 (>40000)	0  1640	  430  2460  5140   259810  272870   278352  154

Investigation

Minor Progress

All measurements in this section are for a 2.4 GHz MacBook Pro and IAssassin.

There are several performance bugs in the assemblers. The fact that both the IAssassin and SPARC assemblers are superlinear made me look for asymptotic inefficiencies in the machine-independent parts of those assemblers, mainly pass5p1.sch. (It's hard to imagine how a peephole optimizer could have an asymptotic inefficiency, so I haven't checked that.)

The constant-sharing as implemented by emit-constant takes quadratic time, so I limited constant-sharing to the first 50 distinct constants of a code segment. This made no measurable difference to the synthetic benchmark or to PLT Scheme's R6RS test program, as I should have determined before I made the change.

Label lookups also take quadratic time. For (time-twobit 65536), I estimate that about 15 seconds are spent in find-label, which is significant but not the main problem. I haven't done anything about that yet.

Crude profiling showed that a lot of time was being spent in check-for-free-ids, which amounts to a runtime assertion that should never fail. This procedure is defined in Asm/IAssassin/pass5p2-sassy.sch and called only from sassy-assemble in that same file, so the call is easily commented out. Doing so establishes that 200 of the 300 cpu seconds for (time-twobit 65536) are spent in check-for-free-ids.

Subtracting the estimated 15 seconds for find-label and the measured 200 seconds for check-for-free-ids would get (time-twobit 65536) down to about 90 seconds, which is still superlinear but not by much.

Unfortunately, the elimination of check-for-free-ids had no measurable effect on compile time for PLT Scheme's R6RS test program. That means we have some other asymptotic inefficiency that the synthetic benchmark does not trigger.

The elimination of check-free-ids did not measurably reduce the excessive storage requirements. Those data are noisy, but aren't noisy enough to hide the solution when we find it.


CPU time for synthetic benchmark

LOC   (ANF)        v0.963    without
                             check-free-ids

    2		     12           9
    4		     11          11
    8		     15          14
   16		     21          21
   32		     35          35
   64		     64          63
  128		    125         119
  256		    245         234
  512		    507         482
 1024		   1082         968
 2048		   2257        1944
 4096 ( 16388)	   5046        4104
 8192 ( 32772)	  11961        8542
16384 (>40000)	  26873       14202
32768 (>40000)	  82381       38934
65536 (>40000)	 300642      105995

The SPARC assembler is okay on base.sls

On a 1.5 GHz Ultrasparc, compiling PLT Scheme's R6RS test libraries:


((pass0 . 10)
 (pass1 . 2770)
 (pass2 . 730)
 (pass3 . 42550)
 (pass4 . 4130)
 (assemble . 6810))
#!unspecified
Words allocated: 325061412
Words reclaimed: 0
Elapsed time...: 78728 ms (User: 75810 ms; System: 1010 ms)
Elapsed GC time: 5585 ms (CPU: 5630 in 1239 collections.)
(heap-area-info
  .
  #(#(0 nursery 1048576 fixed)
    #(1 two-space 37879808 expandable)
    #(2 static 6201344 expandable)))

That's about twice as fast as on the MacBook Pro, which is a much faster machine. Furthermore IAssassin is spending almost all of its time in the assembler, but the SPARC version is spending more than half of its time in pass3.

Conclusion: The machine-independent part of the assembler is probably fast enough, and the problem with IAssassin probably lies within Sassy or our interface to Sassy.

Another possibility, suggested by PnkFelix, is that one of the SRFIs being required by Sassy might be redefining some standard procedure in some way that has a drastic effect upon performance. At the very least, we should make sure all of those SRFIs run as fast as possible because Sassy uses them.

Why is the SPARC assembler quadratic for large synthetic benchmarks?

For (time-twobit 65536), the IAssassin assembler generates 2 code segments, 2 constants, 65536 labels, and 0 fixups.

For (time-twobit 65536), the SPARC assembler generates 1 code segment, 2 constants, 65536 labels, and 196612 fixups. The fixups take negligible time. The presumed 65536 calls to make-asm-label probably take no more than 25 seconds, because 65536 calls to assq on an alist of 65536 pairs that don't contain the key take 50 seconds. Conclusion: Don't bother with the quadratic behavior of make-asm-label for now, although it will be worth looking into later after the real problem has been identified and solved.

An hour later: so much for jumping to conclusions. After hacking a more convenient interface onto PnkFelix's profiler, it looks like (time-twobit 65536) is spending about 93% of its time in find-label within make-asm-label. That should be straightforward to fix.

Using a hashtable for labels

After changing Asm/Shared/pass5p1.sch to use hashtables for labels, (time-twobit 65536) runs in 20 seconds on the SPARC, down from 275 in v0.963. Compiling the libraries of PLT Scheme's R6RS test program takes 80 seconds, which is about the same as in v0.963.

For IAssassin, with check-for-free-ids disabled, (time-twobit 65536) now runs in 70 seconds, down from 300 in v0.963 and from 100 in v0.963 with check-for-free-ids disabled. Compiling the libraries of PLT Scheme's R6RS test program takes 143 cpu seconds, which is about 8 seconds faster than in v0.963.

The IAssassin, IL, and IL-LCG assemblers are calling as-labels, usually to call assq on the resulting alist. This is almost certain to produce quadratic behavior.

Fixing that (for IAssassin only) had no measurable effect on (time-twobit 65536) or on compile time for PLT Scheme's R6RS test program.

Profiling during compilation of the R6RS test program gave these results:


 %  topmost named procedure
18  find-tail
9  sassy
7  equal?
7  hash
6  sassy-symbol-ensure
5  hash-table-ref
4  %hash-table-walk
3  hash-table-set!
3  pass3
3  sassy-assemble
2  find
2  append
1  representation-analysis
1  %hash-table-maybe-resize!
1  env-lookup
1  %hash-table-hash
1  loop
1  %bignum-length
1  sassy-symbol-exists?
1  sassy-symbol-set-scope!
1  every?
1  emit-direct2
1  hash-on-equal
1  get-datum-with-source-locations
1  available-intersect!
1  list-ref
1  %hash-table-add!
1  fixnum->bignum
1  big-divide-digits
[...]

 %  active procedures
100  r5rs-entry-point
100  repl
100  run-with-profiling
82  assemble
82  assemble1
82  assemble-finalize!
76  sassy
48  twobit
47  compile-file
29  sassy-symbol-ensure             Lib/Sassy/intern.scm
29  sassy-symbol-set-scope!         Lib/Sassy/intern.scm
23  find-tail                       SRFI 1
15  emit-direct2                    Lib/Sassy/opcodes.scm
14  hash-table-ref                  SRFI 69
11  compile
10  pass3
8  hash-table-set!                  SRFI 69
8  %hash-table-maybe-resize!        SRFI 69
8  loop
7  hash                             SRFI 69
7  equal?
7  u/s-dword                        Lib/Sassy/numbers.scm
7  lgand                            SRFI 60
6  %hash-table-walk                 SRFI 69
6  find                             SRFI 1
6  compile-expression
4  bignum-quotient
4  big-divide-digits
4  sassy-assemble                   Asm/IAssassin/pass5p2-sassy.sch
4  i32                              Lib/Sassy/opcodes.scm
3  fixup-mem                        Lib/Sassy/opcodes.scm
3  scan-sequence                    lib/R6RS/r6rs-expander.sch
3  operator-binding                 lib/R6RS/r6rs-expander.sch
3  binding                          lib/R6RS/r6rs-expander.sch
3  env-lookup                       lib/R6RS/r6rs-expander.sch, src/Compiler/copy.sch
[...]

Improving Larceny's implementations of SRFI 1 and SRFI 69

Undoing some of the performance damage done by the reference implementation for SRFI 1 reduced the time for (time-twobit 65536) from 70 to 65 seconds. Compiling the libraries of PLT Scheme's R6RS test program has dropped from 143 to 116 cpu seconds.

Rewriting the implementation of SRFI 69 to use R6RS hashtables made very little difference. Compile times for (time-twobit 65536) and the R6RS test program changed by no more than a cpu second, but the profiling output changed dramatically:


 %  topmost named procedure
24  sassy-symbol-ensure
14  hash-table-ref
12  sassy
4  equal?
4  sassy-assemble
3  pass3
3  append
2  %bignum-length
2  sassy-symbol-exists?
2  env-lookup
2  loop
1  representation-analysis
1  every?
1  big-multiply-through-by
1  big-divide-digits
1  hash-on-equal
1  big-compare
1  make-r6rs-hashtable
1  sassy-symbol-set-scope!
1  list-ref
1  big-limited-normalize!
1  available-intersect!
1  emit-direct2
1  get-datum-with-source-locations
[...]

%  active procedures
100  r5rs-entry-point
100  repl
100  run-with-profiling
79  assemble
79  assemble1
79  assemble-finalize!
69  sassy
48  twobit
46  compile-file
34  sassy-symbol-ensure
34  sassy-symbol-set-scope!
18  emit-direct2
16  hash-table-ref
13  compile
12  pass3
11  loop
9  u/s-dword
9  lgand
7  compile-expression
6  sassy-assemble
5  big-divide-digits
5  bignum-quotient
5  i32
4  fixup-mem
4  equal?
4  binding
4  env-lookup
4  scan-sequence
4  operator-binding
3  append
[...]

The profiler's output indicates that Twobit is spending almost half its time in two small procedures:

  • 12% is spent in sassy, just setting up to do some assembling
  • 34% is spent in sassy-symbol-ensure, which is called by sassy-symbol-set-scope!

Adding temporary instrumentation revealed there were 36550 calls to sassy, about 21538865 calls to sassy-symbol-ensure, and almost as many (20938395) iterations of the named loop body in sassy-symbol-ensure:


(define (sassy-symbol-ensure sassy-output symbol-name)
  (or (sassy-symbol-exists-env? sassy-output symbol-name)
      (let ((new (make-sassy-symbol symbol-name 'local #f #f #f '() #f)))
	(let iter ((t (sassy-symbol-table sassy-output)))
          (if (hash-table? (car t))
              (begin (hash-table-set! (car t) symbol-name new)
                     new)
              (iter (cdr t)))))))

Experiments at the ERR5RS top level revealed 4 calls to sassy to compile the usual definition of doubly recursive fib, 99 calls to sassy-symbol-ensure, and 26 executions of the iter loop body above.

Apparently a sassy-output record contains a list of hashtables, through which sassy-symbol-ensure must iterate. The average length of the traversed lists is about 1, but there are an average of over 500 calls to sassy-symbol-ensure for each call to sassy. With the most trivial call to sassy-symbol-ensure I could devise, 21538865 calls took about 7.7 seconds. With the actual arguments, those 21538865 calls are taking about 39 seconds.

Apparently calls can be less trivial than the one I devised, since sassy-symbol-ensure calls sassy-symbol-exists?:


(define (sassy-symbol-exists? sassy-output name)
  (let iter ((rst (sassy-symbol-table sassy-output)))
    (cond ((hash-table? rst)
           (hash-table-ref rst name (lambda () #f)))
          ((hash-table? (car rst))
           (hash-table-ref (car rst) name (lambda () #f)))
          ((eq? name (sassy-symbol-name (car rst))) (car rst))
          (else (iter (cdr rst))))))

This is a second loop over that list of hashtables, and it looks like that list might contain symbols instead of or as well as hashtables. Notice also the gratuitous inefficiency of creating a thunk instead of using hash-table-ref/default or hand-lifting the thunk to top level. (Compilers could optimize that away, but Twobit doesn't because many programmers have a naive view of Scheme's location-tagging eq? semantics for procedures.) Instrumentation revealed 22510940 executions of that loop body, which means at least 95% of the calls to this procedure execute the loop body only once.

The name passed to these procedures can be anything that satisfies the valid-label0 predicate:

  • a symbol other than a keyword (note use of member in valid-label0 when memq would do)
  • a quoted-label of the form (quote sym), where sym is a symbol On the average, it looks as though there are two hashtable operations (with accompanying hash-on-equal), one closure creation, and one creation of a new Sassy symbol for each call to sassy-symbol-ensure.

Inlining the call to sassy-symbol-exists? within sassy-symbol-ensure, followed by local simplification, saved 2 or 3 seconds of cpu time on the R6RS test program, bringing it down to 112 or 113 cpu seconds. At least 5.5 cpu seconds, and conceivably up to 40 cpu seconds, are spent in 82421463 calls to the hash procedure of SRFI 69.

Two or three full days of hard work have reduced the compile time for the R6RS test program by about 25%, from 150 cpu seconds to 112. As described below, there's another factor of 2 to be gained.

Why ERR5RS/R6RS library code is compiled twice

Andre van Tonder's implementation of the R6RS library/macro system calls eval on the expansion of each library definition that is read during evaluation or compilation. The purpose of this call is to register the library definition. Since eval calls the compiler in native Larceny, each library definition will be compiled twice:

  • immediately after it has been expanded, to register the library
  • when the expanded code is compiled to a .slfasl file, or when the expanded code is eval'ed as though it were an ordinary expression. Will implemented a hack to defer registering libraries that are defined by the only library definition in the file being compiled; in that case, the compiled file can be loaded immediately after it has been written, and that is faster than compiling a second time by calling eval. That hack doesn't work when more than one library is defined within the file being compiled.

That hack also doesn't work when R6RS programs are executed from source code. In that case, we'd like to recognize the expanded version of a library so we can avoid calling eval on it a second time, but that's hard to do given the way van Tonder's code is currently written.

Adding a cache to normal hashtables seems to have improved IAssassin compile times by one to two percent.

Progress to date

Total CPU time to compile (without executing) some benchmarks:


                      IAssassin                 Sparc
                  ===================     ===================

LOC   (ANF)       v0.963      v0.963+     v0.963      v0.963+
                              rev5714                 rev5714

    2		      9          10           0           0
    4		     10          11          10           0
    8		     15          14          10          10
   16		     22          20          10          10
   32		     36          33          20          20
   64		     63          60          50          50
  128		    121         114          80          90
  256		    247         225         190         210
  512		    507         446         430         430
 1024		   1066         923         890         910
 2048		   2281        1863        2010        2040
 4096 ( 16388)	   5030        3884        4020        3980
 8192 ( 32772)	  12101        8013        9090        8500
16384 (>40000)	  27242       12474        9180        5120
32768 (>40000)	  82599       25780       51890       10080
65536 (>40000)	 320019       54802      272470       19800

PLT Scheme's
R6RS test pgm    170744       68184       81760       51620
(rev11057)

The current profiling output looks like this:


 %  topmost named procedure
23  sassy-symbol-ensure
13  hash-table-ref
11  sassy
6  equal?
4  sassy-assemble
3  env-lookup
3  get-datum-with-source-locations
2  loop
2  pass3
2  %bignum-length
2  append
2  big-multiply-through-by
[...]

 %  active procedures
100  r5rs-entry-point
100  repl
100  run-with-stats
100  run-with-profiling
85  compile-file
73  assemble
73  assemble1
73  assemble-finalize!
66  sassy
33  sassy-symbol-set-scope!
33  sassy-symbol-ensure
17  emit-direct2
15  hash-table-ref
13  loop
11  compile
10  pass3
9  u/s-dword
9  lgand
7  binding
7  scan-sequence
7  env-lookup
7  operator-binding
6  big-divide-digits
6  bignum-quotient
6  equal?
6  i32
[...]

All known sources of quadratic compile times have been fixed. Short of rewriting Sassy or moving to a faster assembler, we probably have as much to gain by improving the performance of records, hashtables, and bignums as by anything else.

The factor of two for R6RS programs executed from source can be fixed. At worst, we can compile R6RS programs to a temporary file before executing them.

Some notes on object distributions

PnkFelix reduced the R6RS (tests r6rs base) library that seems to cause significant assembly overhead to a R5RS-compatible input (see file attached to Ticket #575 ). He then did some experimentation by hitting Ctrl-C as the file loads and then using SRO to inspect the state of the heap at that time. (Its important to hit Ctrl-C before the heap gets too large; otherwise SRO dies attempting to construct objects too large for Larceny to handle. PnkFelix found that it seems to suffice to watch the larceny.bin process with top and hit Ctrl-C soon after it reaches 100 MB.)

The vast majority of the objects are pairs (pointer tag 1) and vector-likes (pointer tag 3); a vast majority of the vector-likes are structures (header tag 5).


% ./larceny
Larceny v0.963 "Fluoridation" (Aug  4 2008 17:12:02, precise:Posix Unix:unified)
larceny.heap, built on Wed Aug  6 17:14:36 EDT 2008

> (load "boo3.sch")
ANF size: 15910
  C-c C-cKeyboard interrupt.
Entering debugger; type "?" for help.
debug> n

>> (define vecl-objects (sro 3 5 -1))

>> (define type-counts (make-oldstyle-hashtable))

>> (do ((i 0 (+ i 1)))
       ((>= i (vector-length vecl-objects)))
     (let ((r (vector-ref vecl-objects i)))
       (hashtable-update! type-counts
                          (and (record? r)
                               (record-type-name (record-rtd r)))
                          (lambda (n) (+ 1 n)) 0)))

>> (hashtable-map list type-counts)
((#f 4)
 (enum-set 1)
 (enumeration 1)
 ("sassy-symbol" 1167763)
 ("sassy-reloc" 12162)
 (hashtable 1244)
 ("text-block" 6759)
 ("sassy-output" 420)
 ("record-type-descriptor-type" 40))

>> (define pair-objects (sro 1 -1 -1))

>> (vector-length pair-objects)
3794384

Note that there are about 1.2 million sassy-symbol objects, and about 3.8 million pairs. PnkFelix thinks that each sassy-symbol has a vector-like-length of 8. So the pairs and the sassy symbols seem like they would account for the majority of the space usage (assuming that the heap is indeed about 100 MB at the time of the above interaction).

As for WHY there are so many sassy-symbol objects being generated, you can see a description for one reason why this was (accidentally) happening in the log message for changeset:5722, which corrects the problem by generating code segments that export fewer labels (and thus do not need to have corresponding sassy-symbol objects nor entries in Sassy's hashtable for such objects).

Quadratic behavior during code generation

When the size of trigtest.sps increased from about 4700 tests to about 20000, and all 20000 tests were put into a single library instead of splitting the 4700 tests into two libraries, the compile time went from about a minute to almost 45 minutes.

The increase appears to be caused by quadratic behavior during code generation, with at least two possible causes:

  • computing set unions and differences takes quadratic time because sets are represented by lists of symbols
  • frame environments are very large (over 15000 entries), are represented as lists, and cgframe-release! uses remq to remove entries even though the entry to be removed occurs only once in the list and usually occurs near the very beginning of the list

The more puzzling question is why frame environments are so large. The cg-arguments procedure allocates temporaries, which get added to the frame environment by phantom store instructions, but should then be released after code for all arguments has been emitted.

Changing cgframe-release! to remove only the first entry helped a lot, with both time and consing (because the deep recursions were overflowing the stack cache into the heap). Compile time dropped from 45 minutes to about 9.

Nine minutes is still unacceptable. That's with double compilation, so solving that problem would get down to about four and half minutes. Should measure to see how much of that time is being spent in each pass.

Clone this wiki locally