-
Notifications
You must be signed in to change notification settings - Fork 32
WhyBignumsAreSlow
Felix S Klock II edited this page Feb 13, 2014
·
2 revisions
Will wrote a set of micro-benchmarks to learn why Larceny's bignums are so slow.
Timings on a 1.5 GHz Sparc:
Larceny Larceny Chez Scheme 48 MzScheme
v0.963 v0.96a5a1 6.1 1.7 v4.1
rev 5764
fixnum * fixnum 160 nsec 160 nsec 375 nsec 240 nsec 3 usec
bignum + fixnum 12 usec 8 usec 560 nsec 2 usec 2 usec
bignum * fixnum 144 usec 10 usec 1 usec 4 usec 5 usec
bignum * 2^32 42 usec 9 usec 1 usec 5 usec 5 usec
bignum * 2^64 64 usec 10 usec 1 usec 7 usec 5 usec
bignum * 2^128 111 usec 10 usec 1 usec 10 usec 5 usec
bignum * 2^256 202 usec 10 usec 1 usec 15 usec 5 usec
bignum + bignum 13 usec 8 usec 680 nsec 2 usec 2 usec
(expt 10 10) 1 usec 1 usec 2 usec 9 usec 8 usec
(expt 10 100) 300 usec 50 usec 6 usec 21 usec 22 usec
(expt 10 1000) 38 msec 3 msec 177 usec 680 usec 380 usec
(expt 10 10000) 4 sec 110 msec 19 msec 58 msec 15 msec
(expt 10 100000) 363 sec 4 sec 2 sec 6 sec 390 msec
Timings on a 2.8 GHz Pentium:
Larceny Larceny Petite MzScheme Ikarus Gambit
v0.963 v0.965a1 Chez 7.3 v370 0.0.3 v4.1.0
rev 5764
fixnum * fixnum 121 nsec 112 nsec 413 nsec 873 nsec 96 nsec 2 usec
bignum + fixnum 4 usec 4 usec 444 nsec 536 nsec 76 nsec 2 usec
bignum * fixnum 60 usec 12 usec 560 nsec 920 nsec 96 nsec 3 usec
bignum * 2^32 16 usec 10 usec 568 nsec 964 nsec 159 nsec 2 usec
bignum * 2^64 25 usec 11 usec 572 nsec 944 nsec 190 nsec 2 usec
bignum * 2^128 41 usec 11 usec 600 nsec 1 usec 253 nsec 2 usec
bignum * 2^256 74 usec 11 usec 600 nsec 1 usec 352 nsec 3 usec
bignum + bignum 4 usec 4 usec 500 nsec 552 nsec 131 nsec 2 usec
(expt 10 10) 824 nsec 960 nsec 641 nsec 1 usec 712 nsec 2 usec
(expt 10 100) 130 usec 53 usec 2 usec 4 usec 2 usec 6 usec
(expt 10 1000) 16 msec 2 msec 40 usec 356 usec 18 usec 112 usec
(expt 10 10000) 2 sec 80 msec 3 msec 1 msec 840 usec 5 msec
(expt 10 100000) 153 sec 3 sec 344 msec 40 msec 24 msec 52 msec
Except for Larceny, Chez, Ikarus, and MzScheme (Pentium only), these timings are for interpreted benchmarks. Interpreted code often performs reasonably well on bignum benchmarks, so I didn't bother to compile for Gambit.
These numbers show that:
- Larceny takes a long time to get into the bignum code.
- Larceny traps first to millicode, then to C, then back to Scheme.
- To get into the sub-microsecond range, we need millicode that calls Scheme without going through C.
- Multiplying a bignum by a fixnum was way slow.
- Rewriting the inner loop made this case go 5 to 15 times as fast.
- Another factor of 2 to 4 could be obtained using 1616 or 3232 multiplication.
- Larceny wasn't recognizing shifts.
- Another factor of 2 to 4 could be obtained using 16-bit or 32-bit loads/stores.
- Larceny was using the classical algorithm for multiplication.
- Larceny now uses Karatsuba's algorithm for large bignums.
- Another factor of 2 to 4 could be obtained using 16-bit or 32-bit operations.
- For really large bignums, convolution-based multiplication is fastest.
The easiest and most generally applicable improvements would come from bypassing C and using more 16-bit operations.
See WhyMillicodeIsSlow.