-
Notifications
You must be signed in to change notification settings - Fork 32
WhyMillicodeIsSlow
The problem described here was reduced by changeset:6129, which eliminated two
uses of of longjmp
on the Scheme-to-C-to-Scheme millicode path.
h1. Why Millicode is Slow
Larceny's millicode is much slower on the IA32 than on the SPARC, apparently because switching the context from Scheme to C and back to Scheme is much more expensive on the IA32.
This appears to be one of the main reasons for Larceny's slow bignum arithmetic.
It is also one of the main reasons why inexact->exact
is very slow.
Measurements in v0.97a3:
;;; (inexact->exact x)
;;;
;;; If x is a fixnum:
;;;
;;; On a SPARC: 28 nsec
;;; On a Core II Duo: 13 nsec
;;;
;;; If x is an inexact integer:
;;;
;;; On a SPARC: 795 nsec
;;; On a Core II Duo: 6049 nsec = 6.05 usec
That's why every bignum operation, no matter how simple, seems to take at least 6 usec. See WhyBignumsAreSlow.
The main difference between the SPARC and IAssassin millicode is that scheme_start
is implemented by rather simple assembly language on the SPARC, but by a setjmp
on
IAssassin and all other systems; see src/Rts/Sparc/glue.s
and src/Rts/IAssassin/i386-driver.c
.
- Felix notes that
scheme_start
is also doing some dumb things, like the problem documented in Ticket #611
SPARC millicode calls Scheme directly without going through C; see internal_scheme_call
in glue.s
. IAssassin and PetitLarceny systems go through C and perform a longjmp
;
see src/Rts/IAssassin/millicode.c
.
Changing IAssassin and the other systems to avoid the longjmp
might improve Larceny's
bignum performance by a factor of 10 or more. That would be a lot of trouble, so we should
first confirm that longjmp
is the problem.
After hacking the Scheme millicode vector to special-case (inexact->exact 0.0)
:
1.5 GHz Sparc 2.2 GHz Core Duo
(inexact->exact 1) 33 nsec 26 nsec millicode calls C
(inexact->exact 0.0) 205 nsec millicode calls C, which is hacked to stop short of the longjmp (*)
(inexact->exact 0.0) 353 nsec 5500 nsec millicode calls C, which calls Scheme
380 nsec after changeset:6129
(inexact->exact 1.0) 855 nsec 5800 nsec millicode calls C, which calls Scheme
440 nsec after changeset:6129
(inexact->exact 1.2) 143230 nsec 250000 nsec millicode calls C, which calls Scheme, which calls millicode...
63000 nsec after changeset:6129
With (inexact->exact 1.2)
, each of dozens of bignum operations incurs the 5 usec
overhead of a millicode call that goes through C into Scheme.
Experiment (*) tells us the two uses of longjmp
are where the time is going.
Finally, a small C program established that each longjmp
takes about 4.7 usec
on the Core Duo. (There are two invocations of longjmp
per call to
inexact->exact
on a flonum, so the mystery now is why each call doesn't
take at least 9.4 usec.)
/* How long does it take to perform m longjumps
* when the recursion depth is n?
*
* Usage:
* a.out m n
*
* m defaults to 1000000
* n defaults to 10
*/
#include
#include
#include
jmp_buf *jump_buffer;
int result;
int foo (int n, int r) {
if (n > 0)
return 1 + foo (n - 1, r + 1);
else {
result = r;
longjmp( *jump_buffer, r );
}
}
int bar (int n) {
int first_time = 1;
result = 0;
jump_buffer = malloc( sizeof( jmp_buf ) );
setjmp( *jump_buffer );
if (first_time) {
first_time = 0;
result = foo (n, 0);
}
return result;
}
void usage_message () {
fprintf( stderr, "Usage: longjump m n\n" );
}
int main( int argc, char **argv ) {
int m = 1000000;
int n = 10;
int result;
int i;
if (argc == 3) {
if (sscanf( argv[2], "%d%c", &n ) != 1) {
usage_message();
return 1;
}
}
if (argc >= 2) {
if (sscanf( argv[1], "%d%c", &m ) != 1) {
usage_message();
return 1;
}
}
printf( "m = %d, n = %d\n", m, n );
for (i = 0; i < m; i = i + 1) {
result = bar (n);
}
printf( "result = %d\n\n", result );
return 0;
}