WhyMillicodeIsSlow

The problem described here was reduced by changeset:6129, which eliminated two uses of of longjmp on the Scheme-to-C-to-Scheme millicode path.

h1. Why Millicode is Slow

Larceny's millicode is much slower on the IA32 than on the SPARC, apparently because switching the context from Scheme to C and back to Scheme is much more expensive on the IA32.

This appears to be one of the main reasons for Larceny's slow bignum arithmetic. It is also one of the main reasons why inexact->exact is very slow.

Measurements in v0.97a3:


;;; (inexact->exact x)
;;;
;;; If x is a fixnum:
;;;
;;; On a SPARC:          28 nsec
;;; On a Core II Duo:    13 nsec
;;;
;;; If x is an inexact integer:
;;;
;;; On a SPARC:         795 nsec
;;; On a Core II Duo:  6049 nsec = 6.05 usec

That's why every bignum operation, no matter how simple, seems to take at least 6 usec. See WhyBignumsAreSlow.

The main difference between the SPARC and IAssassin millicode is that scheme_start is implemented by rather simple assembly language on the SPARC, but by a setjmp on IAssassin and all other systems; see src/Rts/Sparc/glue.s and src/Rts/IAssassin/i386-driver.c.

Felix notes that scheme_start is also doing some dumb things, like the problem documented in Ticket #611

SPARC millicode calls Scheme directly without going through C; see internal_scheme_call in glue.s. IAssassin and PetitLarceny systems go through C and perform a longjmp; see src/Rts/IAssassin/millicode.c.

Changing IAssassin and the other systems to avoid the longjmp might improve Larceny's bignum performance by a factor of 10 or more. That would be a lot of trouble, so we should first confirm that longjmp is the problem.

After hacking the Scheme millicode vector to special-case (inexact->exact 0.0):


                              1.5 GHz Sparc      2.2 GHz Core Duo

(inexact->exact 1)                 33 nsec            26 nsec          millicode calls C
(inexact->exact 0.0)                                 205 nsec          millicode calls C, which is hacked to stop short of the longjmp (*)
(inexact->exact 0.0)              353 nsec          5500 nsec          millicode calls C, which calls Scheme
                                                     380 nsec              after changeset:6129
(inexact->exact 1.0)              855 nsec          5800 nsec          millicode calls C, which calls Scheme
                                                     440 nsec              after changeset:6129
(inexact->exact 1.2)           143230 nsec        250000 nsec          millicode calls C, which calls Scheme, which calls millicode...
                                                   63000 nsec              after changeset:6129

With (inexact->exact 1.2), each of dozens of bignum operations incurs the 5 usec overhead of a millicode call that goes through C into Scheme.

Experiment (*) tells us the two uses of longjmp are where the time is going.

Finally, a small C program established that each longjmp takes about 4.7 usec on the Core Duo. (There are two invocations of longjmp per call to inexact->exact on a flonum, so the mystery now is why each call doesn't take at least 9.4 usec.)


/* How long does it take to perform m longjumps
 * when the recursion depth is n?
 *
 * Usage:
 * a.out m n
 *
 * m defaults to 1000000
 * n defaults to 10
 */

#include 
#include 
#include 

jmp_buf *jump_buffer;
int result;

int foo (int n, int r) {
  if (n > 0)
    return 1 + foo (n - 1, r + 1);
  else {
    result = r;
    longjmp( *jump_buffer, r );
  }
}

int bar (int n) {
  int first_time = 1;
  result = 0;
  jump_buffer = malloc( sizeof( jmp_buf ) );
  setjmp( *jump_buffer );
  if (first_time) {
    first_time = 0;
    result = foo (n, 0);
  }
  return result;
}

void usage_message () {
  fprintf( stderr, "Usage: longjump m n\n" );
}

int main( int argc, char **argv ) {
  int m = 1000000;
  int n = 10;
  int result;
  int i;

  if (argc == 3) {
    if (sscanf( argv[2], "%d%c", &n ) != 1) {
      usage_message();
      return 1;
    }
  }
  if (argc >= 2) {
    if (sscanf( argv[1], "%d%c", &m ) != 1) {
      usage_message();
      return 1;
    }
  }
  printf( "m = %d, n = %d\n", m, n );
  for (i = 0; i < m; i = i + 1) {
    result = bar (n);
  }
  printf( "result = %d\n\n", result );
  return 0;
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WhyMillicodeIsSlow

Clone this wiki locally