SassyInvoke

The information and musings below are (probably) very out of date; as of changeset:4000, the esp register points to the stack cache (ContIsEsp), not the globals array (ebp now points to globals).

PnkFelix is musing about ways to improve the performance of IasnLarceny.

FYI, right now, the assembly sequence for SETRTN lbl is essentially:


  call $eip
L1:
  pop TEMP
  sub TEMP, L1
  add TEMP, lbl
  mov [CONT+STK_RETADDR], TEMP

One of our problems is that setrtn lbl; invoke n; lbl: is supposed to have an "obvious" fast implementation. PnkFelix does not think that is the case with our current setup. In particular, the only way on ia32 to combine a save of the program counter with a branch is with a call instruction -- but that pushes the pc onto the stack, rather than storing it in a link register as on more riscy architectures. (Remember, right now esp points to the start of globals, so pushes onto the stack get saved but have to be fixed up before computation can proceed.)

However, a small modification to our register mappings and the globals array could fix this.

Right now, to save the program counter, we use a call and then fix up esp to point back at the globals array. What if we instead always kept the return address in the first slot of globals, and then made setrtn become something like: add esp, 4; jmp eip. Then (Pnk thinks) a setrtn; invoke n could be optimized to something like


mov temp, codepointer(result)
mov result, n
add esp, 4
call temp

PnkFelix worried at first about this approach being broken because of possible race conditions with asynchronous interrupts coming in and clobbering the data in the first slot of globals in between the add esp, 4 and the call temp. But that is not a problem, because we're clearly already clobbering the data with the call itself, so its okay if an interrupt clobbers it first, as long as the handler restores the stack to its old state (as it should).

Its not this simple; PnkFelix forgot that we aren't carrying the return address directly in a slot in the globals array; instead, it is kept in a slot in the continuation. PnkFelix believes the "obvious fast code" on the Sparc is feasible because of its delay slot semantics, so that code like:


(sparc.set   as (thefixnum n) $r.result)
(sparc.jmpli as $r.tmp0 $p.codeoffset $r.o7)
(sparc.sti   as $r.o7 4 $r.stkp)

still works.

It doesn't seem like there's much way around the problem. We could change it so that esp is the cont register, but even then it would point at the header for the continuation, which is not where we want to stash the return address. Whereever we did end up stashing the return address, if cont=esp, we'd still need to fix the cont register to point at a valid object header afterwards; yuck.
We could move cont+stk_retaddr into esp before the call. But then esp would be reserved for this sole purpose, and we'd probably get into real trouble with interrupts.

LarsHansen says:

It was a goal for SPARC Larceny to cache the return address in a register, since that aids performance on leaf calls. The problem is that the millicode calls use the same register, and millicode calls are frequent, so overall this would bloat the code. (Not much of a concern any more, perhaps.) On the 386 the issues would be the same, and you would indeed use the slot below GLOBALS as a cache for the return address: a non-tail call would save the current return address through CONT, jump to a location in the target that adds 4 to GLOBALS, and we'd be done; a return would subtract 4 from GLOBALS, execute "ret", and code in the caller would restore the return address from CONT to GLOBALS-4.
(If the CPU has a call/return cache, like the PowerPC has, then using call/return idiomatically is also helpful.
Why so worried about interrupts? Signals should probably be entirely abstracted away in the C layer and handled differently in Scheme (IMO).)
Alternatively, I think you can do SETRTN L1 by


   mov TMP, [GLOBALS+REG0]
   mov TMP, [TMP+CODEVECTOR]
   mov [CONT+STK_RETADDR], (L1 - $)

which looks like more work, except the x86 processors are usually really good about optimizing loads; L1-$ is of course constant.

Generally I've come to believe that moving the code with the GC, though a simplifying assumption, is probably wrong from a performance perspective.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SassyInvoke

Clone this wiki locally