-
Notifications
You must be signed in to change notification settings - Fork 32
SassyInvoke
The information and musings below are (probably) very out of date; as of changeset:4000, the esp
register points to the stack cache (ContIsEsp), not the globals array (ebp
now points to globals).
PnkFelix is musing about ways to improve the performance of IasnLarceny.
FYI, right now, the assembly sequence for SETRTN lbl
is essentially:
call $eip
L1:
pop TEMP
sub TEMP, L1
add TEMP, lbl
mov [CONT+STK_RETADDR], TEMP
One of our problems is that setrtn lbl; invoke n; lbl:
is supposed to have an "obvious" fast implementation. PnkFelix does not think that is the case with our current setup. In particular, the only way on ia32 to combine a save of the program counter with a branch is with a call
instruction -- but that pushes the pc
onto the stack, rather than storing it in a link register as on more riscy architectures. (Remember, right now esp
points to the start of globals
, so pushes onto the stack get saved but have to be fixed up before computation can proceed.)
However, a small modification to our register mappings and the globals array could fix this.
Right now, to save the program counter, we use a call
and then fix up esp
to point back at the globals array. What if we instead always kept the return address in the first slot of globals
, and then made setrtn
become something like: add esp, 4; jmp eip
. Then (Pnk thinks) a setrtn; invoke n
could be optimized to something like
mov temp, codepointer(result)
mov result, n
add esp, 4
call temp
PnkFelix worried at first about this approach being broken because of possible race conditions with asynchronous interrupts coming in and clobbering the data in the first slot of globals
in between the add esp, 4
and the call temp
. But that is not a problem, because we're clearly already clobbering the data with the call itself, so its okay if an interrupt clobbers it first, as long as the handler restores the stack to its old state (as it should).
Its not this simple; PnkFelix forgot that we aren't carrying the return address directly in a slot in the globals array; instead, it is kept in a slot in the continuation. PnkFelix believes the "obvious fast code" on the Sparc is feasible because of its delay slot semantics, so that code like:
(sparc.set as (thefixnum n) $r.result)
(sparc.jmpli as $r.tmp0 $p.codeoffset $r.o7)
(sparc.sti as $r.o7 4 $r.stkp)
still works.
- It doesn't seem like there's much way around the problem. We could change it so that
esp
is thecont
register, but even then it would point at the header for the continuation, which is not where we want to stash the return address. Whereever we did end up stashing the return address, ifcont=esp
, we'd still need to fix thecont
register to point at a valid object header afterwards; yuck. - We could move
cont+stk_retaddr
intoesp
before the call. But thenesp
would be reserved for this sole purpose, and we'd probably get into real trouble with interrupts.
LarsHansen says:
-
It was a goal for SPARC Larceny to cache the return address in a register, since that aids performance on leaf calls. The problem is that the millicode calls use the same register, and millicode calls are frequent, so overall this would bloat the code. (Not much of a concern any more, perhaps.) On the 386 the issues would be the same, and you would indeed use the slot below GLOBALS as a cache for the return address: a non-tail call would save the current return address through CONT, jump to a location in the target that adds 4 to GLOBALS, and we'd be done; a return would subtract 4 from GLOBALS, execute "ret", and code in the caller would restore the return address from CONT to GLOBALS-4.
-
(If the CPU has a call/return cache, like the PowerPC has, then using call/return idiomatically is also helpful.
-
Why so worried about interrupts? Signals should probably be entirely abstracted away in the C layer and handled differently in Scheme (IMO).)
-
Alternatively, I think you can do SETRTN L1 by
mov TMP, [GLOBALS+REG0]
mov TMP, [TMP+CODEVECTOR]
mov [CONT+STK_RETADDR], (L1 - $)
which looks like more work, except the x86 processors are usually really good about optimizing loads; L1-$
is of course constant.
- Generally I've come to believe that moving the code with the GC, though a simplifying assumption, is probably wrong from a performance perspective.