elf: replace .got.zig with a zig jump table #21065

kubkon · 2024-08-13T12:37:00Z

Motivating factor: make this feature as transparent to the codegen as possible

Closes #20887

Previously, we would use .got.zig to indirect pointers to global data too but as was agreed on many an occasion we only really want to indirect function calls via an offset table or similar. In fact, as far as I understand that was the original plan of @andrewrk when he wrote the first PoC of incremental ELF linker. Therefore, since we longer want to indirect pointers to global data, it makes sense to replace an offset table with an equivalent jump (trampoline) table that is directly embedded within the machine code section. This improves code locality but also should make load times shorter if dynamically linking since we no longer have to rebase any pointers.

The new jump table looks as follows (for x86_64):

<0>: jmp symbol_a // jmp rel32
<5>: jmp symbol_b
...

Compared to storing pointers, if we have to relocate the table because it outgrew its capacity, we will have to re-calculate the jump targets since the jump sequence is PC-relative, however, this can be reduced into applying a fixed offset to every entry when relocating.

From the perspective of the codegen, when emitting a call with a relocation the codegen no longers needs to care about the presence or absence of the jump table/offset table - it simply emits call rel32 with R_X86_64_PLT32 relocation where the target is symbol_a. Then, the linker upon resolving R_X86_64_PLT32 will check if the jump table has been created and the symbol can be indirected via said table and rewrite the target address to the jump table entry if so. Again, this is all transparent to the codegen. As an added bonus, codegen now generates identical code in build-exe and build-obj modes.

One caveat of the new approach is that we only indirect function calls - if you request a function pointer, currently you will receive exactly that, with no indirection.

I am looking forward to the feedback if we should proceed or whatnot!

TODO

x86_64 codegen
x86_64-elf trampolines
riscv codegen
~~- [ ] riscv-elf trampolines~~ (deferred until we have a working incremental linker)

…e set

andrewrk · 2024-08-13T22:21:46Z

One caveat of the new approach is that we only indirect function calls - if you request a function pointer, currently you will receive exactly that, with no indirection.

Couldn't the function pointer point to the address in the jump table? Then the pointer would be also correct after that jump table address has been updated.

…extern ptr

andrewrk · 2024-08-15T18:47:36Z

I still think a jump table is better:

Avoid garbage in CPU cache that would otherwise occur when a function is moved leaving behind its trampoline followed by unused space
Avoid fragmentation in virtual memory space and the object file that would otherwise occur when a function is moved leaving behind its trampoline interrupting unused space
There is no function pointer alignment problem. Everything is fine.
Easier bookkeeping; simpler data structure for tracking the trampolines.
Simpler hot code swapping, simpler incremental linking

kubkon · 2024-08-15T19:05:57Z

I still think a jump table is better:

* Avoid garbage in CPU cache that would otherwise occur when a function is moved leaving behind its trampoline followed by unused space

That one, I have no experience or intuition with so will hand over to @jacobly0 instead.

* Avoid fragmentation in virtual memory space and the object file that would otherwise occur when a function is moved leaving behind its trampoline interrupting unused space

We increase fragmentation, that is true, but we still leave <evacuated_symbol>.size + padding - <trampoline>.size of unused space that can be reclaimed by a new atom or atom directly succeeding the trampoline. That was at least my how i envisioned it.

* There is no function pointer alignment problem. Everything is fine.

How is function pointer alignment problem solved with a jump table?

* Easier bookkeeping; simpler data structure for tracking the trampolines.

Actually bookkeeping is largely unchanged whether the jump table is as one big block or distributed. If a function has a trampoline, it gets an extra slot with an index to the symbol that acts as its trampoline. In case of a jump table, it gets an index into a jump table.

* Simpler hot code swapping, simpler incremental linking

Incremental linking does not become more complex because of trampolines since we simply create and add a new atom+symbol in place of the old one with reduced size of X where X is trampoline size. All happens (will happen) using the same algorithm as for allocating and freeing atoms we already use so I don't see any added complexity beyond actually creating a new atom/symbol for the said trampoline. If I am not seeing something obvious tho, please do let me know. Hot code swapping will indeed become more complicated because trampolines I believe for the reasons mentioned by @mlugg

tl;dr I would like us to reach a consensus how to proceed so that I don't have revert the changes immediately after committing them. Also, if you feel we should go back to an offset table, this is also fine fwiw. I just think that ease of maintaining and developing codegen backends is of higher priority than linker comlexity IMHO, or put it another way, I want to make codegen backends completely separated from the concept of incremental linking.

andrewrk · 2024-08-15T19:15:19Z

Function pointers are always pointer-size aligned when using a jump table.

kubkon · 2024-08-15T19:20:30Z

Function pointers are always pointer-size aligned when using a jump table.

I guess I should have asked this first: do we indirect pointers to functions via a jump table too? If so, this will not currently succeed:

test "align(N) on functions" {
    try expect((@intFromPtr(&overaligned_fn) & (0x1000 - 1)) == 0);
}

fn overaligned_fn() align(0x1000) i32 {
    return 42;
}

kubkon · 2024-08-15T19:21:38Z

Also please note that every entry in a jump table is not pointer-size aligned, but instruction aligned. Perhaps you are referring to an offset table?

andrewrk · 2024-08-15T19:24:41Z

garbage in CPU cache

This is easy to understand:

Here is a cache line

[ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ]

Here is a cache line full of jump table data:

[F] [F] [F] [F] [F] [F] [F] [F]

All those F's are valid pointers to functions that might be used.

Here is a cache line after a function has been relocated with the other strategy:

[T] [T] [J] [T] [T] [T] [T] [T]

J - jump instruction to the real function
T - trash. 100% guaranteed waste of space in the CPU cache

kubkon · 2024-08-15T19:27:36Z

garbage in CPU cache

This is easy to understand:

Here is a cache line
[ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] 
Here is a cache line full of jump table data:
[F] [F] [F] [F] [F] [F] [F] [F] 
All those F's are valid pointers to functions that might be used.

Here is a cache line after a function has been relocated with the other strategy:
[T] [T] [J] [T] [T] [T] [T] [T] 
J - jump instruction to the real function T - trash. 100% guaranteed waste of space in the CPU cache

Ok so it seems you mean an offset table being a better solution than a jump table be it in one big block or distributed.

andrewrk · 2024-08-15T19:28:10Z

Actually bookkeeping is largely unchanged whether the jump table is as one big block or distributed. If a function has a trampoline, it gets an extra slot with an index to the symbol that acts as its trampoline. In case of a jump table, it gets an index into a jump table.

This means that with the function header trampoline strategy, you have to keep around an old symbol data entry around for the now-deleted function, whereas with the jump table strategy, you don't.

andrewrk · 2024-08-15T19:36:09Z

Ok so it seems you mean an offset table being a better solution than a jump table be it in one big block or distributed.

No, my mistake for saying "function pointer" when I should have said "jump instruction" but otherwise the same point stands.

It looks like on x86_64, jumps with absolute address is 5 bytes which is annoying, but it's fine. Function pointers will have alignment of 1 on that platform then.

This test case will regress:

test "align(N) on functions" {
    try expect((@intFromPtr(&overaligned_fn) & (0x1000 - 1)) == 0);
}

fn overaligned_fn() align(0x1000) i32 {
    return 42;
}

The language will be modified to say when you take the address of a function, it does not necessarily gain the machine code alignment specified with the align keyword, because there may be a stub being used, or something to this effect.

Even if we go with the function prologue strategy resolution, I will still make this language change, because it is already evident this flexibility is useful for compilers.

kubkon · 2024-08-15T19:42:47Z

Actually bookkeeping is largely unchanged whether the jump table is as one big block or distributed. If a function has a trampoline, it gets an extra slot with an index to the symbol that acts as its trampoline. In case of a jump table, it gets an index into a jump table.

This means that with the function header trampoline strategy, you have to keep around an old symbol data entry around for the now-deleted function, whereas with the jump table strategy, you don't.

I am getting confused, let's agree on what is what if that's OK. We have 3 options under consideration:

Offset table

read-write section, each entry pointer-aligned, pointer-size
<0>: pointer to A
<8>: pointer to B
...

Jump table
read-exec section, .text.zig, each entry is instruction-aligned (1 on x86_64), size is arch dependent, code model dependent, etc

<0>: jmp rel32 // 5 bytes, PC-relative near-jump
<5>: ...

Distributed jump table
read-exec section, .text.zig, each entry is at least instruction-aligned or overaligned if the evacuated function was moved

<0>: code of A
...
<N>: jmp rel32 // jump to now-moved B
<N+5>: free space
....
<M>: code of C etc

FWIW I think there is a way to keep option 1, or at least reconsider it, while ensuring codegen backends are largely incremental linking agnostic by utilising the idea of lazy-symbol binding - we emit a section with immutable jump entries that always point to the same pointer in the offset table. Offset table is again mutable where we are free to rewrite pointers, however since the machine code now points at an immutable jump table we keep the codegen largely unaware of incremental linking. It would look something like this:

// .text.zig
<0>: A
...
<N>: call B -> .plt.zig, .stub.zig, or whatever
...

// .plt.zig
<0>: mov r11, [.got.zig entry at #0]
<N>: call [r11]
...

// .got.zig
<0>: pointer to B
<8>: pointer to A
...

kubkon · 2024-08-15T19:50:05Z

FWIW the obvious con of option 4* is having to (re-)introduce two more sections .got.zig and .plt.zig which is costly on MachO where we are severely limited in sections count.

andrewrk · 2024-08-15T19:52:23Z

I think we should toss out option 1 purely on a performance basis:

Offset table: every function call is indirect (20-50 cpu cycles)
Jump table (distributed or not): every function call is direct (15-30 cpu cycles)

source

The discussion right now is jump table vs distributed jump table. I don't think anyone is advocating for offset table. I'm not sure why you are bringing up Option 4?

In my earlier comments I used these terms:

"function prologue strategy", meaning "distributed jump table"
"jump table", meaning "jump table"

kubkon · 2024-08-15T20:05:36Z

Oh I brought up option 4 to point out yet another mechanism at our disposal in case it wasn't clear how an offset table can be utilised while making codegen agnostic to the concept of indirection.

jacobly0 · 2024-08-15T21:08:44Z

If you pack all of the jumps together, you enforce that every function call requires a minimum of 2 cache references, 3 if the jump instruction itself crosses a cache boundary (since 5 is not a power of 2, and pretty terrible since it would happen on a whopping 6% of functions). In order to fully utilize the first cache reference, you have to get lucky with using all the adjacent jump entries at around the same time, which is not even slightly reasonable to expect.

If you prefix the majority of non-changing functions with their own jump, then you allow them to only require a single cache reference that is fully utilized for all but the most trivial of functions. Saying that this space goes to waste after moving the function is misleading because it could be filled with any other function that fits in that space, in the same way that unused jump entries can be filled with any other jump. Even in the split case, you only have to get lucky with two functions being in use at the same time to fully utilize any given cache.

Increasing the minimum cost of all functions is clearly inferior to only increasing the cost of some changing functions (those rapidly increasing in size). This is even more true when you realize that once a split function stops changing over a period of time, it can be reunited in a "garbage collection"-like manner, restoring the original performance after you are happy with the new function implementation and stop editing it (in a way that vastly increases size). This trades the cost of updating all references every time a function increases in size, to only doing it once the function has stabilized and stops increasing in size (and can be deferred to batch over many functions when there is time to waste). This is never possible with a jump table because those jumps can never be adjacent to their implementation.

The bookkeeping seems completely equivalent, for each symbol you either track where the jump table entry is or which possibly not yet named symbol contains the actual implementation (something which can also be trivially recovered by just reading the jump instruction).

I'm also not sure why we are intent on removing function pointer alignment from the language, since once a function is aligned to a cache line, there is little benefit to aligning it any more other than gaining bits in the function pointer to be used for other things. It's the alignment of the first jump that matters in the common non-split jump prologue case, aligning the "beginning" of the function would just move it to a different cache line negating any benefits.

andrewrk · 2024-08-15T23:51:32Z

Thanks for chiming in @jacobly0. I'm convinced by your performance-related arguments.

I'm also not sure why we are intent on removing function pointer alignment from the language, since once a function is aligned to a cache line, there is little benefit to aligning it any more other than gaining bits in the function pointer to be used for other things. It's the alignment of the first jump that matters in the common non-split jump prologue case, aligning the "beginning" of the function would just move it to a different cache line negating any benefits.

I don't understand what you're saying here. It seems like you're making an argument against function pointer alignment being useful, which seems to comport with removing the alignment guarantees of function pointers. But your conclusion is that we should keep function pointer alignment guarantees?

jacobly0 · 2024-08-16T00:19:55Z

I'm saying that function alignment doesn't have much use at all without function pointer alignment, and so I don't understand the stance that only one should be removed. I'd also argue that its usage is niche enough that theorizing about stubs is not very relevant given that stubs can also be aligned and almost no functions have an explicit alignment. I think Zig already made the correct choice by making functions pointers default to align(1) instead of the target-specific default alignment that is actually used on functions to allow this flexibility in implementation by default.

andrewrk · 2024-08-16T00:33:07Z

I see, so, would you be for or against completely removing align(N) syntax from function declarations then?

jacobly0 · 2024-08-16T00:38:38Z

I don't have a strong opinion on the outcome, I just have a strong opinion against using the discussion in this thread to justify removal. If other arguments are made for its removal, I could probably be easily convinced, but as of today, it still seems like an experiment worth keeping.

andrewrk · 2024-08-16T05:07:52Z

Understood, thank you for the clarifications. I'm satisfied with the distributed jump table solution then.

src/arch/x86_64/CodeGen.zig

Co-authored-by: Jacob Young <[email protected]>

andrewrk · 2024-08-16T16:38:14Z

src/link/Elf/Atom.zig

@@ -2230,7 +2171,7 @@ const riscv = struct {
    const riscv_util = @import("../riscv.zig");
 };

-const ResolveArgs = struct { i64, i64, i64, i64, i64, i64, i64, i64 };
+const ResolveArgs = struct { i64, i64, i64, i64, i64, i64, i64 };


why is this a tuple instead of a struct?

It was like this before but I ought to make it a struct instead of a tuple. Will do that in a follow-up.

kubkon added 25 commits August 13, 2024 13:30

elf: introduce OffsetTable in ZigObject for funcs only

27e1e63

elf: allocate new offset table via Atom.allocate mechanism

67e703d

elf: write offset table entry if dirty

24b915c

elf: dirty offset table entry on moving Atom in off/addr space

97a65ea

elf: indirect via offset table in the linker away from backend

7556b32

elf: relax R_X86_64_32 into jump table indirection if zig_offset_tabl…

4c2b34e

…e set

elf: nuke ZigGotSection from existence

d328140

elf: emit a jump table in place of offset table for functions

d7c5fbc

x86_64: start converting away from .got.zig knowledge

5fd53dc

comp: actually report codegen errors

f968dd0

x86_64: handle lea_symbol returned by genNavRef

16abf51

x86_64+elf: fix jump table indirection for functions

e3f6eba

elf: add poorman's reporting tool for unallocated NAVs/UAVs

e1ce9a7

elf: rename OffsetTable to JumpTable

78b1c73

x86_64: remove handling of .call since it's unused for now

ffcf047

x86_64: emit call rel32 for near calls with linker reloc

d25c93a

elf: make zig jump table indirection implicit via Symbol.address

ce8886d

elf: replace use of linker_extern_fn with more generic Immediate.reloc

57f7209

elf: fix circular dep loop in zig jump table

edd72c7

x86_64: fix generating lazy symbol refs

afaec5c

elf: make .text.zig bigger now that jump table is part of it

39ee346

elf: pretty print alingment when pretty printing atoms

df80ccf

elf: commit non-indirected symbol address to symtab

9daf5e8

elf: only apply zig jump table indirection to function calls (PLT32)

49d78cc

elf: do not emit zig jump table in relocatables

97ab420

kubkon requested review from andrewrk, mlugg and jacobly0 August 13, 2024 12:37

fix compile errors in other codegen backends

1bd54a5

kubkon added 2 commits August 15, 2024 10:52

riscv: remove redundant by-symbol-name check; just check for PIC and …

79418fa

…extern ptr

x86_64: deref GOT pointer when requesting var value

4d5bf0f

test/elf: enhance testImportingDataDynamic

9473d76

jacobly0 approved these changes Aug 16, 2024

View reviewed changes

src/arch/x86_64/CodeGen.zig Outdated Show resolved Hide resolved

riscv64: fix incorrect branch target

624016e

jacobly0 force-pushed the elf-zig-got branch from 5d2ee37 to 624016e Compare August 16, 2024 09:45

Update src/arch/x86_64/CodeGen.zig

73f385e

Co-authored-by: Jacob Young <[email protected]>

kubkon mentioned this pull request Aug 16, 2024

macho: replace __got_zig with distributed jump table #21098

Merged

andrewrk approved these changes Aug 16, 2024

View reviewed changes

kubkon merged commit 90989be into master Aug 16, 2024
10 checks passed

kubkon deleted the elf-zig-got branch August 16, 2024 19:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

elf: replace .got.zig with a zig jump table #21065

elf: replace .got.zig with a zig jump table #21065

kubkon commented Aug 13, 2024 •

edited

Loading

andrewrk commented Aug 13, 2024

andrewrk commented Aug 15, 2024 •

edited

Loading

kubkon commented Aug 15, 2024 •

edited

Loading

andrewrk commented Aug 15, 2024

kubkon commented Aug 15, 2024

kubkon commented Aug 15, 2024

andrewrk commented Aug 15, 2024

kubkon commented Aug 15, 2024

andrewrk commented Aug 15, 2024

andrewrk commented Aug 15, 2024

kubkon commented Aug 15, 2024 •

edited

Loading

kubkon commented Aug 15, 2024 •

edited

Loading

andrewrk commented Aug 15, 2024 •

edited

Loading

kubkon commented Aug 15, 2024

jacobly0 commented Aug 15, 2024 •

edited

Loading

andrewrk commented Aug 15, 2024

jacobly0 commented Aug 16, 2024

andrewrk commented Aug 16, 2024

jacobly0 commented Aug 16, 2024

andrewrk commented Aug 16, 2024

andrewrk Aug 16, 2024

kubkon Aug 16, 2024

elf: replace .got.zig with a zig jump table #21065

elf: replace .got.zig with a zig jump table #21065

Conversation

kubkon commented Aug 13, 2024 • edited Loading

andrewrk commented Aug 13, 2024

andrewrk commented Aug 15, 2024 • edited Loading

kubkon commented Aug 15, 2024 • edited Loading

andrewrk commented Aug 15, 2024

kubkon commented Aug 15, 2024

kubkon commented Aug 15, 2024

andrewrk commented Aug 15, 2024

kubkon commented Aug 15, 2024

andrewrk commented Aug 15, 2024

andrewrk commented Aug 15, 2024

kubkon commented Aug 15, 2024 • edited Loading

kubkon commented Aug 15, 2024 • edited Loading

andrewrk commented Aug 15, 2024 • edited Loading

kubkon commented Aug 15, 2024

jacobly0 commented Aug 15, 2024 • edited Loading

andrewrk commented Aug 15, 2024

jacobly0 commented Aug 16, 2024

andrewrk commented Aug 16, 2024

jacobly0 commented Aug 16, 2024

andrewrk commented Aug 16, 2024

andrewrk Aug 16, 2024

Choose a reason for hiding this comment

kubkon Aug 16, 2024

Choose a reason for hiding this comment

kubkon commented Aug 13, 2024 •

edited

Loading

andrewrk commented Aug 15, 2024 •

edited

Loading

kubkon commented Aug 15, 2024 •

edited

Loading

kubkon commented Aug 15, 2024 •

edited

Loading

kubkon commented Aug 15, 2024 •

edited

Loading

andrewrk commented Aug 15, 2024 •

edited

Loading

jacobly0 commented Aug 15, 2024 •

edited

Loading