Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

elf: replace .got.zig with a zig jump table #21065

Merged
merged 39 commits into from
Aug 16, 2024
Merged

elf: replace .got.zig with a zig jump table #21065

merged 39 commits into from
Aug 16, 2024

Conversation

kubkon
Copy link
Member

@kubkon kubkon commented Aug 13, 2024

Motivating factor: make this feature as transparent to the codegen as possible

Closes #20887

Previously, we would use .got.zig to indirect pointers to global data too but as was agreed on many an occasion we only really want to indirect function calls via an offset table or similar. In fact, as far as I understand that was the original plan of @andrewrk when he wrote the first PoC of incremental ELF linker. Therefore, since we longer want to indirect pointers to global data, it makes sense to replace an offset table with an equivalent jump (trampoline) table that is directly embedded within the machine code section. This improves code locality but also should make load times shorter if dynamically linking since we no longer have to rebase any pointers.

The new jump table looks as follows (for x86_64):

<0>: jmp symbol_a // jmp rel32
<5>: jmp symbol_b
...

Compared to storing pointers, if we have to relocate the table because it outgrew its capacity, we will have to re-calculate the jump targets since the jump sequence is PC-relative, however, this can be reduced into applying a fixed offset to every entry when relocating.

From the perspective of the codegen, when emitting a call with a relocation the codegen no longers needs to care about the presence or absence of the jump table/offset table - it simply emits call rel32 with R_X86_64_PLT32 relocation where the target is symbol_a. Then, the linker upon resolving R_X86_64_PLT32 will check if the jump table has been created and the symbol can be indirected via said table and rewrite the target address to the jump table entry if so. Again, this is all transparent to the codegen. As an added bonus, codegen now generates identical code in build-exe and build-obj modes.

One caveat of the new approach is that we only indirect function calls - if you request a function pointer, currently you will receive exactly that, with no indirection.

I am looking forward to the feedback if we should proceed or whatnot!

TODO

  • x86_64 codegen
  • x86_64-elf trampolines
  • riscv codegen
    - [ ] riscv-elf trampolines (deferred until we have a working incremental linker)

kubkon added 25 commits August 13, 2024 13:30
@kubkon kubkon requested review from andrewrk, mlugg and jacobly0 August 13, 2024 12:37
@andrewrk
Copy link
Member

One caveat of the new approach is that we only indirect function calls - if you request a function pointer, currently you will receive exactly that, with no indirection.

Couldn't the function pointer point to the address in the jump table? Then the pointer would be also correct after that jump table address has been updated.

@andrewrk
Copy link
Member

andrewrk commented Aug 15, 2024

I still think a jump table is better:

  • Avoid garbage in CPU cache that would otherwise occur when a function is moved leaving behind its trampoline followed by unused space
  • Avoid fragmentation in virtual memory space and the object file that would otherwise occur when a function is moved leaving behind its trampoline interrupting unused space
  • There is no function pointer alignment problem. Everything is fine.
  • Easier bookkeeping; simpler data structure for tracking the trampolines.
  • Simpler hot code swapping, simpler incremental linking

@kubkon
Copy link
Member Author

kubkon commented Aug 15, 2024

I still think a jump table is better:

* Avoid garbage in CPU cache that would otherwise occur when a function is moved leaving behind its trampoline followed by unused space

That one, I have no experience or intuition with so will hand over to @jacobly0 instead.

* Avoid fragmentation in virtual memory space and the object file that would otherwise occur when a function is moved leaving behind its trampoline interrupting unused space

We increase fragmentation, that is true, but we still leave <evacuated_symbol>.size + padding - <trampoline>.size of unused space that can be reclaimed by a new atom or atom directly succeeding the trampoline. That was at least my how i envisioned it.

* There is no function pointer alignment problem. Everything is fine.

How is function pointer alignment problem solved with a jump table?

* Easier bookkeeping; simpler data structure for tracking the trampolines.

Actually bookkeeping is largely unchanged whether the jump table is as one big block or distributed. If a function has a trampoline, it gets an extra slot with an index to the symbol that acts as its trampoline. In case of a jump table, it gets an index into a jump table.

* Simpler hot code swapping, simpler incremental linking

Incremental linking does not become more complex because of trampolines since we simply create and add a new atom+symbol in place of the old one with reduced size of X where X is trampoline size. All happens (will happen) using the same algorithm as for allocating and freeing atoms we already use so I don't see any added complexity beyond actually creating a new atom/symbol for the said trampoline. If I am not seeing something obvious tho, please do let me know. Hot code swapping will indeed become more complicated because trampolines I believe for the reasons mentioned by @mlugg

tl;dr I would like us to reach a consensus how to proceed so that I don't have revert the changes immediately after committing them. Also, if you feel we should go back to an offset table, this is also fine fwiw. I just think that ease of maintaining and developing codegen backends is of higher priority than linker comlexity IMHO, or put it another way, I want to make codegen backends completely separated from the concept of incremental linking.

@andrewrk
Copy link
Member

Function pointers are always pointer-size aligned when using a jump table.

@kubkon
Copy link
Member Author

kubkon commented Aug 15, 2024

Function pointers are always pointer-size aligned when using a jump table.

I guess I should have asked this first: do we indirect pointers to functions via a jump table too? If so, this will not currently succeed:

test "align(N) on functions" {
    try expect((@intFromPtr(&overaligned_fn) & (0x1000 - 1)) == 0);
}

fn overaligned_fn() align(0x1000) i32 {
    return 42;
}

@kubkon
Copy link
Member Author

kubkon commented Aug 15, 2024

Also please note that every entry in a jump table is not pointer-size aligned, but instruction aligned. Perhaps you are referring to an offset table?

@andrewrk
Copy link
Member

garbage in CPU cache

This is easy to understand:

Here is a cache line

[ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] 

Here is a cache line full of jump table data:

[F] [F] [F] [F] [F] [F] [F] [F] 

All those F's are valid pointers to functions that might be used.

Here is a cache line after a function has been relocated with the other strategy:

[T] [T] [J] [T] [T] [T] [T] [T] 

J - jump instruction to the real function
T - trash. 100% guaranteed waste of space in the CPU cache

@kubkon
Copy link
Member Author

kubkon commented Aug 15, 2024

garbage in CPU cache

This is easy to understand:

Here is a cache line

[ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] 

Here is a cache line full of jump table data:

[F] [F] [F] [F] [F] [F] [F] [F] 

All those F's are valid pointers to functions that might be used.

Here is a cache line after a function has been relocated with the other strategy:

[T] [T] [J] [T] [T] [T] [T] [T] 

J - jump instruction to the real function T - trash. 100% guaranteed waste of space in the CPU cache

Ok so it seems you mean an offset table being a better solution than a jump table be it in one big block or distributed.

@andrewrk
Copy link
Member

Actually bookkeeping is largely unchanged whether the jump table is as one big block or distributed. If a function has a trampoline, it gets an extra slot with an index to the symbol that acts as its trampoline. In case of a jump table, it gets an index into a jump table.

This means that with the function header trampoline strategy, you have to keep around an old symbol data entry around for the now-deleted function, whereas with the jump table strategy, you don't.

@andrewrk
Copy link
Member

Ok so it seems you mean an offset table being a better solution than a jump table be it in one big block or distributed.

No, my mistake for saying "function pointer" when I should have said "jump instruction" but otherwise the same point stands.

It looks like on x86_64, jumps with absolute address is 5 bytes which is annoying, but it's fine. Function pointers will have alignment of 1 on that platform then.

This test case will regress:

test "align(N) on functions" {
    try expect((@intFromPtr(&overaligned_fn) & (0x1000 - 1)) == 0);
}

fn overaligned_fn() align(0x1000) i32 {
    return 42;
}

The language will be modified to say when you take the address of a function, it does not necessarily gain the machine code alignment specified with the align keyword, because there may be a stub being used, or something to this effect.

Even if we go with the function prologue strategy resolution, I will still make this language change, because it is already evident this flexibility is useful for compilers.

@kubkon
Copy link
Member Author

kubkon commented Aug 15, 2024

Actually bookkeeping is largely unchanged whether the jump table is as one big block or distributed. If a function has a trampoline, it gets an extra slot with an index to the symbol that acts as its trampoline. In case of a jump table, it gets an index into a jump table.

This means that with the function header trampoline strategy, you have to keep around an old symbol data entry around for the now-deleted function, whereas with the jump table strategy, you don't.

I am getting confused, let's agree on what is what if that's OK. We have 3 options under consideration:

  1. Offset table
read-write section, each entry pointer-aligned, pointer-size
<0>: pointer to A
<8>: pointer to B
...
  1. Jump table
    read-exec section, .text.zig, each entry is instruction-aligned (1 on x86_64), size is arch dependent, code model dependent, etc
<0>: jmp rel32 // 5 bytes, PC-relative near-jump
<5>: ...
  1. Distributed jump table
    read-exec section, .text.zig, each entry is at least instruction-aligned or overaligned if the evacuated function was moved
<0>: code of A
...
<N>: jmp rel32 // jump to now-moved B
<N+5>: free space
....
<M>: code of C etc

FWIW I think there is a way to keep option 1, or at least reconsider it, while ensuring codegen backends are largely incremental linking agnostic by utilising the idea of lazy-symbol binding - we emit a section with immutable jump entries that always point to the same pointer in the offset table. Offset table is again mutable where we are free to rewrite pointers, however since the machine code now points at an immutable jump table we keep the codegen largely unaware of incremental linking. It would look something like this:

// .text.zig
<0>: A
...
<N>: call B -> .plt.zig, .stub.zig, or whatever
...

// .plt.zig
<0>: mov r11, [.got.zig entry at #0]
<N>: call [r11]
...

// .got.zig
<0>: pointer to B
<8>: pointer to A
...

@kubkon
Copy link
Member Author

kubkon commented Aug 15, 2024

FWIW the obvious con of option 4* is having to (re-)introduce two more sections .got.zig and .plt.zig which is costly on MachO where we are severely limited in sections count.

@andrewrk
Copy link
Member

andrewrk commented Aug 15, 2024

I think we should toss out option 1 purely on a performance basis:

  • Offset table: every function call is indirect (20-50 cpu cycles)
  • Jump table (distributed or not): every function call is direct (15-30 cpu cycles)

source

The discussion right now is jump table vs distributed jump table. I don't think anyone is advocating for offset table. I'm not sure why you are bringing up Option 4?

In my earlier comments I used these terms:

  • "function prologue strategy", meaning "distributed jump table"
  • "jump table", meaning "jump table"

@kubkon
Copy link
Member Author

kubkon commented Aug 15, 2024

Oh I brought up option 4 to point out yet another mechanism at our disposal in case it wasn't clear how an offset table can be utilised while making codegen agnostic to the concept of indirection.

@jacobly0
Copy link
Member

jacobly0 commented Aug 15, 2024

If you pack all of the jumps together, you enforce that every function call requires a minimum of 2 cache references, 3 if the jump instruction itself crosses a cache boundary (since 5 is not a power of 2, and pretty terrible since it would happen on a whopping 6% of functions). In order to fully utilize the first cache reference, you have to get lucky with using all the adjacent jump entries at around the same time, which is not even slightly reasonable to expect.

If you prefix the majority of non-changing functions with their own jump, then you allow them to only require a single cache reference that is fully utilized for all but the most trivial of functions. Saying that this space goes to waste after moving the function is misleading because it could be filled with any other function that fits in that space, in the same way that unused jump entries can be filled with any other jump. Even in the split case, you only have to get lucky with two functions being in use at the same time to fully utilize any given cache.

Increasing the minimum cost of all functions is clearly inferior to only increasing the cost of some changing functions (those rapidly increasing in size). This is even more true when you realize that once a split function stops changing over a period of time, it can be reunited in a "garbage collection"-like manner, restoring the original performance after you are happy with the new function implementation and stop editing it (in a way that vastly increases size). This trades the cost of updating all references every time a function increases in size, to only doing it once the function has stabilized and stops increasing in size (and can be deferred to batch over many functions when there is time to waste). This is never possible with a jump table because those jumps can never be adjacent to their implementation.

The bookkeeping seems completely equivalent, for each symbol you either track where the jump table entry is or which possibly not yet named symbol contains the actual implementation (something which can also be trivially recovered by just reading the jump instruction).

I'm also not sure why we are intent on removing function pointer alignment from the language, since once a function is aligned to a cache line, there is little benefit to aligning it any more other than gaining bits in the function pointer to be used for other things. It's the alignment of the first jump that matters in the common non-split jump prologue case, aligning the "beginning" of the function would just move it to a different cache line negating any benefits.

@andrewrk
Copy link
Member

Thanks for chiming in @jacobly0. I'm convinced by your performance-related arguments.

I'm also not sure why we are intent on removing function pointer alignment from the language, since once a function is aligned to a cache line, there is little benefit to aligning it any more other than gaining bits in the function pointer to be used for other things. It's the alignment of the first jump that matters in the common non-split jump prologue case, aligning the "beginning" of the function would just move it to a different cache line negating any benefits.

I don't understand what you're saying here. It seems like you're making an argument against function pointer alignment being useful, which seems to comport with removing the alignment guarantees of function pointers. But your conclusion is that we should keep function pointer alignment guarantees?

@jacobly0
Copy link
Member

I'm saying that function alignment doesn't have much use at all without function pointer alignment, and so I don't understand the stance that only one should be removed. I'd also argue that its usage is niche enough that theorizing about stubs is not very relevant given that stubs can also be aligned and almost no functions have an explicit alignment. I think Zig already made the correct choice by making functions pointers default to align(1) instead of the target-specific default alignment that is actually used on functions to allow this flexibility in implementation by default.

@andrewrk
Copy link
Member

I see, so, would you be for or against completely removing align(N) syntax from function declarations then?

@jacobly0
Copy link
Member

I don't have a strong opinion on the outcome, I just have a strong opinion against using the discussion in this thread to justify removal. If other arguments are made for its removal, I could probably be easily convinced, but as of today, it still seems like an experiment worth keeping.

@andrewrk
Copy link
Member

Understood, thank you for the clarifications. I'm satisfied with the distributed jump table solution then.

src/arch/x86_64/CodeGen.zig Outdated Show resolved Hide resolved
@@ -2230,7 +2171,7 @@ const riscv = struct {
const riscv_util = @import("../riscv.zig");
};

const ResolveArgs = struct { i64, i64, i64, i64, i64, i64, i64, i64 };
const ResolveArgs = struct { i64, i64, i64, i64, i64, i64, i64 };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this a tuple instead of a struct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was like this before but I ought to make it a struct instead of a tuple. Will do that in a follow-up.

@kubkon kubkon merged commit 90989be into master Aug 16, 2024
10 checks passed
@kubkon kubkon deleted the elf-zig-got branch August 16, 2024 19:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

rethink .got.zig
5 participants