Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AVX-512] Moving from a GPR to a k-register just to spill to memory should just spill via mov, or just stay in a k-register #110626

Open
Validark opened this issue Oct 1, 2024 · 1 comment

Comments

@Validark
Copy link

Validark commented Oct 1, 2024

My real code compiles like so on Zen 4 and Zen 5: (Godbolt link, line 3158 in the source code, line 5165 in the assembly)

        kmovq   k1, rcx
        mov     rcx, rdi
        and     rcx, qword ptr [rsp + 56]
        kmovq   qword ptr [rsp + 16], k1
        kmovq   k2, qword ptr [rsp + 16]
        kmovq   k1, rcx
        mov     rcx, rdi
        and     rcx, qword ptr [rsp + 72]
        kmovq   qword ptr [rsp + 96], k1
        vmovdqu8        zmm19 {k2} {z}, zmmword ptr [rip + .LCPI5_37]
        kmovq   k2, qword ptr [rsp + 96]
        kmovq   k3, rcx

As you can see, we move rcx to k1, and then spill that to qword ptr [rsp + 16], which we then immediately read back into k2. Obviously kmovq k1, rcx + kmovq qword ptr [rsp + 16], k1 => mov qword ptr [rsp + 16], rcx, or, even better, kmovq k1, rcx + kmovq qword ptr [rsp + 16], k1 + kmovq k2, qword ptr [rsp + 16] => kmovq k2, rcx

Then we do it again with a newer version of rcx that we did an and with. 🤦

It should just be:

        kmovq   k2, rcx
        mov     rcx, rdi
        and     rcx, qword ptr [rsp + 56]
        vmovdqu8        zmm19 {k2} {z}, zmmword ptr [rip + .LCPI5_37]
        kmovq   k2, rcx
        mov     rcx, rdi
        and     rcx, qword ptr [rsp + 72]
        kmovq   k3, rcx

I also think the GPR's that spilled to qword ptr [rsp + 56] and qword ptr [rsp + 72] could probably have been spilled to a k-register instead.

        mov qword ptr [rsp + 72], r11
        ; ...
        mov qword ptr [rsp + 56], rax
        ; ...

Could be:

        kmovq k4, r11 ; formerly [rsp + 72]
        ; ...
        kmovq k3, rax ; formerly [rsp + 56]
        ; ...

Then we do:

        ; Then we could move `rdi` to a k-register, since we use it so much.
        kmovq k7, rdi

        ; Now the above code, transformed
        kmovq   k2, rcx
        vmovdqu8        zmm19 {k2} {z}, zmmword ptr [rip + .LCPI5_37]
        kandq  k2, k3, k7 ; obviously now we could do a different register allocation than what we had before
        kandq  k3, k4, k7

Unfortunately I don't think I can make a small reproduction, because register spilling does not happen in trivial code.

Here is the unoptimized LLVM IR dump: https://gist.github.com/Validark/a19d2babb7955a54a456d0683e95f7d4
Here is the optimized LLVM IR dump: https://gist.github.com/Validark/fd231af0b28cf1bea193d07a18b6d52c

Thank you to whoever helps fix the register allocator!

‒ Validark

@llvmbot
Copy link

llvmbot commented Oct 1, 2024

@llvm/issue-subscribers-backend-x86

Author: Niles Salter (Validark)

My real code compiles like so on Zen 4 and Zen 5: ([Godbolt link](https://zig.godbolt.org/z/fEzPnWjcE), line 3158 in the source code, line 5165 in the assembly)
        kmovq   k1, rcx
        mov     rcx, rdi
        and     rcx, qword ptr [rsp + 56]
        kmovq   qword ptr [rsp + 16], k1
        kmovq   k2, qword ptr [rsp + 16]
        kmovq   k1, rcx
        mov     rcx, rdi
        and     rcx, qword ptr [rsp + 72]
        kmovq   qword ptr [rsp + 96], k1
        vmovdqu8        zmm19 {k2} {z}, zmmword ptr [rip + .LCPI5_37]
        kmovq   k2, qword ptr [rsp + 96]
        kmovq   k3, rcx

As you can see, we move rcx to k1, and then spill that to qword ptr [rsp + 16], which we then immediately read back into k2. Obviously kmovq k1, rcx + kmovq qword ptr [rsp + 16], k1 => mov qword ptr [rsp + 16], rcx, or, even better, kmovq k1, rcx + kmovq qword ptr [rsp + 16], k1 + kmovq k2, qword ptr [rsp + 16] => kmovq k2, rcx

Then we do it again with a newer version of rcx that we did an and with. 🤦

It should just be:

        kmovq   k2, rcx
        mov     rcx, rdi
        and     rcx, qword ptr [rsp + 56]
        vmovdqu8        zmm19 {k2} {z}, zmmword ptr [rip + .LCPI5_37]
        kmovq   k2, rcx
        mov     rcx, rdi
        and     rcx, qword ptr [rsp + 72]
        kmovq   k3, rcx

I also think the GPR's that spilled to qword ptr [rsp + 56] and qword ptr [rsp + 72] could probably have been spilled to a k-register instead.

        mov qword ptr [rsp + 72], r11
        ; ...
        mov qword ptr [rsp + 56], rax
        ; ...

Could be:

        kmovq k4, r11 ; formerly [rsp + 72]
        ; ...
        kmovq k3, rax ; formerly [rsp + 56]
        ; ...

Then we do:

        ; Then we could move `rdi` to a k-register, since we use it so much.
        kmovq k7, rdi

        ; Now the above code, transformed
        kmovq   k2, rcx
        vmovdqu8        zmm19 {k2} {z}, zmmword ptr [rip + .LCPI5_37]
        kandq  k2, k3, k7 ; obviously now we could do a different register allocation than what we had before
        kandq  k3, k4, k7

Unfortunately I don't think I can make a small reproduction, because register spilling does not happen in trivial code.

Here is the unoptimized LLVM IR dump: https://gist.github.com/Validark/a19d2babb7955a54a456d0683e95f7d4
Here is the optimized LLVM IR dump: https://gist.github.com/Validark/fd231af0b28cf1bea193d07a18b6d52c

Thank you to whoever helps fix the register allocator!

‒ Validark

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants