[AVX-512] Moving from a GPR to a k-register just to spill to memory should just spill via `mov`, or just stay in a k-register #110626

Validark · 2024-10-01T03:28:43Z

My real code compiles like so on Zen 4 and Zen 5: (Godbolt link, line 3158 in the source code, line 5165 in the assembly)

        kmovq   k1, rcx
        mov     rcx, rdi
        and     rcx, qword ptr [rsp + 56]
        kmovq   qword ptr [rsp + 16], k1
        kmovq   k2, qword ptr [rsp + 16]
        kmovq   k1, rcx
        mov     rcx, rdi
        and     rcx, qword ptr [rsp + 72]
        kmovq   qword ptr [rsp + 96], k1
        vmovdqu8        zmm19 {k2} {z}, zmmword ptr [rip + .LCPI5_37]
        kmovq   k2, qword ptr [rsp + 96]
        kmovq   k3, rcx

As you can see, we move rcx to k1, and then spill that to qword ptr [rsp + 16], which we then immediately read back into k2. Obviously kmovq k1, rcx + kmovq qword ptr [rsp + 16], k1 => mov qword ptr [rsp + 16], rcx, or, even better, kmovq k1, rcx + kmovq qword ptr [rsp + 16], k1 + kmovq k2, qword ptr [rsp + 16] => kmovq k2, rcx

Then we do it again with a newer version of rcx that we did an and with. 🤦

It should just be:

        kmovq   k2, rcx
        mov     rcx, rdi
        and     rcx, qword ptr [rsp + 56]
        vmovdqu8        zmm19 {k2} {z}, zmmword ptr [rip + .LCPI5_37]
        kmovq   k2, rcx
        mov     rcx, rdi
        and     rcx, qword ptr [rsp + 72]
        kmovq   k3, rcx

I also think the GPR's that spilled to qword ptr [rsp + 56] and qword ptr [rsp + 72] could probably have been spilled to a k-register instead.

        mov qword ptr [rsp + 72], r11
        ; ...
        mov qword ptr [rsp + 56], rax
        ; ...

Could be:

        kmovq k4, r11 ; formerly [rsp + 72]
        ; ...
        kmovq k3, rax ; formerly [rsp + 56]
        ; ...

Then we do:

        ; Then we could move `rdi` to a k-register, since we use it so much.
        kmovq k7, rdi

        ; Now the above code, transformed
        kmovq   k2, rcx
        vmovdqu8        zmm19 {k2} {z}, zmmword ptr [rip + .LCPI5_37]
        kandq  k2, k3, k7 ; obviously now we could do a different register allocation than what we had before
        kandq  k3, k4, k7

Unfortunately I don't think I can make a small reproduction, because register spilling does not happen in trivial code.

Here is the unoptimized LLVM IR dump: https://gist.github.com/Validark/a19d2babb7955a54a456d0683e95f7d4
Here is the optimized LLVM IR dump: https://gist.github.com/Validark/fd231af0b28cf1bea193d07a18b6d52c

Thank you to whoever helps fix the register allocator!

‒ Validark

The text was updated successfully, but these errors were encountered:

llvmbot · 2024-10-01T04:18:39Z

@llvm/issue-subscribers-backend-x86

Author: Niles Salter (Validark)

My real code compiles like so on Zen 4 and Zen 5: ([Godbolt link](https://zig.godbolt.org/z/fEzPnWjcE), line 3158 in the source code, line 5165 in the assembly)

        kmovq   k1, rcx
        mov     rcx, rdi
        and     rcx, qword ptr [rsp + 56]
        kmovq   qword ptr [rsp + 16], k1
        kmovq   k2, qword ptr [rsp + 16]
        kmovq   k1, rcx
        mov     rcx, rdi
        and     rcx, qword ptr [rsp + 72]
        kmovq   qword ptr [rsp + 96], k1
        vmovdqu8        zmm19 {k2} {z}, zmmword ptr [rip + .LCPI5_37]
        kmovq   k2, qword ptr [rsp + 96]
        kmovq   k3, rcx

As you can see, we move rcx to k1, and then spill that to qword ptr [rsp + 16], which we then immediately read back into k2. Obviously kmovq k1, rcx + kmovq qword ptr [rsp + 16], k1 => mov qword ptr [rsp + 16], rcx, or, even better, kmovq k1, rcx + kmovq qword ptr [rsp + 16], k1 + kmovq k2, qword ptr [rsp + 16] => kmovq k2, rcx

Then we do it again with a newer version of rcx that we did an and with. 🤦

It should just be:

        kmovq   k2, rcx
        mov     rcx, rdi
        and     rcx, qword ptr [rsp + 56]
        vmovdqu8        zmm19 {k2} {z}, zmmword ptr [rip + .LCPI5_37]
        kmovq   k2, rcx
        mov     rcx, rdi
        and     rcx, qword ptr [rsp + 72]
        kmovq   k3, rcx

I also think the GPR's that spilled to qword ptr [rsp + 56] and qword ptr [rsp + 72] could probably have been spilled to a k-register instead.

        mov qword ptr [rsp + 72], r11
        ; ...
        mov qword ptr [rsp + 56], rax
        ; ...

Could be:

        kmovq k4, r11 ; formerly [rsp + 72]
        ; ...
        kmovq k3, rax ; formerly [rsp + 56]
        ; ...

Then we do:

        ; Then we could move `rdi` to a k-register, since we use it so much.
        kmovq k7, rdi

        ; Now the above code, transformed
        kmovq   k2, rcx
        vmovdqu8        zmm19 {k2} {z}, zmmword ptr [rip + .LCPI5_37]
        kandq  k2, k3, k7 ; obviously now we could do a different register allocation than what we had before
        kandq  k3, k4, k7

Unfortunately I don't think I can make a small reproduction, because register spilling does not happen in trivial code.

Here is the unoptimized LLVM IR dump: https://gist.github.com/Validark/a19d2babb7955a54a456d0683e95f7d4
Here is the optimized LLVM IR dump: https://gist.github.com/Validark/fd231af0b28cf1bea193d07a18b6d52c

Thank you to whoever helps fix the register allocator!

‒ Validark

github-actions bot added the new issue label Oct 1, 2024

EugeneZelenko added backend:X86 and removed new issue labels Oct 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AVX-512] Moving from a GPR to a k-register just to spill to memory should just spill via `mov`, or just stay in a k-register #110626

[AVX-512] Moving from a GPR to a k-register just to spill to memory should just spill via `mov`, or just stay in a k-register #110626

Validark commented Oct 1, 2024

llvmbot commented Oct 1, 2024

[AVX-512] Moving from a GPR to a k-register just to spill to memory should just spill via mov, or just stay in a k-register #110626

[AVX-512] Moving from a GPR to a k-register just to spill to memory should just spill via mov, or just stay in a k-register #110626

Comments

Validark commented Oct 1, 2024

llvmbot commented Oct 1, 2024

[AVX-512] Moving from a GPR to a k-register just to spill to memory should just spill via `mov`, or just stay in a k-register #110626

[AVX-512] Moving from a GPR to a k-register just to spill to memory should just spill via `mov`, or just stay in a k-register #110626