AMDGCN inefficient long add with constant #237

preda · 2024-12-20T21:21:04Z

Consider this OpenCL kernel:

 kernel void testAdd(global long* io) {
  long C = ((long) 1) << 50;
  io[get_global_id(0)] = C + io[get_global_id(0)];
}

This ISA is generated for the long add:

	v_mov_b32_e32 v4, 0x40000
	v_add_co_u32_e32 v2, vcc, 0, v2
	v_addc_co_u32_e32 v3, vcc, v3, v4, vcc

As you see, the above code is.. un-necessary. In particular, v_add_co_u32_e32 v2, vcc, 0, v2 does not change the value of v2, and can not produce carry-out.

The text was updated successfully, but these errors were encountered:

preda · 2024-12-20T21:24:32Z

Expected would be something like:

v_mov_b32_e32 v4, 0x40000
v_add_co_u32_e32 v3, vcc, v3, v4

preda · 2024-12-20T21:33:36Z

And a small observation: the same code is generated and the problem is easier to see if "long" is replaced with "unsigned long" in the sample kernel.

ppanchad-amd · 2024-12-23T16:34:55Z

Hi @preda. Internal ticket has been created to investigate your issue. Thanks!

sohaibnd · 2025-01-03T21:12:59Z

Hi @preda, what options did you use to compile that code?

preda · 2025-01-04T06:35:17Z

Hi @preda, what options did you use to compile that code?

@sohaibnd

-cl-finite-math-only -cl-std=CL2.0

If one of the inputs has all 0 bits, the low part cannot carry and we can just pass through the original value. Add case: https://alive2.llvm.org/ce/z/TNc7hf Sub case: https://alive2.llvm.org/ce/z/AjH2-J We could do this in the general case with computeKnownBits, but add is so common this could be potentially expensive for something which will fire infrequently. One potential concern is this could break the 64-bit add we expect to see for addressing mode matching, but these constants shouldn't appear often in addressing expressions. One test for large offset expressions changes but isn't worse. Fixes ROCm#237

sohaibnd · 2025-01-08T15:05:23Z

@preda This is an optimization bug, thanks for pointing it out. See llvm#122049 for the fix being put in.

If one of the inputs has all 0 bits, the low part cannot carry and we can just pass through the original value. Add case: https://alive2.llvm.org/ce/z/TNc7hf Sub case: https://alive2.llvm.org/ce/z/AjH2-J We could do this in the general case with computeKnownBits, but add is so common this could be potentially expensive for something which will fire infrequently. One potential concern is this could break the 64-bit add we expect to see for addressing mode matching, but these constants shouldn't appear often in addressing expressions. One test for large offset expressions changes but isn't worse. Fixes ROCm#237

ppanchad-amd added generic Build error, or some other issue not caused by an LLVM bug Under Investigation labels Dec 23, 2024

arsenm mentioned this issue Jan 8, 2025

AMDGPU: Reduce 64-bit add width if low bits are known 0 llvm/llvm-project#122049

Merged

arsenm closed this as completed in llvm/llvm-project#122049 Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AMDGCN inefficient long add with constant #237

AMDGCN inefficient long add with constant #237

preda commented Dec 20, 2024

preda commented Dec 20, 2024

preda commented Dec 20, 2024

ppanchad-amd commented Dec 23, 2024

sohaibnd commented Jan 3, 2025

preda commented Jan 4, 2025

sohaibnd commented Jan 8, 2025

AMDGCN inefficient long add with constant #237

AMDGCN inefficient long add with constant #237

Comments

preda commented Dec 20, 2024

preda commented Dec 20, 2024

preda commented Dec 20, 2024

ppanchad-amd commented Dec 23, 2024

sohaibnd commented Jan 3, 2025

preda commented Jan 4, 2025

sohaibnd commented Jan 8, 2025