AIE2: Small matmul numerical error #342

newling · 2025-02-07T01:32:23Z

I am observing numerical issues in a small matrix multiplication running through the iree-amd-aie compiler (with peano doing lowering from llvm IR). There is only a numerical error when opt is run with optimization level 1: with -O0, -O2, -O3 the matmul output is correct, but at -O1 it is incorrect.

The numerical errors seems to be triggered when I increase an alignment from 32 to 64, although this might not be directly related. Changing from

@buff_3 = external local_unnamed_addr global [1024 x bfloat], align 64

to

@buff_3 = external local_unnamed_addr global [1024 x bfloat], align 32

makes the numerics correct. But I am very confident that the buffer is aligned in memory to 64 bytes, so I am not sure this is directly related to the numerical error. All of the .ll and .opt.ll files related to this issue that seem useful are in this zip file

They're quite small (the matmul has operands A: 8x32 and B: 32x32, bf16 -> f32).

Lines	File	Description
154	`input.ll`	The file that is fed into `opt`
161	`input.opt.pass.O0.ll`	`opt -O0 input.ll`, passes
386	`input.opt.fail.O1.ll`	`opt -O1 input.ll`, fails
317	`input.opt.pass.O2.ll`	`opt -O2 input.ll`, passes
317	`input.opt.pass.O3.ll`	`opt -O3 input.ll`, passes
154	`input.modified.ll`	Same as `input.ll`, but align 32 instead of align 64
386	`input.modified.opt.pass.O1.ll`	`opt -O1 input.modified.ll`, passes

I compile with

llvm-aie/bin/opt -vectorize-loops=false -vectorize-slp=false --two-entry-phi-node-folding-threshold=10 -mandatory-inlining-before-opt=false -basic-aa-full-phi-analysis=true -basic-aa-max-lookup-search-depth=10 -O1 --inline-threshold=10 --disable-builtin=memset -S input.ll -o input.opt.ll

And then

llc input.opt.ll -O2 --march=aie2 --function-sections --filetype=obj -o input.o

And then a .elf is generated and we run it with real data on HW. This example isn't important, but there's a related (larger) reproducer that I think has the same problem (and fails at -O2 and -O3). So if I can get to the bottom of this small reproducer, I'll hopefully closer to solving the larger problem.

So questions:

the files input.modified.opt.pass.O1.ll and input.opt.fail.O1.ll look like they're the same except for a permutation of lines. Are they actually different, and why did the change in alignment trigger them to be different?
any other insights, or suggestions of what to look at?

The text was updated successfully, but these errors were encountered:

martien-de-jong · 2025-02-11T09:16:45Z

First observation is that the loads have 64 byte alignment, but they step with 32 bytes offset. That can't be right?

martien-de-jong · 2025-02-11T09:34:15Z

Second observation is that we load 32 512 bit vectors outside of the loop. That's not going to fit in registers, so it's going to spill massively. When the loads are interleaved in the loop body, we will use the same register multiple times.
I could imagine that the stack segment isn't sufficient, corrupting data.
You can check this by looking at the stack pointer adjustment at the start of the function when disassembling, using llvm-objdump -d

newling · 2025-02-12T00:38:15Z

Thanks for taking a look!

First observation is that the loads have 64 byte alignment, but they step with 32 bytes offset. That can't be right?

I might be missing something but the step shouldn't ever be 32 bytes -- 32 bfloat16s is 64 bytes.

Second observation is that we load 32 512 bit vectors outside of the loop. That's not going to fit in registers, so it's going to spill massively.

That's what looked most troubling to me. I'll take a look at llvm-objdump -d tomorrow and post the .o files here too.

martien-de-jong · 2025-02-12T08:26:37Z

to wit, these are the first four loads in input.opt.fail.O1.ll:

  %100 = load <32 x bfloat>, ptr @buff_3, align 64
  %101 = load <32 x bfloat>, ptr getelementptr inbounds ([1024 x bfloat], ptr @buff_3, i20 0, i20 32), align 64
  %102 = load <32 x bfloat>, ptr getelementptr inbounds ([1024 x bfloat], ptr @buff_3, i20 0, i20 64), align 64
  %103 = load <32 x bfloat>, ptr getelementptr inbounds ([1024 x bfloat], ptr @buff_3, i20 0, i20 96), align 64

they have offsets 0, 32, 64, 96 from the same aligned buff_3. Having said that, I don't understand why they aren't causing problems in input.modified.opt.pass.O1.ll

newling · 2025-02-12T18:38:06Z

to wit, these are the first four loads in input.opt.fail.O1.ll:
  %100 = load <32 x bfloat>, ptr @buff_3, align 64
  %101 = load <32 x bfloat>, ptr getelementptr inbounds ([1024 x bfloat], ptr @buff_3, i20 0, i20 32), align 64
  %102 = load <32 x bfloat>, ptr getelementptr inbounds ([1024 x bfloat], ptr @buff_3, i20 0, i20 64), align 64
  %103 = load <32 x bfloat>, ptr getelementptr inbounds ([1024 x bfloat], ptr @buff_3, i20 0, i20 96), align 64
they have offsets 0, 32, 64, 96 from the same aligned buff_3. Having said that, I don't understand why they aren't causing problems in input.modified.opt.pass.O1.ll

Hmm my understanding from https://llvm.org/docs/GetElementPtr.html#what-is-dereferenced-by-gep is that those offsets are counted in number of elements not number of bytes

newling · 2025-02-12T19:58:14Z

I could imagine that the stack segment isn't sufficient, corrupting data.

You're on the spot here @martien-de-jong . If I double the stack size that we set here to 0x800 (i.e. from 1K to 2K) the numerics are good again (FYI @jtuyls). I'm going to look/think about whether we can delay setting the stack size until after the .o file is created (any ideas welcome here!)

newling · 2025-02-12T22:42:10Z

@martien-de-jong feel free to close this issue, I've created a follow-up question here: #350

newling mentioned this issue Feb 12, 2025

[UKernel] Add ukernel to be compiled through peano nod-ai/iree-amd-aie#1097

Merged

newling mentioned this issue Feb 12, 2025

Getting upper bound on stack size from .o/.asm : possible? #350

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AIE2: Small matmul numerical error #342

AIE2: Small matmul numerical error #342

newling commented Feb 7, 2025

martien-de-jong commented Feb 11, 2025 •

edited

Loading

martien-de-jong commented Feb 11, 2025 •

edited

Loading

newling commented Feb 12, 2025

martien-de-jong commented Feb 12, 2025

newling commented Feb 12, 2025 •

edited

Loading

newling commented Feb 12, 2025 •

edited

Loading

newling commented Feb 12, 2025

AIE2: Small matmul numerical error #342

AIE2: Small matmul numerical error #342

Comments

newling commented Feb 7, 2025

martien-de-jong commented Feb 11, 2025 • edited Loading

martien-de-jong commented Feb 11, 2025 • edited Loading

newling commented Feb 12, 2025

martien-de-jong commented Feb 12, 2025

newling commented Feb 12, 2025 • edited Loading

newling commented Feb 12, 2025 • edited Loading

newling commented Feb 12, 2025

martien-de-jong commented Feb 11, 2025 •

edited

Loading

martien-de-jong commented Feb 11, 2025 •

edited

Loading

newling commented Feb 12, 2025 •

edited

Loading

newling commented Feb 12, 2025 •

edited

Loading