Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AIE2: Small matmul numerical error #342

Open
newling opened this issue Feb 7, 2025 · 7 comments
Open

AIE2: Small matmul numerical error #342

newling opened this issue Feb 7, 2025 · 7 comments

Comments

@newling
Copy link

newling commented Feb 7, 2025

I am observing numerical issues in a small matrix multiplication running through the iree-amd-aie compiler (with peano doing lowering from llvm IR). There is only a numerical error when opt is run with optimization level 1: with -O0, -O2, -O3 the matmul output is correct, but at -O1 it is incorrect.

The numerical errors seems to be triggered when I increase an alignment from 32 to 64, although this might not be directly related. Changing from

@buff_3 = external local_unnamed_addr global [1024 x bfloat], align 64 

to

@buff_3 = external local_unnamed_addr global [1024 x bfloat], align 32 

makes the numerics correct. But I am very confident that the buffer is aligned in memory to 64 bytes, so I am not sure this is directly related to the numerical error. All of the .ll and .opt.ll files related to this issue that seem useful are in this zip file

They're quite small (the matmul has operands A: 8x32 and B: 32x32, bf16 -> f32).

Lines File Description
154 input.ll The file that is fed into opt
161 input.opt.pass.O0.ll opt -O0 input.ll, passes
386 input.opt.fail.O1.ll opt -O1 input.ll, fails
317 input.opt.pass.O2.ll opt -O2 input.ll, passes
317 input.opt.pass.O3.ll opt -O3 input.ll, passes
154 input.modified.ll Same as input.ll, but align 32 instead of align 64
386 input.modified.opt.pass.O1.ll opt -O1 input.modified.ll, passes

I compile with

llvm-aie/bin/opt -vectorize-loops=false -vectorize-slp=false --two-entry-phi-node-folding-threshold=10 -mandatory-inlining-before-opt=false -basic-aa-full-phi-analysis=true -basic-aa-max-lookup-search-depth=10 -O1 --inline-threshold=10 --disable-builtin=memset -S input.ll -o input.opt.ll

And then

llc input.opt.ll -O2 --march=aie2 --function-sections --filetype=obj -o input.o

And then a .elf is generated and we run it with real data on HW. This example isn't important, but there's a related (larger) reproducer that I think has the same problem (and fails at -O2 and -O3). So if I can get to the bottom of this small reproducer, I'll hopefully closer to solving the larger problem.

So questions:

  1. the files input.modified.opt.pass.O1.ll and input.opt.fail.O1.ll look like they're the same except for a permutation of lines. Are they actually different, and why did the change in alignment trigger them to be different?

  2. any other insights, or suggestions of what to look at?

@martien-de-jong
Copy link
Collaborator

martien-de-jong commented Feb 11, 2025

First observation is that the loads have 64 byte alignment, but they step with 32 bytes offset. That can't be right?

@martien-de-jong
Copy link
Collaborator

martien-de-jong commented Feb 11, 2025

Second observation is that we load 32 512 bit vectors outside of the loop. That's not going to fit in registers, so it's going to spill massively. When the loads are interleaved in the loop body, we will use the same register multiple times.
I could imagine that the stack segment isn't sufficient, corrupting data.
You can check this by looking at the stack pointer adjustment at the start of the function when disassembling, using llvm-objdump -d

@newling
Copy link
Author

newling commented Feb 12, 2025

Thanks for taking a look!

First observation is that the loads have 64 byte alignment, but they step with 32 bytes offset. That can't be right?

I might be missing something but the step shouldn't ever be 32 bytes -- 32 bfloat16s is 64 bytes.

Second observation is that we load 32 512 bit vectors outside of the loop. That's not going to fit in registers, so it's going to spill massively.

That's what looked most troubling to me. I'll take a look at llvm-objdump -d tomorrow and post the .o files here too.

@martien-de-jong
Copy link
Collaborator

to wit, these are the first four loads in input.opt.fail.O1.ll:

  %100 = load <32 x bfloat>, ptr @buff_3, align 64
  %101 = load <32 x bfloat>, ptr getelementptr inbounds ([1024 x bfloat], ptr @buff_3, i20 0, i20 32), align 64
  %102 = load <32 x bfloat>, ptr getelementptr inbounds ([1024 x bfloat], ptr @buff_3, i20 0, i20 64), align 64
  %103 = load <32 x bfloat>, ptr getelementptr inbounds ([1024 x bfloat], ptr @buff_3, i20 0, i20 96), align 64

they have offsets 0, 32, 64, 96 from the same aligned buff_3. Having said that, I don't understand why they aren't causing problems in input.modified.opt.pass.O1.ll

@newling
Copy link
Author

newling commented Feb 12, 2025

to wit, these are the first four loads in input.opt.fail.O1.ll:

  %100 = load <32 x bfloat>, ptr @buff_3, align 64
  %101 = load <32 x bfloat>, ptr getelementptr inbounds ([1024 x bfloat], ptr @buff_3, i20 0, i20 32), align 64
  %102 = load <32 x bfloat>, ptr getelementptr inbounds ([1024 x bfloat], ptr @buff_3, i20 0, i20 64), align 64
  %103 = load <32 x bfloat>, ptr getelementptr inbounds ([1024 x bfloat], ptr @buff_3, i20 0, i20 96), align 64

they have offsets 0, 32, 64, 96 from the same aligned buff_3. Having said that, I don't understand why they aren't causing problems in input.modified.opt.pass.O1.ll

Hmm my understanding from https://llvm.org/docs/GetElementPtr.html#what-is-dereferenced-by-gep is that those offsets are counted in number of elements not number of bytes

@newling
Copy link
Author

newling commented Feb 12, 2025

I could imagine that the stack segment isn't sufficient, corrupting data.

You're on the spot here @martien-de-jong . If I double the stack size that we set here to 0x800 (i.e. from 1K to 2K) the numerics are good again (FYI @jtuyls). I'm going to look/think about whether we can delay setting the stack size until after the .o file is created (any ideas welcome here!)

@newling
Copy link
Author

newling commented Feb 12, 2025

@martien-de-jong feel free to close this issue, I've created a follow-up question here: #350

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants