-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AIE2: Small matmul numerical error #342
Comments
First observation is that the loads have 64 byte alignment, but they step with 32 bytes offset. That can't be right? |
Second observation is that we load 32 512 bit vectors outside of the loop. That's not going to fit in registers, so it's going to spill massively. When the loads are interleaved in the loop body, we will use the same register multiple times. |
Thanks for taking a look!
I might be missing something but the step shouldn't ever be 32 bytes -- 32 bfloat16s is 64 bytes.
That's what looked most troubling to me. I'll take a look at |
to wit, these are the first four loads in input.opt.fail.O1.ll:
they have offsets 0, 32, 64, 96 from the same aligned buff_3. Having said that, I don't understand why they aren't causing problems in input.modified.opt.pass.O1.ll |
Hmm my understanding from https://llvm.org/docs/GetElementPtr.html#what-is-dereferenced-by-gep is that those offsets are counted in number of elements not number of bytes |
You're on the spot here @martien-de-jong . If I double the stack size that we set here to 0x800 (i.e. from 1K to 2K) the numerics are good again (FYI @jtuyls). I'm going to look/think about whether we can delay setting the stack size until after the .o file is created (any ideas welcome here!) |
@martien-de-jong feel free to close this issue, I've created a follow-up question here: #350 |
I am observing numerical issues in a small matrix multiplication running through the iree-amd-aie compiler (with peano doing lowering from llvm IR). There is only a numerical error when
opt
is run with optimization level1
: with-O0
,-O2
,-O3
the matmul output is correct, but at-O1
it is incorrect.The numerical errors seems to be triggered when I increase an alignment from 32 to 64, although this might not be directly related. Changing from
to
makes the numerics correct. But I am very confident that the buffer is aligned in memory to 64 bytes, so I am not sure this is directly related to the numerical error. All of the .ll and .opt.ll files related to this issue that seem useful are in this zip file
They're quite small (the matmul has operands A: 8x32 and B: 32x32, bf16 -> f32).
input.ll
opt
input.opt.pass.O0.ll
opt -O0 input.ll
, passesinput.opt.fail.O1.ll
opt -O1 input.ll
, failsinput.opt.pass.O2.ll
opt -O2 input.ll
, passesinput.opt.pass.O3.ll
opt -O3 input.ll
, passesinput.modified.ll
input.ll
, but align 32 instead of align 64input.modified.opt.pass.O1.ll
opt -O1 input.modified.ll
, passesI compile with
And then
And then a .elf is generated and we run it with real data on HW. This example isn't important, but there's a related (larger) reproducer that I think has the same problem (and fails at -O2 and -O3). So if I can get to the bottom of this small reproducer, I'll hopefully closer to solving the larger problem.
So questions:
the files
input.modified.opt.pass.O1.ll
andinput.opt.fail.O1.ll
look like they're the same except for a permutation of lines. Are they actually different, and why did the change in alignment trigger them to be different?any other insights, or suggestions of what to look at?
The text was updated successfully, but these errors were encountered: