-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error: error when utilizing the data reuse function of the ObjectFIFO #1659
Comments
Hello! Thank you for your patience! Your code looks good to me. I believe this is the same issue as mentioned in #1556 where the MLIR loop unrolling produces erroneous IR when the number of iterations in the loop is less than the unroll factor. Would you be able to increase the size of your loop to > 4 in order to confirm this? |
Hi, Sure, I changed my My input is
The running result is
However, if I changed the
|
Hi again, I also found that I only changed the objectfifo size to 64 or smaller, and the corresponding part in the Input:
Output and execution time:
from aie.dialects.aie import *
from aie.dialects.aiex import *
from aie.extras.dialects.ext import memref, arith
from aie.dialects.scf import *
from aie.extras.context import mlir_mod_ctx
from aie.ir import MemRefType, TypeAttr
import sys
IN_SIZE = 1024
# the only part I changed
BLOCK_SIZE = 64
BLOCK = IN_SIZE // BLOCK_SIZE
def my_vector_bias_add():
@device(AIEDevice.npu1_1col)
def device_body():
memRef_mem_tile_ty = T.memref(BLOCK_SIZE, T.f32())
memRef_aie_tile_ty = T.memref(BLOCK_SIZE, T.f32())
# Tile declarations
ShimTile = tile(0, 0)
MemTile = tile(0, 1)
ComputeTile2 = tile(0, 2)
# kernel definitions
add = external_func(
"add",
inputs=[
memRef_aie_tile_ty,
memRef_aie_tile_ty,
memRef_aie_tile_ty,
memRef_aie_tile_ty,
T.i32(),
],
)
# AIE-array data movement with object fifos
# Input
of_in0 = object_fifo("in0", ShimTile, MemTile, 4, memRef_mem_tile_ty)
of_in1 = object_fifo("in1", MemTile, ComputeTile2, 2, memRef_aie_tile_ty)
object_fifo_link(of_in0, of_in1)
# Output
of_out0 = object_fifo("out0", MemTile, ShimTile, 2, memRef_mem_tile_ty)
of_out1 = object_fifo("out1", ComputeTile2, MemTile, 2, memRef_aie_tile_ty)
object_fifo_link(of_out1, of_out0)
# Add
@core(ComputeTile2, "add.o")
def core_body():
for _ in for_(sys.maxsize):
# pre-amble: top row
elementActivactionsIn = of_in1.acquire(ObjectFifoPort.Consume, 2)
element0ActivactionsOut = of_out1.acquire(ObjectFifoPort.Produce, 1)
res = call(
add,
[
elementActivactionsIn[0],
elementActivactionsIn[0],
elementActivactionsIn[1],
element0ActivactionsOut,
BLOCK_SIZE,
],
)
objectfifo_release(ObjectFifoPort.Produce, "out1", 1)
# middle
for _ in for_(BLOCK - 2):
elementActivactionsIn = of_in1.acquire(ObjectFifoPort.Consume, 3)
element0ActivactionsOut = of_out1.acquire(ObjectFifoPort.Produce, 1)
res = call(
add,
[
elementActivactionsIn[0],
elementActivactionsIn[1],
elementActivactionsIn[2],
element0ActivactionsOut,
BLOCK_SIZE,
],
)
objectfifo_release(ObjectFifoPort.Consume, "in1", 1)
objectfifo_release(ObjectFifoPort.Produce, "out1", 1)
yield_([])
# last part
elementActivactionsIn = of_in1.acquire(ObjectFifoPort.Consume, 2)
element0ActivactionsOut = of_out1.acquire(ObjectFifoPort.Produce, 1)
res = call(
add,
[
elementActivactionsIn[0],
elementActivactionsIn[1],
elementActivactionsIn[1],
element0ActivactionsOut,
BLOCK_SIZE,
],
)
objectfifo_release(ObjectFifoPort.Consume, "in1", 2)
objectfifo_release(ObjectFifoPort.Produce, "out1", 1)
yield_([])
# instruction stream generation
tensor_ty = T.memref(IN_SIZE, T.f32())
@runtime_sequence(tensor_ty, tensor_ty)
def sequence(inTensor, outTensor):
npu_dma_memcpy_nd(
metadata="out0", bd_id=1, mem=outTensor, sizes=[1, 1, 1, IN_SIZE]
)
npu_dma_memcpy_nd(
metadata="in0", bd_id=0, mem=inTensor, sizes=[1, 1, 1, IN_SIZE]
)
npu_sync(column=0, row=0, direction=0, channel=0)
# Declares that subsequent code is in mlir-aie context
with mlir_mod_ctx() as ctx:
my_vector_bias_add()
res = ctx.module.operation.verify()
if res == True:
print(ctx.module)
else:
print(res) |
Hello again! Thank you for taking the time to make all of these different tests, it's very helpful! The fix for the first issue regarding the loop unrolling (#1568) is ready for review and should be merged in soon. Once it is, could you please verify that it indeed solves your original error, then also check whether it solves any of the other issues you've identified? |
Hi, Sure, thank you very much! |
Hi, Will there be any document on this? It seems like a lot of things are happening in the complier while we are not aware of them. Thanks a lot! |
Hello, For the lower level workings of the compiler, the documentation is usually available in the MLIR tablegen files which can be found in the For this particular case here are a few places where this is mentioned: lowering passes tablegen file, design patterns description, the feature's corresponding tests. I hope this helps! |
PR with the fix has been merged! |
Hi again, The issue still exists. Do I only need to run |
Hello. Thank you for taking the time to test. This is my bad, I forgot that the quick_setup scrip might need to be updated to point to the build version with the fix. I'll follow-up once the update is in. |
Hello, Okay, thank you very much! |
@ngdymx Can you run |
Hi team, I just wanted to kindly ask for an update on this issue when you get a chance. Of course, I know you’re likely balancing other responsibilities, so no pressure at all. Please let me know if there’s anything I can do to assist. Thank you very much! |
Hi team,
I am trying to implement a line-buffer architecture with the data reuse function of the ObjectFIFO, here is a simple example I used to test the architecture.
The input is four vector<int32_t, 128>, each vector is one row and is filled with its row count (all values in vector 0 or row 0 are 0, and all values in vector 1 or row 1 are 1), shown in the pseudo code
In[4] = [0, 1, 2, 3]
. I am trying to implement a line buffer function as described in the following pseudo code:Then, I try to mimic the code under
programming_examples/ml/bottleneck
, the central part is shown below.The input is printed as follows:
The running result is shown below, which is out of the expectation.
Then, I self-unrolled the
for loop
in theaie2.py
code and got the correct result, shown below.Is there something special in the for loop? Please help me review it. Thank you very much!
I have attached the necessary files for your testing below:
Original aie2.py:
Unroll version of aie2.py:
The kernel code is shown below:
And, the following is my
host.cpp
:The following is the
Makefile
I used:The text was updated successfully, but these errors were encountered: