-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Pass to align transfer_reads #867
base: main
Are you sure you want to change the base?
[WIP] Pass to align transfer_reads #867
Conversation
501ca32
to
b61f8aa
Compare
dcb4fb4
to
50710bc
Compare
Could you rename your PRs to be more descriptive of what they actually add? I was looking at the commit history earlier and |
//`alignBits`: transfer_reads must be aligned to this number of bits, | ||
// otherwise the reads will be scalarized. | ||
// | ||
//`maxVectorSizeBits`: the maximum size of a vector that can be read in a | ||
// single transfer_read operation. | ||
int alignBits = 256; | ||
int maxVectorSizeBits = 1024; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should retrieve this from the AMDAIEDeviceModel
. Ideally, you check whether it can be retrieved from aie-rt and otherwise add it here:
struct AMDAIEDeviceConfig { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's nothing there related to the core, are you sure that's a good place to add it? We already have vector size related stuff elsewhere like matmul instruction size:
iree-amd-aie/compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIEUtils.cpp
Line 50 in c6156e3
const auto &getIntegerMatmulInstructionSizeMap() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, at least all hardware related information is in one place then. Doing this will make it really easy to support multiple architectures. I know we currently have hardcoded information all over the place, but that will make it really cumbersome/buggy to support multiple/new architectures. All that information, like the map you're referencing, should be moved to the device model.
/** | ||
* This pass ensures that reads from AIE tile memory are aligned according to | ||
* hardware constraints. For example, suppose we have 128 bytes in tile memory, | ||
* represented in hex as: | ||
* | ||
* 0x00 0x01 ... 0x7E 0x7F | ||
* | ||
* On AIE-2, the (vector) read instructions from the tile memory into registers | ||
* must be aligned to 256-bits (32-bytes). So if we want to read 64 bytes | ||
* starting from 0x00 that is fine, but if we want to read 64 bytes starting | ||
* from 0x01, then we cannot use a vector read instruction directly. To work | ||
* around this constraint, we do the following: | ||
* | ||
* 1. Perform a wider read, that loads 128 bytes (2x as many as we want) | ||
* starting from 0x00 into a larger register. That is, bytes 0x00-0x7F are | ||
* loaded, so we have 1 'junk' byte at the beginning and 63 'junk' bytes at | ||
* the end. | ||
* | ||
* 2. Extract the target bytes 0x01 ... 0x40 from the larger register into a | ||
* smaller register in 2 steps, using 2 AIE specific instructions: | ||
* | ||
* a) Extract: | ||
* https://www.xilinx.com/htmldocs/xilinx2023_2/aiengine_ml_intrinsics/intrinsics/group__intr__gpvectorconv__elem.html | ||
* | ||
* b) Shift: | ||
* https://www.xilinx.com/htmldocs/xilinx2023_2/aiengine_ml_intrinsics/intrinsics/group__intr__gpvectorop__shift.html | ||
* | ||
* First, we use the extract instruction to split the read 128-bytes into two | ||
* halves, 0x00-0x3F and 0x40-0x7F, each in its own 64-byte register. Then, we | ||
* use a shift operation to combine the upper 31 bytes from the first half | ||
* and the lower 33 bytes from the second half into a new 64-byte register. | ||
* This new register contains exactly the 64 bytes we want to read, starting | ||
* from 0x01. | ||
* | ||
* If we want to read 32 bytes starting from 0x01, we can use a similar | ||
* approach. The only consideration is that the shift operation requires 64-byte | ||
* inputs, so the order of the of the shift and extracts is reversed. | ||
* | ||
* We do not currently support unaligned reads of vectors which are not 32-bytes | ||
* or 64-bytes in length. | ||
* | ||
* TODO(newling) use this same approach to align writes to unaligned memory. | ||
* */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IREE uses ///
everywhere:
/** | |
* This pass ensures that reads from AIE tile memory are aligned according to | |
* hardware constraints. For example, suppose we have 128 bytes in tile memory, | |
* represented in hex as: | |
* | |
* 0x00 0x01 ... 0x7E 0x7F | |
* | |
* On AIE-2, the (vector) read instructions from the tile memory into registers | |
* must be aligned to 256-bits (32-bytes). So if we want to read 64 bytes | |
* starting from 0x00 that is fine, but if we want to read 64 bytes starting | |
* from 0x01, then we cannot use a vector read instruction directly. To work | |
* around this constraint, we do the following: | |
* | |
* 1. Perform a wider read, that loads 128 bytes (2x as many as we want) | |
* starting from 0x00 into a larger register. That is, bytes 0x00-0x7F are | |
* loaded, so we have 1 'junk' byte at the beginning and 63 'junk' bytes at | |
* the end. | |
* | |
* 2. Extract the target bytes 0x01 ... 0x40 from the larger register into a | |
* smaller register in 2 steps, using 2 AIE specific instructions: | |
* | |
* a) Extract: | |
* https://www.xilinx.com/htmldocs/xilinx2023_2/aiengine_ml_intrinsics/intrinsics/group__intr__gpvectorconv__elem.html | |
* | |
* b) Shift: | |
* https://www.xilinx.com/htmldocs/xilinx2023_2/aiengine_ml_intrinsics/intrinsics/group__intr__gpvectorop__shift.html | |
* | |
* First, we use the extract instruction to split the read 128-bytes into two | |
* halves, 0x00-0x3F and 0x40-0x7F, each in its own 64-byte register. Then, we | |
* use a shift operation to combine the upper 31 bytes from the first half | |
* and the lower 33 bytes from the second half into a new 64-byte register. | |
* This new register contains exactly the 64 bytes we want to read, starting | |
* from 0x01. | |
* | |
* If we want to read 32 bytes starting from 0x01, we can use a similar | |
* approach. The only consideration is that the shift operation requires 64-byte | |
* inputs, so the order of the of the shift and extracts is reversed. | |
* | |
* We do not currently support unaligned reads of vectors which are not 32-bytes | |
* or 64-bytes in length. | |
* | |
* TODO(newling) use this same approach to align writes to unaligned memory. | |
* */ | |
/// This pass ensures that reads from AIE tile memory are aligned according to | |
/// hardware constraints. For example, suppose we have 128 bytes in tile memory, | |
/// represented in hex as: | |
/// | |
/// 0x00 0x01 ... 0x7E 0x7F | |
/// | |
/// On AIE-2, the (vector) read instructions from the tile memory into registers | |
/// must be aligned to 256-bits (32-bytes). So if we want to read 64 bytes | |
/// starting from 0x00 that is fine, but if we want to read 64 bytes starting | |
/// from 0x01, then we cannot use a vector read instruction directly. To work | |
/// around this constraint, we do the following: | |
/// | |
/// 1. Perform a wider read, that loads 128 bytes (2x as many as we want) | |
/// starting from 0x00 into a larger register. That is, bytes 0x00-0x7F are | |
/// loaded, so we have 1 'junk' byte at the beginning and 63 'junk' bytes at | |
/// the end. | |
/// | |
/// 2. Extract the target bytes 0x01 ... 0x40 from the larger register into a | |
/// smaller register in 2 steps, using 2 AIE specific instructions: | |
/// | |
/// a) Extract: | |
/// https://www.xilinx.com/htmldocs/xilinx2023_2/aiengine_ml_intrinsics/intrinsics/group__intr__gpvectorconv__elem.html | |
/// | |
/// b) Shift: | |
/// https://www.xilinx.com/htmldocs/xilinx2023_2/aiengine_ml_intrinsics/intrinsics/group__intr__gpvectorop__shift.html | |
/// | |
/// First, we use the extract instruction to split the read 128-bytes into two | |
/// halves, 0x00-0x3F and 0x40-0x7F, each in its own 64-byte register. Then, we | |
/// use a shift operation to combine the upper 31 bytes from the first half | |
/// and the lower 33 bytes from the second half into a new 64-byte register. | |
/// This new register contains exactly the 64 bytes we want to read, starting | |
/// from 0x01. | |
/// | |
/// If we want to read 32 bytes starting from 0x01, we can use a similar | |
/// approach. The only consideration is that the shift operation requires 64-byte | |
/// inputs, so the order of the of the shift and extracts is reversed. | |
/// | |
/// We do not currently support unaligned reads of vectors which are not 32-bytes | |
/// or 64-bytes in length. | |
/// | |
/// TODO(newling) use this same approach to align writes to unaligned memory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now I'd prefer to keep this consistent with the style of the file (which use /** */ commenting). I intend to align the style of aievec code that has been copied and pasted from mlir-aie to iree-amd-aie in a future PR. Specifically I'd like to create a Passes.td file for aievec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, sure. At some point we should align it.
bool allInBounds = | ||
std::all_of(inBounds.begin(), inBounds.end(), [](bool b) { return b; }); | ||
|
||
if (shortBytes != 64 && shortBytes != 32) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These constants are architecture dependent as well, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, well actually it's just 64 that is -- it's the number of bytes that the shift instruction operands must be. Good spot -- let me clean up the archictecture specific parts, and rerequest a review.
loc, longType, readOp.getSource(), SmallVector<Value>{newIndex}, padding, | ||
SmallVector<bool>{allInBounds}); | ||
|
||
Value replacement = [&]() -> Value { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the lambda? You could just do:
ShiftOp shiftOp;
if (shortBytes == 64) {
...
} else if (shotrBytes == 32) {
...
} else {
return failure()...;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I think lambdas are great but ok, ok.
Note that I've already checked that shortBytes is either 32 or 64 at this point -- I do this before creating longVec (because I like to make all IR checks before modifying IR). So this assertion should never be hit (in general this is true whenever I use assert
. If one of my asserts is hit, it's because I made a mistake).
But let me rejiggle this to not use the lambda.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that I've already checked that shortBytes is either 32 or 64 at this point -- I do this before creating longVec (because I like to make all IR checks before modifying IR). So this assertion should never be hit (in general this is true whenever I use assert. If one of my asserts is hit, it's because I made a mistake).
Ok, that's the same way I try to use error vs assert. assert
makes sense then.
return rewriter.createOrFold<ShiftOp>(loc, shortType, low, upp, | ||
offsetBytes_i32); | ||
} | ||
assert(shortBytes == 32 && "Expect nb bytes 64 or 32."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to throw an error as shortBytes
depends on data type + shape, so this might actually be hit in a use case from what I can tell?
Are you planning to add an e2e testcase? |
ce65c04
to
0db125f
Compare
0db125f
to
19b468a
Compare
[UPDATE: will split this PR into 3]
This PR adds the necessary pass to align
transfer_read
. It is based on mlir-aie's aievec, see:https://github.com/Xilinx/mlir-aie/blob/d3da586305ebc22e5ecdf1d3e682b44853436e91/lib/Dialect/AIEVec/Transforms/VectorToVectorConversions.cpp#L123
Some changes were needed for our use case, however. The main one is that the lowering in this PR skips the
vector.extract_strided_slice
operation, because we have an offset which is not constant. i.e. the offsets in https://mlir.llvm.org/docs/Dialects/Vector/#vectorextract_strided_slice-vectorextractstridedsliceop cannot be integers for us, because they are determined from loop induction variables. The pass implemented here goes straight to aievec extract and shift operations, where mlir Values are used for offsets.Also included in this PR: an aievec.shift folder. I can make this a separate PR if preferred.
This PR enables vectorization for convolution and resolves #820