Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Pass to align transfer_reads #867

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

newling
Copy link
Contributor

@newling newling commented Oct 30, 2024

[UPDATE: will split this PR into 3]

This PR adds the necessary pass to align transfer_read. It is based on mlir-aie's aievec, see:

https://github.com/Xilinx/mlir-aie/blob/d3da586305ebc22e5ecdf1d3e682b44853436e91/lib/Dialect/AIEVec/Transforms/VectorToVectorConversions.cpp#L123

Some changes were needed for our use case, however. The main one is that the lowering in this PR skips the vector.extract_strided_slice operation, because we have an offset which is not constant. i.e. the offsets in https://mlir.llvm.org/docs/Dialects/Vector/#vectorextract_strided_slice-vectorextractstridedsliceop cannot be integers for us, because they are determined from loop induction variables. The pass implemented here goes straight to aievec extract and shift operations, where mlir Values are used for offsets.

Also included in this PR: an aievec.shift folder. I can make this a separate PR if preferred.

This PR enables vectorization for convolution and resolves #820

@newling newling force-pushed the vectorized_convolution_with_larger_transfer_reads branch from 501ca32 to b61f8aa Compare October 31, 2024 18:58
@newling newling changed the title [WIP] Vectorized convolution with larger transfer reads Towards vectorized convolution (PR 3 of 3) Oct 31, 2024
@newling newling force-pushed the vectorized_convolution_with_larger_transfer_reads branch from dcb4fb4 to 50710bc Compare October 31, 2024 21:41
@newling newling marked this pull request as ready for review October 31, 2024 21:41
@newling newling requested review from jsetoain and removed request for makslevental and nirvedhmeshram October 31, 2024 21:42
@jtuyls
Copy link
Collaborator

jtuyls commented Oct 31, 2024

Could you rename your PRs to be more descriptive of what they actually add? I was looking at the commit history earlier and Towards vectorized convolution (second PR) isn't a great commit message to understand what was added.

@newling newling changed the title Towards vectorized convolution (PR 3 of 3) Full vectorized convolution support (align transfer_reads) Oct 31, 2024
@newling newling changed the title Full vectorized convolution support (align transfer_reads) Pass to align transfer_reads Oct 31, 2024
Comment on lines 940 to 946
//`alignBits`: transfer_reads must be aligned to this number of bits,
// otherwise the reads will be scalarized.
//
//`maxVectorSizeBits`: the maximum size of a vector that can be read in a
// single transfer_read operation.
int alignBits = 256;
int maxVectorSizeBits = 1024;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should retrieve this from the AMDAIEDeviceModel. Ideally, you check whether it can be retrieved from aie-rt and otherwise add it here:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's nothing there related to the core, are you sure that's a good place to add it? We already have vector size related stuff elsewhere like matmul instruction size:

const auto &getIntegerMatmulInstructionSizeMap() {

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, at least all hardware related information is in one place then. Doing this will make it really easy to support multiple architectures. I know we currently have hardcoded information all over the place, but that will make it really cumbersome/buggy to support multiple/new architectures. All that information, like the map you're referencing, should be moved to the device model.

Comment on lines +44 to +86
/**
* This pass ensures that reads from AIE tile memory are aligned according to
* hardware constraints. For example, suppose we have 128 bytes in tile memory,
* represented in hex as:
*
* 0x00 0x01 ... 0x7E 0x7F
*
* On AIE-2, the (vector) read instructions from the tile memory into registers
* must be aligned to 256-bits (32-bytes). So if we want to read 64 bytes
* starting from 0x00 that is fine, but if we want to read 64 bytes starting
* from 0x01, then we cannot use a vector read instruction directly. To work
* around this constraint, we do the following:
*
* 1. Perform a wider read, that loads 128 bytes (2x as many as we want)
* starting from 0x00 into a larger register. That is, bytes 0x00-0x7F are
* loaded, so we have 1 'junk' byte at the beginning and 63 'junk' bytes at
* the end.
*
* 2. Extract the target bytes 0x01 ... 0x40 from the larger register into a
* smaller register in 2 steps, using 2 AIE specific instructions:
*
* a) Extract:
* https://www.xilinx.com/htmldocs/xilinx2023_2/aiengine_ml_intrinsics/intrinsics/group__intr__gpvectorconv__elem.html
*
* b) Shift:
* https://www.xilinx.com/htmldocs/xilinx2023_2/aiengine_ml_intrinsics/intrinsics/group__intr__gpvectorop__shift.html
*
* First, we use the extract instruction to split the read 128-bytes into two
* halves, 0x00-0x3F and 0x40-0x7F, each in its own 64-byte register. Then, we
* use a shift operation to combine the upper 31 bytes from the first half
* and the lower 33 bytes from the second half into a new 64-byte register.
* This new register contains exactly the 64 bytes we want to read, starting
* from 0x01.
*
* If we want to read 32 bytes starting from 0x01, we can use a similar
* approach. The only consideration is that the shift operation requires 64-byte
* inputs, so the order of the of the shift and extracts is reversed.
*
* We do not currently support unaligned reads of vectors which are not 32-bytes
* or 64-bytes in length.
*
* TODO(newling) use this same approach to align writes to unaligned memory.
* */
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IREE uses /// everywhere:

Suggested change
/**
* This pass ensures that reads from AIE tile memory are aligned according to
* hardware constraints. For example, suppose we have 128 bytes in tile memory,
* represented in hex as:
*
* 0x00 0x01 ... 0x7E 0x7F
*
* On AIE-2, the (vector) read instructions from the tile memory into registers
* must be aligned to 256-bits (32-bytes). So if we want to read 64 bytes
* starting from 0x00 that is fine, but if we want to read 64 bytes starting
* from 0x01, then we cannot use a vector read instruction directly. To work
* around this constraint, we do the following:
*
* 1. Perform a wider read, that loads 128 bytes (2x as many as we want)
* starting from 0x00 into a larger register. That is, bytes 0x00-0x7F are
* loaded, so we have 1 'junk' byte at the beginning and 63 'junk' bytes at
* the end.
*
* 2. Extract the target bytes 0x01 ... 0x40 from the larger register into a
* smaller register in 2 steps, using 2 AIE specific instructions:
*
* a) Extract:
* https://www.xilinx.com/htmldocs/xilinx2023_2/aiengine_ml_intrinsics/intrinsics/group__intr__gpvectorconv__elem.html
*
* b) Shift:
* https://www.xilinx.com/htmldocs/xilinx2023_2/aiengine_ml_intrinsics/intrinsics/group__intr__gpvectorop__shift.html
*
* First, we use the extract instruction to split the read 128-bytes into two
* halves, 0x00-0x3F and 0x40-0x7F, each in its own 64-byte register. Then, we
* use a shift operation to combine the upper 31 bytes from the first half
* and the lower 33 bytes from the second half into a new 64-byte register.
* This new register contains exactly the 64 bytes we want to read, starting
* from 0x01.
*
* If we want to read 32 bytes starting from 0x01, we can use a similar
* approach. The only consideration is that the shift operation requires 64-byte
* inputs, so the order of the of the shift and extracts is reversed.
*
* We do not currently support unaligned reads of vectors which are not 32-bytes
* or 64-bytes in length.
*
* TODO(newling) use this same approach to align writes to unaligned memory.
* */
/// This pass ensures that reads from AIE tile memory are aligned according to
/// hardware constraints. For example, suppose we have 128 bytes in tile memory,
/// represented in hex as:
///
/// 0x00 0x01 ... 0x7E 0x7F
///
/// On AIE-2, the (vector) read instructions from the tile memory into registers
/// must be aligned to 256-bits (32-bytes). So if we want to read 64 bytes
/// starting from 0x00 that is fine, but if we want to read 64 bytes starting
/// from 0x01, then we cannot use a vector read instruction directly. To work
/// around this constraint, we do the following:
///
/// 1. Perform a wider read, that loads 128 bytes (2x as many as we want)
/// starting from 0x00 into a larger register. That is, bytes 0x00-0x7F are
/// loaded, so we have 1 'junk' byte at the beginning and 63 'junk' bytes at
/// the end.
///
/// 2. Extract the target bytes 0x01 ... 0x40 from the larger register into a
/// smaller register in 2 steps, using 2 AIE specific instructions:
///
/// a) Extract:
/// https://www.xilinx.com/htmldocs/xilinx2023_2/aiengine_ml_intrinsics/intrinsics/group__intr__gpvectorconv__elem.html
///
/// b) Shift:
/// https://www.xilinx.com/htmldocs/xilinx2023_2/aiengine_ml_intrinsics/intrinsics/group__intr__gpvectorop__shift.html
///
/// First, we use the extract instruction to split the read 128-bytes into two
/// halves, 0x00-0x3F and 0x40-0x7F, each in its own 64-byte register. Then, we
/// use a shift operation to combine the upper 31 bytes from the first half
/// and the lower 33 bytes from the second half into a new 64-byte register.
/// This new register contains exactly the 64 bytes we want to read, starting
/// from 0x01.
///
/// If we want to read 32 bytes starting from 0x01, we can use a similar
/// approach. The only consideration is that the shift operation requires 64-byte
/// inputs, so the order of the of the shift and extracts is reversed.
///
/// We do not currently support unaligned reads of vectors which are not 32-bytes
/// or 64-bytes in length.
///
/// TODO(newling) use this same approach to align writes to unaligned memory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I'd prefer to keep this consistent with the style of the file (which use /** */ commenting). I intend to align the style of aievec code that has been copied and pasted from mlir-aie to iree-amd-aie in a future PR. Specifically I'd like to create a Passes.td file for aievec.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, sure. At some point we should align it.

bool allInBounds =
std::all_of(inBounds.begin(), inBounds.end(), [](bool b) { return b; });

if (shortBytes != 64 && shortBytes != 32) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These constants are architecture dependent as well, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, well actually it's just 64 that is -- it's the number of bytes that the shift instruction operands must be. Good spot -- let me clean up the archictecture specific parts, and rerequest a review.

loc, longType, readOp.getSource(), SmallVector<Value>{newIndex}, padding,
SmallVector<bool>{allInBounds});

Value replacement = [&]() -> Value {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the lambda? You could just do:

ShiftOp shiftOp;
if (shortBytes == 64) {
  ...
} else if (shotrBytes == 32) {
  ...
} else {
  return failure()...;
}

Copy link
Contributor Author

@newling newling Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I think lambdas are great but ok, ok.

Note that I've already checked that shortBytes is either 32 or 64 at this point -- I do this before creating longVec (because I like to make all IR checks before modifying IR). So this assertion should never be hit (in general this is true whenever I use assert. If one of my asserts is hit, it's because I made a mistake).

But let me rejiggle this to not use the lambda.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I've already checked that shortBytes is either 32 or 64 at this point -- I do this before creating longVec (because I like to make all IR checks before modifying IR). So this assertion should never be hit (in general this is true whenever I use assert. If one of my asserts is hit, it's because I made a mistake).

Ok, that's the same way I try to use error vs assert. assert makes sense then.

return rewriter.createOrFold<ShiftOp>(loc, shortType, low, upp,
offsetBytes_i32);
}
assert(shortBytes == 32 && "Expect nb bytes 64 or 32.");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to throw an error as shortBytes depends on data type + shape, so this might actually be hit in a use case from what I can tell?

@jtuyls
Copy link
Collaborator

jtuyls commented Oct 31, 2024

This PR enables vectorization for convolution and resolves #820

Are you planning to add an e2e testcase?

@newling newling force-pushed the vectorized_convolution_with_larger_transfer_reads branch from ce65c04 to 0db125f Compare November 1, 2024 21:26
@newling newling changed the title Pass to align transfer_reads [WIP] Pass to align transfer_reads Nov 1, 2024
@newling newling force-pushed the vectorized_convolution_with_larger_transfer_reads branch from 0db125f to 19b468a Compare November 1, 2024 22:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Numerics issue with vectorized conv2d
2 participants