[WIP] Pass to align transfer_reads #867

newling · 2024-10-30T22:42:58Z

[UPDATE: will split this PR into 3]

This PR adds the necessary pass to align transfer_read. It is based on mlir-aie's aievec, see:

https://github.com/Xilinx/mlir-aie/blob/d3da586305ebc22e5ecdf1d3e682b44853436e91/lib/Dialect/AIEVec/Transforms/VectorToVectorConversions.cpp#L123

Some changes were needed for our use case, however. The main one is that the lowering in this PR skips the vector.extract_strided_slice operation, because we have an offset which is not constant. i.e. the offsets in https://mlir.llvm.org/docs/Dialects/Vector/#vectorextract_strided_slice-vectorextractstridedsliceop cannot be integers for us, because they are determined from loop induction variables. The pass implemented here goes straight to aievec extract and shift operations, where mlir Values are used for offsets.

Also included in this PR: an aievec.shift folder. I can make this a separate PR if preferred.

This PR enables vectorization for convolution and resolves #820

jtuyls · 2024-10-31T21:55:27Z

Could you rename your PRs to be more descriptive of what they actually add? I was looking at the commit history earlier and Towards vectorized convolution (second PR) isn't a great commit message to understand what was added.

jtuyls · 2024-10-31T22:01:58Z

compiler/plugins/target/AMD-AIE/aievec/VectorToVectorConversions.cpp

+  //`alignBits`: transfer_reads must be aligned to this number of bits,
+  //             otherwise the reads will be scalarized.
+  //
+  //`maxVectorSizeBits`: the maximum size of a vector that can be read in a
+  //                     single transfer_read operation.
+  int alignBits = 256;
+  int maxVectorSizeBits = 1024;


We should retrieve this from the AMDAIEDeviceModel. Ideally, you check whether it can be retrieved from aie-rt and otherwise add it here:

iree-amd-aie/runtime/src/iree-amd-aie/aie_runtime/iree_aie_runtime.h

Line 224 in c6156e3

struct AMDAIEDeviceConfig {

There's nothing there related to the core, are you sure that's a good place to add it? We already have vector size related stuff elsewhere like matmul instruction size:

iree-amd-aie/compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIEUtils.cpp

Line 50 in c6156e3

const auto &getIntegerMatmulInstructionSizeMap() {

Yeah, at least all hardware related information is in one place then. Doing this will make it really easy to support multiple architectures. I know we currently have hardcoded information all over the place, but that will make it really cumbersome/buggy to support multiple/new architectures. All that information, like the map you're referencing, should be moved to the device model.

jtuyls · 2024-10-31T22:06:57Z

compiler/plugins/target/AMD-AIE/aievec/Passes.h

+/**
+ * This pass ensures that reads from AIE tile memory are aligned according to
+ * hardware constraints. For example, suppose we have 128 bytes in tile memory,
+ * represented in hex as:
+ *
+ *    0x00 0x01 ... 0x7E 0x7F
+ *
+ * On AIE-2, the (vector) read instructions from the tile memory into registers
+ * must be aligned to 256-bits (32-bytes). So if we want to read 64 bytes
+ * starting from 0x00 that is fine, but if we want to read 64 bytes starting
+ * from 0x01, then we cannot use a vector read instruction directly. To work
+ * around this constraint, we do the following:
+ *
+ * 1. Perform a wider read, that loads 128 bytes (2x as many as we want)
+ *    starting from 0x00 into a larger register. That is, bytes 0x00-0x7F are
+ *    loaded, so we have 1 'junk' byte at the beginning and 63 'junk' bytes at
+ *    the end.
+ *
+ * 2. Extract the target bytes 0x01 ... 0x40 from the larger register into a
+ *    smaller register in 2 steps, using 2 AIE specific instructions:
+ *
+ *   a) Extract:
+ *      https://www.xilinx.com/htmldocs/xilinx2023_2/aiengine_ml_intrinsics/intrinsics/group__intr__gpvectorconv__elem.html
+ *
+ *   b) Shift:
+ *      https://www.xilinx.com/htmldocs/xilinx2023_2/aiengine_ml_intrinsics/intrinsics/group__intr__gpvectorop__shift.html
+ *
+ *   First, we use the extract instruction to split the read 128-bytes into two
+ *   halves, 0x00-0x3F and 0x40-0x7F, each in its own 64-byte register. Then, we
+ *   use a shift operation to combine the upper 31 bytes from the first half
+ *   and the lower 33 bytes from the second half into a new 64-byte register.
+ *   This new register contains exactly the 64 bytes we want to read, starting
+ *   from 0x01.
+ *
+ * If we want to read 32 bytes starting from 0x01, we can use a similar
+ * approach. The only consideration is that the shift operation requires 64-byte
+ * inputs, so the order of the of the shift and extracts is reversed.
+ *
+ * We do not currently support unaligned reads of vectors which are not 32-bytes
+ * or 64-bytes in length.
+ *
+ * TODO(newling) use this same approach to align writes to unaligned memory.
+ *  */


IREE uses /// everywhere:

Suggested change

/**

* This pass ensures that reads from AIE tile memory are aligned according to

* hardware constraints. For example, suppose we have 128 bytes in tile memory,

* represented in hex as:

*

* 0x00 0x01 ... 0x7E 0x7F

*

* On AIE-2, the (vector) read instructions from the tile memory into registers

* must be aligned to 256-bits (32-bytes). So if we want to read 64 bytes

* starting from 0x00 that is fine, but if we want to read 64 bytes starting

* from 0x01, then we cannot use a vector read instruction directly. To work

* around this constraint, we do the following:

*

* 1. Perform a wider read, that loads 128 bytes (2x as many as we want)

* starting from 0x00 into a larger register. That is, bytes 0x00-0x7F are

* loaded, so we have 1 'junk' byte at the beginning and 63 'junk' bytes at

* the end.

*

* 2. Extract the target bytes 0x01 ... 0x40 from the larger register into a

* smaller register in 2 steps, using 2 AIE specific instructions:

*

* a) Extract:

* https://www.xilinx.com/htmldocs/xilinx2023_2/aiengine_ml_intrinsics/intrinsics/group__intr__gpvectorconv__elem.html

*

* b) Shift:

* https://www.xilinx.com/htmldocs/xilinx2023_2/aiengine_ml_intrinsics/intrinsics/group__intr__gpvectorop__shift.html

*

* First, we use the extract instruction to split the read 128-bytes into two

* halves, 0x00-0x3F and 0x40-0x7F, each in its own 64-byte register. Then, we

* use a shift operation to combine the upper 31 bytes from the first half

* and the lower 33 bytes from the second half into a new 64-byte register.

* This new register contains exactly the 64 bytes we want to read, starting

* from 0x01.

*

* If we want to read 32 bytes starting from 0x01, we can use a similar

* approach. The only consideration is that the shift operation requires 64-byte

* inputs, so the order of the of the shift and extracts is reversed.

*

* We do not currently support unaligned reads of vectors which are not 32-bytes

* or 64-bytes in length.

*

* TODO(newling) use this same approach to align writes to unaligned memory.

* */

/// This pass ensures that reads from AIE tile memory are aligned according to

/// hardware constraints. For example, suppose we have 128 bytes in tile memory,

/// represented in hex as:

///

/// 0x00 0x01 ... 0x7E 0x7F

///

/// On AIE-2, the (vector) read instructions from the tile memory into registers

/// must be aligned to 256-bits (32-bytes). So if we want to read 64 bytes

/// starting from 0x00 that is fine, but if we want to read 64 bytes starting

/// from 0x01, then we cannot use a vector read instruction directly. To work

/// around this constraint, we do the following:

///

/// 1. Perform a wider read, that loads 128 bytes (2x as many as we want)

/// starting from 0x00 into a larger register. That is, bytes 0x00-0x7F are

/// loaded, so we have 1 'junk' byte at the beginning and 63 'junk' bytes at

/// the end.

///

/// 2. Extract the target bytes 0x01 ... 0x40 from the larger register into a

/// smaller register in 2 steps, using 2 AIE specific instructions:

///

/// a) Extract:

/// https://www.xilinx.com/htmldocs/xilinx2023_2/aiengine_ml_intrinsics/intrinsics/group__intr__gpvectorconv__elem.html

///

/// b) Shift:

/// https://www.xilinx.com/htmldocs/xilinx2023_2/aiengine_ml_intrinsics/intrinsics/group__intr__gpvectorop__shift.html

///

/// First, we use the extract instruction to split the read 128-bytes into two

/// halves, 0x00-0x3F and 0x40-0x7F, each in its own 64-byte register. Then, we

/// use a shift operation to combine the upper 31 bytes from the first half

/// and the lower 33 bytes from the second half into a new 64-byte register.

/// This new register contains exactly the 64 bytes we want to read, starting

/// from 0x01.

///

/// If we want to read 32 bytes starting from 0x01, we can use a similar

/// approach. The only consideration is that the shift operation requires 64-byte

/// inputs, so the order of the of the shift and extracts is reversed.

///

/// We do not currently support unaligned reads of vectors which are not 32-bytes

/// or 64-bytes in length.

///

/// TODO(newling) use this same approach to align writes to unaligned memory.

For now I'd prefer to keep this consistent with the style of the file (which use /** */ commenting). I intend to align the style of aievec code that has been copied and pasted from mlir-aie to iree-amd-aie in a future PR. Specifically I'd like to create a Passes.td file for aievec.

I see, sure. At some point we should align it.

jtuyls · 2024-10-31T22:16:09Z

compiler/plugins/target/AMD-AIE/aievec/VectorToVectorConversions.cpp

+  bool allInBounds =
+      std::all_of(inBounds.begin(), inBounds.end(), [](bool b) { return b; });
+
+  if (shortBytes != 64 && shortBytes != 32) {


These constants are architecture dependent as well, right?

Yes, well actually it's just 64 that is -- it's the number of bytes that the shift instruction operands must be. Good spot -- let me clean up the archictecture specific parts, and rerequest a review.

jtuyls · 2024-10-31T22:21:04Z

compiler/plugins/target/AMD-AIE/aievec/VectorToVectorConversions.cpp

+      loc, longType, readOp.getSource(), SmallVector<Value>{newIndex}, padding,
+      SmallVector<bool>{allInBounds});
+
+  Value replacement = [&]() -> Value {


Why the lambda? You could just do:

ShiftOp shiftOp; if (shortBytes == 64) { ... } else if (shotrBytes == 32) { ... } else { return failure()...; }

Ok. I think lambdas are great but ok, ok.

Note that I've already checked that shortBytes is either 32 or 64 at this point -- I do this before creating longVec (because I like to make all IR checks before modifying IR). So this assertion should never be hit (in general this is true whenever I use assert. If one of my asserts is hit, it's because I made a mistake).

But let me rejiggle this to not use the lambda.

Note that I've already checked that shortBytes is either 32 or 64 at this point -- I do this before creating longVec (because I like to make all IR checks before modifying IR). So this assertion should never be hit (in general this is true whenever I use assert. If one of my asserts is hit, it's because I made a mistake).

Ok, that's the same way I try to use error vs assert. assert makes sense then.

jtuyls · 2024-10-31T22:23:05Z

compiler/plugins/target/AMD-AIE/aievec/VectorToVectorConversions.cpp

+      return rewriter.createOrFold<ShiftOp>(loc, shortType, low, upp,
+                                            offsetBytes_i32);
+    }
+    assert(shortBytes == 32 && "Expect nb bytes 64 or 32.");


I think it would be better to throw an error as shortBytes depends on data type + shape, so this might actually be hit in a use case from what I can tell?

jtuyls · 2024-10-31T22:25:29Z

This PR enables vectorization for convolution and resolves #820

Are you planning to add an e2e testcase?

newling force-pushed the vectorized_convolution_with_larger_transfer_reads branch from 501ca32 to b61f8aa Compare October 31, 2024 18:58

newling changed the title ~~[WIP] Vectorized convolution with larger transfer reads~~ Towards vectorized convolution (PR 3 of 3) Oct 31, 2024

newling force-pushed the vectorized_convolution_with_larger_transfer_reads branch from dcb4fb4 to 50710bc Compare October 31, 2024 21:41

newling marked this pull request as ready for review October 31, 2024 21:41

newling requested review from makslevental, MaheshRavishankar, nirvedhmeshram, yzhang93, Abhishek-Varma and jtuyls as code owners October 31, 2024 21:41

newling requested review from jsetoain and removed request for makslevental and nirvedhmeshram October 31, 2024 21:42

newling changed the title ~~Towards vectorized convolution (PR 3 of 3)~~ Full vectorized convolution support (align transfer_reads) Oct 31, 2024

newling changed the title ~~Full vectorized convolution support (align transfer_reads)~~ Pass to align transfer_reads Oct 31, 2024

jtuyls requested changes Oct 31, 2024

View reviewed changes

newling force-pushed the vectorized_convolution_with_larger_transfer_reads branch from ce65c04 to 0db125f Compare November 1, 2024 21:26

newling changed the title ~~Pass to align transfer_reads~~ [WIP] Pass to align transfer_reads Nov 1, 2024

squashed commit

40bcb64

newling mentioned this pull request Nov 1, 2024

Add HW specific vector sizes to device model #870

Open

update runtime

19b468a

newling force-pushed the vectorized_convolution_with_larger_transfer_reads branch from 0db125f to 19b468a Compare November 1, 2024 22:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Pass to align transfer_reads #867

[WIP] Pass to align transfer_reads #867

newling commented Oct 30, 2024 •

edited

Loading

jtuyls commented Oct 31, 2024 •

edited

Loading

jtuyls Oct 31, 2024

newling Oct 31, 2024

jtuyls Nov 1, 2024

jtuyls Oct 31, 2024

newling Oct 31, 2024

jtuyls Nov 1, 2024

jtuyls Oct 31, 2024

newling Oct 31, 2024

jtuyls Oct 31, 2024

newling Oct 31, 2024 •

edited

Loading

jtuyls Nov 1, 2024

jtuyls Oct 31, 2024

jtuyls commented Oct 31, 2024

[WIP] Pass to align transfer_reads #867

Are you sure you want to change the base?

[WIP] Pass to align transfer_reads #867

Conversation

newling commented Oct 30, 2024 • edited Loading

jtuyls commented Oct 31, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

newling Oct 31, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtuyls commented Oct 31, 2024

newling commented Oct 30, 2024 •

edited

Loading

jtuyls commented Oct 31, 2024 •

edited

Loading

newling Oct 31, 2024 •

edited

Loading