Xilinx · hunhoffe · Jul 31, 2024 · Jul 19, 2024 · Jul 19, 2024 · Jul 19, 2024
@@ -4,26 +4,30 @@ These programming examples are provided so that application programmers can lear
 
 ## [2-Dimensional Shim DMA Passthrough](shim_dma_2d)
 
-This example demonstrates how data may be moved using shim DMA operations. It also includes extra infrastructure that illustrates different ways to compile, build, run, and test programs written using the mlir-air python bindings.
+This example demonstrates how data may be moved using shim DMA operations. It also includes extra infrastructure that illustrates different ways to compile, build, run, and test programs written using the mlir-air python bindings on an NPU.
 
 ## [Passthrough Examples](passthrough)
 
-Three examples that copy data from the input to the output (a data passthrough). The data movement is done through either DMA or Channels, and there is a simple example of calling a an external function which performs a vectorized memcopy.
+This directory contains three examples that each copy data from the input to the output (a data passthrough). The data movement is done through either DMA or Channels, and there is a simple example of calling a an external function which performs a vectorized memcopy.
 
 ## [Channel Examples](channel_examples)
 
-This is a collection of simple examples that illustrate how to use channels.
+This is a collection of simple examples that illustrate how to use *channels*. At a high level, channels are the abstraction for data movement in mlir-air. Some of the examples are experimental works-in-progress.
 
 ## [Matrix Scalar Addition](matrix_scalar_add)
 
-This example provides logic to divide in input 2D matrix into *tiles* of data, and add a value to every element in every tile. It includes some description of the fundamental concepts of mlir-air, including *launches*, *herds*, and *channels*.
+This example provides logic to divide an input 2D matrix into *tiles* of data, and add a value to every element in every tile. It includes some description of the fundamental concepts of mlir-air, including *launches*, *herds*, and *channels*. There are five different implementations of this example, some of which are experimental (and are currently works-in-progress).
 
 ## [Data Transfer Transpose](data_transfer_transpose)
 
-Transposes a matrix with using either Channels or `dma_memcpy_nd`.
+Transposes a matrix with using either air channels or `dma_memcpy_nd`.
+
+## [Segment Alloc](segment_alloc)
+
+While a *worker* (a compute unit managed as part of a *herd*) are able to allocate L1 memory, they are not able to allocate L2 memory. This must be done in the *segment*. This example shows how a segment can allocate L2 memory which is then accessed within the herd.
 
 ## [WIP: Multi-Segment Examples](multi_segment)
 
-This is a collection of simple examples that illustrate how to use multiple segments. 
+This is a collection of simple examples that illustrate how to use multiple segments.
 
 Warning: This example is a work-in-progress.
@@ -1,67 +1,61 @@
 # Channel Examples
 
-This example focuses on one of the key abstractions of air: *channels*. This is a collection of examples that use channels in various ways. The patterns shown here may be used to create more complex examples.
+This collection of examples focuses on one of the key abstractions of air: *channels*. The patterns shown here may be used to create more complex examples.
 
 ## Running and Testing
 
 #### ```herd-to-herd```: Using a channel to pass data between herd.
 
-There are two part of this example: two herds within one segment (single segment), and one herd per segment for two segments (multi-segment)
+There are two part of this example: two herds within one segment (single segment), and one herd per segment for two segments (multi-segment).
 
-The single segment example example ([herd_to_herd/single_segment/herd_to_herd.py](herd_to_herd/single_segment/herd_to_herd.py)) defines two `herd`s within the same `launch` + `segment`. There is a *producer herd*, which writes data to a `Herd2Herd` channel, and a *consumer herd*, which reads data form the `Herd2Herd` channel.
-
-```bash
-cd herd_to_herd/single_segment
-make clean && make
-```
+The single segment example example ([herd_to_herd/single_segment/herd_to_herd.py](herd_to_herd/single_segment/herd_to_herd.py)) defines two *herds* within the same *launch* and *segment*. There is a *producer herd*, which writes data to a `Herd2Herd` channel, and a *consumer herd*, which reads data form the `Herd2Herd` channel.
 
 The multi-segment example ([herd_to_herd/multi_segment/herd_to_herd.py](herd_to_herd/multi_segment/herd_to_herd.py)) defines two `segment`s, each with one `herd`, within the same `launch`. There is a *producer_segment* with a *producer herd*, which writes data to a `Herd2Herd` channel, and a *consumer_segment* with a *consumer herd*, which reads data form the `Herd2Herd` channel.
 
 Warning: The multi-segment example is a work in progress!
 
-```bash
-cd herd_to_herd/multi_segment
-make clean && make
-```
-
 #### ```channel-size```: Use the channel size argument
 
 This example ([channel_size/channel_size.py](channel_size/channel_size.py)) is a data passthrough example using the same tiling structure as the [matrix_scalar_add/multi_core_channel](../matrix_scalar_add/multi_core_channel.py) examples, only instead of using a separately defined channel for each tile/core, a bundle of channels is created (using the `ChannelOp` `size` parameter) and indexed into (the `ChannelGet` and `ChannelPut` `indices` parameter).
 
-```bash
-cd channel_size
-make clean && make
-```
-
 #### ```hierarchical```: Use channels for sending data from Launch to Segment to Herd and back again
 
 This example ([hierarchical/hierarchical.py](hierarchical/hierarchical.py)) is a data passthrough example that uses a channel to send data from Launch to Segment (L3->L2 memory) and then from Segment to Herd (L2->L1 memory). The data is then sent back on an analogous path.
 
-```bash
-cd hierarchical
-make clean && make
-```
-
 #### WIP: ```worker-to-self```:
 
 This example ([worker_to_self/worker_to_self.py](worker_to_self/worker_to_self.py)) is a work-in-progress data passthrough example using the same tiling structure as the [matrix_scalar_add/multi_core_channel](../matrix_scalar_add/multi_core_channel.py) examples, only the sole worker in the herd does some extra shuffling between input and output by putting the current data tile into a channel and then getting it from the same channel.
 
-WARNING: This example currently fails because it is assumed channel gets/parts are not from the same memory region, and this example breaks this assumption.
-
-```bash
-cd worker_to_self
-make clean && make
-```
+WARNING: This example currently fails for unknown reasons.
 
 #### WIP: ```worker-to-worker```:
 
 This example ([worker_to_worker/worker_to_worker.py](worker_to_worker/worker_to_worker.py)) is a work-in-progress data passthrough example using the same tiling structure as the [matrix_scalar_add/multi_core_channel](../matrix_scalar_add/multi_core_channel.py) examples, only the each worker trades a tile of input data to another worker in the herd by sending it via channel.
 
 WARNING: This example currently fails for unknown reasons.
 
+#### Usage (For All Examples)
+
+To generate AIR MLIR from Python:
 ```bash
-cd worker_to_worker
+cd <example_dir>
+make clean && make print
+```
+
+To run:
+```bash
+cd <example_dir>
 make clean && make
-``
+```
 
-#### WIP: more examples!
+To run with verbose output:
+```bash
+cd <example_dir>
+python <example_file>.py -v
+```
+
+You may be able to configure examples (data types, sizes); to get additional usage information, run:
+```bash
+cd <example_dir>
+python <example_file>.py -h
+```
@@ -1,12 +1,17 @@
-# Copyright (C) 2022, Advanced Micro Devices, Inc.
+# (c) Copyright 2024 Advanced Micro Devices, Inc.
 # SPDX-License-Identifier: MIT
 srcdir := $(shell dirname $(realpath $(firstword $(MAKEFILE_LIST))))
 
 targetname := $(shell basename ${srcdir})
 
+all: run
+
+print:
+	${powershell} python3 ${srcdir}/channel_size.py -p
+
 run:
-	mkdir -p build
-	cd build && ${powershell} python3 ${srcdir}/run.py
+	mkdir -p ${srcdir}/build
+	cd ${srcdir}/build && ${powershell} python3 ${srcdir}/channel_size.py
 
 clean:
-	rm -rf build __pycache__
+	rm -rf ${srcdir}/build ${srcdir}/__pycache__
@@ -1,33 +1,39 @@
 # Copyright (C) 2024, Advanced Micro Devices, Inc.
 # SPDX-License-Identifier: MIT
+import argparse
+import numpy as np
 
 from air.ir import *
 from air.dialects.air import *
 from air.dialects.memref import AllocOp, DeallocOp, load, store
 from air.dialects.func import FuncOp
 from air.dialects.scf import for_, yield_
+from air.backend.xrt_runner import XRTRunner, type_mapper
 
 range_ = for_
 
-IMAGE_WIDTH = 32
+IMAGE_WIDTH = 48
 IMAGE_HEIGHT = 16
-IMAGE_SIZE = [IMAGE_WIDTH, IMAGE_HEIGHT]
+IMAGE_SIZE = [IMAGE_HEIGHT, IMAGE_WIDTH]
 
 TILE_WIDTH = 16
 TILE_HEIGHT = 8
-TILE_SIZE = [TILE_WIDTH, TILE_HEIGHT]
+TILE_SIZE = [TILE_HEIGHT, TILE_WIDTH]
 
-assert IMAGE_WIDTH % TILE_WIDTH == 0
 assert IMAGE_HEIGHT % TILE_HEIGHT == 0
+assert IMAGE_WIDTH % TILE_WIDTH == 0
+
+INOUT_DATATYPE = np.int32
 
 
 @module_builder
 def build_module():
-    memrefTyInOut = MemRefType.get(IMAGE_SIZE, T.i32())
+    xrt_dtype = type_mapper(INOUT_DATATYPE)
+    memrefTyInOut = MemRefType.get(IMAGE_SIZE, xrt_dtype)
 
     # Create an input/output channel pair per worker
-    ChannelOp("ChanIn", size=[IMAGE_WIDTH // TILE_WIDTH, IMAGE_HEIGHT // TILE_HEIGHT])
-    ChannelOp("ChanOut", size=[IMAGE_WIDTH // TILE_WIDTH, IMAGE_HEIGHT // TILE_HEIGHT])
+    ChannelOp("ChanIn", size=[IMAGE_HEIGHT // TILE_HEIGHT, IMAGE_WIDTH // TILE_WIDTH])
+    ChannelOp("ChanOut", size=[IMAGE_HEIGHT // TILE_HEIGHT, IMAGE_WIDTH // TILE_WIDTH])
 
     # We will send an image worth of data in and out
     @FuncOp.from_py_func(memrefTyInOut, memrefTyInOut)
@@ -40,32 +46,32 @@ def launch_body(a, b):
             # Transfer one tile of data per worker
             for h in range(IMAGE_HEIGHT // TILE_HEIGHT):
                 for w in range(IMAGE_WIDTH // TILE_WIDTH):
-                    offset0 = IMAGE_HEIGHT * h
-                    offset1 = IMAGE_HEIGHT * w
+                    offset0 = TILE_HEIGHT * h
+                    offset1 = TILE_WIDTH * w
 
                     # Put data into the channel tile by tile
                     ChannelPut(
                         "ChanIn",
                         a,
-                        indices=[w, h],
+                        indices=[h, w],
                         offsets=[offset0, offset1],
-                        sizes=[TILE_HEIGHT, TILE_WIDTH],
+                        sizes=TILE_SIZE,
                         strides=[IMAGE_WIDTH, 1],
                     )
 
             # Transfer one tile of data per worker
             for h in range(IMAGE_HEIGHT // TILE_HEIGHT):
                 for w in range(IMAGE_WIDTH // TILE_WIDTH):
-                    offset0 = IMAGE_HEIGHT * h
-                    offset1 = IMAGE_HEIGHT * w
+                    offset0 = TILE_HEIGHT * h
+                    offset1 = TILE_WIDTH * w
 
                     # Write data back out to the channel tile by tile
                     ChannelGet(
                         "ChanOut",
                         b,
-                        indices=[w, h],
+                        indices=[h, w],
                         offsets=[offset0, offset1],
-                        sizes=[TILE_HEIGHT, TILE_WIDTH],
+                        sizes=TILE_SIZE,
                         strides=[IMAGE_WIDTH, 1],
                     )
 
@@ -75,7 +81,7 @@ def segment_body():
 
                 @herd(
                     name="xaddherd",
-                    sizes=[IMAGE_WIDTH // TILE_WIDTH, IMAGE_HEIGHT // TILE_HEIGHT],
+                    sizes=[IMAGE_HEIGHT // TILE_HEIGHT, IMAGE_WIDTH // TILE_WIDTH],
                 )
                 def herd_body(th, tw, _sx, _sy):
 
@@ -85,7 +91,7 @@ def herd_body(th, tw, _sx, _sy):
                     # This is the type definition of the tile
                     tile_type = MemRefType.get(
                         shape=TILE_SIZE,
-                        element_type=T.i32(),
+                        element_type=xrt_dtype,
                         memory_space=mem_space,
                     )
 
@@ -94,11 +100,11 @@ def herd_body(th, tw, _sx, _sy):
                     tile_out = AllocOp(tile_type, [], [])
 
                     # Copy a tile from the input image (a) into the L1 memory region (tile_in)
-                    ChannelGet("ChanIn", tile_in, indices=[tw, th])
+                    ChannelGet("ChanIn", tile_in, indices=[th, tw])
 
                     # Access every value in the tile
-                    for j in range_(TILE_HEIGHT):
-                        for i in range_(TILE_WIDTH):
+                    for i in range_(TILE_HEIGHT):
+                        for j in range_(TILE_WIDTH):
                             # Load the input value from tile_in
                             val = load(tile_in, [i, j])
 
@@ -108,13 +114,46 @@ def herd_body(th, tw, _sx, _sy):
                         yield_([])
 
                     # Copy the output tile into the output
-                    ChannelPut("ChanOut", tile_out, indices=[tw, th])
+                    ChannelPut("ChanOut", tile_out, indices=[th, tw])
 
                     # Deallocate our L1 buffers
                     DeallocOp(tile_in)
                     DeallocOp(tile_out)
 
 
 if __name__ == "__main__":
-    module = build_module()
-    print(module)
+    parser = argparse.ArgumentParser(
+        prog="run.py",
+        description="Builds, runs, and tests the channel_size example",
+    )
+    parser.add_argument(
+        "-v",
+        "--verbose",
+        action="store_true",
+    )
+    parser.add_argument(
+        "-p",
+        "--print-module-only",
+        action="store_true",
+    )
+    args = parser.parse_args()
+
+    mlir_module = build_module()
+    if args.print_module_only:
+        print(mlir_module)
+        exit(0)
+
+    input_matrix = np.random.randint(
+        low=np.iinfo(INOUT_DATATYPE).min,
+        high=np.iinfo(INOUT_DATATYPE).max,
+        size=IMAGE_SIZE,
+        dtype=INOUT_DATATYPE,
+    )
+    output_matrix = input_matrix.copy()
+
+    runner = XRTRunner(verbose=args.verbose, experimental_passes=True)
+    exit(
+        runner.run_test(
+            mlir_module, inputs=[input_matrix], expected_outputs=[output_matrix]
+        )
+    )