Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XRTRunner Utility Class & Programming Examples Cleanup #673

Merged
merged 37 commits into from
Jul 31, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
2b24c89
Introduce XRTRunner
hunhoffe Jul 19, 2024
50e020a
Fixup a few bugs
hunhoffe Jul 19, 2024
155fc27
migrate passthrough_channel to xrtrunner
hunhoffe Jul 19, 2024
298c0bc
migrate passthrough_kernel to use xrtrunner
hunhoffe Jul 19, 2024
7d3f247
Fix up passthrough kernel structure and comments
hunhoffe Jul 19, 2024
d51a2cb
Merge branch 'main' into xrt-runner-utility
hunhoffe Jul 19, 2024
d13e14f
clean up other passthrough examples
hunhoffe Jul 19, 2024
273ffa9
Continue cleaning up passthrough example
hunhoffe Jul 19, 2024
5784a40
Start to clean up matrix scalar add
hunhoffe Jul 20, 2024
c1d8893
Continue fixing up matrix scalar add example
hunhoffe Jul 20, 2024
41cdb31
Get multi-launch example ported to new format
hunhoffe Jul 20, 2024
9c88fd3
Clean up multi-core-dma; use tile_size where appropriate
hunhoffe Jul 20, 2024
569a2ac
fixup multi core channel
hunhoffe Jul 20, 2024
0585149
Fixing up multi launch example, not working currently
hunhoffe Jul 20, 2024
5c522d8
Merge branch 'main' into xrt-runner-utility
hunhoffe Jul 24, 2024
69517a0
Rewrite multi-launch channel in a way that makes more sense; it still…
hunhoffe Jul 29, 2024
e48465e
Mark multi launch test to fail
hunhoffe Jul 29, 2024
a6ae88b
Clean up shim dma 2d example
hunhoffe Jul 29, 2024
14cd7fc
Clean up segment alloc example
hunhoffe Jul 29, 2024
59935cc
Merge branch 'main' into xrt-runner-utility
hunhoffe Jul 29, 2024
5091c24
Clean up segment_alloc code a little bit more
hunhoffe Jul 29, 2024
505b5c1
update multi-segment dma example
hunhoffe Jul 29, 2024
609709b
finish cleaning up multi segment
hunhoffe Jul 29, 2024
2c8c3dd
Test different data types with transpose DMA
hunhoffe Jul 29, 2024
8cdef30
add mapping for bfloat16
hunhoffe Jul 29, 2024
bb8d107
Clean up makefiles
hunhoffe Jul 29, 2024
8e3c4a7
Fix some bugs
hunhoffe Jul 29, 2024
ee31b76
Clean up channel herd_to_herd examples
hunhoffe Jul 29, 2024
9af812a
revert bad makefile changes
hunhoffe Jul 29, 2024
e3706bb
Fix up channel size example
hunhoffe Jul 29, 2024
17c1347
Fixup datatype mismatch between uint32 and int32 in programming examples
hunhoffe Jul 29, 2024
a69df68
Fix up hierarchical example
hunhoffe Jul 29, 2024
f5f9214
clean up worker to worker
hunhoffe Jul 29, 2024
959bb30
Merge branch 'main' into xrt-runner-utility
hunhoffe Jul 30, 2024
497a591
update programming example documentation
hunhoffe Jul 30, 2024
b48b8fc
Merge branch 'main' into xrt-runner-utility
hunhoffe Jul 30, 2024
982a33a
Merge branch 'main' into xrt-runner-utility
hunhoffe Jul 31, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 10 additions & 6 deletions programming_examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,26 +4,30 @@ These programming examples are provided so that application programmers can lear

## [2-Dimensional Shim DMA Passthrough](shim_dma_2d)

This example demonstrates how data may be moved using shim DMA operations. It also includes extra infrastructure that illustrates different ways to compile, build, run, and test programs written using the mlir-air python bindings.
This example demonstrates how data may be moved using shim DMA operations. It also includes extra infrastructure that illustrates different ways to compile, build, run, and test programs written using the mlir-air python bindings on an NPU.

## [Passthrough Examples](passthrough)

Three examples that copy data from the input to the output (a data passthrough). The data movement is done through either DMA or Channels, and there is a simple example of calling a an external function which performs a vectorized memcopy.
This directory contains three examples that each copy data from the input to the output (a data passthrough). The data movement is done through either DMA or Channels, and there is a simple example of calling a an external function which performs a vectorized memcopy.

## [Channel Examples](channel_examples)

This is a collection of simple examples that illustrate how to use channels.
This is a collection of simple examples that illustrate how to use *channels*. At a high level, channels are the abstraction for data movement in mlir-air. Some of the examples are experimental works-in-progress.

## [Matrix Scalar Addition](matrix_scalar_add)

This example provides logic to divide in input 2D matrix into *tiles* of data, and add a value to every element in every tile. It includes some description of the fundamental concepts of mlir-air, including *launches*, *herds*, and *channels*.
This example provides logic to divide an input 2D matrix into *tiles* of data, and add a value to every element in every tile. It includes some description of the fundamental concepts of mlir-air, including *launches*, *herds*, and *channels*. There are five different implementations of this example, some of which are experimental (and are currently works-in-progress).

## [Data Transfer Transpose](data_transfer_transpose)

Transposes a matrix with using either Channels or `dma_memcpy_nd`.
Transposes a matrix with using either air channels or `dma_memcpy_nd`.

## [Segment Alloc](segment_alloc)

While a *worker* (a compute unit managed as part of a *herd*) are able to allocate L1 memory, they are not able to allocate L2 memory. This must be done in the *segment*. This example shows how a segment can allocate L2 memory which is then accessed within the herd.

## [WIP: Multi-Segment Examples](multi_segment)

This is a collection of simple examples that illustrate how to use multiple segments.
This is a collection of simple examples that illustrate how to use multiple segments.

Warning: This example is a work-in-progress.
58 changes: 26 additions & 32 deletions programming_examples/channel_examples/README.md
Original file line number Diff line number Diff line change
@@ -1,67 +1,61 @@
# Channel Examples

This example focuses on one of the key abstractions of air: *channels*. This is a collection of examples that use channels in various ways. The patterns shown here may be used to create more complex examples.
This collection of examples focuses on one of the key abstractions of air: *channels*. The patterns shown here may be used to create more complex examples.

## Running and Testing

#### ```herd-to-herd```: Using a channel to pass data between herd.

There are two part of this example: two herds within one segment (single segment), and one herd per segment for two segments (multi-segment)
There are two part of this example: two herds within one segment (single segment), and one herd per segment for two segments (multi-segment).

The single segment example example ([herd_to_herd/single_segment/herd_to_herd.py](herd_to_herd/single_segment/herd_to_herd.py)) defines two `herd`s within the same `launch` + `segment`. There is a *producer herd*, which writes data to a `Herd2Herd` channel, and a *consumer herd*, which reads data form the `Herd2Herd` channel.

```bash
cd herd_to_herd/single_segment
make clean && make
```
The single segment example example ([herd_to_herd/single_segment/herd_to_herd.py](herd_to_herd/single_segment/herd_to_herd.py)) defines two *herds* within the same *launch* and *segment*. There is a *producer herd*, which writes data to a `Herd2Herd` channel, and a *consumer herd*, which reads data form the `Herd2Herd` channel.

The multi-segment example ([herd_to_herd/multi_segment/herd_to_herd.py](herd_to_herd/multi_segment/herd_to_herd.py)) defines two `segment`s, each with one `herd`, within the same `launch`. There is a *producer_segment* with a *producer herd*, which writes data to a `Herd2Herd` channel, and a *consumer_segment* with a *consumer herd*, which reads data form the `Herd2Herd` channel.

Warning: The multi-segment example is a work in progress!

```bash
cd herd_to_herd/multi_segment
make clean && make
```

#### ```channel-size```: Use the channel size argument

This example ([channel_size/channel_size.py](channel_size/channel_size.py)) is a data passthrough example using the same tiling structure as the [matrix_scalar_add/multi_core_channel](../matrix_scalar_add/multi_core_channel.py) examples, only instead of using a separately defined channel for each tile/core, a bundle of channels is created (using the `ChannelOp` `size` parameter) and indexed into (the `ChannelGet` and `ChannelPut` `indices` parameter).

```bash
cd channel_size
make clean && make
```

#### ```hierarchical```: Use channels for sending data from Launch to Segment to Herd and back again

This example ([hierarchical/hierarchical.py](hierarchical/hierarchical.py)) is a data passthrough example that uses a channel to send data from Launch to Segment (L3->L2 memory) and then from Segment to Herd (L2->L1 memory). The data is then sent back on an analogous path.

```bash
cd hierarchical
make clean && make
```

#### WIP: ```worker-to-self```:

This example ([worker_to_self/worker_to_self.py](worker_to_self/worker_to_self.py)) is a work-in-progress data passthrough example using the same tiling structure as the [matrix_scalar_add/multi_core_channel](../matrix_scalar_add/multi_core_channel.py) examples, only the sole worker in the herd does some extra shuffling between input and output by putting the current data tile into a channel and then getting it from the same channel.

WARNING: This example currently fails because it is assumed channel gets/parts are not from the same memory region, and this example breaks this assumption.

```bash
cd worker_to_self
make clean && make
```
WARNING: This example currently fails for unknown reasons.

#### WIP: ```worker-to-worker```:

This example ([worker_to_worker/worker_to_worker.py](worker_to_worker/worker_to_worker.py)) is a work-in-progress data passthrough example using the same tiling structure as the [matrix_scalar_add/multi_core_channel](../matrix_scalar_add/multi_core_channel.py) examples, only the each worker trades a tile of input data to another worker in the herd by sending it via channel.

WARNING: This example currently fails for unknown reasons.

#### Usage (For All Examples)

To generate AIR MLIR from Python:
```bash
cd worker_to_worker
cd <example_dir>
make clean && make print
```

To run:
```bash
cd <example_dir>
make clean && make
``
```

#### WIP: more examples!
To run with verbose output:
```bash
cd <example_dir>
python <example_file>.py -v
```

You may be able to configure examples (data types, sizes); to get additional usage information, run:
```bash
cd <example_dir>
python <example_file>.py -h
```
13 changes: 9 additions & 4 deletions programming_examples/channel_examples/channel_size/Makefile
Original file line number Diff line number Diff line change
@@ -1,12 +1,17 @@
# Copyright (C) 2022, Advanced Micro Devices, Inc.
# (c) Copyright 2024 Advanced Micro Devices, Inc.
# SPDX-License-Identifier: MIT
srcdir := $(shell dirname $(realpath $(firstword $(MAKEFILE_LIST))))

targetname := $(shell basename ${srcdir})

all: run

print:
${powershell} python3 ${srcdir}/channel_size.py -p

run:
mkdir -p build
cd build && ${powershell} python3 ${srcdir}/run.py
mkdir -p ${srcdir}/build
cd ${srcdir}/build && ${powershell} python3 ${srcdir}/channel_size.py

clean:
rm -rf build __pycache__
rm -rf ${srcdir}/build ${srcdir}/__pycache__
Original file line number Diff line number Diff line change
@@ -1,33 +1,39 @@
# Copyright (C) 2024, Advanced Micro Devices, Inc.
# SPDX-License-Identifier: MIT
import argparse
import numpy as np

from air.ir import *
from air.dialects.air import *
from air.dialects.memref import AllocOp, DeallocOp, load, store
from air.dialects.func import FuncOp
from air.dialects.scf import for_, yield_
from air.backend.xrt_runner import XRTRunner, type_mapper

range_ = for_

IMAGE_WIDTH = 32
IMAGE_WIDTH = 48
IMAGE_HEIGHT = 16
IMAGE_SIZE = [IMAGE_WIDTH, IMAGE_HEIGHT]
IMAGE_SIZE = [IMAGE_HEIGHT, IMAGE_WIDTH]

TILE_WIDTH = 16
TILE_HEIGHT = 8
TILE_SIZE = [TILE_WIDTH, TILE_HEIGHT]
TILE_SIZE = [TILE_HEIGHT, TILE_WIDTH]

assert IMAGE_WIDTH % TILE_WIDTH == 0
assert IMAGE_HEIGHT % TILE_HEIGHT == 0
assert IMAGE_WIDTH % TILE_WIDTH == 0

INOUT_DATATYPE = np.int32


@module_builder
def build_module():
memrefTyInOut = MemRefType.get(IMAGE_SIZE, T.i32())
xrt_dtype = type_mapper(INOUT_DATATYPE)
memrefTyInOut = MemRefType.get(IMAGE_SIZE, xrt_dtype)

# Create an input/output channel pair per worker
ChannelOp("ChanIn", size=[IMAGE_WIDTH // TILE_WIDTH, IMAGE_HEIGHT // TILE_HEIGHT])
ChannelOp("ChanOut", size=[IMAGE_WIDTH // TILE_WIDTH, IMAGE_HEIGHT // TILE_HEIGHT])
ChannelOp("ChanIn", size=[IMAGE_HEIGHT // TILE_HEIGHT, IMAGE_WIDTH // TILE_WIDTH])
ChannelOp("ChanOut", size=[IMAGE_HEIGHT // TILE_HEIGHT, IMAGE_WIDTH // TILE_WIDTH])

# We will send an image worth of data in and out
@FuncOp.from_py_func(memrefTyInOut, memrefTyInOut)
Expand All @@ -40,32 +46,32 @@ def launch_body(a, b):
# Transfer one tile of data per worker
for h in range(IMAGE_HEIGHT // TILE_HEIGHT):
for w in range(IMAGE_WIDTH // TILE_WIDTH):
offset0 = IMAGE_HEIGHT * h
offset1 = IMAGE_HEIGHT * w
offset0 = TILE_HEIGHT * h
offset1 = TILE_WIDTH * w

# Put data into the channel tile by tile
ChannelPut(
"ChanIn",
a,
indices=[w, h],
indices=[h, w],
offsets=[offset0, offset1],
sizes=[TILE_HEIGHT, TILE_WIDTH],
sizes=TILE_SIZE,
strides=[IMAGE_WIDTH, 1],
)

# Transfer one tile of data per worker
for h in range(IMAGE_HEIGHT // TILE_HEIGHT):
for w in range(IMAGE_WIDTH // TILE_WIDTH):
offset0 = IMAGE_HEIGHT * h
offset1 = IMAGE_HEIGHT * w
offset0 = TILE_HEIGHT * h
offset1 = TILE_WIDTH * w

# Write data back out to the channel tile by tile
ChannelGet(
"ChanOut",
b,
indices=[w, h],
indices=[h, w],
offsets=[offset0, offset1],
sizes=[TILE_HEIGHT, TILE_WIDTH],
sizes=TILE_SIZE,
strides=[IMAGE_WIDTH, 1],
)

Expand All @@ -75,7 +81,7 @@ def segment_body():

@herd(
name="xaddherd",
sizes=[IMAGE_WIDTH // TILE_WIDTH, IMAGE_HEIGHT // TILE_HEIGHT],
sizes=[IMAGE_HEIGHT // TILE_HEIGHT, IMAGE_WIDTH // TILE_WIDTH],
)
def herd_body(th, tw, _sx, _sy):

Expand All @@ -85,7 +91,7 @@ def herd_body(th, tw, _sx, _sy):
# This is the type definition of the tile
tile_type = MemRefType.get(
shape=TILE_SIZE,
element_type=T.i32(),
element_type=xrt_dtype,
memory_space=mem_space,
)

Expand All @@ -94,11 +100,11 @@ def herd_body(th, tw, _sx, _sy):
tile_out = AllocOp(tile_type, [], [])

# Copy a tile from the input image (a) into the L1 memory region (tile_in)
ChannelGet("ChanIn", tile_in, indices=[tw, th])
ChannelGet("ChanIn", tile_in, indices=[th, tw])

# Access every value in the tile
for j in range_(TILE_HEIGHT):
for i in range_(TILE_WIDTH):
for i in range_(TILE_HEIGHT):
for j in range_(TILE_WIDTH):
# Load the input value from tile_in
val = load(tile_in, [i, j])

Expand All @@ -108,13 +114,46 @@ def herd_body(th, tw, _sx, _sy):
yield_([])

# Copy the output tile into the output
ChannelPut("ChanOut", tile_out, indices=[tw, th])
ChannelPut("ChanOut", tile_out, indices=[th, tw])

# Deallocate our L1 buffers
DeallocOp(tile_in)
DeallocOp(tile_out)


if __name__ == "__main__":
module = build_module()
print(module)
parser = argparse.ArgumentParser(
prog="run.py",
description="Builds, runs, and tests the channel_size example",
)
parser.add_argument(
"-v",
"--verbose",
action="store_true",
)
parser.add_argument(
"-p",
"--print-module-only",
action="store_true",
)
args = parser.parse_args()

mlir_module = build_module()
if args.print_module_only:
print(mlir_module)
exit(0)

input_matrix = np.random.randint(
low=np.iinfo(INOUT_DATATYPE).min,
high=np.iinfo(INOUT_DATATYPE).max,
size=IMAGE_SIZE,
dtype=INOUT_DATATYPE,
)
output_matrix = input_matrix.copy()

runner = XRTRunner(verbose=args.verbose, experimental_passes=True)
exit(
runner.run_test(
mlir_module, inputs=[input_matrix], expected_outputs=[output_matrix]
)
)
Loading
Loading