Skip to content

Commit

Permalink
Update readme
Browse files Browse the repository at this point in the history
  • Loading branch information
zhanglx13 committed Aug 18, 2024
1 parent cba3d19 commit c550c5b
Showing 1 changed file with 17 additions and 1 deletion.
18 changes: 17 additions & 1 deletion python/perf-kernels/tune_gemm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
## matmul kernel

The matmul kernel implementation can be found as [matmul_kernel.py](https://github.com/ROCm/triton/blob/main_perf/python/perf-kernels/tune_gemm/matmul_kernel.py), which includes the following features:
- XCD-based pid remapping
- grouping order of workgroup id, which is controlled by `GROUP_SIZE_M`, that
implements L2 cache optimization introduced in the [tutorial](https://triton-lang.org/main/getting-started/tutorials/03-matrix-multiplication.html#l2-cache-optimizations).
- split-k algorithm, which is controlled by `SPLIT_K`.
Expand Down Expand Up @@ -144,7 +145,7 @@ The default value is 1000.

The general idea of the tuning script can be summarized as
- Compile all the kernels in the tuning space in parallel.
- Divide the tuning space into tasks and invoke `rocprofv2` once per
- Divide the tuning space into tasks and invoke `rocprof` once per
task. This will save invocation overhead of the profiler.
- Profile tasks in parallel on multiple GPUs.

Expand Down Expand Up @@ -309,6 +310,21 @@ places:
- Statically set `device` and `stream` in the [jit.py](https://github.com/triton-lang/triton/blob/fd691c67ac20958a67693358186d877790f5f48f/python/triton/runtime/jit.py#L588-L589)


# GEMM Tuning Script v3.4

## API changes

No API changes

## Implementation changes

- Now the matmul_kernel supports XCD-based pid remapping. Details with experiments
will be added later.
- Switched back to rocprofv1. Check [ticket#228](https://github.com/ROCm/triton-internal/issues/228) for more details.
- Improved the post-procesing logic to filter out the "spikes" in the profiling results.
- Reduced the number of iterations in both tuning and benchmark mode (120 and 200).


# One config running script

`one_config.py` is a script that runs one given matmul config.
Expand Down

0 comments on commit c550c5b

Please sign in to comment.