Skip to content

Commit

Permalink
Copy *tune_gemm* from triton-mlir branch to main_perf branch
Browse files Browse the repository at this point in the history
The source commit in `triton-mlir` branch is the following one:
```
commit 59d6be1
Author: xiaohuguo2023 <[email protected]>
Date:   Wed Jul 17 19:45:58 2024 +0100

    replace rocprof with rocprofv2 for the tune gemm script (#613)

    * replace rocprof with rocprofv2

    * remove comment

    * add link for rocprofv2 output header bug
```

*tune_gemm* was copied from the source branch directory `scripts/amd/gemm`
to the destination branch directory `python/perf-kernels/tune_gemm`.

The SHA-256 hashes of *tune_gemm* files are the following ones:
```
7beb6b677fa57f535f1341d682fd0d9f939d48db9c2695165a4ce5f9137013b2  README.md
46ab93978fee33f75df23332f12546dae7910478c391f08b7b1ebd415d8266b7  icache_flush.py
f18711544641b810a652e6a6629bfa2b613f6ade87399e88fdf05b81d4af58a4  matmul.py
898f72767fec91188ce2502976e52679f50c094b594c258be9f7437724e29d1a  matmul_kernel.py
2812b40183637bc8d7e47d283c7d66b1792134a43de76f3eacf7b9b3e1c2431a  one_config.py
0ac09c33b0173cea06ddabbf9f4e3afa1816781dea4fdcce5894a7e7d6a80e19  rocprof_gemm.py
23cfd7259773fc07e007ee53842ea75391abca0b904d09ff1bd5f3ec2ca41d5d  tune_gemm.py
e787f35d750b869f113b3c01692f64243a9cb8a71a18ade2f0465f614f7284e4  tune_gemm.sh
```

The files were kept as-is despite `pre-commit` intentions to change them.
  • Loading branch information
brunomazzottiamd committed Jul 19, 2024
1 parent dbe1173 commit c492a1c
Show file tree
Hide file tree
Showing 8 changed files with 2,056 additions and 0 deletions.
161 changes: 161 additions & 0 deletions python/perf-kernels/tune_gemm/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@
# GEMM tuning script v2

This is the v2 version of the gemm tuning script, which is based on @scxiao's v1 (https://github.com/ROCmSoftwarePlatform/triton/pull/309) and @alefimov-amd's thread pool https://github.com/ROCmSoftwarePlatform/triton/pull/310

### Main features
- `rocprof` is used to measure the time for kernels in the full tuning space
- Each kernel is executed 10 times and the execution time of the last instance is used
- All kernels are compiled in parallel
- Two modes for correctness checking
- During tuning, check correctness with the best perf_config for the current gemm size
- Without tuning, check correctness based on the tuning results, which includes best perf_config for each gemm size
- The process takes about 30 - 40 minutes for the full tuning space with ~15000 configs
- Limitations
- For now, only support fp16 as inputs. It should be trivial to extend to other types, but may require some work for mixed inputs

### Usage
Go to the script dir
```bash
cd triton/scripts/amd/gemm/
```

1. Tune gemm sizes given in a yaml file and check correctness on the way
```bash
python tune_gemm.py --gemm_size_file input_gemm_sizes.yaml --compare
```

2. Tune a single gemm size
```bash
python tune_gemm.py -m 16 -n 16 -k 16
```

3. Choose the file to store tuning results
```bash
python tune_gemm.py --gemm_size_file input_gemm_sizes.yaml --o output_tuning.yaml
```

4. Only check correctness given the tuning results
```bash
python tune_gemm.py --gemm_size_file output_tuning.yaml --compare_wo_tuning
```
Note that the tuning results file are provided as the `gemm_size_file` in this scenario.

### Overview of implementations

Workflow of the tuning process
1. Generate the full tuning space. For now the `range`s for each tuning parameter are hard-coded
2. Prune the tuning space according to the current GEMM size and some rules
- BLOCK_SIZE must be equal or larger than the mfma instruction size.
- SPLIT_K * BLOCK_SIZE_K must divide K. Therefore, we do not need EVEN_K in the kernel.
- When split-k is not needed, i.e. both M and N are large, it must be 1
- GROUP_M * BLOCK_SIZE_M must be smaller than M. Otherwise, GROUP_M must be 1
- When BLOCK_SIZE_K = 128, neither BLOCK_SIZE_M or BLOCK_SIZE_N can be 128. Otherwise too much LDS will be required. **Needs further investigation**
- Skip BLOCK_SIZE_M or BLOCK_SIZE_N if they are over 2 times larger than M or N.
3. Open a file `generated_kernel{M}-{N}-{K}-{gpuid}.py` and write the following into the file
1. For each config in the pruned space, generate a kernel with name `matmul_kernel_{configStr}`, where `configStr` contains the gemm size and the tuning parameters.
2. Generate `matmul` function for each config in a similar way
3. Generate `try_config` functions for each `matmul` function.
4. Generate `test_gemm`, which does
1. Add all `try_config` functions in the thread_pool by `thread_pool.apply_async(try_config)`. This is used to compile all kernels in parallel.
2. Call each `matmul` function in a for loop of 10 iterations
5. Generate `main` function
4. Run the generated script with 16 workers. This will compile all kernels in parallel.
5. Invoke `rocprof` on the generated script
6. Post process `results.csv` by extract the execution time of the last instance of each kernel. Pick the best one, write to file, and return.

# GEMM Tuning Script v3

### API changes

- Input and output data types can be provided as `-dtype_a`, `-dtype_b`, and `-dtype_c`.
The provided types must be one of ['fp32', 'fp16', 'bf16', 'fp8', 'bf8', 'int8'].
- Row/col major-ness of operand a and b can be provided as `-col_a` and `-col_b`.
If set, it means the corresponding operand is column major.
The major-ness is considered as problem input.
So they should be included in the input yaml file. However, in the yaml file, user should
set `rowMajowA` and `rowMajorB` as shown in the example below.
- `--benchmark` is used to control if the perf config in the input yaml file is used as the tuning space.
- `--jobs` is used to control the number of .py files for generated kernels.
Note that this can be different from `ngpus`. This usually means multiple kernel files
will be profiled on each GPU.
This is necessary to keep each file "small" in terms of execution time.

### Implementation changes
- `gen_input` is used to generate matmul inputs.
- Time measurement
- In benchmark mode, the kernel is executed 1000 times.
- In tuning mode, each kernel is executed 200 times. We cannot afford to larger runs since rocprof hangs if the session takes too long.
- In both tuning and benchmark mode, kernel time is measured as the average execution time of the last 100 instances.
- Added error recovery. This helps when rocprof crashes in multi-processing mode.


### Example Usage

Let's say we have an input yaml file, named `gemm_input.yaml`, that contains the following configs
```yaml
- {'M': 4864, 'N': 4096, 'K': 8192, 'rowMajorA': 'T', 'rowMajorB': 'N'}
- {'M': 8192, 'N': 8192, 'K': 8192, 'rowMajorA': 'T', 'rowMajorB': 'N'}
```
1. Tuning with bf8 input types with gpu 4,5,6,7, and save output to `output.yaml`
```bash
python ./tune_gemm.py --gemm_size_file gemm_input.yaml -dtype_a bf8 -dtype_b bf8 --gpu_ids 4,5,6,7 --o output.yaml
```

2. Check the correctness of the tuned configs
```bash
python ./tune_gemm.py --gemm_size_file output.yaml -dtype_a bf8 -dtype_b bf8 --compare_wo_tuning
```

3. Run benchmark of the tuned configs
```bash
python ./tune_gemm.py --gemm_size_file output.yaml -dtype_a bf8 -dtype_b bf8 --benchmark
```

A sample output from `benchmark` looks like
```bash
Benchmarking gemm with bf8 inputs (peak tflops: 1298)
trans M N K TFLOPS Efficiency
NT 4864 4096 8192 841.22 65%
NT 8192 8192 8192 745.31 57%
```

# GEMM Tuning Script v3.1

### API changes

- Added `matrix_instr_nonkdim` into the tuning space. Now we can tune mfma instruction size.


# One config running script

`one_config.py` is a script that runs one given matmul config.
It is an interface to `tune_gemm.py` functionality and could be used for triton debugging.

### Usage

This script supports two methods to specify configuration parameters.

Variant 1: Separate command line attributes.

```bash
python one_config.py -m 256 -n 256 -k 256 --block_m 64 --block_n 64 --block_k 64 --group_m 1 --split_k 2 --num_warps 2 --num_stages 0 --waves_per_eu 0 --matrix_instr_nonkdim 16 --kpack 2
```

Variant 2: one-line config description.
This is how configs are printed by `tune_gemm.py` script

```bash
python one_config.py --config_str M16_N8_K128_BM64_BN64_BK64_GM1_SK2_nW2_nS0_EU0_kP2_mfma16
```

# GEMM Tuning Script v3.2

### API changes

- Added `--rotating_tensor <value>` to use rotating memory blocks in each iteration, size in MB. Default is 0MB.
- Added `--icache_flush` to flush icache in each iteration.
Note, icache flush needs the module `python-hip`, which can be installed as:
`python3 -m pip install -i https://test.pypi.org/simple hip-python~=$rocm_version`
Rotating tensor and icache flush are to make perf numbers are closer to that in real applications.
- Added `--bias_vector` to support kernel execution with bias (bias vector is of the same size as the number of rows of the output matrix,
so each element of the bias vector is added to all elements of the corresponding row of the output matrix.)
94 changes: 94 additions & 0 deletions python/perf-kernels/tune_gemm/icache_flush.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
import ctypes
import array
import random
import math

# the hip module can be installed as
# `python3 -m pip install -i https://test.pypi.org/simple hip-python~=$rocm_version`
# more information about hip-python is at: https://github.com/ROCm/hip-python
from hip import hip, hiprtc

def hip_check(call_result):
err = call_result[0]
result = call_result[1:]
if len(result) == 1:
result = result[0]

if isinstance(err, hip.hipError_t) and err != hip.hipError_t.hipSuccess:
raise RuntimeError(str(err))
elif (
isinstance(err, hiprtc.hiprtcResult)
and err != hiprtc.hiprtcResult.HIPRTC_SUCCESS
):
raise RuntimeError(str(err))

return result

# S_ICACHE_INV Invalidate entire first level instruction cache.
# There must be 16 separate S_NOP instructions or a jump/branch instruction
# after this instruction to ensure the internal instruction buffers are also
# invalidated.
def gen_kernel():
source = b"""\
extern "C" __global__ void icache_flush_kernel() {
asm __volatile__("s_icache_inv");
asm __volatile__("s_nop 0");
asm __volatile__("s_nop 0");
asm __volatile__("s_nop 0");
asm __volatile__("s_nop 0");
asm __volatile__("s_nop 0");
asm __volatile__("s_nop 0");
asm __volatile__("s_nop 0");
asm __volatile__("s_nop 0");
asm __volatile__("s_nop 0");
asm __volatile__("s_nop 0");
asm __volatile__("s_nop 0");
asm __volatile__("s_nop 0");
asm __volatile__("s_nop 0");
asm __volatile__("s_nop 0");
asm __volatile__("s_nop 0");
asm __volatile__("s_nop 0");
}
"""

# print(f"source = {source}")
prog = hip_check(hiprtc.hiprtcCreateProgram(source, b"icache_flush_kernel", 0, [], []))
progs = hip.hipDeviceProp_t()
hip_check(hip.hipGetDeviceProperties(progs, 0))
arch = progs.gcnArchName
cflags = [b"--offload-arch="+arch]
err, = hiprtc.hiprtcCompileProgram(prog, len(cflags), cflags)
if err != hiprtc.hiprtcResult.HIPRTC_SUCCESS:
log_size = hip_check(hiprtc.hiprtcGetProgramLogSize(prog))
log = bytearray(log_size)
hip_check(hiprtc.hiprtcGetProgramLog(prog, log))
print(f"log = {log.decode()}, err = {err}")
raise RuntimeError(log.decode())

code_size = hip_check(hiprtc.hiprtcGetCodeSize(prog))
code = bytearray(code_size)
hip_check(hiprtc.hiprtcGetCode(prog, code))
module = hip_check(hip.hipModuleLoadData(code))
kernel = hip_check(hip.hipModuleGetFunction(module, b"icache_flush_kernel"))

return kernel

kernel = gen_kernel()
progs = hip.hipDeviceProp_t()
hip_check(hip.hipGetDeviceProperties(progs, 0))
cu_num = progs.multiProcessorCount

def icache_flush():
block = hip.dim3(x=64)
grid = hip.dim3(cu_num * 60)

hip_check(hip.hipModuleLaunchKernel(
kernel,
*grid,
*block,
sharedMemBytes=0,
stream=None,
kernelParams=None,
extra=()
)
)
Loading

0 comments on commit c492a1c

Please sign in to comment.