replace rocprof with rocprofv2 for the tune gemm script #613

xiaohuguo2023 · 2024-07-17T11:34:34Z

rocprofv2 is much faster than rocprof

rocprofv2 naturally support python

for 8192x8192x8192, it reduce tuning time from more than an hour to 5.22 mins with full tuning space.

zhanglx13 · 2024-07-17T13:24:05Z

scripts/amd/gemm/tune_gemm.py

@@ -392,9 +392,14 @@ def main():


 def extract_kernel_time(M, N, K, config, df, bias_size):
+    # Correct the header by removing 'sig' and 'obj' to reduce number from 21 to 19
+    # once the bug is fixed, we should not need below two lines


Can you add the issue of rocprof here so that people know what you are referring to
ROCm/rocprofiler#144

zhanglx13 · 2024-07-17T13:26:16Z

scripts/amd/gemm/tune_gemm.py

@@ -409,7 +414,7 @@ def profile_batch_kernels(M, N, K, gpuid, gpus, jobs, verbose):
        kernel_name = generated_kernel_name(M, N, K, jobId)
        if verbose:
            print(f"profiling {kernel_name} on GPU {gpuid}")
-        run_bash_command_wrapper(f"rocprof --stats -o results-{jobId}.csv python {kernel_name}", capture=(verbose < 2))
+        run_bash_command_wrapper(f"rocprofv2 --plugin file --plugin-version 1 --kernel-trace -o {jobId} python {generated_kernel_name(M, N, K, jobId)}", capture=(verbose < 2))


generated_kernel_name(M, N, K, jobId) is the same as kernel_name, the latter is simpler.
And why did you change the output from results-{jobId}.csv to {jobId}? Is it automatically expanded to results_{jobId}.csv?

yes, it is automatically expanded.

vgokhale · 2024-07-17T13:48:39Z

Have you checked performance on a few shapes to confirm before and after is the same?

xiaohuguo2023 · 2024-07-17T18:19:13Z

yes, I did @vgokhale

Here is with tune_streamk using rocprof

xiaohugu@banff-cyxtera-s79-2:~/work/persistent-kernels$ cat tuning_results_main@d7a28b6_07-15-2024-09\:32\:03.yaml
- {'M': 8192, 'N': 8192, 'K': 8192, 'rowMajorA': 'T', 'rowMajorB': 'N', 'BLOCK_SIZE_M': 256, 'BLOCK_SIZE_N': 128, 'BLOCK_SIZE_K': 64, 'GROUP_SIZE_M': 32, 'num_warps': 8, 'num_stages': 0, 'waves_per_eu': 0, 'matrix_instr_nonkdim': 16, 'kpack': 2} # TFLOPS: 453.75 time(us): 2423.1

and tune_streamk using rocprofv2

xiaohugu@banff-cyxtera-s79-2:~/work/persistent-kernels$ cat tuning_results_main@4bb2b95_07-16-2024-22\:03\:33.yaml
- {'M': 8192, 'N': 8192, 'K': 8192, 'rowMajorA': 'T', 'rowMajorB': 'N', 'BLOCK_SIZE_M': 256, 'BLOCK_SIZE_N': 128, 'BLOCK_SIZE_K': 64, 'GROUP_SIZE_M': 32, 'num_warps': 8, 'num_stages': 0, 'waves_per_eu': 0, 'matrix_instr_nonkdim': 16, 'kpack': 2} # TFLOPS: 452.34 time(us): 2430.7

tune_gemm with rocprof on smc300x

xiaohugu@smc300x-ccs-aus-GPUF292:~/work/triton-pr/scripts/amd/gemm$ cat tuning_results_tune_streamk@0bda85cd6_07-14-2024-22\:23\:39.yaml
- {'M': 8192, 'N': 8192, 'K': 8192, 'rowMajorA': 'T', 'rowMajorB': 'N', 'BLOCK_SIZE_M': 256, 'BLOCK_SIZE_N': 256, 'BLOCK_SIZE_K': 64, 'GROUP_SIZE_M': 32, 'SPLIT_K': 1, 'num_warps': 8, 'num_stages': 0, 'waves_per_eu': 0, 'matrix_instr_nonkdim': 16, 'kpack': 2} # TFLOPS: 467.55 time(us): 2351.6

tune_gemm with rocprofv2 on smc300x

xiaohugu@banff-cyxtera-s79-2:~/work/triton-mlir-pr/scripts/amd/gemm$ cat tuning_results_tune_gemm_opt@2ddcac6a3_07-17-2024-11\:21\:22.yaml
- {'M': 8192, 'N': 8192, 'K': 8192, 'rowMajorA': 'T', 'rowMajorB': 'N', 'BLOCK_SIZE_M': 128, 'BLOCK_SIZE_N': 256, 'BLOCK_SIZE_K': 64, 'GROUP_SIZE_M': 1, 'SPLIT_K': 1, 'num_warps': 8, 'num_stages': 0, 'waves_per_eu': 0, 'matrix_instr_nonkdim': 16, 'kpack': 2} # TFLOPS: 467.96 time(us): 2349.6

xiaohuguo2023 added 2 commits July 17, 2024 11:23

replace rocprof with rocprofv2

7106dc1

remove comment

f94f1cf

xiaohuguo2023 requested review from zhanglx13, scxiao, jayfurmanek and vgokhale July 17, 2024 11:34

zhanglx13 reviewed Jul 17, 2024

View reviewed changes

add link for rocprofv2 output header bug

90486ce

zhanglx13 approved these changes Jul 17, 2024

View reviewed changes

zhanglx13 merged commit 59d6be1 into triton-mlir Jul 17, 2024
2 of 3 checks passed

zhanglx13 mentioned this pull request Aug 17, 2024

[tune gemm v3.4] Add xcd-based pid remapping and change back to rocprofv1 #630

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

replace rocprof with rocprofv2 for the tune gemm script #613

replace rocprof with rocprofv2 for the tune gemm script #613

xiaohuguo2023 commented Jul 17, 2024

zhanglx13 Jul 17, 2024

xiaohuguo2023 Jul 17, 2024

zhanglx13 Jul 17, 2024 •

edited

Loading

xiaohuguo2023 Jul 17, 2024 •

edited

Loading

vgokhale commented Jul 17, 2024

xiaohuguo2023 commented Jul 17, 2024 •

edited

Loading

replace rocprof with rocprofv2 for the tune gemm script #613

replace rocprof with rocprofv2 for the tune gemm script #613

Conversation

xiaohuguo2023 commented Jul 17, 2024

zhanglx13 Jul 17, 2024

Choose a reason for hiding this comment

xiaohuguo2023 Jul 17, 2024

Choose a reason for hiding this comment

zhanglx13 Jul 17, 2024 • edited Loading

Choose a reason for hiding this comment

xiaohuguo2023 Jul 17, 2024 • edited Loading

Choose a reason for hiding this comment

vgokhale commented Jul 17, 2024

xiaohuguo2023 commented Jul 17, 2024 • edited Loading

zhanglx13 Jul 17, 2024 •

edited

Loading

xiaohuguo2023 Jul 17, 2024 •

edited

Loading

xiaohuguo2023 commented Jul 17, 2024 •

edited

Loading