Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replace rocprof with rocprofv2 for the tune gemm script #613

Merged
merged 3 commits into from
Jul 17, 2024

Conversation

xiaohuguo2023
Copy link
Member

rocprofv2 is much faster than rocprof

rocprofv2 naturally support python

for 8192x8192x8192, it reduce tuning time from more than an hour to 5.22 mins with full tuning space.

@@ -392,9 +392,14 @@ def main():


def extract_kernel_time(M, N, K, config, df, bias_size):
# Correct the header by removing 'sig' and 'obj' to reduce number from 21 to 19
# once the bug is fixed, we should not need below two lines

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add the issue of rocprof here so that people know what you are referring to
ROCm/rocprofiler#144

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -409,7 +414,7 @@ def profile_batch_kernels(M, N, K, gpuid, gpus, jobs, verbose):
kernel_name = generated_kernel_name(M, N, K, jobId)
if verbose:
print(f"profiling {kernel_name} on GPU {gpuid}")
run_bash_command_wrapper(f"rocprof --stats -o results-{jobId}.csv python {kernel_name}", capture=(verbose < 2))
run_bash_command_wrapper(f"rocprofv2 --plugin file --plugin-version 1 --kernel-trace -o {jobId} python {generated_kernel_name(M, N, K, jobId)}", capture=(verbose < 2))
Copy link

@zhanglx13 zhanglx13 Jul 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generated_kernel_name(M, N, K, jobId) is the same as kernel_name, the latter is simpler.
And why did you change the output from results-{jobId}.csv to {jobId}? Is it automatically expanded to results_{jobId}.csv?

Copy link
Member Author

@xiaohuguo2023 xiaohuguo2023 Jul 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it is automatically expanded.

@vgokhale
Copy link
Collaborator

Have you checked performance on a few shapes to confirm before and after is the same?

@xiaohuguo2023
Copy link
Member Author

xiaohuguo2023 commented Jul 17, 2024

yes, I did @vgokhale

Here is with tune_streamk using rocprof

xiaohugu@banff-cyxtera-s79-2:~/work/persistent-kernels$ cat tuning_results_main@d7a28b6_07-15-2024-09\:32\:03.yaml
- {'M': 8192, 'N': 8192, 'K': 8192, 'rowMajorA': 'T', 'rowMajorB': 'N', 'BLOCK_SIZE_M': 256, 'BLOCK_SIZE_N': 128, 'BLOCK_SIZE_K': 64, 'GROUP_SIZE_M': 32, 'num_warps': 8, 'num_stages': 0, 'waves_per_eu': 0, 'matrix_instr_nonkdim': 16, 'kpack': 2} # TFLOPS: 453.75 time(us): 2423.1

and tune_streamk using rocprofv2

xiaohugu@banff-cyxtera-s79-2:~/work/persistent-kernels$ cat tuning_results_main@4bb2b95_07-16-2024-22\:03\:33.yaml
- {'M': 8192, 'N': 8192, 'K': 8192, 'rowMajorA': 'T', 'rowMajorB': 'N', 'BLOCK_SIZE_M': 256, 'BLOCK_SIZE_N': 128, 'BLOCK_SIZE_K': 64, 'GROUP_SIZE_M': 32, 'num_warps': 8, 'num_stages': 0, 'waves_per_eu': 0, 'matrix_instr_nonkdim': 16, 'kpack': 2} # TFLOPS: 452.34 time(us): 2430.7

tune_gemm with rocprof on smc300x

xiaohugu@smc300x-ccs-aus-GPUF292:~/work/triton-pr/scripts/amd/gemm$ cat tuning_results_tune_streamk@0bda85cd6_07-14-2024-22\:23\:39.yaml
- {'M': 8192, 'N': 8192, 'K': 8192, 'rowMajorA': 'T', 'rowMajorB': 'N', 'BLOCK_SIZE_M': 256, 'BLOCK_SIZE_N': 256, 'BLOCK_SIZE_K': 64, 'GROUP_SIZE_M': 32, 'SPLIT_K': 1, 'num_warps': 8, 'num_stages': 0, 'waves_per_eu': 0, 'matrix_instr_nonkdim': 16, 'kpack': 2} # TFLOPS: 467.55 time(us): 2351.6

tune_gemm with rocprofv2 on smc300x

xiaohugu@banff-cyxtera-s79-2:~/work/triton-mlir-pr/scripts/amd/gemm$ cat tuning_results_tune_gemm_opt@2ddcac6a3_07-17-2024-11\:21\:22.yaml
- {'M': 8192, 'N': 8192, 'K': 8192, 'rowMajorA': 'T', 'rowMajorB': 'N', 'BLOCK_SIZE_M': 128, 'BLOCK_SIZE_N': 256, 'BLOCK_SIZE_K': 64, 'GROUP_SIZE_M': 1, 'SPLIT_K': 1, 'num_warps': 8, 'num_stages': 0, 'waves_per_eu': 0, 'matrix_instr_nonkdim': 16, 'kpack': 2} # TFLOPS: 467.96 time(us): 2349.6

@zhanglx13 zhanglx13 merged commit 59d6be1 into triton-mlir Jul 17, 2024
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants