[QST] How to improve skinny matrix perf over Ampere like 3090? #1582

leiwen83 · 2024-06-11T08:19:47Z

Hi~

For skinny matrix mul like m=16 k=5120 n=1792, cutlass seems cannot reach max bandwidth:
With profiler I got: cutlass_tensorop_s16816gemm_f16_64x64_64x5_tn_align8, and its bandwidth is as 374.597GB/s, runtime 46us.
While with cublas running within pytorch, I could reach 27us. [cublas choose ampere_fp16_s16816gemm_fp16_64x64_ldg8_f2f_stages_64x5_nn here]

So how to further optimize cutlass config to reach the same speed that cublas currently has?

If M goes down from 16 to like 2, this number would get worse:
profiler would choose cutlass_tensorop_s16816gemm_f16_64x64_32x10_tn_align2, and bandwidth is only 280GB/s, runtime 59us.
while the same problem setting in pytorch still get 27us with different kernel [cublas choose ampere_fp16_s16816gemm_fp16_128x64_ldg8_f2f_stages_64x3_nn+cublasLt::splitKreduce_kernel].

Thx~

github-actions · 2024-07-11T09:05:05Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions · 2024-10-09T09:05:46Z

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

github-actions bot added the inactive-30d label Jul 11, 2024

github-actions bot added the inactive-90d label Oct 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] How to improve skinny matrix perf over Ampere like 3090? #1582

[QST] How to improve skinny matrix perf over Ampere like 3090? #1582

leiwen83 commented Jun 11, 2024 •

edited

Loading

github-actions bot commented Jul 11, 2024

github-actions bot commented Oct 9, 2024

[QST] How to improve skinny matrix perf over Ampere like 3090? #1582

[QST] How to improve skinny matrix perf over Ampere like 3090? #1582

Comments

leiwen83 commented Jun 11, 2024 • edited Loading

github-actions bot commented Jul 11, 2024

github-actions bot commented Oct 9, 2024

leiwen83 commented Jun 11, 2024 •

edited

Loading