You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For skinny matrix mul like m=16 k=5120 n=1792, cutlass seems cannot reach max bandwidth:
With profiler I got: cutlass_tensorop_s16816gemm_f16_64x64_64x5_tn_align8, and its bandwidth is as 374.597GB/s, runtime 46us.
While with cublas running within pytorch, I could reach 27us. [cublas choose ampere_fp16_s16816gemm_fp16_64x64_ldg8_f2f_stages_64x5_nn here]
So how to further optimize cutlass config to reach the same speed that cublas currently has?
If M goes down from 16 to like 2, this number would get worse:
profiler would choose cutlass_tensorop_s16816gemm_f16_64x64_32x10_tn_align2, and bandwidth is only 280GB/s, runtime 59us.
while the same problem setting in pytorch still get 27us with different kernel [cublas choose ampere_fp16_s16816gemm_fp16_128x64_ldg8_f2f_stages_64x3_nn+cublasLt::splitKreduce_kernel].
Thx~
The text was updated successfully, but these errors were encountered:
This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.
This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.
Hi~
For skinny matrix mul like m=16 k=5120 n=1792, cutlass seems cannot reach max bandwidth:
With profiler I got: cutlass_tensorop_s16816gemm_f16_64x64_64x5_tn_align8, and its bandwidth is as 374.597GB/s, runtime 46us.
While with cublas running within pytorch, I could reach 27us. [cublas choose ampere_fp16_s16816gemm_fp16_64x64_ldg8_f2f_stages_64x5_nn here]
So how to further optimize cutlass config to reach the same speed that cublas currently has?
If M goes down from 16 to like 2, this number would get worse:
profiler would choose cutlass_tensorop_s16816gemm_f16_64x64_32x10_tn_align2, and bandwidth is only 280GB/s, runtime 59us.
while the same problem setting in pytorch still get 27us with different kernel [cublas choose ampere_fp16_s16816gemm_fp16_128x64_ldg8_f2f_stages_64x3_nn+cublasLt::splitKreduce_kernel].
Thx~
The text was updated successfully, but these errors were encountered: