-
In , we wait for two oldest stages of async copy groups to finish. Why do we need to wait for two, instead of one?If my read of the innermost cutlass/include/cutlass/gemm/threadblock/mma_multistage.h Lines 555 to 563 in 12f4108 So I want to know why we keep two stages of copy results always ready in the innermost loop. |
Beta Was this translation helpful? Give feedback.
Replies: 6 comments 2 replies
-
oh maybe I got it. At the last cutlass/include/cutlass/gemm/threadblock/mma_multistage.h Lines 500 to 501 in 12f4108 so yeah, I can see why we need two stages of copy results always ready. |
Beta Was this translation helpful? Give feedback.
-
Correct.
Yes, we do this. |
Beta Was this translation helpful? Give feedback.
-
You are correct. I misspoke. |
Beta Was this translation helpful? Give feedback.
-
Is this multi-stage discussed here related to stages argument used to init tile description? |
Beta Was this translation helpful? Give feedback.
-
yes
yes
sgemm on sm80 can use multiple stages. |
Beta Was this translation helpful? Give feedback.
-
multistage is only useful for >= sm80. you can profile to choose the best stage for you. you can start from those used by cutlass profiler. |
Beta Was this translation helpful? Give feedback.
oh maybe I got it. At the last
warp_mma_k
iteration of a given stage, we want to start loading from the next stage copy results:cutlass/include/cutlass/gemm/threadblock/mma_multistage.h
Lines 500 to 501 in 12f4108
so yeah, I can see why we need two stages of copy results always ready.