Replies: 1 comment 8 replies
-
Yes.
No, every warp does both load and compute. It is like below.
The intention is to hide the load time by the compute time. |
Beta Was this translation helpful? Give feedback.
8 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi all,
I was wondering how double-buffering to hide the global-to-shared movement is realized. Suppose I launch a kernel with thread-block size 128x128x32 and choose statges==2, Is the K-dimension double in this case, i.e., the actual allocation in shared memory is of size 128x128x64?
Adding to this, I want to know that how the actual global-to-shared load latency is hidden. If we set warps_m=2 and warps_n=2, then in total, we will launch 128 threads per thread block. How exactly are they divided into compute and copy? Are half of the warps doing copy while that other half is busy copying data for the next k-iteration? It would be great if someone could point out what is actually happening?
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions