Double-buffering in shared memory, details? #227

navdeepkk · 2021-04-06T15:35:19Z

navdeepkk
Apr 6, 2021

Hi all,
I was wondering how double-buffering to hide the global-to-shared movement is realized. Suppose I launch a kernel with thread-block size 128x128x32 and choose statges==2, Is the K-dimension double in this case, i.e., the actual allocation in shared memory is of size 128x128x64?
Adding to this, I want to know that how the actual global-to-shared load latency is hidden. If we set warps_m=2 and warps_n=2, then in total, we will launch 128 threads per thread block. How exactly are they divided into compute and copy? Are half of the warps doing copy while that other half is busy copying data for the next k-iteration? It would be great if someone could point out what is actually happening?

Thanks!

hwu36 · 2021-04-06T16:31:34Z

hwu36
Apr 6, 2021
Maintainer

Suppose I launch a kernel with thread-block size 128x128x32 and choose statges==2, Is the K-dimension double in this case, i.e., the actual allocation in shared memory is of size 128x128x64?

Yes.

If we set warps_m=2 and warps_n=2, then in total, we will launch 128 threads per thread block. How exactly are they divided into compute and copy? Are half of the warps doing copy while that other half is busy copying data for the next k-iteration?

No, every warp does both load and compute. It is like below.

load(n)
compute(n-1)
load(n+1)
compute(n)

The intention is to hide the load time by the compute time.

8 replies

hwu36 Apr 7, 2021
Maintainer

Yeah, of course. We use double buffer as long as there are enough registers. The only exception is sparse gemm which requires 2x register for B operand compared with dense gemm. We thus only use single buffer for one of the operands when the warp tile size is large.

navdeepkk Apr 11, 2021
Author

Thanks for the answers. I was wondering why CUTLASS chooses to hide the global load latency via double buffering instead of just pre-fetching data for the next iteration into the registers. Any particular reasons?

hwu36 Apr 11, 2021
Maintainer

I don't see how "pre-fetching data for the next iteration into the registers" works, but anyway we don't have spare registers doing extra work since tensor core instructions take large amount of registers.

navdeepkk Apr 13, 2021
Author

I read it in a few works where it was used and they get reasonable performance, so I was wondering if that is the case here. Anyways, to add here does cutlass use some kind of instruction reordering for the double buffering to work, or its just the arrangement that you highlighted before?
Thanks!

hwu36 Apr 13, 2021
Maintainer

We just wrote plain C++ code like this: https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/gemm/threadblock/mma_pipelined.h

We left the rest to the compiler and the hardware.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Double-buffering in shared memory, details? #227

{{title}}

Replies: 1 comment 8 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Double-buffering in shared memory, details? #227

navdeepkk Apr 6, 2021

Replies: 1 comment · 8 replies

hwu36 Apr 6, 2021 Maintainer

hwu36 Apr 7, 2021 Maintainer

navdeepkk Apr 11, 2021 Author

hwu36 Apr 11, 2021 Maintainer

navdeepkk Apr 13, 2021 Author

hwu36 Apr 13, 2021 Maintainer

navdeepkk
Apr 6, 2021

Replies: 1 comment 8 replies

hwu36
Apr 6, 2021
Maintainer

hwu36 Apr 7, 2021
Maintainer

navdeepkk Apr 11, 2021
Author

hwu36 Apr 11, 2021
Maintainer

navdeepkk Apr 13, 2021
Author

hwu36 Apr 13, 2021
Maintainer