Will CUTLASS support `cp.reduce.async.bulk`? #1379

hyhieu · 2024-03-05T16:10:10Z

hyhieu
Mar 5, 2024

I was just hopelessly scrolling the PTX ISA 8.3 page and found a section on the cp.reduce.async.bulk instructions.

From my very amateur look, these instructions seem useful for many Split-K / Streamed-K ideas, such as GEMM, FMHA, etc.

Are there plans to support them in future versions of CuTe / CUTLASS?

Disclaimer: it's totally possible that I am embarrassing myself and these are already in CUTLASS. If that's the case, please educate me 😄.

Thanks!

jackkosaian · 2024-03-05T17:28:15Z

jackkosaian
Mar 5, 2024

Good question. We don't currently use that because that instruction requires that the source data be in shared memory, but the current stream-K and split-K implementations in CUTLASS do not stage partial accumulations in shared memory before reducing them in global memory.

2 replies

hyhieu Mar 5, 2024
Author

Would be beneficial to store some partial accumulations in shared memory, and then use these cp.reduce.async.bulk instructions to reduce in global memory?

I guess this is akin to more the coalesced fashion during some GEMM epilogues.

jackkosaian Mar 5, 2024

It could be. However, the CUTLASS stream-K and split-K implementations are currently determine how a problem should be split (if it should be split at all) using runtime arguments, while any shared memory that might be used for staging partial accumulations would need to be known at compile time. Allocating this shared memory in cases where we don't end up needing it (e.g., due to stream-K choosing not to split the problem) could reduce shared memory available for the mainloop and epilogue unnecessarily.

This would need to be considered through additional benchmarking.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Will CUTLASS support `cp.reduce.async.bulk`? #1379

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Will CUTLASS support cp.reduce.async.bulk? #1379

hyhieu Mar 5, 2024

Replies: 1 comment · 2 replies

jackkosaian Mar 5, 2024

hyhieu Mar 5, 2024 Author

jackkosaian Mar 5, 2024

Will CUTLASS support `cp.reduce.async.bulk`? #1379

hyhieu
Mar 5, 2024

Replies: 1 comment 2 replies

jackkosaian
Mar 5, 2024

hyhieu Mar 5, 2024
Author