Replies: 1 comment 2 replies
-
Good question. We don't currently use that because that instruction requires that the source data be in shared memory, but the current stream-K and split-K implementations in CUTLASS do not stage partial accumulations in shared memory before reducing them in global memory. |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I was just hopelessly scrolling the PTX ISA 8.3 page and found a section on the
cp.reduce.async.bulk
instructions.From my very amateur look, these instructions seem useful for many Split-K / Streamed-K ideas, such as GEMM, FMHA, etc.
Are there plans to support them in future versions of CuTe / CUTLASS?
Disclaimer: it's totally possible that I am embarrassing myself and these are already in CUTLASS. If that's the case, please educate me 😄.
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions