-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for dynamic offsets to DefaultEpilogue #1274
base: main
Are you sure you want to change the base?
Conversation
Dynamic offsets in `DefaultEpilogue` allows to move pointer arithmetics to device and shift `C` and `D` pointers based on offsets stored in device memory.
@ezhulenev ,what is the use case of this? |
In XLA inside loops (and in general inside control flow) we keep buffer offsets on device, this for example allows to put two gemms writing at different offsets calculated at run time into different If branches and capture both of them into single cuda graph (with conditional graphs added in cuda 12.3). Without dynamic offsets we would be forced to move offset value to host and build multiple cuda graphs. |
how does XLA handle this now when not using cutlass? |
Well… it doesn’t that’s why I’m looking at adding cutlass :) it does it for non-gemm computations by compiling kernels, but for cuBLAS for example we are forced to materialize temporary buffers at known offsets, and overhead adds up |
gotcha. thanks. |
I'm also considering keeping it in XLA as template specialization as this is a little bit too xla specific (especially int32_t offsets, in general int64 makes more sense, but harder to target from XLA). |
@kadeng , does torch have this need? |
Not at this moment, but the argument about improving Cudagraph reuseability appears compelling. |
I implemented this inside XLA with template specializations here: openxla/xla#7916, so I don't need it in CUTLASS right now, but in general I think it would be very useful if dynamic offsets could work with epilogues and also inputs (I didn't look how to make it work with TMA), and making it more generic and less XLA-focused is worthwhile. Mostly because of the CUDA graphs, they are getting more powerful with every CUDA release, and with on-device control flow this is really handy. |
Hello! Before we go ahead with accepting this MR for the default epi, I wanted to ask some questions about its generality to some other epilogues we have. Default epi is what we call a direct store epilogue, which uses no shared memory, and therefore cannot swizzle its output stores, leading to suboptimal perf. Additionally it does not support fusions via EVT. This epi was designed as a vanilla epilogue to aid in development of mainloops and is mostly a debugging tool rather than a zippy fusion+store API. We recommend using the TMA EVT epilogue on SM90 for best perf, or the sm70 vectorized epi on non-TMA architectures via the 3.x API. If we were to accept this MR, would you see it used in production workloads despite its suboptimal perf, or shall we discuss ways to generalize this to all of our epilogues including the performant ones? |
This is a general feature that we'd need for inputs and outputs (epilogues), we know the "base" address at run time when we prepare TMA descriptors (when they are initialized inside CUTLASS from arguments), but the real address of input/output buffers can depend on offsets computed on device (strides are known ahead of time). In extreme we can always set "base" address to 0 (nullptr) and reuse the same CUDA graph with a cutlass kernel for all problems of the same shape at different memory locations. |
This PR has been labeled |
This PR has been labeled |
Dynamic offsets in
DefaultEpilogue
allows to move pointer arithmetics to device and shiftC
andD
pointers based on offsets stored in device memory.Depends on #1273