Add support for dynamic offsets to DefaultEpilogue #1274

ezhulenev · 2023-12-19T04:21:56Z

Dynamic offsets in DefaultEpilogue allows to move pointer arithmetics to device and shift C and D pointers based on offsets stored in device memory.

Depends on #1273

Dynamic offsets in `DefaultEpilogue` allows to move pointer arithmetics to device and shift `C` and `D` pointers based on offsets stored in device memory.

hwu36 · 2023-12-19T04:54:58Z

@thakkarV

@ezhulenev ,what is the use case of this?

ezhulenev · 2023-12-19T04:59:10Z

In XLA inside loops (and in general inside control flow) we keep buffer offsets on device, this for example allows to put two gemms writing at different offsets calculated at run time into different If branches and capture both of them into single cuda graph (with conditional graphs added in cuda 12.3). Without dynamic offsets we would be forced to move offset value to host and build multiple cuda graphs.

hwu36 · 2023-12-19T05:12:50Z

In XLA inside loops (and in general inside control flow) we keep buffer offsets on device, this for example allows to put two gemms writing at different offsets calculated at run time into different If branches and capture both of them into single cuda graph (with conditional graphs added in cuda 12.3). Without dynamic offsets we would be forced to move offset value to host and build multiple cuda graphs.

how does XLA handle this now when not using cutlass?

ezhulenev · 2023-12-19T05:28:06Z

Well… it doesn’t that’s why I’m looking at adding cutlass :) it does it for non-gemm computations by compiling kernels, but for cuBLAS for example we are forced to materialize temporary buffers at known offsets, and overhead adds up

hwu36 · 2023-12-19T05:37:35Z

gotcha. thanks.

ezhulenev · 2023-12-19T07:03:12Z

I'm also considering keeping it in XLA as template specialization as this is a little bit too xla specific (especially int32_t offsets, in general int64 makes more sense, but harder to target from XLA).

hwu36 · 2023-12-20T19:07:12Z

@kadeng , does torch have this need?

kadeng · 2023-12-20T19:46:10Z

@kadeng , does torch have this need?

Not at this moment, but the argument about improving Cudagraph reuseability appears compelling.

ezhulenev · 2023-12-20T20:02:04Z

I implemented this inside XLA with template specializations here: openxla/xla#7916, so I don't need it in CUTLASS right now, but in general I think it would be very useful if dynamic offsets could work with epilogues and also inputs (I didn't look how to make it work with TMA), and making it more generic and less XLA-focused is worthwhile. Mostly because of the CUDA graphs, they are getting more powerful with every CUDA release, and with on-device control flow this is really handy.

thakkarV · 2024-01-08T16:09:52Z

Hello! Before we go ahead with accepting this MR for the default epi, I wanted to ask some questions about its generality to some other epilogues we have. Default epi is what we call a direct store epilogue, which uses no shared memory, and therefore cannot swizzle its output stores, leading to suboptimal perf. Additionally it does not support fusions via EVT. This epi was designed as a vanilla epilogue to aid in development of mainloops and is mostly a debugging tool rather than a zippy fusion+store API.

We recommend using the TMA EVT epilogue on SM90 for best perf, or the sm70 vectorized epi on non-TMA architectures via the 3.x API. If we were to accept this MR, would you see it used in production workloads despite its suboptimal perf, or shall we discuss ways to generalize this to all of our epilogues including the performant ones?

ezhulenev · 2024-01-08T17:28:42Z

This is a general feature that we'd need for inputs and outputs (epilogues), we know the "base" address at run time when we prepare TMA descriptors (when they are initialized inside CUTLASS from arguments), but the real address of input/output buffers can depend on offsets computed on device (strides are known ahead of time).

In extreme we can always set "base" address to 0 (nullptr) and reuse the same CUDA graph with a cutlass kernel for all problems of the same shape at different memory locations.

github-actions · 2024-02-07T18:04:55Z

This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions · 2024-05-07T18:05:34Z

This PR has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates.

Add support for dynamic offsets to DefaultEpilogue

673bd89

Dynamic offsets in `DefaultEpilogue` allows to move pointer arithmetics to device and shift `C` and `D` pointers based on offsets stored in device memory.

github-actions bot added the inactive-30d label Feb 7, 2024

github-actions bot added the inactive-90d label May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for dynamic offsets to DefaultEpilogue #1274

Add support for dynamic offsets to DefaultEpilogue #1274

ezhulenev commented Dec 19, 2023

hwu36 commented Dec 19, 2023 •

edited

Loading

ezhulenev commented Dec 19, 2023

hwu36 commented Dec 19, 2023

ezhulenev commented Dec 19, 2023

hwu36 commented Dec 19, 2023

ezhulenev commented Dec 19, 2023

hwu36 commented Dec 20, 2023

kadeng commented Dec 20, 2023

ezhulenev commented Dec 20, 2023

thakkarV commented Jan 8, 2024

ezhulenev commented Jan 8, 2024

github-actions bot commented Feb 7, 2024

github-actions bot commented May 7, 2024

Add support for dynamic offsets to DefaultEpilogue #1274

Are you sure you want to change the base?

Add support for dynamic offsets to DefaultEpilogue #1274

Conversation

ezhulenev commented Dec 19, 2023

hwu36 commented Dec 19, 2023 • edited Loading

ezhulenev commented Dec 19, 2023

hwu36 commented Dec 19, 2023

ezhulenev commented Dec 19, 2023

hwu36 commented Dec 19, 2023

ezhulenev commented Dec 19, 2023

hwu36 commented Dec 20, 2023

kadeng commented Dec 20, 2023

ezhulenev commented Dec 20, 2023

thakkarV commented Jan 8, 2024

ezhulenev commented Jan 8, 2024

github-actions bot commented Feb 7, 2024

github-actions bot commented May 7, 2024

hwu36 commented Dec 19, 2023 •

edited

Loading