CUTLASS 3.5.1 #1759

hwu36 · 2024-08-29T20:15:44Z

hwu36
Aug 29, 2024
Maintainer

Minimal SM90 WGMMA + TMA GEMM example in 100 lines of code.
Exposure of L2 cache_hints in TMA copy atoms
Exposure of raster order and tile swizzle extent in CUTLASS library profiler, and
example 48.
TMA store based and EVT supported epilogues for Hopper pointer array batched kernels.
A new GemmSparseUniversal API for CUTLASS 2.x Ampere kernels to enable serial and parallel split-k for sparse tensor cores and new tiny tile sizes to better support LLM inference.
CUDA host adapter extensions to support TMA descriptor construction driver APIs.
Inclusion of more Hopper fprop, dgrad, and wgrad convolution kernels in CUTLASS library and profiler.
Support for residual add (beta != 0) in convolution kernels.
A new convolution epilogue for CUTLASS 2.x to support non-packed NHWC output.
A refactor of include files throughout CUTLASS core directories to reduce circular dependencies and tests to guard against them.
A guide for setting up VSCode to work well with CUTLASS and expanded code style guide.
Better support for MSVC as a host compiler.
Many performance optimizations, improvements, and bug fixes including fixes for FlashAttention-2.
Optimal code generation with CUDA toolkit versions 12.4 and 12.5u1.
NOTICE:
- Upcoming CUTLASS 3.6 release will include a breaking refactor to the CUTLASS 3.x convolution kernel::ConvUniversal API to bring it in line with gemm::GemmUniversal. After this, the 3.x convolution API will no longer be considered as a beta API.
- Upcoming CUTLASS 3.6 release will include a breaking refactor to the Hopper TMA pointer array batched epilogue in order to support grouped GEMMs.

This discussion was created from the release CUTLASS 3.5.1.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUTLASS 3.5.1 #1759

{{title}}

Replies: 0 comments

Select a reply

CUTLASS 3.5.1 #1759

hwu36 Aug 29, 2024 Maintainer

Replies: 0 comments

hwu36
Aug 29, 2024
Maintainer