You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Many performance optimizations, improvements, and bug fixes including fixes for FlashAttention-2.
Optimal code generation with CUDA toolkit versions 12.4 and 12.5u1.
NOTICE:
Upcoming CUTLASS 3.6 release will include a breaking refactor to the CUTLASS 3.x convolution kernel::ConvUniversal API to bring it in line with gemm::GemmUniversal. After this, the 3.x convolution API will no longer be considered as a beta API.
Upcoming CUTLASS 3.6 release will include a breaking refactor to the Hopper TMA pointer array batched epilogue in order to support grouped GEMMs.
This discussion was created from the release CUTLASS 3.5.1.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
cache_hint
s in TMA copy atomsexample 48.
GemmSparseUniversal
API for CUTLASS 2.x Ampere kernels to enable serial and parallel split-k for sparse tensor cores and new tiny tile sizes to better support LLM inference.kernel::ConvUniversal
API to bring it in line withgemm::GemmUniversal
. After this, the 3.x convolution API will no longer be considered as a beta API.This discussion was created from the release CUTLASS 3.5.1.
Beta Was this translation helpful? Give feedback.
All reactions