All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
Pre-built binary wheels are available for PyTorch 2.6.0. Following PyTorch, we build wheels for CUDA 11.8, 12.4, and 12.6 only (we no longer build for CUDA 12.1). xFormers now requires PyTorch >= 2.6
- [fMHA] Creating a
LowerTriangularMask
no longer creates a CUDA tensor - [fMHA] Updated Flash-Attention to
v2.7.2.post1
- [fMHA] Flash-Attention v3 will now be used by
memory_efficient_attention
by default when available, unless the operator is enforced with theop
keyword-argument. Switching from Flash2 to Flash3 can make transformer trainings ~10% faster end-to-end on H100s - [fMHA] Fixed a performance regression with the
cutlass
backend for the backward pass (#1176) - mostly used on older GPUs (eg V100) - Fixed swiglu operator compatibility with torch-compile with PyTorch 2.6
- Fix activation checkpointing of SwiGLU when AMP is enabled (#1152)
- Following PyTorch, xFormers no longer builds binaries for conda. Pip is now the only recommended way to get xFormers
- Removed unmaintained/deprecated components in
xformers.components.*
(see #848)
Pre-built binary wheels require PyTorch 2.5.1
Pre-built binary wheels require PyTorch 2.5.0
Properly upload wheels for cuda 12.4
Pre-built binary wheels require PyTorch 2.4.1
- Added wheels for cuda 12.4
- Added conda builds for python 3.11
- Added wheels for rocm 6.1
- Profiler: Fix computation of FLOPS for the attention when using xFormers
- Profiler: Fix MFU/HFU calculation when multiple dtypes are used
- Profiler: Trace analysis to compute MFU & HFU is now much faster
- fMHA/splitK: Fixed
nan
in the output when using atorch.Tensor
bias where a lot of consecutive keys are masked with-inf
- Update Flash-Attention version to
v2.6.3
when building from scratch - When using the most recent version of Flash-Attention, it is no longer possible to mix it with the cutlass backend. In other words, it is no longer possible to use the cutlass Fw with the flash Bw.
- fMHA: Removed
decoder
andsmall_k
backends - profiler: Removed
DetectSlowOpsProfiler
profiler - Removed compatibility with PyTorch < 2.4
- Removed conda builds for python 3.11
- Removed windows pip wheels for cuda 12.1 and 11.8
Pre-built binary wheels require PyTorch 2.4.0
Pre-built binary wheels require PyTorch 2.4.0
Pre-built binary wheels require PyTorch 2.3.1
- fMHA:
PagedBlockDiagonalGappyKeysMask
- fMHA: heterogeneous queries in
triton_splitk
- fMHA: support for paged attention in flash
- fMHA: Added backwards pass for
merge_attentions
- fMHA: Added
torch.compile
support for 3 biases (LowerTriangularMask
,LowerTriangularMaskWithTensorBias
andBlockDiagonalMask
) - some might require PyTorch 2.4 - fMHA: Added
torch.compile
support inmemory_efficient_attention
when passing the flash operator explicitely (egmemory_efficient_attention(..., op=(flash.FwOp, flash.BwOp))
) - fMHA:
memory_efficient_attention
now expects itsattn_bias
argument to be on the same device as the other input tensor. Previously, it would convert the bias to the right device. - fMHA:
AttentionBias
subclasses are now constructed by default on thecuda
device if available - they used to be created on the CPU device - 2:4 sparsity: Added
xformers.ops.sp24.sparsify24_ste
for Straight Through Estimator (STE) with options to rescale the gradient differently for masked out/kept values
- fMHA: Fixed out-of-bounds reading for Split-K triton implementation
- Profiler: fix bug with modules that take a single tuple as argument
- Profiler: Added manual trigger for a profiling step, by creating a
trigger
file in the profiling directory
- Removed support for PyTorch version older than 2.2
Pre-built binary wheels require PyTorch 2.3.0
- [2:4 sparsity] Added support for Straight-Through Estimator for
sparsify24
gradient (GRADIENT_STE
) - [2:4 sparsity]
sparsify24_like
now supports the cuSparseLt backend, and the STE gradient - Basic support for
torch.compile
for thememory_efficient_attention
operator. Currently only supports Flash-Attention, and without any bias provided. We want to expand this coverage progressively.
- merge_attentions no longer needs inputs to be stacked.
- fMHA: triton_splitk now supports additive bias
- fMHA: benchmark cleanup
Pre-built binary wheels require PyTorch 2.2.2
Pre-built binary wheels require PyTorch 2.2.1
- New
merge_attentions
function - fMHA: New gappy attention biases.
- fMHA: Updated Flash-Attention to v2.5.6: this has a performance improvement for multiquery.
- fMHA: triton_splitk changed and expanded. Now amalgamates using LSE. Can autotune, supports causal with a small number of queries - not just 1. Experimental support for paged attention.
rope_padded
: Fixed CUDA error with many queries (more than 65k)rmsnorm
: Fixed CUDA error with large inputs (enables 512k+ sequence length on Llama2 70B)
- fMHA: Removed triton operator (
fmha.triton.*
,xformers.ops.MemoryEfficientAttentionTritonFwdFlashBwOp
,xformers.ops.TritonFlashAttentionOp
), as it has correctness issues under some conditions, and is slower than other implementations.
Pre-built binary wheels require PyTorch 2.2.0
- Added components for model/sequence parallelism, as near-drop-in replacements for FairScale/Megatron Column&RowParallelLinear modules. They support fusing communication and computation for sequence parallelism, thus making the communication effectively free. Read more
- Added kernels for training models with 2:4-sparsity. We introduced a very fast kernel for converting a matrix A into 24-sparse format, which can be used during training to sparsify weights dynamically, activations etc... xFormers also provides an API that is compatible with torch-compile, see
xformers.ops.sparsify24
.
- Make selective activation checkpointing be compatible with torch.compile.
- Triton kernels now require a GPU with compute capability 8.0 at least (A100 or newer). This is due to newer versions of triton not supporting older GPUs correctly
- Removed support for PyTorch version older than 2.1.0
Pre-built binary wheels require PyTorch 2.1.1 (xFormers 0.0.23
) or PyTorch 2.1.2 (xFormers 0.0.23.post1
).
- fMHA: Fixed a bug in cutlass backend forward pass where the logsumexp was not correctly calculated, resulting in wrong results in the BW pass. This would happen with MQA when one sequence has a query with
length%64 == 1
- fMHA: Updated Flash-Attention to v2.3.6 - this fixes a performance regression in causal backward passes, and now supports
BlockDiagonalCausalWithOffsetPaddedKeysMask
- fMHA: Added
LocalAttentionFromBottomRightMask
(local) - fMHA: Added
LowerTriangularFromBottomRightMask
(causal) - fMHA: Added
LowerTriangularFromBottomRightLocalAttentionMask
(local + causal)
- Removed
xformers.triton.sum_strided
- fMHA: Backward pass now works in PyTorch deterministic mode (although slower)
- fMHA: Added experimental support for Multi-Query Attention and Grouped-Query Attention. This is handled by passing 5-dimensional inputs to
memory_efficient_attention
, see the documentation for more details - fMHA: Added experimental support for Local Attention biases to
memory_efficient_attention
- Added an example of efficient LLaMa decoding using xformers operators
- Added Flash-Decoding for faster attention during Large Language Model (LLM) decoding - up to 50x faster for long sequences (token decoding up to 8x faster end-to-end)
- Added an efficient rope implementation in triton, to be used in LLM decoding
- Added selective activation checkpointing, which gives fine-grained control of which activations to keep and which activations to recompute
xformers.info
now indicates the Flash-Attention version used
- fMHA: Removed
smallK
backend support for CPU.memory_efficient_attention
only works for CUDA/GPU tensors now - DEPRECATION: Many classes in
xformers.factory
,xformers.triton
andxformers.components
have been or will be deprecated soon (see tracking issue #848)
- fMHA: Updated flash-attention to v2, with massive performance improvements for both the forward pass and backward pass. This implementation is now used by default when it's available
- fMHA/cutlass: Fix potential race condition in the FW/BW passes
- fMHA/cutlass: Fix
attn_bias
stride overflow for very long sequences (>32k) LowerTriangularMask
is now backward compatible with older xformers versions
memory_efficient_attention
now expects theattn_bias
argument to have a head dimensionmemory_efficient_attention
no longer broadcasts the batch/head dimensions ofattn_bias
. Please use.expand
if you need to broadcast the bias- Remove
causal_diagonal
argument fromBlockDiagonalCausalWithOffsetPaddedKeysMask
- Binary wheels on pypi/conda now contain H100 kernels
- fMHA: Added backend specialized for decoding that does not use TensorCores - useful when not using multiquery
NOTE: Binary wheels are now provided only for PyTorch 2 with cuda 11.8. It is still possible to use xFormers with older versions of PyTorch by building from source or using conda.
- fMHA/cutlass (backward): Massive performance improvements when
batch_size * num_heads
is low (10x+) - fMHA/cutlass: Further performance improvements for both the forward & backward kernels
- fMHA (backward): Now dispatching to cutlass when
embed_dim>64
- fMHA: Updated Flash-Attention to
v1.0.5
- fMHA now runs on H100 (support is experimental)
- Display
nvcc
version used to compilexformers
inpython -m xformers.info
- Fixed performance regression with
nvcc>11.6
(#712) - fMHA/cutlass: Fixed
nan
in the output when using atorch.Tensor
with-inf
prefixes asattn_bias
(#722) - fMHA/cutlass: Fixed
nan
in the output when the sequence length is larger than2 ** 15
(#719) - fMHA/cutlass: Significative performance improvements (up to 2x) for both the forward pass and backward pass
- fMHA/cutlass: The kernel are now deterministic
- fMHA/cutlass: Fixed backward pass correctness when using dropout (#724)
- Added
xformers.ops.index_select_cat
andxformers.ops.scaled_index_add
- those are experimental functions that only work with a few shapes, and can be used to write efficient stochastic depth in transformer architectures for instance
- fMHA:
memory_efficient_attention
now acceptstorch.Tensor
as attention bias for any seqlen, although there are still requirements on the alignment of the bias tensor (see #683)
- fMHA: Fixed BW pass on Sm86/Sm89 GPUs when
K > 64
(RTX 3090, RTX 4090, A6000, ..) [#631]
- fMHA/CUTLASS: Added tensor attn bias support [#587] - contribution from @jfc4050
- fMHA/CUTLASS: Added tensor attn bias grad support [#587] - contribution from @jfc4050
- fMHA/CUTLASS: Added dropout support [#587] - contribution from @jfc4050
- fMHA: Added support for varying sequence lengths [#500]
- Updated triton dependency [#418]
- Stripe lineinfo from binaries, reducing the binary size [#549]
- Added support for pip wheels [#588, #573, #534, #523, ...] big thanks to @AbdBarho!
- Fixed compatibility with Python 3.7 [#541] - thanks to @susumuota
- fMHA: Fixed strides for QKV gradients for cutlass attention [#535]
- fMHA: Stricter inputs validation to avoid CUDA errors for unsupported inputs [#592]
- fMHA/Flash-Attention: Updated to https://github.com/HazyResearch/flash-attention/commit/a1f49a2b92b6fa022379bbebafed9d7f5e96a675 with multiple changes from @TriDao that make the operator up to 20% faster
- fMHA/Flash-Attention: Fixed backward pass wrapper, where non-contiguous gradients could give the wrong result [#548]
- fMHA: Separate each operator into forward and backward operators. It's now possible to use any combination of forward+backward (for instance Triton forward and Flash-Attention backward) [#560]
- fMHA: Added Triton operator for forward pass from Flash-Attention authored by @TriDao, will be automatically used on A100 when compatible
- fMHA: Added
xformers.ops.memory_efficient_attention_forward
,xformers.ops.memory_efficient_attention_forward_requires_grad
,xformers.ops.memory_efficient_attention_backward
for power-users who write custom autograd functions [#560] - fMHA: Support for custom scaling for the CUTLASS-based kernel [#530] - contribution from @comaniac
- fMHA/CUTLASS: The current CUDA stream is now used by the kernel [#491]
- fMHA/CUTLASS: Improve overall performance
- SwiGLU: Added
xformers.ops.SwiGLU
and its functional counterpart (xformers.ops.swiglu
) [#490] - fMHA: Possible to combine CUTLASS's forward with flash-attention's backward pass [#469] - improves performance on A100 for K = 128
- fMHA: Add custom
xformers.ops.unbind
operator to avoid a cat in the attention block [#458]
- fMHA: Added CUTLASS-based kernel for
xformers.ops.memory_efficient_attention
. This kernel is automatically depending on the inputs, and works on any GPU after P100 [#362]
- Removed duplicated biases in the FusedMLP layers [#317]
- Rotary embeddings respecting input types [#326]
- Poolformer style instantiating useless projection layers [#349]
- Fix layer position not being properly tracked, causing extra layernorms for programmatic xformers [#348]
- Pass use_triton flag to LayerNorm module [#336]
- Four blocksparsity layouts from DeepSpeed [#320]
- Support several initialization options [#312]
- Conv2DFeedforward feedforward part [#321]
- VisualAttention [#329]
- Automatic blocksparse for causal attention [#334]
- Better hierarchical transformer generation [#345]
- Fused operations with AOTAutograd/NVFuser, integration into MLP [#357]
- Refactor LRA code to use Pytorch Lightning [#343]
- Fix some torchscriptability [#246]
- Fix FourierMix being compatible with AMP [#258]
- Better asserts on QKV dimensions [#264]
- Better perfs for FusedMLP and FusedLinearLayer [#283]
- Deepnorm init missing self-attention [#284]
- Simplicial Embeddings [#259]
- Mem efficient attention, FW pass [#267]
- MHA benchmark
- MLP benchmark
- Move all triton kernels to triton v2 [#272]
- Mem efficient attention, BW pass [#281]
- Metaformer support [#294]
- Expose bias flag for feedforwards, same default as Timm [#220]
- Update eps value for layernorm, same default as torch [#221]
- PreNorm bugfix, only one input was normalized [#233]
- Fix bug where embedding dimensions that did not match model dim would lead to a crash [#244]
- Add DeepNet (DeepNorm) residual path and init [#227]
- Compositional Attention [#41]
- Experimental Ragged attention [#189]
- Mixture of Experts [#181]
- BlockSparseTensor [#202]
- Nd-tensor support for triton softmax [#210]
- Bugfix Favor, single feature map [#183]
- Sanity check blocksparse settings [#207]
- Fixed some picklability [#204]
- Embedding weight tying option [#172]
- Dropout setting not properly passed in many attentions [#123]
- Fix self attention optimization not being triggered, broken residual path [#119]
- Improve speed by not using contiguous Tensors when not needed [#119]
- Homogenizing the masks, additive or bool [#79][#85][#86]
- Fix causality flag not being respected [#103]
- Enabling FusedLayerNorm by default in the factory if Triton is available
- Fixing Favor with fp16
- Fixing Favor trainability
- Nystrom causal attention [#75]
- More robust blocksparse [#24]