TransformerEngine attention #1715

janEbert · 2025-01-22T13:38:37Z

🚀 Feature Request

TransformerEngine has advanced Attention kernels, including support for FlashAttention-3 and low-precision kernels.

Having TransformerEngine's Attention as an attn_impl option would be super nice due to the additional features for H100 users.

Would require some changes in MPT configuration and adding that new Attention layer.

Not yet sure if I am available for the implementation, but wanted to get the request and discussion out there for now. :)

There was a previous PR with a similar proposal here: #803

The text was updated successfully, but these errors were encountered:

janEbert added the enhancement New feature or request label Jan 22, 2025