📚FFPA: Yet antother Faster Flash Prefill Attention with O(1)⚡️SRAM complexity for headdim > 256, 1.8x~3x↑🎉faster vs SDPA EA.
cuda attention sdpa mla mlsys tensor-cores flash-attention deepseek deepseek-v3 deepseek-r1 fused-mla
-
Updated
Feb 8, 2025 - Cuda