[feat] support ffpa-l1 registers double buffers (#70)

* Update README.md * Update README.md * Update env.py * Update prefill.cuh * Update ffpa_attn_templates_L1.cuh * Update launch_templates.cuh * Update README.md
DefTruth · Feb 4, 2025 · 6a85c42 · 6a85c42
1 parent 8aade41
commit 6a85c42
Show file tree

Hide file tree

Showing 5 changed files with 214 additions and 56 deletions.
diff --git a/README.md b/README.md
@@ -68,9 +68,9 @@ By leveraging this approach, we can achieve better performance for large headdim
 
 |📚Feature |📚Feature |📚Feature |📚Feature|
 |:---:|:---:|:---:|:---:|
-|✔️Tensor Cores|✔️Loop over N/D |✔️Tile Block(Br, Bc) |✔️**MMA(m16n8k16)**|
+|✔️Tensor Cores |✔️**MMA(m16n8k16)** |✔️Tile Block(Br, Bc) |✔️Tile MMA/Warp |
 |✔️**Split Q**(FA-2)|✔️Pack LDST(128 bits)|✔️SMEM **Swizzle/Pad** |✔️Copy Async |
-|✔️Tile MMA/Warp |✔️QKV Multi-Stages(1~4) |✔️Collective Store(**Shfl**)|✔️**Prefetch QKV** g2s |
+|✔️**Reg Double Buffers** |✔️QKV **Multi-Stages(1~4)** |✔️Collective Store(**Shfl**)|✔️**Prefetch QKV** g2s |
 |✔️**QKV Fine-grained Tiling**|✔️**Shared QKV** SMEM|✔️Mixed MMA Acc|✔️**Persist Q** s2r/g2s|
 
 - 📚 case: FFPA `L1` kernel template signature: [ffpa_attn_templates_L1.cuh](csrc/cuffpa/ffpa_attn_templates_L1.cuh)