FA3 forward performance regression on H200 #1438

complexfilter · 2025-01-10T23:58:05Z

I did some benchmark test on H200 at bf16 and fp8 precision.

I found that in the forward pass, H200 is slightly slower (4% average). And in the backward pass, H200 is slightly faster (3.5% on average).

I was wondering if the slower forward pass is expected given H200 is more prestigious than H100. Do we need something called FA3.5 that adapts to and exploits H200?

tridao · 2025-01-11T05:26:43Z

What TFLOPS do you get?
Which version of the code (e..g which commit) did you use?

complexfilter · 2025-01-11T06:22:24Z

What TFLOPS do you get? Which version of the code (e..g which commit) did you use?

Having some difficulty to use ncu to compute the FLOPs now. But I have the runtime results:
when (batch_size, num_heads, seq_len, dim)= (1, 8, 32768, 64),

for H100 at forward bf16 task (full attention), I get 4.27728 ms.
for H200 at forward bf16 task (full attention), I get 4.51709 ms.

I used commit: 3cea2fb.

tridao · 2025-01-11T11:31:25Z

Can you try the latest commit?

complexfilter · 2025-01-14T02:07:24Z

Hi, @tridao I tried the latest commit. But when running python setup.py install on hopper folder, I have encountered the error message.

[71/102] /mnt/localssd/flash-attention/hopper/../third_party/nvidia/backend/bin/nvcc --generate-dependencies-with-compile --dependency-output /mnt/localssd/flash-attention/hopper/build/temp.linux-x86_64-cpython-310/instantiations/flash_fwd_hdimall_bf16_packgqa_sm90.o.d -I/mnt/localssd/flash-attention/hopper -I/mnt/localssd/flash-attention/csrc/cutlass/include -I/home/colligo/fa3/lib/python3.10/site-packages/torch/include -I/home/colligo/fa3/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/colligo/fa3/lib/python3.10/site-packages/torch/include/TH -I/home/colligo/fa3/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/colligo/fa3/include -I/usr/include/python3.10 -c -c /mnt/localssd/flash-attention/hopper/instantiations/flash_fwd_hdimall_bf16_packgqa_sm90.cu -o /mnt/localssd/flash-attention/hopper/build/temp.linux-x86_64-cpython-310/instantiations/flash_fwd_hdimall_bf16_packgqa_sm90.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' --threads 4 -O3 -std=c++17 --ftemplate-backtrace-limit=0 --use_fast_math --resource-usage -lineinfo -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DCUTLASS_DEBUG_TRACE_LEVEL=0 -DNDEBUG -gencode arch=compute_90a,code=sm_90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_3_cuda -D_GLIBCXX_USE_CXX11_ABI=0 FAILED: /mnt/localssd/flash-attention/hopper/build/temp.linux-x86_64-cpython-310/instantiations/flash_fwd_hdimall_bf16_packgqa_sm90.o

complexfilter · 2025-01-14T02:43:52Z

I have posted the error message as an issue.

complexfilter · 2025-01-14T23:22:52Z

Hi, @tridao , after fixing the setup issue. I started to benchmark on the latest build.
Compared with this old commit: 3cea2fb, the latest commit offers:

on average 5% forward speedup on H100 and 6% forward speedup on H200 at bf16.
on average both 8% forward speedup on H100 and on H200 at fp8.
on average 25% backward speedup on H100 and 19% backward speedup on H200 at bf16.

However, based on the latest commit, H200 is consistently slower than H100, at bf16 fwd, bf16 bwd, fp8 fwd. The performance drop on H200 ranges from 1% to 3%.

tridao · 2025-01-15T04:55:07Z

That's great to hear!
I'm guessing the tile sizes are tuned for H100. You can try tuning them for H200 (this is the fwd pass tile sizes):
https://github.com/Dao-AILab/flash-attention/blob/main/hopper/tile_size.h
Tbh I don't expect H200 to be much faster than H100 for this setting (training) since it's pretty much FLOPS bound, and H200 and H100 have the same max FLOPS. The case where I'd expect difference is during inference (decode) where it's memory bound and the higher mem bandwidth of H200 will help.

Btw you can set these env variables to make compilation faster (by disabling features):

FLASH_ATTENTION_DISABLE_BACKWARD=FALSE
FLASH_ATTENTION_DISABLE_SPLIT=TRUE
FLASH_ATTENTION_DISABLE_LOCAL=TRUE
FLASH_ATTENTION_DISABLE_PAGEDKV=TRUE
FLASH_ATTENTION_DISABLE_FP16=TRUE
FLASH_ATTENTION_DISABLE_FP8=TRUE
FLASH_ATTENTION_DISABLE_APPENDKV=TRUE
FLASH_ATTENTION_DISABLE_VARLEN=TRUE
FLASH_ATTENTION_DISABLE_CLUSTER=FALSE
FLASH_ATTENTION_DISABLE_PACKGQA=TRUE
FLASH_ATTENTION_DISABLE_SOFTCAP=TRUE
FLASH_ATTENTION_DISABLE_HDIM64=TRUE
FLASH_ATTENTION_DISABLE_HDIM96=TRUE
FLASH_ATTENTION_DISABLE_HDIM128=FALSE
FLASH_ATTENTION_DISABLE_HDIM192=TRUE
FLASH_ATTENTION_DISABLE_HDIM256=TRUE

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FA3 forward performance regression on H200 #1438

FA3 forward performance regression on H200 #1438

complexfilter commented Jan 10, 2025

tridao commented Jan 11, 2025

complexfilter commented Jan 11, 2025

tridao commented Jan 11, 2025

complexfilter commented Jan 14, 2025 •

edited

Loading

complexfilter commented Jan 14, 2025

complexfilter commented Jan 14, 2025

tridao commented Jan 15, 2025

FA3 forward performance regression on H200 #1438

FA3 forward performance regression on H200 #1438

Comments

complexfilter commented Jan 10, 2025

tridao commented Jan 11, 2025

complexfilter commented Jan 11, 2025

tridao commented Jan 11, 2025

complexfilter commented Jan 14, 2025 • edited Loading

complexfilter commented Jan 14, 2025

complexfilter commented Jan 14, 2025

tridao commented Jan 15, 2025

complexfilter commented Jan 14, 2025 •

edited

Loading