Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FA3 forward performance regression on H200 #1438

Open
complexfilter opened this issue Jan 10, 2025 · 7 comments
Open

FA3 forward performance regression on H200 #1438

complexfilter opened this issue Jan 10, 2025 · 7 comments

Comments

@complexfilter
Copy link

I did some benchmark test on H200 at bf16 and fp8 precision.

I found that in the forward pass, H200 is slightly slower (4% average). And in the backward pass, H200 is slightly faster (3.5% on average).

I was wondering if the slower forward pass is expected given H200 is more prestigious than H100. Do we need something called FA3.5 that adapts to and exploits H200?

@tridao
Copy link
Member

tridao commented Jan 11, 2025

What TFLOPS do you get?
Which version of the code (e..g which commit) did you use?

@complexfilter
Copy link
Author

What TFLOPS do you get? Which version of the code (e..g which commit) did you use?

Having some difficulty to use ncu to compute the FLOPs now. But I have the runtime results:
when (batch_size, num_heads, seq_len, dim)= (1, 8, 32768, 64),

  • for H100 at forward bf16 task (full attention), I get 4.27728 ms.
  • for H200 at forward bf16 task (full attention), I get 4.51709 ms.

I used commit: 3cea2fb.

@tridao
Copy link
Member

tridao commented Jan 11, 2025

Can you try the latest commit?

@complexfilter
Copy link
Author

complexfilter commented Jan 14, 2025

Hi, @tridao I tried the latest commit. But when running python setup.py install on hopper folder, I have encountered the error message.

[71/102] /mnt/localssd/flash-attention/hopper/../third_party/nvidia/backend/bin/nvcc --generate-dependencies-with-compile --dependency-output /mnt/localssd/flash-attention/hopper/build/temp.linux-x86_64-cpython-310/instantiations/flash_fwd_hdimall_bf16_packgqa_sm90.o.d -I/mnt/localssd/flash-attention/hopper -I/mnt/localssd/flash-attention/csrc/cutlass/include -I/home/colligo/fa3/lib/python3.10/site-packages/torch/include -I/home/colligo/fa3/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/colligo/fa3/lib/python3.10/site-packages/torch/include/TH -I/home/colligo/fa3/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/colligo/fa3/include -I/usr/include/python3.10 -c -c /mnt/localssd/flash-attention/hopper/instantiations/flash_fwd_hdimall_bf16_packgqa_sm90.cu -o /mnt/localssd/flash-attention/hopper/build/temp.linux-x86_64-cpython-310/instantiations/flash_fwd_hdimall_bf16_packgqa_sm90.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' --threads 4 -O3 -std=c++17 --ftemplate-backtrace-limit=0 --use_fast_math --resource-usage -lineinfo -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DCUTLASS_DEBUG_TRACE_LEVEL=0 -DNDEBUG -gencode arch=compute_90a,code=sm_90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_3_cuda -D_GLIBCXX_USE_CXX11_ABI=0 FAILED: /mnt/localssd/flash-attention/hopper/build/temp.linux-x86_64-cpython-310/instantiations/flash_fwd_hdimall_bf16_packgqa_sm90.o

@complexfilter
Copy link
Author

I have posted the error message as an issue.

@complexfilter
Copy link
Author

Hi, @tridao , after fixing the setup issue. I started to benchmark on the latest build.
Compared with this old commit: 3cea2fb, the latest commit offers:

  • on average 5% forward speedup on H100 and 6% forward speedup on H200 at bf16.
  • on average both 8% forward speedup on H100 and on H200 at fp8.
  • on average 25% backward speedup on H100 and 19% backward speedup on H200 at bf16.

However, based on the latest commit, H200 is consistently slower than H100, at bf16 fwd, bf16 bwd, fp8 fwd. The performance drop on H200 ranges from 1% to 3%.

@tridao
Copy link
Member

tridao commented Jan 15, 2025

That's great to hear!
I'm guessing the tile sizes are tuned for H100. You can try tuning them for H200 (this is the fwd pass tile sizes):
https://github.com/Dao-AILab/flash-attention/blob/main/hopper/tile_size.h
Tbh I don't expect H200 to be much faster than H100 for this setting (training) since it's pretty much FLOPS bound, and H200 and H100 have the same max FLOPS. The case where I'd expect difference is during inference (decode) where it's memory bound and the higher mem bandwidth of H200 will help.

Btw you can set these env variables to make compilation faster (by disabling features):

FLASH_ATTENTION_DISABLE_BACKWARD=FALSE
FLASH_ATTENTION_DISABLE_SPLIT=TRUE
FLASH_ATTENTION_DISABLE_LOCAL=TRUE
FLASH_ATTENTION_DISABLE_PAGEDKV=TRUE
FLASH_ATTENTION_DISABLE_FP16=TRUE
FLASH_ATTENTION_DISABLE_FP8=TRUE
FLASH_ATTENTION_DISABLE_APPENDKV=TRUE
FLASH_ATTENTION_DISABLE_VARLEN=TRUE
FLASH_ATTENTION_DISABLE_CLUSTER=FALSE
FLASH_ATTENTION_DISABLE_PACKGQA=TRUE
FLASH_ATTENTION_DISABLE_SOFTCAP=TRUE
FLASH_ATTENTION_DISABLE_HDIM64=TRUE
FLASH_ATTENTION_DISABLE_HDIM96=TRUE
FLASH_ATTENTION_DISABLE_HDIM128=FALSE
FLASH_ATTENTION_DISABLE_HDIM192=TRUE
FLASH_ATTENTION_DISABLE_HDIM256=TRUE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants