Tiny upstream pr #1094

qianfengz · 2024-09-05T07:25:43Z

This PR provide:

Synchronize to latest composable kernel commit which added inline-asm implementation of fp32 to bf16 RTN conversion. Using inline-asm RTN conversion is able to improve the performance when BF16+RTN is used
Add compiler options for compiling c++ extension on ROCM/HIP, which is able to improve the performance of HIP FMHA BWD on ROCM 6.2

The following are benchmark results compared with triton when using RTN with those compiling options added on ROCM 6.2

Run reference fwd:
Reference fwd time: 28.90159034729004
Run reference bwd:
Reference bwd time: 48.68329620361328
Run triton fwd:
Triton fwd time: 2.0252671241760254
Run triton bwd:
Triton bwd time: 6.977703094482422
Run CK fwd:
xformers fwd time: 1.8350895643234253
Run CK fwd:
xformers bwd time: 7.089707374572754
(triton_dq - ref_dq).abs().mean()=tensor(0.0002, device='cuda:0', dtype=torch.bfloat16)
(triton_dk - ref_dk).abs().mean()=tensor(0.0001, device='cuda:0', dtype=torch.bfloat16)
(triton_dv - ref_dv).abs().mean()=tensor(0.0004, device='cuda:0', dtype=torch.bfloat16)
(xformer_dq - ref_dq).abs().mean()=tensor(0.0002, device='cuda:0', dtype=torch.bfloat16, grad_fn=<MeanBackward0>)
(xformer_dk - ref_dk).abs().mean()=tensor(0.0002, device='cuda:0', dtype=torch.bfloat16, grad_fn=<MeanBackward0>)
(xformer_dv - ref_dv).abs().mean()=tensor(6.7234e-05, device='cuda:0', dtype=torch.bfloat16, grad_fn=<MeanBackward0>)

The following are benchmark results compared with triton when using RTN without those compiling options added on ROCM 6.2

Run reference fwd:
Reference fwd time: 28.867050170898438
Run reference bwd:
Reference bwd time: 48.91793441772461
Run triton fwd:
Triton fwd time: 2.056668996810913
Run triton bwd:
Triton bwd time: 6.982858180999756
Run CK fwd:
xformers fwd time: 1.8234171867370605
Run CK fwd:
xformers bwd time: 7.428786754608154
(triton_dq - ref_dq).abs().mean()=tensor(0.0002, device='cuda:0', dtype=torch.bfloat16)
(triton_dk - ref_dk).abs().mean()=tensor(0.0001, device='cuda:0', dtype=torch.bfloat16)
(triton_dv - ref_dv).abs().mean()=tensor(0.0004, device='cuda:0', dtype=torch.bfloat16)
(xformer_dq - ref_dq).abs().mean()=tensor(0.0002, device='cuda:0', dtype=torch.bfloat16, grad_fn=<MeanBackward0>)
(xformer_dk - ref_dk).abs().mean()=tensor(0.0002, device='cuda:0', dtype=torch.bfloat16, grad_fn=<MeanBackward0>)
(xformer_dv - ref_dv).abs().mean()=tensor(8.7738e-05, device='cuda:0', dtype=torch.bfloat16, grad_fn=<MeanBackward0>)

ensure ck_decoder does not dispatch in test_attn_bias_padded

Apply the existing linters (1/n)

add rocm_ci workflow

This reverts commit 12fb41c.

Fix lints

…ch the xformers scripts

…om flexible location

…] to match the xformers scripts" This reverts commit 7a91589.

Develop asorb upstream

Merge facebookresearch xformers into rocm (08/26/24)

danthe3rd · 2024-09-05T08:08:00Z

Thanks! can you fix the formatting of setup.py tho? (see linter CI)

qianfengz · 2024-09-10T06:21:33Z

Any further layout changing is needed ?

danthe3rd · 2024-09-10T06:43:09Z

Sorry, forgot about that PR :)
Let me merge it

qianfengz and others added 30 commits February 5, 2024 17:58

Merge branch 'ck-tiled-fa' into develop

dc0e67a

Building xformers using ck-tiled as default

58e6101

Merge branch 'ck-tiled-fa' into develop

1276abc

ensure ck_decoder does not dispatch

389dfb4

Add disable_on_rocm on some test scripts

f8d9043

Merge branch 'ck-tiled-fa' into develop

78df6a9

Update to test_mem_eff_attention.py

6dae63c

Merge branch 'ck-tiled-fa' into develop

a7ed88c

Merge pull request #16 from ROCm/fix_test_attn_bias_padded

20e178a

ensure ck_decoder does not dispatch in test_attn_bias_padded

apply isort

0624c92

apply black

b8ebf08

fix flake8 suggestions

3b33c5d

add license headers and reapply black

0a9c933

Merge pull request #17 from ROCm/linters

47367a4

Apply the existing linters (1/n)

Merge pull request #10 from ROCm/enable-ci

fb46611

add rocm_ci workflow

Tiny update to rocm_ci.yml

28d3672

Add conditional compiling for cuda-depending codes in ROCM

12fb41c

Update to benchmark scripts

a9d83c6

Rename the one script file

9ab3831

Revert "Add conditional compiling for cuda-depending codes in ROCM"

243dc6a

This reverts commit 12fb41c.

Update to scripts

3240ba1

Change and add readme for tests and benchmarks

0c51af1

Remove the stuffs for supporting old ck

f36c93b

Remove old composable_kernel from submodule list

9e4582d

Remove folder third_party/composable_kernel

356cafd

Merge branch 'develop' into dev_to_upstream

8415b00

Rename the folder

79c554c

Remove unused script file

2be6c04

apply black

61d875a

pacify mypy

4616121

tenpercent and others added 20 commits August 16, 2024 21:05

revert disable flash operator on rocm

d6b6456

Synchronize to ck_tile latest commit again

87188ea

Re-position the composable_kernel submodule to the develop branch

5be80a3

Merge pull request #20 from tenpercent/develop

cee0980

Fix lints

Avoid the Async pipeline when khasBias is true

2a5c141

clang-format for two files

2874842

Change allocation of grouped mode lse from [H, M] to [1, H, M] to mat…

7a91589

…ch the xformers scripts

Change in generate_instances.py so that this scripts can be called fr…

66efb2c

…om flexible location

Add manual for generate_instances.py (.md)

c19b1f5

Modification in GENERATE_INSTANCES.md

b450d01

Fix in GENERATE_INSTANCES.md

07dc8e7

Update GENERATE_INSTANCES.md

72bf603

clean-up commented codes

e397974

Revert "Change allocation of grouped mode lse from [H, M] to [1, H, M…

7a04357

…] to match the xformers scripts" This reverts commit 7a91589.

Merge branch 'main' into develop-asorb-upstream

2923301

Merge pull request #22 from ROCm/develop-asorb-upstream

84b50ac

Develop asorb upstream

Merge remote-tracking branch 'fair/main' into merge-xformers-0826

e0e6863

Merge pull request #23 from tenpercent/merge-xformers-0826

e1387a4

Merge facebookresearch xformers into rocm (08/26/24)

Synchronize to latest ck develop for using the latest RTN bf16 convert

77a2c24

Add c++ extension compiling options for better performance on ROCM 6.2

4e51efa

facebook-github-bot added CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. module: rocm labels Sep 5, 2024

qianfengz added 2 commits September 5, 2024 07:28

Use the same rocm_ci.yml as upstream

887996a

Use the same ck.py as upstream

7c06b55

Reformat setup.py

2efa6cd

danthe3rd merged commit 0004c67 into facebookresearch:main Sep 10, 2024
22 of 27 checks passed

qianfengz deleted the tiny_upstream_pr branch September 20, 2024 08:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tiny upstream pr #1094

Tiny upstream pr #1094

qianfengz commented Sep 5, 2024

danthe3rd commented Sep 5, 2024

qianfengz commented Sep 10, 2024

danthe3rd commented Sep 10, 2024

Tiny upstream pr #1094

Tiny upstream pr #1094

Conversation

qianfengz commented Sep 5, 2024

danthe3rd commented Sep 5, 2024

qianfengz commented Sep 10, 2024

danthe3rd commented Sep 10, 2024