[CK_TILE] Shrink lse_accum distributed tensor size in fmha splitkv combine kernel #1577

poyenc · 2024-10-16T01:47:51Z

No description provided.

qianfengz · 2024-10-16T07:09:02Z

include/ck_tile/ops/fmha/pipeline/block_fmha_fwd_splitkv_combine_pipeline_default_policy.hpp

-        constexpr index_t MThreads   = kBlockSize / NThreads;
-        constexpr index_t MPerThread = kMPerBlock / MThreads;
+        constexpr index_t MThreadPerWarp = get_warp_size() / NThreads;
+        constexpr index_t MWarps         = kMPerBlock / MThreadPerWarp;


MWarps = kMPerBlock / MThreadPerWarp ? Confusing, why not MWarps = kBlockSize / get_warp_size()

because I'm using hard-coded kBlockSize=256, thus kBlockSize/ get_warp_size() (=4) will exceed the real # warp needed here if kMPerBlock is less than 64.

…ROCm/composable_kernel into feature/optimize-splitkv-combine-kernel

poyenc added 3 commits October 16, 2024 01:43

Use smaller width for lse_accum dist tensor

d68ceca

Update pipeline comment

f92661a

Fix wrong distribution for lse_accum

253c9ba

poyenc self-assigned this Oct 16, 2024

poyenc requested review from junliume, illsilin, carlushuang, aosewski, geyyer and bartekxk as code owners October 16, 2024 01:47

Merge branch 'develop' into feature/optimize-splitkv-combine-kernel

7ad674e

qianfengz reviewed Oct 16, 2024

View reviewed changes

poyenc added 5 commits October 16, 2024 07:48

Remove duplicate dim in lse_accum dist encoding

3e15078

Decide fmha splitkv combine kernel kBlockSize by kM0

81cb4c4

Merge branch 'feature/optimize-splitkv-combine-kernel' of github.com:…

d878c46

…ROCm/composable_kernel into feature/optimize-splitkv-combine-kernel

Remove assumption of MPerThread=1

9df2983

Merge branch 'develop' into feature/optimize-splitkv-combine-kernel

af50a11

qianfengz previously approved these changes Oct 17, 2024

View reviewed changes

poyenc added 2 commits October 17, 2024 07:03

Add log<4> & log<8> specialization

f262bec

Merge branch 'feature/optimize-splitkv-combine-kernel' of github.com:…

f03ee03

…ROCm/composable_kernel into feature/optimize-splitkv-combine-kernel

poyenc dismissed qianfengz’s stale review via f03ee03 October 17, 2024 07:23

poyenc marked this pull request as draft October 17, 2024 08:32

poyenc added 3 commits October 17, 2024 09:23

Enlarge occupancy array

88e5de4

Fix vector size for small tile

598ee5c

Add support for kMaxSplits=8

5776e16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CK_TILE] Shrink lse_accum distributed tensor size in fmha splitkv combine kernel #1577

[CK_TILE] Shrink lse_accum distributed tensor size in fmha splitkv combine kernel #1577

poyenc commented Oct 16, 2024

qianfengz Oct 16, 2024

poyenc Oct 16, 2024

[CK_TILE] Shrink lse_accum distributed tensor size in fmha splitkv combine kernel #1577

Are you sure you want to change the base?

[CK_TILE] Shrink lse_accum distributed tensor size in fmha splitkv combine kernel #1577

Conversation

poyenc commented Oct 16, 2024

qianfengz Oct 16, 2024

Choose a reason for hiding this comment

poyenc Oct 16, 2024

Choose a reason for hiding this comment