Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CK_TILE] Shrink lse_accum distributed tensor size in fmha splitkv combine kernel #1577

Draft
wants to merge 14 commits into
base: develop
Choose a base branch
from

Conversation

poyenc
Copy link
Contributor

@poyenc poyenc commented Oct 16, 2024

No description provided.

constexpr index_t MThreads = kBlockSize / NThreads;
constexpr index_t MPerThread = kMPerBlock / MThreads;
constexpr index_t MThreadPerWarp = get_warp_size() / NThreads;
constexpr index_t MWarps = kMPerBlock / MThreadPerWarp;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MWarps = kMPerBlock / MThreadPerWarp ? Confusing, why not MWarps = kBlockSize / get_warp_size()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because I'm using hard-coded kBlockSize=256, thus kBlockSize/ get_warp_size() (=4) will exceed the real # warp needed here if kMPerBlock is less than 64.

qianfengz
qianfengz previously approved these changes Oct 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants