Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed deadlock in sgmv_shrink kernel caused by skewed segments #35

Merged
merged 3 commits into from
Jan 9, 2024

Conversation

tgaddair
Copy link
Contributor

@tgaddair tgaddair commented Jan 4, 2024

When there is a large imbalance (>= 65 elements in the batch) in the size of two or more segments in a batch, it can lead to deadlocks in the sgmv_shrink kernel.

The crux of the issue was that each grid block can execute a dynamic number of steps depending on the size of its segment (s_end - s_start). However, during each step the block will call grid.sync(). If one block executes more steps than another, it will call grid.sync() a different number of times, leading to a deadlock.

The solution presented here is to compute the max number of steps from the largest segment, and then call grid.sync() at the end of the kernel for the difference between the max steps and the current block's steps.

Because the length of the s vector is generally very small (< batch size), the loop here should not introduce noticeable latency. However, it may be worth exploring more optimized solutions to this problem in a follow-up.

Note that this issue only occurs when using cooperative groups.

Related:

@tgaddair tgaddair changed the title Fix sgmv deadlock Fixed deadlock in sgmv_shrink kernel caused by skewed segments Jan 4, 2024
@tgaddair
Copy link
Contributor Author

tgaddair commented Jan 4, 2024

cc @abcdabcd987

Copy link

codecov bot commented Jan 4, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (07a40b9) 43.27% compared to head (2a32749) 43.27%.
Report is 1 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master      #35   +/-   ##
=======================================
  Coverage   43.27%   43.27%           
=======================================
  Files          10       10           
  Lines         647      647           
=======================================
  Hits          280      280           
  Misses        367      367           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@abcdabcd987
Copy link
Contributor

Thanks!

@yzh119 Can you take a look?

Copy link
Contributor

@yzh119 yzh119 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @tgaddair for the patch, yes I do think you are correct.

@abcdabcd987 abcdabcd987 merged commit 591b598 into punica-ai:master Jan 9, 2024
6 checks passed
@tgaddair tgaddair deleted the fix-sgmv-deadlock branch January 12, 2024 17:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants