Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit test_hipcub_device_radix_sort memory usage #454

Open
wants to merge 2 commits into
base: release-staging/rocm-rel-6.4
Choose a base branch
from

Conversation

umfranzw
Copy link
Collaborator

On Windows, HipcubDeviceRadixSort.SortKeysLargeSizes fails due to an out of memory error on some devices. Limiting the data sizes that used are used for this test fixes this problem.

@stanleytsang-amd
Copy link
Collaborator

Could we not implement a run-time check on the current's GPU's total available memory via hipDeviceProp_t instead, and skip the test if the available memory is not sufficient?

@stanleytsang-amd
Copy link
Collaborator

We already have a check at

#define HIP_CHECK_MEMORY(condition) \
and its being used in that test case - so if the test is still executing and failing due to out of memory, we should look into the root cause for that.

On Windows, HipcubDeviceRadixSort.SortKeysLargeSizes fails due to an
out of memory error on some devices. This happens because of an issue
that sometimes causes hipMalloc to return hipSuccess for some allocation
requests that are too large. This prevents us from being able to
reliably detect whether a data size is too large for the test.

This change works around the problem for now by limiting the data sizes
that used are used for the test.
@umfranzw umfranzw force-pushed the limit_test_radix_sort_mem_usage branch from df2b24c to 1b94b50 Compare January 17, 2025 21:38
@umfranzw umfranzw requested a review from a team as a code owner January 17, 2025 21:38
@umfranzw
Copy link
Collaborator Author

We already have a check at

#define HIP_CHECK_MEMORY(condition) \

and its being used in that test case - so if the test is still executing and failing due to out of memory, we should look into the root cause for that.

Good idea - I took a closer look at this, and the problem seems to be that in the latest Windows hipSDK build, hipMalloc returns hipSuccess with some sizes that returned hipErrorOutOfMemory in previous hipSDK builds. I'll log a bug with the hipSDK team. We can use this change as a workaround for now if you like. I've added a note about this problem to the "known issues" section of the changelog.

Copy link
Contributor

@spolifroni-amd spolifroni-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changelog wording changes.

CHANGELOG.md Outdated Show resolved Hide resolved
Co-authored-by: spolifroni-amd <[email protected]>
Copy link
Member

@Naraenda Naraenda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of working around SDK bugs. Is that bug still to be fixed before it releases? If so, then we wouldn't need this workaround.

@@ -1215,7 +1215,7 @@ inline void sort_keys_large_sizes()

hipStream_t stream = 0;

const std::vector<size_t> sizes = test_utils::get_large_sizes(seeds[0]);
const std::vector<size_t> sizes = test_utils::get_large_sizes<34>(seeds[0]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to document workarounds in the source as well and not impact Linux builds.

Suggested change
const std::vector<size_t> sizes = test_utils::get_large_sizes<34>(seeds[0]);
// Workaround: `hipMalloc` always returns `hipSuccess` even when allocation fails.
// We limit the maximum size so this bug doesn't occur.
#ifdef _WIN32
const std::vector<size_t> sizes = test_utils::get_large_sizes<34>(seeds[0]);
#else
const std::vector<size_t> sizes = test_utils::get_large_sizes(seeds[0]);
#endif

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants