-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix synchronization issues in radix sort and histogram #2054
Conversation
Taking into account the discussion from #1679 |
else | ||
__dpl_sycl::__group_barrier(__it, __dpl_sycl::__fence_space_global_and_local{}); | ||
}; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This just a thought, but could we use a single lambda here and pass the boolean constant as a parameter to it? Something like:
auto __mem_adjusted_barrier = [__it](auto __is_slm) {
if constexpr (decltype(__is_slm)::value)
__dpl_sycl::__group_barrier(__it);
else
__dpl_sycl::__group_barrier(__it, __dpl_sycl::__fence_space_global_and_local{});
};
And when used:
__mem_adjusted_barrier(_SLM_tag_val{});
__mem_adjusted_barrier(_SLM_counter{});
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need a compile time tag here?
Whether is it simpler to call with a constant, wrapped into __dpl_sycl::name_xx
, like?
__dpl_sycl::__group_barrier(__it, __dpl_sycl::fence_global_and_local);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MikeDvorskiy, I do not understand how your suggestion is supposed to work. We need to use different fence arguments depending on an SLM tag. Could you be more specific? Perhaps I just do not get your idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meanwhile, I've implemented what Adam suggested.
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_radix_sort_one_wg.h
Outdated
Show resolved
Hide resolved
|
||
#if ONEDPL_SYCL121_GROUP_BARRIER | ||
template <sycl::access::fence_space _Space> | ||
struct __fence_space |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest no writing 3floor C++ code....
Actually, we already have the simple approach for wrapping sycl constants. For example,
see __target_device
usage.
https://github.com/uxlfoundation/oneDPL/pull/2054/files#diff-521f7db06a55567c1a1dffa855dde585fb6bc5fbe8633d19b528356e3a501ea0R481
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that I need to provide some no-op alternative in contrast to the case with __target_device
. Defining structures helps to achieve it. Could you provide an example in this context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps you can do this a little simpler...
#if ONEDPL_SYCL121_GROUP_BARRIER
constexpr sycl::access::fence_space __fence_space_local = sycl::access::fence_space::local_space;
constexpr sycl::access::fence_space __fence_space_global = sycl::access::fence_space::global_space;
constexpr sycl::access::fence_space __fence_space_global_and_local = sycl::access::fence_space::global_and_local;
#else
constexpr int __fence_space_local = 0;
constexpr int __fence_space_global = 0;
constexpr int __fence_space_global_and_local = 0;
#endif // ONEDPL_SYCL121_GROUP_BARRIER
template <typename _Item, typename _Space = decltype(__fence_space_local)>
void
__group_barrier(_Item __item, [[maybe_unused]] _Space __space = __fence_space_local)
{
#if ONEDPL_SYCL121_GROUP_BARRIER
__item.barrier(__space);
#elif _ONEDPL_SYCL2020_GROUP_BARRIER_PRESENT
sycl::group_barrier(__item.get_group(), sycl::memory_scope::work_group);
#else
# error "sycl::group_barrier is not supported, and no alternative is available"
#endif
}
Does this work? maybe there is a better "dummy" type / value than int, but otherwise I think it would work.
You would just need to remove the {}
from __dpl_sycl::__fence_space_global{}
above I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess if we are following the example of __target_device
we can do a similar trick to create a typename __fence_space_t
which is either sycl::access::fence_space
or int
depending on ONEDPL_SYCL121_GROUP_BARRIER
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The option with global constants looks good. I've implemented it with adding a dummy type.
My assumptions that the changes from this PR is a little bit overcomplicated. |
sycl::group_barrier(__item.get_group(), sycl::memory_scope::work_group); | ||
#else | ||
__item.barrier(sycl::access::fence_space::local_space); | ||
# error "sycl::group_barrier is not supported, and no alternative is available" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whether do we get potential issues here with user's code in case when neither ONEDPL_SYCL121_GROUP_BARRIER
neither _ONEDPL_SYCL2020_GROUP_BARRIER_PRESENT
is defined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If neither macro is defined, this error will appear. _ONEDPL_SYCL2020_GROUP_BARRIER_PRESENT
is defined for any compiler irrespective to a version, and for icpx starting with some old release (some pre-2022 release). Is that an issue? Did I get you right?
About
|
008e9a8
to
4834d3e
Compare
I briefly checked the other places: they all seem to be using local memory. Hence the current approach should be fine (with this radix-sort as an exception). Another state with SYCL 2020 barrier, which has stronger guarantees, also looks grand. But let me delve deeper. |
Co-authored-by: Dan Hoeflinger <[email protected]>
If somebody will have interest for #2055, please reopen it. |
It looks like one implementation of histogram may have a similar issue (my fault):
I can come up with a fix if you want, its also something which could probably wait until after the release. |
I've added #2056 in case we want to include it here. The rest of the usages I believe should be in the local memory space. If we have any question here with the changes, lets wait because this implementation requires a very large number of bins to be used. |
I think as far as @dmitriy-sobolev introduced three new types here, probably required to declare that they are device-copyable (
|
Additional question: in the current state of this PR the value |
bb31221
to
affaabe
Compare
Co-authored-by: Dmitriy Sobolev <[email protected]>
They were indeed unnecessary, I've removed them. |
I see a couple of problems here:
We discussed it offline, and the issue was work-arounded by declaring |
7c0a99c
to
8430e0c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the first clang formatting complaint, not the others. Otherwise, LGTM.
Probably makes sense to move the TODO above the line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Co-authored-by: Dan Hoeflinger <[email protected]> Co-authored-by: Sergey Kopienko <[email protected]>
Co-authored-by: Dmitriy Sobolev <[email protected]> Co-authored-by: Dan Hoeflinger <[email protected]> Co-authored-by: Sergey Kopienko <[email protected]>
Value (
__buf_val
) and Count (__buf_count
) buffers store data in either local or global memory. Let's use group barriers with proper fences (local or global memory fences) to avoid memory contention issues, which are observed on Xe2 architectures.The fix changes how
__group_barrier
is defined, so the PR additionally includes what is done in #1988 to avoid conflicts and fix another issue related to the barriers: we use SYCL 1.2.1 barriers, but oneDPL claims to be SYCL 2020 compatible.