Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix synchronization issues in radix sort and histogram #2054

Merged
merged 20 commits into from
Feb 10, 2025

Conversation

dmitriy-sobolev
Copy link
Contributor

@dmitriy-sobolev dmitriy-sobolev commented Feb 7, 2025

Value (__buf_val) and Count (__buf_count) buffers store data in either local or global memory. Let's use group barriers with proper fences (local or global memory fences) to avoid memory contention issues, which are observed on Xe2 architectures.

The fix changes how __group_barrier is defined, so the PR additionally includes what is done in #1988 to avoid conflicts and fix another issue related to the barriers: we use SYCL 1.2.1 barriers, but oneDPL claims to be SYCL 2020 compatible.

@SergeyKopienko
Copy link
Contributor

SergeyKopienko commented Feb 7, 2025

Taking into account the discussion from #1679
I think that probably the switcher which barrier exactly we should use in all oneDPL code - may be incorrect approach...

else
__dpl_sycl::__group_barrier(__it, __dpl_sycl::__fence_space_global_and_local{});
};

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just a thought, but could we use a single lambda here and pass the boolean constant as a parameter to it? Something like:

    auto __mem_adjusted_barrier = [__it](auto __is_slm) {
        if constexpr (decltype(__is_slm)::value)
            __dpl_sycl::__group_barrier(__it);
        else
            __dpl_sycl::__group_barrier(__it, __dpl_sycl::__fence_space_global_and_local{});
    };

And when used:

   __mem_adjusted_barrier(_SLM_tag_val{});
  __mem_adjusted_barrier(_SLM_counter{});

See https://godbolt.org/z/GMx64dTb5.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a compile time tag here?
Whether is it simpler to call with a constant, wrapped into __dpl_sycl::name_xx, like?

__dpl_sycl::__group_barrier(__it, __dpl_sycl::fence_global_and_local);

Copy link
Contributor Author

@dmitriy-sobolev dmitriy-sobolev Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MikeDvorskiy, I do not understand how your suggestion is supposed to work. We need to use different fence arguments depending on an SLM tag. Could you be more specific? Perhaps I just do not get your idea.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meanwhile, I've implemented what Adam suggested.

@dmitriy-sobolev dmitriy-sobolev changed the title Fix memory contention in radix sort Fix synchronization issues in radix sort Feb 7, 2025

#if ONEDPL_SYCL121_GROUP_BARRIER
template <sycl::access::fence_space _Space>
struct __fence_space
Copy link
Contributor

@MikeDvorskiy MikeDvorskiy Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@dmitriy-sobolev dmitriy-sobolev Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that I need to provide some no-op alternative in contrast to the case with __target_device. Defining structures helps to achieve it. Could you provide an example in this context?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps you can do this a little simpler...

#if ONEDPL_SYCL121_GROUP_BARRIER
constexpr sycl::access::fence_space __fence_space_local = sycl::access::fence_space::local_space;
constexpr sycl::access::fence_space __fence_space_global = sycl::access::fence_space::global_space;
constexpr sycl::access::fence_space __fence_space_global_and_local = sycl::access::fence_space::global_and_local;
#else
constexpr int __fence_space_local = 0;
constexpr int __fence_space_global = 0;
constexpr int __fence_space_global_and_local = 0;
#endif // ONEDPL_SYCL121_GROUP_BARRIER


template <typename _Item, typename _Space = decltype(__fence_space_local)>
void
__group_barrier(_Item __item, [[maybe_unused]] _Space __space = __fence_space_local)
{
#if ONEDPL_SYCL121_GROUP_BARRIER
    __item.barrier(__space);
#elif _ONEDPL_SYCL2020_GROUP_BARRIER_PRESENT
    sycl::group_barrier(__item.get_group(), sycl::memory_scope::work_group);
#else
#    error "sycl::group_barrier is not supported, and no alternative is available"

#endif
}

Does this work? maybe there is a better "dummy" type / value than int, but otherwise I think it would work.

You would just need to remove the {} from __dpl_sycl::__fence_space_global{} above I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess if we are following the example of __target_device we can do a similar trick to create a typename __fence_space_t which is either sycl::access::fence_space or int depending on ONEDPL_SYCL121_GROUP_BARRIER.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The option with global constants looks good. I've implemented it with adding a dummy type.

@SergeyKopienko
Copy link
Contributor

My assumptions that the changes from this PR is a little bit overcomplicated.
Let me propose alternative approach for this fix: #2055

sycl::group_barrier(__item.get_group(), sycl::memory_scope::work_group);
#else
__item.barrier(sycl::access::fence_space::local_space);
# error "sycl::group_barrier is not supported, and no alternative is available"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whether do we get potential issues here with user's code in case when neither ONEDPL_SYCL121_GROUP_BARRIER neither _ONEDPL_SYCL2020_GROUP_BARRIER_PRESENT is defined?

Copy link
Contributor Author

@dmitriy-sobolev dmitriy-sobolev Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If neither macro is defined, this error will appear. _ONEDPL_SYCL2020_GROUP_BARRIER_PRESENT is defined for any compiler irrespective to a version, and for icpx starting with some old release (some pre-2022 release). Is that an issue? Did I get you right?

@SergeyKopienko
Copy link
Contributor

About
"The fix changes how __group_barrier is defined, so the PR additionally includes what is done in #1988 to avoid conflicts and fix another issue related to the barriers: we use SYCL 1.2.1 barriers, but oneDPL claims to be SYCL 2020 compatible."

  • I believe it's incorrect moment to change that before release. Let's do this change later in the next release.

@dmitriy-sobolev dmitriy-sobolev force-pushed the dev/dmitriy-sobolev/radix-fix branch 2 times, most recently from 008e9a8 to 4834d3e Compare February 7, 2025 18:37
@dmitriy-sobolev
Copy link
Contributor Author

dmitriy-sobolev commented Feb 7, 2025

Taking into account the discussion from #1679 I think that probably the switcher which barrier exactly we should use in all oneDPL code - may be incorrect approach...

I briefly checked the other places: they all seem to be using local memory. Hence the current approach should be fine (with this radix-sort as an exception). Another state with SYCL 2020 barrier, which has stronger guarantees, also looks grand. But let me delve deeper.

Co-authored-by: Dan Hoeflinger <[email protected]>
@SergeyKopienko SergeyKopienko removed their request for review February 7, 2025 20:38
@SergeyKopienko
Copy link
Contributor

If somebody will have interest for #2055, please reopen it.

@danhoeflinger
Copy link
Contributor

danhoeflinger commented Feb 7, 2025

I briefly checked the other places: they all seem to be using local memory.
But let me delve deeper.

It looks like one implementation of histogram may have a similar issue (my fault):

__clear_wglocal_histograms when called from __histogram_general_private_global_atomics_submitter is operating on global data, also the barrier in the top level parallel for lambda for that same kernel should be a global fence.

I can come up with a fix if you want, its also something which could probably wait until after the release.

@danhoeflinger
Copy link
Contributor

I've added #2056 in case we want to include it here. The rest of the usages I believe should be in the local memory space. If we have any question here with the changes, lets wait because this implementation requires a very large number of bins to be used.

@SergeyKopienko
Copy link
Contributor

SergeyKopienko commented Feb 10, 2025

I think as far as @dmitriy-sobolev introduced three new types here, probably required to declare that they are device-copyable (is_device_copyable):

  • __fence_space_local
  • __fence_space_global
  • __fence_space_global_and_local

@SergeyKopienko
Copy link
Contributor

Additional question: in the current state of this PR the value __fence_space_global_and_local is never used.
So should we really declare this value __fence_space_global_and_local ?

@SergeyKopienko SergeyKopienko self-requested a review February 10, 2025 09:28
@dmitriy-sobolev dmitriy-sobolev force-pushed the dev/dmitriy-sobolev/radix-fix branch from bb31221 to affaabe Compare February 10, 2025 11:01
@dmitriy-sobolev
Copy link
Contributor Author

Additional question: in the current state of this PR the value __fence_space_global_and_local is never used. So should we really declare this value __fence_space_global_and_local ?

They were indeed unnecessary, I've removed them.

@dmitriy-sobolev
Copy link
Contributor Author

dmitriy-sobolev commented Feb 10, 2025

I think as far as @dmitriy-sobolev introduced three new types here, probably required to declare that they are device-copyable (is_device_copyable):

  • __fence_space_local
  • __fence_space_global
  • __fence_space_global_and_local

I see a couple of problems here:

  • We should not copy these object onto a device.
  • sycl::access::fence_space (enum) and __fence_space_dummy are a trivially copyable types, so they should also be device copyable.

We discussed it offline, and the issue was work-arounded by declaring get_fence as static (a function this issue occurred), but I still have no understanding why the lack of static triggered that issue...

@dmitriy-sobolev dmitriy-sobolev force-pushed the dev/dmitriy-sobolev/radix-fix branch from 7c0a99c to 8430e0c Compare February 10, 2025 16:41
danhoeflinger
danhoeflinger previously approved these changes Feb 10, 2025
Copy link
Contributor

@danhoeflinger danhoeflinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the first clang formatting complaint, not the others. Otherwise, LGTM.

Probably makes sense to move the TODO above the line.

@dmitriy-sobolev dmitriy-sobolev changed the title Fix synchronization issues in radix sort Fix synchronization issues in radix sort and histogram Feb 10, 2025
Copy link
Contributor

@danhoeflinger danhoeflinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@mmichel11 mmichel11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dmitriy-sobolev dmitriy-sobolev merged commit 0bc7341 into main Feb 10, 2025
21 of 22 checks passed
@dmitriy-sobolev dmitriy-sobolev deleted the dev/dmitriy-sobolev/radix-fix branch February 10, 2025 20:18
timmiesmith pushed a commit that referenced this pull request Feb 11, 2025
timmiesmith added a commit that referenced this pull request Feb 12, 2025
Co-authored-by: Dmitriy Sobolev <[email protected]>
Co-authored-by: Dan Hoeflinger <[email protected]>
Co-authored-by: Sergey Kopienko <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants