Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding local work requesting scheduler that is based on message passing internally #5845

Merged
merged 6 commits into from
Aug 30, 2023

Conversation

hkaiser
Copy link
Member

@hkaiser hkaiser commented Apr 11, 2022

This adds a new experimental work-requesting scheduler to the list of existing schedulers

  • Using uniform_int_distribution with proper bounds
  • Removing queue index from thread_queues as it was unused
  • Renaming workstealing --> workrequesting
  • Adding adaptive work stealing (steal half/steal one)
    • this makes this scheduler consistently (albeit only slightly) faster than the (default) local-priority scheduler
  • Adding LIFO and FIFO variations of local work-stealing scheduler
    • flyby: fixing HPX_WITH_SWAP_CONTEXT_EMULATION
    • flyby: minor changes to fibonacci_local example
  • Adding high- and low- priority queues
    • flyby: cache_line_data now does not generate warnings errors if padding is not needed
  • Adding bound queues
  • flyby: using cache_line_data for scheduler states

@hkaiser
Copy link
Member Author

hkaiser commented Apr 13, 2022

First performance measurements show an overall improvement of up to 5-10%. Very promising!

@hkaiser hkaiser force-pushed the local_workstealing_scheduler branch from a98609d to 26bbc03 Compare April 13, 2022 22:08
@StellarBot
Copy link

Performance test report

HPX Performance

Comparison

BENCHMARKFORK_JOIN_EXECUTORPARALLEL_EXECUTORSCHEDULER_EXECUTOR
For Each(=)(=)+

Info

PropertyBeforeAfter
HPX Commita681968cbd27da
HPX Datetime2022-02-09T15:45:06+00:002022-04-13T22:08:13+00:00
Envfile
Hostnamenid00729nid01193
Compiler/apps/daint/SSL/HPX/packages/llvm-11.0.0/bin/clang++ 11.0.0/apps/daint/SSL/HPX/packages/llvm-11.0.0/bin/clang++ 11.0.0
Clusternamedaintdaint
Datetime2022-02-09T17:03:21.440240+01:002022-04-14T00:21:45.647786+02:00

Comparison

BENCHMARKNO-EXECUTOR
Future Overhead - Create Thread Hierarchical - Latch(=)

Info

PropertyBeforeAfter
HPX Commit96a2e4bcbd27da
HPX Datetime2021-11-11T08:14:57+00:002022-04-13T22:08:13+00:00
Envfile
Hostnamenid00006nid01193
Compiler/apps/daint/SSL/HPX/packages/llvm-11.0.0/bin/clang++ 11.0.0/apps/daint/SSL/HPX/packages/llvm-11.0.0/bin/clang++ 11.0.0
Clusternamedaintdaint
Datetime2021-11-11T09:28:13.071121+01:002022-04-14T00:22:02.373105+02:00

Comparison

BENCHMARKFORK_JOIN_EXECUTOR_DEFAULT_FORK_JOIN_POLICY_ALLOCATORPARALLEL_EXECUTOR_DEFAULT_PARALLEL_POLICY_ALLOCATORSCHEDULER_EXECUTOR_DEFAULT_SCHEDULER_EXECUTOR_ALLOCATOR
Stream Benchmark - Add-(=)(=)
Stream Benchmark - Scale(=)(=)(=)
Stream Benchmark - Triad-(=)(=)
Stream Benchmark - Copy(=)(=)(=)

Info

PropertyBeforeAfter
HPX Commit71d8dbecbd27da
HPX Datetime2021-11-10T19:14:21+00:002022-04-13T22:08:13+00:00
Envfile
Hostnamenid00120nid01193
Compiler/apps/daint/SSL/HPX/packages/llvm-11.0.0/bin/clang++ 11.0.0/apps/daint/SSL/HPX/packages/llvm-11.0.0/bin/clang++ 11.0.0
Clusternamedaintdaint
Datetime2021-11-10T20:28:18.266961+01:002022-04-14T00:22:17.738365+02:00

Explanation of Symbols

SymbolMEANING
=No performance change (confidence interval within ±1%)
(=)Probably no performance change (confidence interval within ±2%)
(+)/(-)Very small performance improvement/degradation (≤1%)
+/-Small performance improvement/degradation (≤5%)
++/--Large performance improvement/degradation (≤10%)
+++/---Very large performance improvement/degradation (>10%)
?Probably no change, but quite large uncertainty (confidence interval with ±5%)
??Unclear result, very large uncertainty (±10%)
???Something unexpected…

@hkaiser hkaiser force-pushed the local_workstealing_scheduler branch 5 times, most recently from 2a7d2fc to 753df97 Compare April 16, 2022 15:54
@Pansysk75 Pansysk75 force-pushed the local_workstealing_scheduler branch from a7f7496 to 511b157 Compare June 28, 2023 15:31
@Pansysk75
Copy link
Member

Pansysk75 commented Jun 28, 2023

I started looking at the performance of this scheduler a bit (on the single-NUMA-domain case)

Directly replacing our current scheduler, it performs worse in the general case (at least on the algorithm benchmarks, which I had handy),
I also tried using the do_not_combine_tasks hint to bypass the index queues we use on bulk_execute, because that is technically another form of stealing, which would skew the comparison. It is still a bit worse.

In the case of a bulk-execution of relatively uniform work, stealing is very limited, because our scheduling is already quite balanced anyways. So this would explain the performance deficit, as we are possibly taking on a larger overhead (more complex scheduler) for small benefit (a few non-cache-disrupting steals).

And then, there is the fact that we are only able to poll the steal requests in between thread execution, which introduces some latency in responding to steal requests.

I will try to produce a best-case scenario for this scheduler though, even if it is just as a proof of concept.

Edit: it seems to perform much better on few cores, which could suggest a large number of failed stealing attempts when we have many cores. We can experiment with sending the steal request where there is actual work.

@hkaiser
Copy link
Member Author

hkaiser commented Jun 28, 2023

I started looking at the performance of this scheduler a bit (on the single-NUMA-domain case)

Directly replacing our current scheduler, it performs worse in the general case (at least on the algorithm benchmarks, which I had handy), I also tried using the do_not_combine_tasks hint to bypass the index queues we use on bulk_execute, because that is technically another form of stealing, which would skew the comparison. It is still a bit worse.

In the case of a bulk-execution of relatively uniform work, stealing is very limited, because our scheduling is already quite balanced anyways. So this would explain the performance deficit, as we are possibly taking on a larger overhead (more complex scheduler) for small benefit (a few non-cache-disrupting steals).

And then, there is the fact that we are only able to poll the steal requests in between thread execution, which introduces some latency in responding to steal requests.

I will try to produce a best-case scenario for this scheduler though, even if it is just as a proof of concept.

Edit: it seems to perform much better on few cores, which could suggest a large number of failed stealing attempts when we have many cores. We can experiment with sending the steal request where there is actual work.

I did see improvements on tests that ran a large amount of separate tasks (like fibonacci). For uniform iterative parallelism the benefit would probably be small.

@hkaiser hkaiser force-pushed the local_workstealing_scheduler branch 2 times, most recently from f910674 to a282657 Compare August 5, 2023 20:53
@Pansysk75
Copy link
Member

@hkaiser You'll still have to add the fix in 1d_stencil_8
451bbd0#diff-126ef56cea5c727b92b9ec59b26d8c429db7f6655405270d63adda430a4a5091
The sem->wait() is not guaranteed to block, so the lambda ends up with a reference to a deleted object.

@hkaiser
Copy link
Member Author

hkaiser commented Aug 5, 2023

@hkaiser You'll still have to add the fix in 1d_stencil_8 451bbd0#diff-126ef56cea5c727b92b9ec59b26d8c429db7f6655405270d63adda430a4a5091 The sem->wait() is not guaranteed to block, so the lambda ends up with a reference to a deleted object.

I thought that was fixed by: #6294

@Pansysk75
Copy link
Member

Pansysk75 commented Aug 5, 2023

I thought that was fixed by: #6294

@hkaiser No, I think it cannot be fixed that way (that's why I had gotten a bit confused there). It's a plain old out-of-scope situation.
Just think of what happens if the semaphore doesn't need to wait for the sem.signal() (which will often happen, if the semaphore is not "saturated").
The created lambda (probably not executed yet) has a reference to the semaphore, which gets deleted if we go out of scope.
There is no way for the semaphore to know it is being referenced this way.

I think that makes sense, unless I have some misconception about the role of sliding_semaphore

@hkaiser hkaiser force-pushed the local_workstealing_scheduler branch 3 times, most recently from 6b926b4 to 48b364b Compare August 23, 2023 12:46
@hkaiser
Copy link
Member Author

hkaiser commented Aug 23, 2023

@Pansysk75 I have applied the change to the use of the sliding_semaphore everywhere. This however doesn't seem to be sufficient to make things work: https://app.circleci.com/pipelines/github/STEllAR-GROUP/hpx/15639/workflows/eb35fe70-6509-4d54-80a1-3a4454e93d06/jobs/360130

@Pansysk75
Copy link
Member

Pansysk75 commented Aug 27, 2023

@Pansysk75 I have applied the change to the use of the sliding_semaphore everywhere. This however doesn't seem to be sufficient to make things work: https://app.circleci.com/pipelines/github/STEllAR-GROUP/hpx/15639/workflows/eb35fe70-6509-4d54-80a1-3a4454e93d06/jobs/360130

Seems like tasks get stuck in the stealing queue, re-applying this fix solves the issue:
451bbd0

Did you do something else to try solve that issue, or was this fix accidentally left behind?

@hkaiser
Copy link
Member Author

hkaiser commented Aug 27, 2023

@Pansysk75 I have applied the change to the use of the sliding_semaphore everywhere. This however doesn't seem to be sufficient to make things work: https://app.circleci.com/pipelines/github/STEllAR-GROUP/hpx/15639/workflows/eb35fe70-6509-4d54-80a1-3a4454e93d06/jobs/360130

Seems like tasks get stuck in the stealing queue, re-applying this fix solves the issue: 451bbd0

Did you do something else to try solve that issue, or was this fix accidentally left behind?

Thanks a lot - not sure how that got lost. Much appreciated!

… internally

- Using uniform_int_distribution with proper bounds
- Removing queue index from thread_queues as it was unused
  - flyby: remove commented out options from .clang-format
- Renaming workstealing --> workrequesting
- Adding adaptive work stealing (steal half/steal one)
  - this makes this scheduler consistently (albeit only slightly) faster than
    the (default) local-priority scheduler
- Adding LIFO and FIFO variations of local work-stealing scheduler
  - flyby: fixing HPX_WITH_SWAP_CONTEXT_EMULATION
  - flyby: minor changes to fibonacci_local example
- Adding high- and low- priority queues
  - flyby: cache_line_data now does not generate warnings errors if padding is not needed
- Adding bound queues
- flyby: using cache_line_data for scheduler states
@hkaiser hkaiser force-pushed the local_workstealing_scheduler branch from cb0e449 to 02e2b4b Compare August 28, 2023 15:35
@StellarBot
Copy link

Performance test report

HPX Performance

Comparison

BENCHMARKFORK_JOIN_EXECUTORPARALLEL_EXECUTORSCHEDULER_EXECUTOR
For Each(=)??(=)

Info

PropertyBeforeAfter
HPX Datetime2023-05-10T12:07:53+00:002023-08-28T15:47:36+00:00
HPX Commitdcb5415de95b0f
Compiler/opt/apps/llvm/13.0.1/bin/clang++ 13.0.1/opt/apps/llvm/13.0.1/bin/clang++ 13.0.1
Datetime2023-05-10T14:50:18.616050-05:002023-08-28T12:31:56.660155-05:00
Clusternamerostamrostam
Envfile
Hostnamemedusa08.rostam.cct.lsu.edumedusa08.rostam.cct.lsu.edu

Comparison

BENCHMARKNO-EXECUTOR
Future Overhead - Create Thread Hierarchical - Latch(=)

Info

PropertyBeforeAfter
HPX Datetime2023-05-10T12:07:53+00:002023-08-28T15:47:36+00:00
HPX Commitdcb5415de95b0f
Compiler/opt/apps/llvm/13.0.1/bin/clang++ 13.0.1/opt/apps/llvm/13.0.1/bin/clang++ 13.0.1
Datetime2023-05-10T14:52:35.047119-05:002023-08-28T12:34:09.253604-05:00
Clusternamerostamrostam
Envfile
Hostnamemedusa08.rostam.cct.lsu.edumedusa08.rostam.cct.lsu.edu

Comparison

BENCHMARKFORK_JOIN_EXECUTOR_DEFAULT_FORK_JOIN_POLICY_ALLOCATORPARALLEL_EXECUTOR_DEFAULT_PARALLEL_POLICY_ALLOCATORSCHEDULER_EXECUTOR_DEFAULT_SCHEDULER_EXECUTOR_ALLOCATOR
Stream Benchmark - Add(=)(=)(=)
Stream Benchmark - Scale(=)(=)=
Stream Benchmark - Triad(=)=(=)
Stream Benchmark - Copy(=)-(=)

Info

PropertyBeforeAfter
HPX Datetime2023-05-10T12:07:53+00:002023-08-28T15:47:36+00:00
HPX Commitdcb5415de95b0f
Compiler/opt/apps/llvm/13.0.1/bin/clang++ 13.0.1/opt/apps/llvm/13.0.1/bin/clang++ 13.0.1
Datetime2023-05-10T14:52:52.237641-05:002023-08-28T12:34:26.236878-05:00
Clusternamerostamrostam
Envfile
Hostnamemedusa08.rostam.cct.lsu.edumedusa08.rostam.cct.lsu.edu

Explanation of Symbols

SymbolMEANING
=No performance change (confidence interval within ±1%)
(=)Probably no performance change (confidence interval within ±2%)
(+)/(-)Very small performance improvement/degradation (≤1%)
+/-Small performance improvement/degradation (≤5%)
++/--Large performance improvement/degradation (≤10%)
+++/---Very large performance improvement/degradation (>10%)
?Probably no change, but quite large uncertainty (confidence interval with ±5%)
??Unclear result, very large uncertainty (±10%)
???Something unexpected…

@hkaiser hkaiser marked this pull request as ready for review August 30, 2023 15:47
@hkaiser
Copy link
Member Author

hkaiser commented Aug 30, 2023

I think this is good to go now. Thanks again @Pansysk75!

@hkaiser
Copy link
Member Author

hkaiser commented Aug 30, 2023

bors merge

@bors
Copy link

bors bot commented Aug 30, 2023

Build succeeded!

The publicly hosted instance of bors-ng is deprecated and will go away soon.

If you want to self-host your own instance, instructions are here.
For more help, visit the forum.

If you want to switch to GitHub's built-in merge queue, visit their help page.

@bors bors bot merged commit 7849eef into master Aug 30, 2023
47 of 53 checks passed
@bors bors bot deleted the local_workstealing_scheduler branch August 30, 2023 16:06
@hkaiser hkaiser added this to the 1.10.0 milestone May 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants