Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor scheduler and implement spinner thread for Partr. #56475

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

gbaraldi
Copy link
Member

@gbaraldi gbaraldi commented Nov 6, 2024

Also add option for child first
I'm splitting this from the workstealing PR to facilitate the reviews. This part should be much easier to merge.

The spinner design is rougly based on go's and mostly reuses the seq-cst barriers we have currently for sleeping. Though unlike n_threads_running the new counters rely on the thread being woken up so they will underreport

function schedule(t::Task)
if ChildFirst
ct = current_task()
if ct.sticky || t.sticky
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should actually check if set_task_tid succeeded so that this isn't a data race here (even though this is dead code right now)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually any use of yieldto seems to have this problem, so maybe it deserves another look

src/scheduler.c Show resolved Hide resolved
@oscardssmith oscardssmith added the multithreading Base.Threads and related functionality label Jan 10, 2025
This also adds a counter for idle/sleeping threads to avoid checking every thread when everyone is running.
@gbaraldi
Copy link
Member Author

Using

function fib(n::Int)
           n <= 1 && return n
           t = Threads.@spawn fib(n - 2)
           return fib(n - 1) + fetch(t)::Int
       end

as a benchmark this shows a pretty measurable improvement:
nightly

julia> @benchmark fib(20)
BenchmarkTools.Trial: 1571 samples with 1 evaluation.
 Range (min  max):  2.895 ms    7.492 ms  ┊ GC (min  max): 0.00%   0.00%
 Time  (median):     2.958 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.181 ms ± 544.458 μs  ┊ GC (mean ± σ):  6.30% ± 10.37%

  ▆█▃            ▁▁▁                                           
  ███▇▄▄▁▁▁▁▁▃▅███████▇▆▅▅▅▅▅▇▅▆▅▅▅▇▆▆▄▅▆▅▃▅▆▃▆▅▄▅▅▄▅▃▄▃▃▁▄▁▄ █
  2.9 ms       Histogram: log(frequency) by time      5.45 ms <

 Memory estimate: 3.71 MiB, allocs estimate: 67768

PR

julia> @benchmark fib(20)
BenchmarkTools.Trial: 1882 samples with 1 evaluation.
 Range (min  max):  2.403 ms    5.918 ms  ┊ GC (min  max): 0.00%  57.77%
 Time  (median):     2.456 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.655 ms ± 501.850 μs  ┊ GC (mean ± σ):  6.95% ± 11.26%

  ▆█▃            ▁▂▁                                           
  ███▇▃▁▁▁▁▁▁▁▅▆█████▇▇▇▄▅▅▃▅▆▆▅▃▅▆▅▇▅▅▅▆▆▅▅▅▅▅▅▆▄▅▅▅▃▆▄▅▃▅▅▅ █
  2.4 ms       Histogram: log(frequency) by time      4.75 ms <

 Memory estimate: 3.51 MiB, allocs estimate: 54732.

While this benchmark isn't super comprehensive, it is pretty much just measuring scheduler latency, and given that this doesn't really change any scheduler decisions and just changes the wakeup logic it seems like a pretty nice improvement

@gbaraldi gbaraldi requested a review from vtjnash January 20, 2025 19:57
@oscardssmith oscardssmith added the performance Must go faster label Jan 20, 2025
return t
end

const ChildFirst = false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we for now not add ChildFirst?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's not even correct for now.

ccall(:jl_wakeup_thread, Cvoid, (Int16,), (tid - 1) % Int16)

if (tid == 0)
Core.Intrinsics.atomic_fence(:sequentially_consistent)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did fix the codegen for this on AMD?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have this in our branch, but llvm/llvm-project#106555 is still there

Comment on lines +429 to +433
JL_DLLEXPORT void jl_add_spinner(void)
{
jl_task_t *ct = jl_current_task;
add_spinner(ct);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could probably pass current_task in from Julia? It's cheaper to get there.

Comment on lines +157 to +174
# This task is stuck to a thread that's likely sleeping, move the task to it's private queue and wake it up
# We move this out of the queue to avoid spinning on it
tid2 = Threads.threadid(task)
if tid2 != 0
ntasks = heap.ntasks
@atomic :monotonic heap.ntasks = ntasks - Int32(1)
heap.tasks[1] = heap.tasks[ntasks]
Base._unsetindex!(heap.tasks, Int(ntasks))
prio1 = typemax(UInt16)
if ntasks > 1
multiq_sift_down(heap, Int32(1))
prio1 = heap.tasks[1].priority
end
@atomic :monotonic heap.priority = prio1
push!(Base.workqueue_for(tid2), task)
unlock(heap.lock)
ccall(:jl_wakeup_thread, Cvoid, (Int16,), (tid2 - 1) % Int16)
else
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why this code is now needed? Maybe it would also make sense to factor it out?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a more comprehensive comment but the gist is:

  • Thread 2 with Task1 calls wait for some reason, maybe on a lock
  • Thread 2 tries to find more work, fails and goes to sleep (while holding on to task1)
  • Thread 1 notifies Task1, scheduling that task
    This is where in the new design things might not work as expected, since we're not waking every thread up on every schedule, it can happen that we decide to try and run task1, but it's still stuck to thread 2. This requires waking up thread 2. In order to avoid other tasks from spinning on the stuck task we push it to the private queue of thread 2, since regardless of what happens it has to at least run that task once.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In relation to the factoring out, I guess we could factor out a delete task from partr, but not sure if we gain much

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
multithreading Base.Threads and related functionality performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants