-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor scheduler and implement spinner thread for Partr. #56475
base: master
Are you sure you want to change the base?
Conversation
function schedule(t::Task) | ||
if ChildFirst | ||
ct = current_task() | ||
if ct.sticky || t.sticky |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should actually check if set_task_tid succeeded so that this isn't a data race here (even though this is dead code right now)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually any use of yieldto seems to have this problem, so maybe it deserves another look
2593629
to
0d72173
Compare
This also adds a counter for idle/sleeping threads to avoid checking every thread when everyone is running.
0d72173
to
9ecf779
Compare
Using function fib(n::Int)
n <= 1 && return n
t = Threads.@spawn fib(n - 2)
return fib(n - 1) + fetch(t)::Int
end as a benchmark this shows a pretty measurable improvement: julia> @benchmark fib(20)
BenchmarkTools.Trial: 1571 samples with 1 evaluation.
Range (min … max): 2.895 ms … 7.492 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.958 ms ┊ GC (median): 0.00%
Time (mean ± σ): 3.181 ms ± 544.458 μs ┊ GC (mean ± σ): 6.30% ± 10.37%
▆█▃ ▁▁▁
███▇▄▄▁▁▁▁▁▃▅███████▇▆▅▅▅▅▅▇▅▆▅▅▅▇▆▆▄▅▆▅▃▅▆▃▆▅▄▅▅▄▅▃▄▃▃▁▄▁▄ █
2.9 ms Histogram: log(frequency) by time 5.45 ms <
Memory estimate: 3.71 MiB, allocs estimate: 67768 PR julia> @benchmark fib(20)
BenchmarkTools.Trial: 1882 samples with 1 evaluation.
Range (min … max): 2.403 ms … 5.918 ms ┊ GC (min … max): 0.00% … 57.77%
Time (median): 2.456 ms ┊ GC (median): 0.00%
Time (mean ± σ): 2.655 ms ± 501.850 μs ┊ GC (mean ± σ): 6.95% ± 11.26%
▆█▃ ▁▂▁
███▇▃▁▁▁▁▁▁▁▅▆█████▇▇▇▄▅▅▃▅▆▆▅▃▅▆▅▇▅▅▅▆▆▅▅▅▅▅▅▆▄▅▅▅▃▆▄▅▃▅▅▅ █
2.4 ms Histogram: log(frequency) by time 4.75 ms <
Memory estimate: 3.51 MiB, allocs estimate: 54732. While this benchmark isn't super comprehensive, it is pretty much just measuring scheduler latency, and given that this doesn't really change any scheduler decisions and just changes the wakeup logic it seems like a pretty nice improvement |
return t | ||
end | ||
|
||
const ChildFirst = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we for now not add ChildFirst?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's not even correct for now.
ccall(:jl_wakeup_thread, Cvoid, (Int16,), (tid - 1) % Int16) | ||
|
||
if (tid == 0) | ||
Core.Intrinsics.atomic_fence(:sequentially_consistent) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did fix the codegen for this on AMD?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have this in our branch, but llvm/llvm-project#106555 is still there
JL_DLLEXPORT void jl_add_spinner(void) | ||
{ | ||
jl_task_t *ct = jl_current_task; | ||
add_spinner(ct); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could probably pass current_task
in from Julia? It's cheaper to get there.
# This task is stuck to a thread that's likely sleeping, move the task to it's private queue and wake it up | ||
# We move this out of the queue to avoid spinning on it | ||
tid2 = Threads.threadid(task) | ||
if tid2 != 0 | ||
ntasks = heap.ntasks | ||
@atomic :monotonic heap.ntasks = ntasks - Int32(1) | ||
heap.tasks[1] = heap.tasks[ntasks] | ||
Base._unsetindex!(heap.tasks, Int(ntasks)) | ||
prio1 = typemax(UInt16) | ||
if ntasks > 1 | ||
multiq_sift_down(heap, Int32(1)) | ||
prio1 = heap.tasks[1].priority | ||
end | ||
@atomic :monotonic heap.priority = prio1 | ||
push!(Base.workqueue_for(tid2), task) | ||
unlock(heap.lock) | ||
ccall(:jl_wakeup_thread, Cvoid, (Int16,), (tid2 - 1) % Int16) | ||
else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why this code is now needed? Maybe it would also make sense to factor it out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add a more comprehensive comment but the gist is:
- Thread 2 with Task1 calls wait for some reason, maybe on a lock
- Thread 2 tries to find more work, fails and goes to sleep (while holding on to task1)
- Thread 1 notifies Task1, scheduling that task
This is where in the new design things might not work as expected, since we're not waking every thread up on every schedule, it can happen that we decide to try and run task1, but it's still stuck to thread 2. This requires waking up thread 2. In order to avoid other tasks from spinning on the stuck task we push it to the private queue of thread 2, since regardless of what happens it has to at least run that task once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In relation to the factoring out, I guess we could factor out a delete task from partr, but not sure if we gain much
Also add option for child first
I'm splitting this from the workstealing PR to facilitate the reviews. This part should be much easier to merge.
The spinner design is rougly based on go's and mostly reuses the seq-cst barriers we have currently for sleeping. Though unlike n_threads_running the new counters rely on the thread being woken up so they will underreport