-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sched_ext: watchdog aware of RT tasks on CPUs' runqueues #1202
Comments
The condition can be handled by implementing |
Unfortunately I also don't think we can solve the problem in general, but maybe we can handle some special cases. For example, I received a message even earlier today with the following trace (and I've received many similar reports), raising concerns about a potential scx_bpfland regressions, when, in fact, it was just an RT task hogging a CPU and consequently stalling some per-cpu kthreads:
So, I was wondering if we could be more explicit in the trace to better explain this particular (non-sched_ext) issue. There's not much to do on the BPF scheduler side, because even an Also kicking out the scheduler doesn't really improve the situation, because the CPU will continue to be monopolized even when all the tasks are moved back to SCHED_NORMAL (however, it can solve the problem if the scheduler is running in partial mode and a SCHED_NORMAL task is hogging the CPU). Maybe we can handle the special case of per-cpu tasks starved by RT tasks, when not running in partial mode, something like the following (pseudo-code / not tested):
|
At that point, wouldn't there already be RT throttling message in dmesg? |
Hm... if I run |
@htejun @arighi do you think we can add some logic to the watchdog to check this: // Only if a task exceeds the timeout
if (unlikely(time_after(jiffies,
last_runnable + scx_watchdog_timeout))) {
...
// Only if we are not in partial mode (SCX_OPS_SWITCH_PARTIAL not set)
const struct cpumask *mask = p->cpus_ptr;
...
for_each_cpu(cpu, mask) {
struct rq *cpu_rq = cpu_rq(cpu);
if (cpu_rq->rt.rt_nr_running == 0) {
task_stalled = true;
break;
}
} This means that if a task has a CPU with a runqueue free of RT tasks, it is certainly stalled. Otherwise, if it was stalled due to RT tasks, we can choose either to report this condition in the trace (when the scheduler exits) or to mark it as not stalled at all. |
I think we may need some locking here to access cpu_rq. Maybe we could just focus only at per-cpu tasks for now, which would probably prevent 99% of the false positive stalls, and do something like the following (totally untested):
|
So, |
Okay, I think we just need to copy what fair is doing and create |
I did some tests by bringing exactly what fair does now into scx, starting/stopping the dl_server when we are in the enqueue/dequeue phases (CFS has the throttling part in addition and therefore handles it slightly differently), and I can confirm that the starvation problem has completely disappeared, so it looks like the removal of the default RT bandwidth Regarding the implementation, on the deadline scheduler side, the logic for handling the dl_server is currently closely coupled to the fair_server (which is considered to be the sole dl_server), so we can either abstract some points of the current logic to handle both types of server (fair_server and scx_server) or maybe keep things completely separate. |
re. deadline side, it's difficult to tell without looking at the code. Once you have patches ready to share, post them upstream and possibly describe what the alternative approach would look like? |
In scenarios where CPU-intensive RT tasks fully occupy a CPU,
SCHED_EXT
tasks confined to that CPU may trigger a stall and cause the scheduler to exit.This behavior might be interpreted by end-users as a scheduler bug, leading to bug reports. However, it is not a bug; the scheduler exits because runnable tasks cannot proceed due to the CPU being monopolized by RT tasks.
Possible solution
To address this, we could implement a mechanism (possibly using a flag) that instructs the watchdog to perform an additional check:
After detecting a task that exceeds the timeout threshold, the watchdog would verify if RT tasks are running on the run queues of CPUs available to the stalled task. If RT tasks are indeed monopolizing the CPUs, the watchdog would mark the task as not stalled.
This additional check would only be performed when we are not in partial mode (i.e., when
SCX_OPS_SWITCH_PARTIAL
is not set).The soft lockup mechanism already in place would prevent run queues from being indefinitely stuck, even if tasks are not stalled immediately.
Check #1202 (comment)
Example:
Results:
scx_bpfland
stall:This issue is reproducible with other schedulers as well by increasing the timeout in stress-ng.
The text was updated successfully, but these errors were encountered: