Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sched_ext: watchdog aware of RT tasks on CPUs' runqueues #1202

Open
luigidematteis opened this issue Jan 15, 2025 · 10 comments
Open

sched_ext: watchdog aware of RT tasks on CPUs' runqueues #1202

luigidematteis opened this issue Jan 15, 2025 · 10 comments
Labels

Comments

@luigidematteis
Copy link
Contributor

luigidematteis commented Jan 15, 2025

In scenarios where CPU-intensive RT tasks fully occupy a CPU, SCHED_EXT tasks confined to that CPU may trigger a stall and cause the scheduler to exit.

This behavior might be interpreted by end-users as a scheduler bug, leading to bug reports. However, it is not a bug; the scheduler exits because runnable tasks cannot proceed due to the CPU being monopolized by RT tasks.

Possible solution
To address this, we could implement a mechanism (possibly using a flag) that instructs the watchdog to perform an additional check:

After detecting a task that exceeds the timeout threshold, the watchdog would verify if RT tasks are running on the run queues of CPUs available to the stalled task. If RT tasks are indeed monopolizing the CPUs, the watchdog would mark the task as not stalled.

This additional check would only be performed when we are not in partial mode (i.e., when SCX_OPS_SWITCH_PARTIAL is not set).

The soft lockup mechanism already in place would prevent run queues from being indefinitely stuck, even if tasks are not stalled immediately.


Check #1202 (comment)


Example:

// Confine a tasks on CPU 1
taskset -p 0x2

// CPU-intensive RT tasks confined on CPU 1
sudo chrt -f 40 taskset 0x2 stress-ng --cpu 1 --cpu-method all --timeout 10s

Results:

scx_bpfland stall:

19:52:00 [INFO] scx_bpfland 1.0.8-ga878d8be x86_64-unknown-linux-gnu SMT on
19:52:00 [INFO] primary CPU domain = 0xffff
19:52:00 [INFO] cpufreq performance level: auto
19:52:00 [INFO] L2 cache ID 0: sibling CPUs: [0, 1]
19:52:00 [INFO] L2 cache ID 1: sibling CPUs: [2, 3]
19:52:00 [INFO] L2 cache ID 2: sibling CPUs: [4, 5]
19:52:00 [INFO] L2 cache ID 3: sibling CPUs: [6, 7]
19:52:00 [INFO] L2 cache ID 4: sibling CPUs: [8, 9]
19:52:00 [INFO] L2 cache ID 5: sibling CPUs: [10, 11]
19:52:00 [INFO] L2 cache ID 6: sibling CPUs: [12, 13]
19:52:00 [INFO] L2 cache ID 7: sibling CPUs: [14, 15]
19:52:00 [INFO] L3 cache ID 0: sibling CPUs: [0, 1, 2, 3, 4, 5, 6, 7]
19:52:00 [INFO] L3 cache ID 1: sibling CPUs: [10, 11, 12, 13, 14, 15, 8, 9]

DEBUG DUMP
================================================================================

kworker/u66:25[50751] triggered exit kind 1026:
  runnable task stall (kworker/1:0[9746] failed to run for 6.144s)

Backtrace:
  scx_watchdog_workfn+0x16d/0x210
  process_one_work+0x179/0x330
  worker_thread+0x252/0x390
  kthread+0xd2/0x100
  ret_from_fork+0x34/0x50
  ret_from_fork_asm+0x1a/0x30

CPU states
----------

CPU 1   : nr_run=2 flags=0x5 cpu_rel=0 ops_qseq=5497 pnt_seq=7066
          curr=stress-ng-cpu[131293] class=rt_sched_class

  R kworker/1:0[9746] -6144ms
      scx_state/flags=3/0x9 dsq_flags=0x0 ops_state/qseq=0/0
      sticky/holding_cpu=-1/-1 dsq_id=0x8000000000000002 dsq_vtime=66690225236
      cpus=0002

    kthread+0xd2/0x100
    ret_from_fork+0x34/0x50
    ret_from_fork_asm+0x1a/0x30

This issue is reproducible with other schedulers as well by increasing the timeout in stress-ng.

Note: schedulers can have different timeout thresholds for the watchdog, with a maximum of 30 seconds.

[ 9733.479390] sched_ext: BPF scheduler "bpfland" disabled (runnable task stall)
[ 9733.479398] sched_ext: bpfland: kworker/1:0[9746] failed to run for 5.888s
[ 9733.479402]    scx_watchdog_workfn+0x16d/0x210
[ 9733.479413]    process_one_work+0x179/0x330
[ 9733.479421]    worker_thread+0x252/0x390
[ 9733.479427]    kthread+0xd2/0x100
[ 9733.479432]    ret_from_fork+0x34/0x50
[ 9733.479439]    ret_from_fork_asm+0x1a/0x30

[ 5260.002045] sched_ext: BPF scheduler "lavd" enabled
[ 5260.003573] sched_ext: scx_lavd[34948] has zero slice in pick_task_scx()
[ 5352.043412] sched_ext: BPF scheduler "lavd" disabled (runnable task stall)
[ 5352.043424] sched_ext: lavd: kworker/1:1[33210] failed to run for 33.728s
[ 5352.043429]    scx_watchdog_workfn+0x16d/0x210
[ 5352.043445]    process_one_work+0x179/0x330
[ 5352.043455]    worker_thread+0x252/0x390
[ 5352.043463]    kthread+0xd2/0x100
[ 5352.043469]    ret_from_fork+0x34/0x50
[ 5352.043477]    ret_from_fork_asm+0x1a/0x30

[ 5392.586405] sched_ext: "rusty" does not implement cgroup cpu.weight
[ 5392.602830] sched_ext: BPF scheduler "rusty" enabled
[ 5412.972422] sched_ext: BPF scheduler "rusty" disabled (runnable task stall)
[ 5412.972434] sched_ext: rusty: kworker/1:1[33210] failed to run for 15.104s
[ 5412.972440]    scx_watchdog_workfn+0x16d/0x210
[ 5412.972455]    process_one_work+0x179/0x330
[ 5412.972465]    worker_thread+0x252/0x390
[ 5412.972473]    kthread+0xd2/0x100
[ 5412.972479]    ret_from_fork+0x34/0x50
[ 5412.972487]    ret_from_fork_asm+0x1a/0x30

[ 5447.385386] sched_ext: "flash" does not implement cgroup cpu.weight
[ 5447.394214] sched_ext: BPF scheduler "flash" enabled
[ 5462.763398] sched_ext: BPF scheduler "flash" disabled (runnable task stall)
[ 5462.763410] sched_ext: flash: kworker/1:1[33210] failed to run for 7.296s
[ 5462.763415]    scx_watchdog_workfn+0x16d/0x210
[ 5462.763427]    process_one_work+0x179/0x330
[ 5462.763436]    worker_thread+0x252/0x390
[ 5462.763444]    kthread+0xd2/0x100
[ 5462.763451]    ret_from_fork+0x34/0x50
[ 5462.763458]    ret_from_fork_asm+0x1a/0x30

[ 9401.645243] sched_ext: BPF scheduler "flatcg" enabled
[ 9539.698387] sched_ext: BPF scheduler "flatcg" disabled (runnable task stall)
[ 9539.698399] sched_ext: flatcg: kworker/1:1[33210] failed to run for 34.589s
[ 9539.698405]    scx_watchdog_workfn+0x16d/0x210
[ 9539.698420]    process_one_work+0x179/0x330
[ 9539.698430]    worker_thread+0x252/0x390
[ 9539.698438]    kthread+0xd2/0x100
[ 9539.698445]    ret_from_fork+0x34/0x50
[ 9539.698453]    ret_from_fork_asm+0x1a/0x30

[ 9760.846222] sched_ext: BPF scheduler "qmap" enabled
[ 9773.613378] sched_ext: BPF scheduler "qmap" disabled (runnable task stall)
[ 9773.613389] sched_ext: qmap: kworker/1:1[33210] failed to run for 7.232s
[ 9773.613394]    scx_watchdog_workfn+0x16d/0x210
[ 9773.613410]    process_one_work+0x179/0x330
[ 9773.613421]    worker_thread+0x252/0x390
[ 9773.613430]    kthread+0xd2/0x100
[ 9773.613437]    ret_from_fork+0x34/0x50
[ 9773.613446]    ret_from_fork_asm+0x1a/0x30
@htejun
Copy link
Contributor

htejun commented Jan 15, 2025

The condition can be handled by implementing ops.cpu_release() which calls scx_bpf_reenqueue_local() and possibly takes other necessary actions. Note that from sched_ext core's POV, it's difficult to tell whether a given task is destined to a particular CPU or not in a general manner. e.g. Only the BPF scheduler itself knows which CPUs will consume from a given DSQ or BPF data structure.

@arighi
Copy link
Contributor

arighi commented Jan 15, 2025

The condition can be handled by implementing ops.cpu_release() which calls scx_bpf_reenqueue_local() and possibly takes other necessary actions. Note that from sched_ext core's POV, it's difficult to tell whether a given task is destined to a particular CPU or not in a general manner. e.g. Only the BPF scheduler itself knows which CPUs will consume from a given DSQ or BPF data structure.

Unfortunately ops.cpu_release() isn't enough in some cases, in particular when an RT task is hogging a CPU and the scx scheduler needs to schedule per-cpu tasks on that CPU (more in general, when all the cpus usable by a SCHED_EXT task are stolen by higher class tasks for too long).

I also don't think we can solve the problem in general, but maybe we can handle some special cases.

For example, I received a message even earlier today with the following trace (and I've received many similar reports), raising concerns about a potential scx_bpfland regressions, when, in fact, it was just an RT task hogging a CPU and consequently stalling some per-cpu kthreads:

 ramin-linux scx_loader[115457]: kworker/u32:7[113526] triggered exit kind 1026:
Jan 15 03:18:23 ramin-linux scx_loader[115457]:   runnable task stall (kworker/0:0[106377] failed to run for 5.043s)
Jan 15 03:18:23 ramin-linux scx_loader[115457]: Backtrace:
Jan 15 03:18:23 ramin-linux scx_loader[115457]:   scx_watchdog_workfn+0x135/0x1a0
Jan 15 03:18:23 ramin-linux scx_loader[115457]:   worker_thread+0x3c3/0x8b0
Jan 15 03:18:23 ramin-linux scx_loader[115457]:   kthread+0x10b/0x170
Jan 15 03:18:23 ramin-linux scx_loader[115457]:   ret_from_fork+0x37/0x50
Jan 15 03:18:23 ramin-linux scx_loader[115457]:   ret_from_fork_asm+0x1a/0x30
Jan 15 03:18:23 ramin-linux scx_loader[115457]: CPU states
Jan 15 03:18:23 ramin-linux scx_loader[115457]: ----------
Jan 15 03:18:23 ramin-linux scx_loader[115457]: CPU 0   : nr_run=3 flags=0xd cpu_rel=0 ops_qseq=20646200 pnt_seq=45388738
Jan 15 03:18:23 ramin-linux scx_loader[115457]:           curr=sway[994] class=rt_sched_class
Jan 15 03:18:23 ramin-linux scx_loader[115457]:   R kworker/0:0[106377] -5043ms
Jan 15 03:18:23 ramin-linux scx_loader[115457]:       scx_state/flags=3/0x1 dsq_flags=0x0 ops_state/qseq=0/0
Jan 15 03:18:23 ramin-linux scx_loader[115457]:       sticky/holding_cpu=-1/-1 dsq_id=0x8000000000000002 dsq_vtime=0 slice=20000000
Jan 15 03:18:23 ramin-linux scx_loader[115457]:       cpus=01
Jan 15 03:18:23 ramin-linux scx_loader[115457]:     lrc_unpin+0x0/0x70 [i915]
Jan 15 03:18:23 ramin-linux scx_loader[115457]:     __intel_context_do_unpin+0x26/0xc0 [i915]
Jan 15 03:18:23 ramin-linux scx_loader[115457]:     i915_request_retire+0x1a3/0x270 [i915]
Jan 15 03:18:23 ramin-linux scx_loader[115457]:     engine_retire+0x84/0xe0 [i915]
Jan 15 03:18:23 ramin-linux scx_loader[115457]:     worker_thread+0x3c3/0x8b0
Jan 15 03:18:23 ramin-linux scx_loader[115457]:     kthread+0x10b/0x170
Jan 15 03:18:23 ramin-linux scx_loader[115457]:     ret_from_fork+0x37/0x50
Jan 15 03:18:23 ramin-linux scx_loader[115457]:     ret_from_fork_asm+0x1a/0x30
Jan 15 03:18:23 ramin-linux scx_loader[115457]:   R kworker/0:1H[117] -5044ms
Jan 15 03:18:23 ramin-linux scx_loader[115457]:       scx_state/flags=3/0x1 dsq_flags=0x0 ops_state/qseq=0/0
Jan 15 03:18:23 ramin-linux scx_loader[115457]:       sticky/holding_cpu=-1/-1 dsq_id=0x8000000000000002 dsq_vtime=0 slice=20000000
Jan 15 03:18:23 ramin-linux scx_loader[115457]:       cpus=01
Jan 15 03:18:23 ramin-linux scx_loader[115457]:     kthread+0x10b/0x170
Jan 15 03:18:23 ramin-linux scx_loader[115457]:     ret_from_fork+0x37/0x50
Jan 15 03:18:23 ramin-linux scx_loader[115457]:     ret_from_fork_asm+0x1a/0x30
Jan 15 03:18:23 ramin-linux scx_loader[115457]:   R ThreadPoolForeg[82260] -5018ms
Jan 15 03:18:23 ramin-linux scx_loader[115457]:       scx_state/flags=3/0x9 dsq_flags=0x0 ops_state/qseq=0/0
Jan 15 03:18:23 ramin-linux scx_loader[115457]:       sticky/holding_cpu=-1/-1 dsq_id=0x8000000000000002 dsq_vtime=47245 slice=20000000
Jan 15 03:18:23 ramin-linux scx_loader[115457]:       cpus=ff
Jan 15 03:18:23 ramin-linux scx_loader[115457]:     __x64_sys_futex+0x2e4/0x370
Jan 15 03:18:23 ramin-linux scx_loader[115457]:     do_syscall_64+0x8f/0x170
Jan 15 03:18:23 ramin-linux scx_loader[115457]:     entry_SYSCALL_64_after_hwframe+0x76/0x7e

So, I was wondering if we could be more explicit in the trace to better explain this particular (non-sched_ext) issue. There's not much to do on the BPF scheduler side, because even an scx_bpf_reenqueue_local() would re-enqueue the per-cpu kthreads back to the same CPU (that can't still be used, unless the RT task decides to release it).

Also kicking out the scheduler doesn't really improve the situation, because the CPU will continue to be monopolized even when all the tasks are moved back to SCHED_NORMAL (however, it can solve the problem if the scheduler is running in partial mode and a SCHED_NORMAL task is hogging the CPU).

Maybe we can handle the special case of per-cpu tasks starved by RT tasks, when not running in partial mode, something like the following (pseudo-code / not tested):

if (p->nr_cpus_allowed == 1 && rq->curr->sched_class == &rt_sched_class)
    continue; // skip stall check

@htejun
Copy link
Contributor

htejun commented Jan 15, 2025

At that point, wouldn't there already be RT throttling message in dmesg?

@arighi
Copy link
Contributor

arighi commented Jan 15, 2025

Hm... if I run sudo schedtool -a 4 -F -p 99 -e yes >/dev/null I can trigger the scx stall, but I don't see the RT throttling message.

@hodgesds hodgesds added the bpf label Jan 16, 2025
@luigidematteis
Copy link
Contributor Author

@htejun @arighi do you think we can add some logic to the watchdog to check this:

                // Only if a task exceeds the timeout
		if (unlikely(time_after(jiffies,
					last_runnable + scx_watchdog_timeout))) {
...
                               // Only if we are not in partial mode (SCX_OPS_SWITCH_PARTIAL not set)
			        const struct cpumask *mask = p->cpus_ptr;
...
				for_each_cpu(cpu, mask) {
					struct rq *cpu_rq = cpu_rq(cpu);

					if (cpu_rq->rt.rt_nr_running == 0) {
						task_stalled = true;
						break;
					}
				}

This means that if a task has a CPU with a runqueue free of RT tasks, it is certainly stalled. Otherwise, if it was stalled due to RT tasks, we can choose either to report this condition in the trace (when the scheduler exits) or to mark it as not stalled at all.
In the latter case, the scheduler should not exit, and this behavior can be enabled via a flag.

scx_bpfland_1

@arighi
Copy link
Contributor

arighi commented Jan 16, 2025

I think we may need some locking here to access cpu_rq. Maybe we could just focus only at per-cpu tasks for now, which would probably prevent 99% of the false positive stalls, and do something like the following (totally untested):

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index f1bc7639e730..29a21d4e6e71 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -3783,6 +3783,10 @@ static bool check_rq_for_timeouts(struct rq *rq)
 	list_for_each_entry(p, &rq->scx.runnable_list, scx.runnable_node) {
 		unsigned long last_runnable = p->scx.runnable_at;
 
+		if (p->nr_cpus_allowed == 1 &&
+		    rq->curr->sched_class == &rt_sched_class)
+			continue;
+
 		if (unlikely(time_after(jiffies,
 					last_runnable + scx_watchdog_timeout))) {
 			u32 dur_ms = jiffies_to_msecs(jiffies - last_runnable);

@htejun
Copy link
Contributor

htejun commented Jan 22, 2025

So, 5f6bd380c7bd ("sched/rt: Remove default bandwidth control") removed the default RT bandwidth control which probably is why we're seeing these complete stalls. Looking into whether the new mechanism (fair_server) can be applied to SCX.

@htejun
Copy link
Contributor

htejun commented Jan 22, 2025

Okay, I think we just need to copy what fair is doing and create scx_server. This should also guarantee some bandwidth while in partial mode.

@luigidematteis
Copy link
Contributor Author

I did some tests by bringing exactly what fair does now into scx, starting/stopping the dl_server when we are in the enqueue/dequeue phases (CFS has the throttling part in addition and therefore handles it slightly differently), and I can confirm that the starvation problem has completely disappeared, so it looks like the removal of the default RT bandwidth 5f6bd380c7bd was the real cause of the recent stalls.

Regarding the implementation, on the deadline scheduler side, the logic for handling the dl_server is currently closely coupled to the fair_server (which is considered to be the sole dl_server), so we can either abstract some points of the current logic to handle both types of server (fair_server and scx_server) or maybe keep things completely separate.

@htejun @arighi

@htejun
Copy link
Contributor

htejun commented Jan 27, 2025

re. deadline side, it's difficult to tell without looking at the code. Once you have patches ready to share, post them upstream and possibly describe what the alternative approach would look like?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants