scx_rustland: prevent starvation and improve responsiveness #60

arighi · 2024-01-01T13:44:24Z

Improvements and fixes to scx_rustland to prevent some cases of starvation and enhance responsiveness:

scx_rustland: evaluate the proper vruntime delta
scx_rustland: prevent starvation handling short-lived tasks properly
bpf_rustland: do not dispatch the scheduler to the global DSQ
scx_rustland: remove SCX_ENQ_LAST check in is_task_cpu_available()
scx_rustland: never account more than slice_ns to vruntime
scx_rustland: clean up old entries in the task map

Small code readability improvement:

scx_rustland: small code refactoring
scx_rustland: rename variable id -> pos for better clarity

With all these changes in place the scheduler can properly handle, on my 8-cores laptop, a parallel kernel build (make -j32), a stress test (stress-ng --cpu 32 --iomix 4 --vm 2 --vm-bytes 128M --fork 4), a YouTube music video playing in the background, all while reading emails and writing the text of this PR.

Signed-off-by: Andrea Righi <[email protected]>

The user-space scheduler maintains an internal hash map of tasks information (indexed by their pid). Tasks are only added to this hash map and never removed. After running the scheduler for a while we may experience a performance degration, because the hash map keeps growing. Therefore implement a mechanism of garbage collector to remove the old entries from the task map (periodically removing pids that don't exist anymore). Signed-off-by: Andrea Righi <[email protected]>

In any case make sure that we never account more than the maximum slice_ns to a task's vruntime. This helps to prevent starving a task for too long in the user-space scheduler. Signed-off-by: Andrea Righi <[email protected]>

With commit 49f2e7c ("scx_rustland: enable SCX_OPS_ENQ_LAST") we have enabled SCX_OPS_ENQ_LAST that seems to save some unnecessary user-space scheduler activations when the system is mostly idle. We are also checking for the SCX_ENQ_LAST in the enqueue flags, that apparently it is not needed and we can achieve the same behavior dropping this check. Signed-off-by: Andrea Righi <[email protected]>

Never dispatch the user-space scheduler to the global DSQ, while all the other tasks are dispatched to the local per-CPU DSQ. Since tasks are consumed from the local DSQ first and then from the global DSQ, we may end up starving the scheduler if we dispatch only this one on the global DSQ. In fact it is really easy to trigger a stall with a workload that triggers many context switches in the system, for example (on a 8 cores system): $ stress-ng --cpu 32 --iomix 4 --vm 2 --vm-bytes 128M --fork 4 --timeout 30s ... 09:28:11 [WARN] EXIT: scx_rustland[1455943] failed to run for 5.275s 09:28:11 [INFO] Unregister RustLand scheduler To prevent this from happening also dispatch the user-space scheduler on the local DSQ, using the current CPU where .dispatch() is called, if possible, or the previously used CPU otherwise. Apply the same logic when the scheduler is congested: dispatch on the previously used CPU using the local DSQ. In this way all tasks will always get the same "dispatch priority" and we can prevent the scheduler starvation issue. Note that with this change in place dispatch_global() is never used and we can get rid of it. Signed-off-by: Andrea Righi <[email protected]>

Prevent newly created short-lived tasks from starving the other tasks sitting in the user-space scheduler. This can be done setting an initial vruntime of (min_vruntime + 1) to newly scheduled tasks, instead of min_vruntime: this ensures a progressing global vruntime durig each scheduler run, providing a priority boost to newer tasks (that is still beneficial for potential short-lived tasks) while also preventing excessive starvation of the other tasks sitting in the user-space scheduler, waiting to be dispatched. Without this change it is really easy to create a stall condition simply by forking a bunch of short-lived tasks in a busy loop, with this change applied the scheduler can handle properly the consistent flow of newly created short-lived tasks, without introducing any stall. Signed-off-by: Andrea Righi <[email protected]>

The forumla used to evaluate the weighted time delta is not correct, it's not considering the weight as a percentage. Fix this by using the proper formula. Moreover, take into account also the task weight when evaluating the maximum time delta to account in vruntime and make sure that we never charge a task more than slice_ns. This helps to prevent starvation of low priority tasks. Signed-off-by: Andrea Righi <[email protected]>

No functional change, make the user-space scheduler code a bit more readable and more Rust idiomatic. Signed-off-by: Andrea Righi <[email protected]>

htejun · 2024-01-02T04:41:31Z

scheds/rust/scx_rustland/src/bpf/main.bpf.c

+	 * previously used one.
+	 */
+	if (!bpf_cpumask_test_cpu(cpu, p->cpus_ptr))
+		cpu = scx_bpf_task_cpu(p);


I think this should work fine but can you please verify that changing CPU affinity for a constantly running task which is bound to one CPU works? ie. Have one thread which busy loops, bind it to one cpu using taskset, and then change it to another CPU.

Yep, it seems to work:

09:10 arighi@gpd$ yes >/dev/null & [1] 1797739 6.7.0-3-generic ~ 09:10 arighi@gpd$ taskset -pc 2 $! pid 1797739's current affinity list: 0-7 pid 1797739's new affinity list: 2 6.7.0-3-generic ~ 09:10 arighi@gpd$ cut -d' ' -f39 /proc/$!/stat 2 6.7.0-3-generic ~ 09:11 arighi@gpd$ taskset -pc 5 $! pid 1797739's current affinity list: 2 pid 1797739's new affinity list: 5 6.7.0-3-generic ~ 09:11 arighi@gpd$ cut -d' ' -f39 /proc/$!/stat 5

Of course it changes cpu only if the task is doing something (i.e. a sleep infinity won't change cpu), but the same happens with the default scheduler, so we shouldn't change any behavior here.

htejun · 2024-01-02T04:45:21Z

scheds/rust/scx_rustland/src/main.rs

+            let mut delta = sum_exec_runtime - task_info.sum_exec_runtime;
+            // Never account more than max_slice_ns. This helps to prevent starving a task for too
+            // long in the scheduler task pool.
+            delta = delta.min(max_slice_ns);


I'm curious why this is necessary. Maybe this is needed to mask a different issue such as failing to wind current vtime forward when it should be? Or is this necessary because sometimes the scheduler takes a bit too long to kick them off cpu and their execution pattern becomes too chunky?

I'm curious why this is necessary. Maybe this is needed to mask a different issue such as failing to wind current vtime forward when it should be? Or is this necessary because sometimes the scheduler takes a bit too long to kick them off cpu and their execution pattern becomes too chunky?

I think I was just trying to fix what was fixed by 2900b20 ("scx_rustland: evaluate the proper vruntime delta") here, I should have squashed the commits together...

However, now I can see a case where the delta can be bigger than slice_ns, but that's a side effect of the fact that we don't clean entries synchronously from the self.task_info HashMap when the corresponding tasks exit.

If a pid is reused we would (incorrectly) get the previous TaskInfo entry with the sum_exec_runtime of the previous task.

Do we have a callback that we can use in sched_ext_ops to detect when a task exits in a synchronous way? I was trying to use .disable, but it seems to be called only when the scheduler exits, not when the task exits... is this the expected behavior? Am I missing something?

.disable should be called when a task gets freed, which may take place a bit after the task exits.

.disable should be called when a task gets freed, which may take place a bit after the task exits.

ok, I'll try again with the .disable callback to notify the user-space sched when tasks are exiting, it's more elegant than running the periodic garbage collector.

(Maybe we still need to do the garbage collector in case we fail to notify the user-space scheduler, but I don't think it's a big problem for now).

Thanks!

Andrea Righi added 6 commits January 1, 2024 14:17

scx_rustland: rename variable id -> pos for better clarity

2773906

Signed-off-by: Andrea Righi <[email protected]>

scx_rustland: never account more than slice_ns to vruntime

8402601

In any case make sure that we never account more than the maximum slice_ns to a task's vruntime. This helps to prevent starving a task for too long in the user-space scheduler. Signed-off-by: Andrea Righi <[email protected]>

arighi force-pushed the scx-rustland-prevent-starvation branch from 719d093 to 90e92ac Compare January 1, 2024 15:58

Andrea Righi added 2 commits January 1, 2024 19:47

scx_rustland: small code refactoring

280796c

No functional change, make the user-space scheduler code a bit more readable and more Rust idiomatic. Signed-off-by: Andrea Righi <[email protected]>

arighi force-pushed the scx-rustland-prevent-starvation branch from 5151117 to 280796c Compare January 1, 2024 18:48

htejun approved these changes Jan 2, 2024

View reviewed changes

htejun merged commit 6c437ae into sched-ext:main Jan 2, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scx_rustland: prevent starvation and improve responsiveness #60

scx_rustland: prevent starvation and improve responsiveness #60

arighi commented Jan 1, 2024 •

edited

Loading

htejun Jan 2, 2024

arighi Jan 2, 2024

htejun Jan 2, 2024

arighi Jan 2, 2024

htejun Jan 2, 2024

arighi Jan 2, 2024

scx_rustland: prevent starvation and improve responsiveness #60

scx_rustland: prevent starvation and improve responsiveness #60

Conversation

arighi commented Jan 1, 2024 • edited Loading

htejun Jan 2, 2024

Choose a reason for hiding this comment

arighi Jan 2, 2024

Choose a reason for hiding this comment

htejun Jan 2, 2024

Choose a reason for hiding this comment

arighi Jan 2, 2024

Choose a reason for hiding this comment

htejun Jan 2, 2024

Choose a reason for hiding this comment

arighi Jan 2, 2024

Choose a reason for hiding this comment

arighi commented Jan 1, 2024 •

edited

Loading