-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scx_rustland: prevent starvation and improve responsiveness #60
Conversation
Signed-off-by: Andrea Righi <[email protected]>
The user-space scheduler maintains an internal hash map of tasks information (indexed by their pid). Tasks are only added to this hash map and never removed. After running the scheduler for a while we may experience a performance degration, because the hash map keeps growing. Therefore implement a mechanism of garbage collector to remove the old entries from the task map (periodically removing pids that don't exist anymore). Signed-off-by: Andrea Righi <[email protected]>
In any case make sure that we never account more than the maximum slice_ns to a task's vruntime. This helps to prevent starving a task for too long in the user-space scheduler. Signed-off-by: Andrea Righi <[email protected]>
With commit 49f2e7c ("scx_rustland: enable SCX_OPS_ENQ_LAST") we have enabled SCX_OPS_ENQ_LAST that seems to save some unnecessary user-space scheduler activations when the system is mostly idle. We are also checking for the SCX_ENQ_LAST in the enqueue flags, that apparently it is not needed and we can achieve the same behavior dropping this check. Signed-off-by: Andrea Righi <[email protected]>
Never dispatch the user-space scheduler to the global DSQ, while all the other tasks are dispatched to the local per-CPU DSQ. Since tasks are consumed from the local DSQ first and then from the global DSQ, we may end up starving the scheduler if we dispatch only this one on the global DSQ. In fact it is really easy to trigger a stall with a workload that triggers many context switches in the system, for example (on a 8 cores system): $ stress-ng --cpu 32 --iomix 4 --vm 2 --vm-bytes 128M --fork 4 --timeout 30s ... 09:28:11 [WARN] EXIT: scx_rustland[1455943] failed to run for 5.275s 09:28:11 [INFO] Unregister RustLand scheduler To prevent this from happening also dispatch the user-space scheduler on the local DSQ, using the current CPU where .dispatch() is called, if possible, or the previously used CPU otherwise. Apply the same logic when the scheduler is congested: dispatch on the previously used CPU using the local DSQ. In this way all tasks will always get the same "dispatch priority" and we can prevent the scheduler starvation issue. Note that with this change in place dispatch_global() is never used and we can get rid of it. Signed-off-by: Andrea Righi <[email protected]>
Prevent newly created short-lived tasks from starving the other tasks sitting in the user-space scheduler. This can be done setting an initial vruntime of (min_vruntime + 1) to newly scheduled tasks, instead of min_vruntime: this ensures a progressing global vruntime durig each scheduler run, providing a priority boost to newer tasks (that is still beneficial for potential short-lived tasks) while also preventing excessive starvation of the other tasks sitting in the user-space scheduler, waiting to be dispatched. Without this change it is really easy to create a stall condition simply by forking a bunch of short-lived tasks in a busy loop, with this change applied the scheduler can handle properly the consistent flow of newly created short-lived tasks, without introducing any stall. Signed-off-by: Andrea Righi <[email protected]>
719d093
to
90e92ac
Compare
The forumla used to evaluate the weighted time delta is not correct, it's not considering the weight as a percentage. Fix this by using the proper formula. Moreover, take into account also the task weight when evaluating the maximum time delta to account in vruntime and make sure that we never charge a task more than slice_ns. This helps to prevent starvation of low priority tasks. Signed-off-by: Andrea Righi <[email protected]>
No functional change, make the user-space scheduler code a bit more readable and more Rust idiomatic. Signed-off-by: Andrea Righi <[email protected]>
5151117
to
280796c
Compare
* previously used one. | ||
*/ | ||
if (!bpf_cpumask_test_cpu(cpu, p->cpus_ptr)) | ||
cpu = scx_bpf_task_cpu(p); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should work fine but can you please verify that changing CPU affinity for a constantly running task which is bound to one CPU works? ie. Have one thread which busy loops, bind it to one cpu using taskset
, and then change it to another CPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, it seems to work:
09:10 arighi@gpd$ yes >/dev/null &
[1] 1797739
6.7.0-3-generic ~
09:10 arighi@gpd$ taskset -pc 2 $!
pid 1797739's current affinity list: 0-7
pid 1797739's new affinity list: 2
6.7.0-3-generic ~
09:10 arighi@gpd$ cut -d' ' -f39 /proc/$!/stat
2
6.7.0-3-generic ~
09:11 arighi@gpd$ taskset -pc 5 $!
pid 1797739's current affinity list: 2
pid 1797739's new affinity list: 5
6.7.0-3-generic ~
09:11 arighi@gpd$ cut -d' ' -f39 /proc/$!/stat
5
Of course it changes cpu only if the task is doing something (i.e. a sleep infinity
won't change cpu), but the same happens with the default scheduler, so we shouldn't change any behavior here.
scheds/rust/scx_rustland/src/main.rs
Outdated
let mut delta = sum_exec_runtime - task_info.sum_exec_runtime; | ||
// Never account more than max_slice_ns. This helps to prevent starving a task for too | ||
// long in the scheduler task pool. | ||
delta = delta.min(max_slice_ns); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious why this is necessary. Maybe this is needed to mask a different issue such as failing to wind current vtime forward when it should be? Or is this necessary because sometimes the scheduler takes a bit too long to kick them off cpu and their execution pattern becomes too chunky?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious why this is necessary. Maybe this is needed to mask a different issue such as failing to wind current vtime forward when it should be? Or is this necessary because sometimes the scheduler takes a bit too long to kick them off cpu and their execution pattern becomes too chunky?
I think I was just trying to fix what was fixed by 2900b20 ("scx_rustland: evaluate the proper vruntime delta")
here, I should have squashed the commits together...
However, now I can see a case where the delta can be bigger than slice_ns
, but that's a side effect of the fact that we don't clean entries synchronously from the self.task_info
HashMap when the corresponding tasks exit.
If a pid is reused we would (incorrectly) get the previous TaskInfo
entry with the sum_exec_runtime
of the previous task.
Do we have a callback that we can use in sched_ext_ops
to detect when a task exits in a synchronous way? I was trying to use .disable
, but it seems to be called only when the scheduler exits, not when the task exits... is this the expected behavior? Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.disable
should be called when a task gets freed, which may take place a bit after the task exits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.disable
should be called when a task gets freed, which may take place a bit after the task exits.
ok, I'll try again with the .disable
callback to notify the user-space sched when tasks are exiting, it's more elegant than running the periodic garbage collector.
(Maybe we still need to do the garbage collector in case we fail to notify the user-space scheduler, but I don't think it's a big problem for now).
Thanks!
Improvements and fixes to scx_rustland to prevent some cases of starvation and enhance responsiveness:
Small code readability improvement:
With all these changes in place the scheduler can properly handle, on my 8-cores laptop, a parallel kernel build (
make -j32
), a stress test (stress-ng --cpu 32 --iomix 4 --vm 2 --vm-bytes 128M --fork 4
), a YouTube music video playing in the background, all while reading emails and writing the text of this PR.