scx_rustland improvements #47

arighi · 2023-12-23T09:50:22Z

Set of changes for scx_rustland that allow to massively improve the user-space scheduler effectiveness, especially for low latency tasks:

provide CPU usage awareness to the user-space scheduler
dispatch tasks from the user-space scheduler to the BPF dispatcher in batch, instead of draining the entire task pool all at once
distinguish between graceful exit vs non-graceful exit

Do not report an exit error message if it's empty. Moreover, distinguish between a graceful exit vs a non-graceful exit. In general, try to follow the behavior of user_exit_info.h for the C schedulers. NOTE: in the future the whole exit handling probably can be moved to a more generic place (scx_utils) to prevent code duplication across schedulers and also to prevent small inconsistencies like this one. Signed-off-by: Andrea Righi <[email protected]>

Provide an interface for the BPF dispatcher and user-space scheduler to share CPU information. This information can empower the user-space scheduler to make more informed decisions and enable the implementation of a broader range of scheduling policies. With this change the BPF dispatcher provides a CPU map (one entry per CPU) that stores the pid that is running on each CPU (0 if the CPU is idle). The CPU map is updated by the BPF dispatcher in the .running() and .stopping() callbacks. The dispatcher then sends to the user-space scheduler a suggestion of the candidate CPU for each task that needs to run (that is always the previously used CPU), along with all the task's information. The user-space scheduler can decide to confirm the selected CPU or to choose a different one, using all the shared CPU information. Lastly, the selected CPU is communicated back to the dispatcher along with all the task's information and the BPF dispatcher takes care of executing the task on the selected CPU, eventually triggering a migration. Signed-off-by: Andrea Righi <[email protected]>

Dispatch tasks in a batch equal to the amount of idle CPUs in the system. This allows to reduce the pressure on the dispatcher queues, improving the effectiveness of the scheduler (by having more tasks sitting in the scheduler task pool) and mitigating potential priority inversion issues. Signed-off-by: Andrea Righi <[email protected]>

arighi · 2023-12-23T10:04:05Z

Test result: before this change running a make -j16 in the kernel source dir was tanking my laptop, after these changes I can still watch a video on Youtube while make -j16 is running.

htejun · 2023-12-23T21:29:00Z

I generally like the direction. This likely is less performant but gives userspace a lot more control over what happens on the CPUs, and optimizing from there if necessary seems like the right approach. Generally looks good to me. A few comments:

Now that the interaction between the BPF and rust parts are clearer, it might make sense to define the interlocking more concretely. BPF doesn't have full memory model yet but I don't think we need them here anyway. Setting usersched_needed should be a release operation and reading it and clearing it should be an acquire. Setting of usersched_needed in rustland_enqueue() already has full mb in the preceding nr_enqueues increment and reading and clearing are followed by multiple spinlocks which provide the needed rmb ordering. Just noting as such should be sufficient here.
I wonder whether rustland_stopping() needs to set usersched_needed as the CPU might be running out of things to do; otherwise, there may be unnecessary upto 1s gaps which doesn't look like it would be too difficult to trigger. An alternative and probably more robust implementation would be adding .update_idle() method which is called when the CPU is about to go idle, keep track of the number of tasks queued in userspace and trigger usersched iff there are pending tasks when a CPU is about to go to idle. Strictly speaking, this should be the only hand-over mechanism that should be necessary and can avoid e.g. triggering usersched unnecessarily after every enqueue while all CPUs are still busy.
Note that implementing .update_idle() will disable the bulit in idle tracking and scx_bpf_test_and_clear_cpu_idle() will stop working. You can keep the builtin idle tracking on by setting SCX_OPS_KEEP_BUILTIN_IDLE. However, given that the benefit the code is getting from default idle tracking is rather minimal and doing duplicate idle tracking anyway, maybe the right thing to do is implementing a better "keep running" logic in .select_cpu() based on the number of tasks queued on the rust side. ie. if there's nothing to switch to, just keep running?

Andrea Righi added 3 commits December 22, 2023 19:44

htejun merged commit 8443d8a into sched-ext:main Dec 23, 2023
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scx_rustland improvements #47

scx_rustland improvements #47

arighi commented Dec 23, 2023

arighi commented Dec 23, 2023

htejun commented Dec 23, 2023

scx_rustland improvements #47

scx_rustland improvements #47

Conversation

arighi commented Dec 23, 2023

arighi commented Dec 23, 2023

htejun commented Dec 23, 2023