Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scx_rustland improvements #47

Merged
merged 3 commits into from
Dec 23, 2023
Merged

scx_rustland improvements #47

merged 3 commits into from
Dec 23, 2023

Conversation

arighi
Copy link
Contributor

@arighi arighi commented Dec 23, 2023

Set of changes for scx_rustland that allow to massively improve the user-space scheduler effectiveness, especially for low latency tasks:

  • provide CPU usage awareness to the user-space scheduler
  • dispatch tasks from the user-space scheduler to the BPF dispatcher in batch, instead of draining the entire task pool all at once
  • distinguish between graceful exit vs non-graceful exit

Andrea Righi added 3 commits December 22, 2023 19:44
Do not report an exit error message if it's empty. Moreover, distinguish
between a graceful exit vs a non-graceful exit.

In general, try to follow the behavior of user_exit_info.h for the C
schedulers.

NOTE: in the future the whole exit handling probably can be moved to a
more generic place (scx_utils) to prevent code duplication across
schedulers and also to prevent small inconsistencies like this one.

Signed-off-by: Andrea Righi <[email protected]>
Provide an interface for the BPF dispatcher and user-space scheduler to
share CPU information. This information can empower the user-space
scheduler to make more informed decisions and enable the implementation
of a broader range of scheduling policies.

With this change the BPF dispatcher provides a CPU map (one entry per
CPU) that stores the pid that is running on each CPU (0 if the CPU is
idle). The CPU map is updated by the BPF dispatcher in the .running()
and .stopping() callbacks.

The dispatcher then sends to the user-space scheduler a suggestion of
the candidate CPU for each task that needs to run (that is always the
previously used CPU), along with all the task's information.

The user-space scheduler can decide to confirm the selected CPU or to
choose a different one, using all the shared CPU information.

Lastly, the selected CPU is communicated back to the dispatcher along
with all the task's information and the BPF dispatcher takes care of
executing the task on the selected CPU, eventually triggering a
migration.

Signed-off-by: Andrea Righi <[email protected]>
Dispatch tasks in a batch equal to the amount of idle CPUs in the
system.

This allows to reduce the pressure on the dispatcher queues, improving
the effectiveness of the scheduler (by having more tasks sitting in the
scheduler task pool) and mitigating potential priority inversion issues.

Signed-off-by: Andrea Righi <[email protected]>
@arighi
Copy link
Contributor Author

arighi commented Dec 23, 2023

Test result: before this change running a make -j16 in the kernel source dir was tanking my laptop, after these changes I can still watch a video on Youtube while make -j16 is running.

@htejun
Copy link
Contributor

htejun commented Dec 23, 2023

I generally like the direction. This likely is less performant but gives userspace a lot more control over what happens on the CPUs, and optimizing from there if necessary seems like the right approach. Generally looks good to me. A few comments:

  • Now that the interaction between the BPF and rust parts are clearer, it might make sense to define the interlocking more concretely. BPF doesn't have full memory model yet but I don't think we need them here anyway. Setting usersched_needed should be a release operation and reading it and clearing it should be an acquire. Setting of usersched_needed in rustland_enqueue() already has full mb in the preceding nr_enqueues increment and reading and clearing are followed by multiple spinlocks which provide the needed rmb ordering. Just noting as such should be sufficient here.
  • I wonder whether rustland_stopping() needs to set usersched_needed as the CPU might be running out of things to do; otherwise, there may be unnecessary upto 1s gaps which doesn't look like it would be too difficult to trigger. An alternative and probably more robust implementation would be adding .update_idle() method which is called when the CPU is about to go idle, keep track of the number of tasks queued in userspace and trigger usersched iff there are pending tasks when a CPU is about to go to idle. Strictly speaking, this should be the only hand-over mechanism that should be necessary and can avoid e.g. triggering usersched unnecessarily after every enqueue while all CPUs are still busy.
  • Note that implementing .update_idle() will disable the bulit in idle tracking and scx_bpf_test_and_clear_cpu_idle() will stop working. You can keep the builtin idle tracking on by setting SCX_OPS_KEEP_BUILTIN_IDLE. However, given that the benefit the code is getting from default idle tracking is rather minimal and doing duplicate idle tracking anyway, maybe the right thing to do is implementing a better "keep running" logic in .select_cpu() based on the number of tasks queued on the rust side. ie. if there's nothing to switch to, just keep running?

@htejun htejun merged commit 8443d8a into sched-ext:main Dec 23, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants