Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scx_rustlite: simple vtime-based scheduler written in Rust #38

Merged
merged 1 commit into from
Dec 21, 2023

Conversation

arighi
Copy link
Collaborator

@arighi arighi commented Dec 18, 2023

Provide a simple scheduler with a BPF part and a user-space counterpart, written in Rust, that can be used as a template to implement more complex scheduling policies.

The main goal of this scheduler is to be easy to read and well documented, so that newcomers (i.e., students, researchers, junior devs, etc.) can use it to quickly experiment their scheduling theories.

For this reason the design of this scheduler is mostly focused on simplicity and code readability.

@sirlucjan
Copy link
Collaborator

Builds and packages correctly:

==> Extracting sources...
  -> Creating working copy of scx git repo...
Cloning into 'scx'...
done.
==> Starting prepare()...
Applying patch rustlite.patch...
patching file scheds/rust/meson.build
patching file scheds/rust/scx_rustlite/.gitignore
patching file scheds/rust/scx_rustlite/Cargo.toml
patching symbolic link scheds/rust/scx_rustlite/LICENSE
patching file scheds/rust/scx_rustlite/build.rs
patching file scheds/rust/scx_rustlite/meson.build
patching file scheds/rust/scx_rustlite/rustfmt.toml
patching file scheds/rust/scx_rustlite/src/bpf/intf.h
patching file scheds/rust/scx_rustlite/src/bpf/main.bpf.c
patching file scheds/rust/scx_rustlite/src/bpf_intf.rs
patching file scheds/rust/scx_rustlite/src/bpf_skel.rs
patching file scheds/rust/scx_rustlite/src/main.rs
==> Starting pkgver()...
==> Starting build()...
+ exec meson setup --prefix /usr --libexecdir lib --sbindir bin --buildtype plain --auto-features enabled --wrap-mode nodownload -D b_lto=true -D b_pie=true -D python.bytecompile=1 . build
The Meson build system
Version: 1.3.0
Source dir: /home/lucjan/Pobrane/scx-scheds-git/src/scx
Build dir: /home/lucjan/Pobrane/scx-scheds-git/src/scx/build
Build type: native build
Project name: sched_ext schedulers
Project version: 0.1.2
C compiler for the host machine: cc (gcc 13.2.1 "cc (GCC) 13.2.1 20231216")
C linker for the host machine: cc ld.bfd 2.41.0
Host machine cpu family: x86_64
Host machine cpu: x86_64
Program clang found: YES (/usr/bin/clang)
Program bpftool found: YES (/usr/bin/bpftool)
Program cargo found: YES (/usr/bin/cargo)
Program /home/lucjan/Pobrane/scx-scheds-git/src/scx/meson-scripts/get_clang_ver found: YES (/home/lucjan/Pobrane/scx-scheds-git/src/scx/meson-scripts/get_clang_ver)
Program /home/lucjan/Pobrane/scx-scheds-git/src/scx/meson-scripts/bpftool_build_skel found: YES (/home/lucjan/Pobrane/scx-scheds-git/src/scx/meson-scripts/bpftool_build_skel)
Program /home/lucjan/Pobrane/scx-scheds-git/src/scx/meson-scripts/get_sys_incls found: YES (/home/lucjan/Pobrane/scx-scheds-git/src/scx/meson-scripts/get_sys_incls)
Program /home/lucjan/Pobrane/scx-scheds-git/src/scx/meson-scripts/cargo_fetch found: YES (/home/lucjan/Pobrane/scx-scheds-git/src/scx/meson-scripts/cargo_fetch)
meson.build:33: WARNING: clang >= 17 recommended (/usr/bin/clang ver=16.0.6)
Found pkg-config: YES (/usr/bin/pkg-config) 2.1.0
Run-time dependency libbpf found: YES 1.3.0
Message: cpu=x86_64 bpf_base_cflags=['-g', '-O2', '-Wall', '-Wno-compare-distinct-pointer-types', '-D__TARGET_ARCH_x86', '-mcpu=v3', '-mlittle-endian', '-idirafter /usr/lib/clang/16/include', '-idirafter /usr/local/include', '-idirafter /usr/include']
Build targets in project: 12

sched_ext schedulers 0.1.2

  User defined options
    auto_features     : enabled
    buildtype         : plain
    libexecdir        : lib
    prefix            : /usr
    sbindir           : bin
    wrap_mode         : nodownload
    python.bytecompile: 1
    b_lto             : true
    b_pie             : true

Found ninja-1.11.1 at /usr/bin/ninja
INFO: autodetecting backend as ninja                                                                                                                                                                                                                                           
INFO: calculating backend command to run: /usr/bin/ninja -C /home/lucjan/Pobrane/scx-scheds-git/src/scx/build
ninja: Entering directory `/home/lucjan/Pobrane/scx-scheds-git/src/scx/build'
[32/32] Generating scheds/rust/scx_rusty/scx_rusty with a custom command (wrapped by meson to set env)
==> Entering fakeroot environment...
==> Starting package()...
ninja: Entering directory `/home/lucjan/Pobrane/scx-scheds-git/src/scx/build'
[4/4] Generating scheds/rust/scx_rustlite/scx_rustlite with a custom command (wrapped by meson to set env)
Installing scheds/c/scx_simple to /home/lucjan/Pobrane/scx-scheds-git/pkg/scx-scheds-git/usr/bin
Installing scheds/c/scx_qmap to /home/lucjan/Pobrane/scx-scheds-git/pkg/scx-scheds-git/usr/bin
Installing scheds/c/scx_central to /home/lucjan/Pobrane/scx-scheds-git/pkg/scx-scheds-git/usr/bin
Installing scheds/c/scx_pair to /home/lucjan/Pobrane/scx-scheds-git/pkg/scx-scheds-git/usr/bin
Installing scheds/c/scx_flatcg to /home/lucjan/Pobrane/scx-scheds-git/pkg/scx-scheds-git/usr/bin
Installing scheds/c/scx_userland to /home/lucjan/Pobrane/scx-scheds-git/pkg/scx-scheds-git/usr/bin
Installing scheds/c/scx_nest to /home/lucjan/Pobrane/scx-scheds-git/pkg/scx-scheds-git/usr/bin
Running custom install script '/home/lucjan/Pobrane/scx-scheds-git/src/scx/meson-scripts/install_rust_user_scheds'
==> Tidying install...
  -> Removing libtool files...
  -> Purging unwanted files...
  -> Removing static library files...
  -> Stripping unneeded symbols from binaries and libraries...
  -> Compressing man and info pages...
==> Checking for packaging issues...
==> WARNING: Package contains reference to $srcdir
usr/bin/scx_simple
usr/bin/scx_qmap
usr/bin/scx_central
usr/bin/scx_pair
usr/bin/scx_flatcg
usr/bin/scx_userland
usr/bin/scx_nest
usr/bin/scx_layered
usr/bin/scx_rustlite
usr/bin/scx_rusty
==> Creating package "scx-scheds-git"...
  -> Generating .PKGINFO file...
  -> Generating .BUILDINFO file...
  -> Generating .MTREE file...
  -> Compressing package...
==> Leaving fakeroot environment.
==> Signing package(s)...
  -> Created signature file scx-scheds-git-0.1.2.r17.g239d5d1-2-x86_64_v3.pkg.tar.zst.sig.
==> Finished making: scx-scheds-git 0.1.2.r17.g239d5d1-2 (pon, 18 gru 2023, 17:19:41)

All indications are that it is working properly

❯ sudo scx_rustlite                                                 
[sudo] password for lucjan: 
16:22:01 [INFO] RustLite scheduler attached
16:22:02 [INFO] nr_enqueues=1040 nr_user_dispatched=1039 nr_kernel_dispatches=52
16:22:03 [INFO] nr_enqueues=2692 nr_user_dispatched=2691 nr_kernel_dispatches=99
16:22:04 [INFO] nr_enqueues=4654 nr_user_dispatched=4653 nr_kernel_dispatches=142
16:22:05 [INFO] nr_enqueues=6238 nr_user_dispatched=6237 nr_kernel_dispatches=174
16:22:06 [INFO] nr_enqueues=7454 nr_user_dispatched=7453 nr_kernel_dispatches=213
16:22:07 [INFO] nr_enqueues=9877 nr_user_dispatched=9876 nr_kernel_dispatches=673
16:22:08 [INFO] nr_enqueues=13022 nr_user_dispatched=13021 nr_kernel_dispatches=1533
16:22:09 [INFO] nr_enqueues=14985 nr_user_dispatched=14983 nr_kernel_dispatches=1631
16:22:10 [INFO] nr_enqueues=17880 nr_user_dispatched=17878 nr_kernel_dispatches=1853
16:22:11 [INFO] nr_enqueues=19319 nr_user_dispatched=19318 nr_kernel_dispatches=1901
16:22:12 [INFO] nr_enqueues=22550 nr_user_dispatched=22549 nr_kernel_dispatches=2247
16:22:13 [INFO] nr_enqueues=24441 nr_user_dispatched=24440 nr_kernel_dispatches=242

Copy link
Collaborator

@sirlucjan sirlucjan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned, I built and tested - it seems that everything works properly.

Copy link
Contributor

@Byte-Lab Byte-Lab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, Andrea. I left a few comments, but the overall approach LGTM

*
* The scheduler is made of a BPF part that implements the basic sched-ext
* functionalities and a user-space counterpart, written in Rust, that
* implement the scheduling policy itself.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/implement/implements

/* Task @p starts to run on a CPU */
void BPF_STRUCT_OPS(rustlite_running, struct task_struct *p)
{
bpf_printk("start: pid=%d (%s)", p->pid, p->comm);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we hide this (and other bpf_printk() calls) behind a debug flag? I'm fine with leaving it as is, just thought it was worth mentioning

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed! It was something that I was planning to do... but forgot to do. :) I'll add that.

task.vtime = task_vtime(p);
bpf_printk("enqueue: pid=%d vtime=%llu", task.pid, task.vtime);
if (bpf_map_push_elem(&enqueued, &task, 0)) {
bpf_printk("dispatch (global): pid=%d", task.pid);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we bring this into dispatch_task_global()?

* kernel.
*/
if (is_kthread(p) && p->nr_cpus_allowed == 1) {
bpf_printk("dispatch (local): pid=%d", p->pid);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we bring this into dispatch_task_local()?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can bring all the dispatch bpf_printk into dispatch_task_local/global(), maybe also add a vtime argument for global, so that we can bring in also the bpf_printk in rustlite_dispatch()?

* is idle.
*/
if (p->nr_cpus_allowed == 1 ||
scx_bpf_test_and_clear_cpu_idle(prev_cpu))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So here (and below), we're grabbing an idle core, but then not dispatching the task to that idle core when it's enqueued in rustlite_enqueue() if it's not a per-cpu kthread. This is a bit sub-optimal because we don't really gain anything from nabbing that idle core, and we're taking it away from another task that could have come along and grabbed that core and enqueued directly on it.

If you look at scx_userland.bpf.c, we only do anything in the ops.select_cpu() callback if that task is going to be scheduled in the kernel directly, so we may want to do something similar here. Or, alternatively, it might be a reasonable policy to always schedule a task directly in the kernel if you're able to find an idle core.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh ok, this part was not very clear to me. Now, IIUC in ops.select_cpu() we can return a CPU that is where ops.enqueue() will be called for the task, right? That would explain the force_local and the dispatch from ops.enqueue().

If that's right I'll mimic what scx_userland does and I'll try to clarify all of this with some comments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, correct. The CPU returned by ops.select_cpu() is where the task will be migrated to and enqueued on in ops.enqueue(). So exactly as you're saying, the force local, etc is how we communicate that the task should enqueue locally on that DSQ.

@htejun, perhaps it would be useful to always set the SCX_ENQ_LOCAL flag any time a task is migrated to a different CPU from a call to ops.select_cpu()?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't that be a bit too magical? SCX_ENQ_LOCAL itself is a quirky thing which is meaningful only when the default select_cpu is used. It is tempting to expand it to make it more useful but the benefit is pretty trivial & marginal, I think. e.g. your suggested semantics can be really confusing if an implementation switches from default select_cpu implementation to custom one. If the enqueue path is not updated, it'll work as expected sometimes, which is a pretty bad failure mode. The flag name becomes a misnomer too.

I wonder what we should do is making it really clear that the flag is only meaningful when the default select_cpu is used.

Copy link
Contributor

@Byte-Lab Byte-Lab Dec 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SCX_ENQ_LOCAL itself is a quirky thing which is meaningful only when the default select_cpu is used.

The fact that it's only useful when default select_cpu is used is really odd to me and seems circumstantial, considering that it's a part of the "public" API that's consumable and used by schedulers. To me, what it really should mean is, "This is a hint (that you can ignore) that you probably want to enqueue this on the local DSQ". In my opinion we should either document the semantics to make it consistent for any implementation of any scheduler, or we should probably just get rid of it altogether.

If the enqueue path is not updated, it'll work as expected sometimes, which is a pretty bad failure mode.

I'm not sure I understand this point. What do you mean by "not updated"?

Copy link
Contributor

@Byte-Lab Byte-Lab Dec 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default select_cpu() needs a way to communicate it.

Right, but IMO the relevant point is that the flag does communicate something about what CPU was chosen in ops.select_cpu(). The fact that it's only necessary to accommodate the default implementation of ops.select_cpu() is an implementation detail that we shouldn't be leaking. Put in other words, right now SCX_ENQ_LOCAL means: "you didn't implement ops.select_cpu(), and this is an (opaque) hint that you may want to enqueue locally". I think the flag should be made more generic than that. Something like SCX_CPU_CHANGED or SCX_ENQ_TASK_MIGRATED. The default implementation can and needs to use it, but there's no reason we shouldn't make it available to schedulers that implement ops.select_cpu() and want to take advantage of the same semantics for convenience. I expect we'd save a lot more code doing that than having a default implementation for ops.select_cpu().

The fact that we're providing the default select_cpu implementation is where the quirkiness is coming from. The usefulness to quirkiness factor doesn't seem too bad to me.

Eh, I don't really agree. The existence of that flag adds surface area to the API, which itself adds cognitive load to users trying to understand how things work. IMO it's pretty non-intuitive that the flag is only relevant for if you don't implement ops.select_cpu(), and given how likely infrequent it is that someone won't implement ops.select_cpu(), it feels like it provides minimal value.

To be honest, IMO we shouldn't even be tryind to find idle cores in ops.select_cpu(). The default behavior should probably just be to return prev_cpu, as I'd expect the default implementation to do the minimal, sane thing. Selecting idle cores isn't a good idea for every use case, but prev_cpu is consistent behavior regardless of host topology.

The meaning of the flag changes between default select_cpu and non-default ones, which can cause a lot of confusion. We can be strict about it and e.g. have a kfunc interface instead so that there's an explicit query for "did the default select_cpu indicated local enqueue?" which triggers ops_error if default select_cpu is not enabled, but that seems like an overkill to me.

I agree that it's overkill and a poor UX to add a kfunc for this. The enq_flags field feels like it's meant for this exact scenario. I guess I'm just not understanding why we can't change it to be generic and apply to both the default and non-default uses?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a hard time understanding what you're suggesting because SCX_ENQ_CPU_CHANGED doesn't mean the same thing as what SCX_ENQ_LOCAL means. How would that work in practice?

As for the default implementation of select_cpu() becoming simpler, I'm not sure. It can be but I'm not sure whether or how much that'd be better. It's nice to have an effective default behavior as long as it's generic so that even really simple schedulers can be useful and we can further optimize the default behavior to e.g. make it NUMA area and whatnot.

As for cognitive overhead, yes it does add a bit but it's a small one. Maybe we can rename the flag to be clearer but it's something which is easily addressed by documentation and examples. Of the learning curves that writing a sched_ext scheduler involves, this isn't a noticeable one.

API cleanliness is a factor that we need to consider but not the only one and it's easy to overdo. Practically speaking, the impact here one way or the other is negligible, no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, let me try to clarify what I mean a bit.

I have a hard time understanding what you're suggesting because SCX_ENQ_CPU_CHANGED doesn't mean the same thing as what SCX_ENQ_LOCAL means. How would that work in practice?

So my issue is that SCX_ENQ_LOCAL feels like a weird and leaky part of the API to me because it doesn't actually communicate any concrete state / signal. It's a hint that it's advisable to enqueue the task on the local dsq of @p's rq, but how am I supposed to really interpret that if I'm implementing my own ops.enqueue()? Wouldn't it be more useful to have flags like:

/*
 * The task has changed CPUs since the last time it was enqueued.
 */
SCX_ENQ_CPU_CHANGED = 1LLU << 42,

A BPF prog could then do if ((enq_flags & SCX_ENQ_CPU_CHANGED) && (enq_flags & SCX_ENQ_WAKEUP)) to know that the task was migrated at wakeup time, and could dispatch it locally if it wants to. If a task is migrated to a different on say the dispatch path, then (enq_flags & SCX_ENQ_WAKEUP) wouldn't be set. A prog could choose to do something with that information, or it could choose to ignore it. The point is that it actually has concrete signal on the state of the task, rather than having to decide if it wants to listen to the opaque hint about whether to enqueue locally.

As for cognitive overhead, yes it does add a bit but it's a small one. Maybe we can rename the flag to be clearer but it's something which is easily addressed by documentation and examples. Of the learning curves that writing a sched_ext scheduler involves, this isn't a noticeable one.

Yeah I agree it's not horrible cognitive load. If there ended up being a non-negligible complexity tradeoff to implement this, I agree there's an argument to be made against it.

API cleanliness is a factor that we need to consider but not the only one and it's easy to overdo. Practically speaking, the impact here one way or the other is negligible, no?

I agree that API cleanliness isn't ironclad. I think my confusion is that I'm not quite understanding what the issue is with making the flag more generic. Is there some complexity that I'm overlooking?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's confusing to me is that SCX_ENQ_LOCAL doesn't indicate that the CPU has changed from @prev_cpu. It is telling .enqueue() that the CPU the task is on at the time of enqueue is ready for the task immediately and thus the enqueueing operation can be optimized into direct dispatch on the CPU. SCX_ENQ_LOCAL can be set when the task stayed on @prev_cpu too, so I don't see how "cpu changed" flag can be a substitute. This is .force_local flag for the default select_cpu implementation and means exactly the same thing and that doesn't have anything to do with whether the task's rq has changed or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh yeah, fair enough, for some reason I completely forgot about reserving and staying on @prev_cpu. We can just keep the semantics the way they are. It is kind of sad to me because it's one of the more awkward parts of the API, but as you said it's pretty minor and can be fixed with documentation.

*
* - tasks are then dispatched from the lowest to the highest vtime
*
* The vtime is evaluated in the BPF part, tasks' then all the information are
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a bit of trouble parsing this sentence. Did it mean to say, "then all tasks' information are sent to the user-space part and stored in a red-black tree"? (applies to the rust comment header as well)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I just fat fingered and cut some text, I'll rephrase it. :) but yeah the logic is pretty much what you said.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've rephrased like this, I think it's a bit more clear:

 * The BPF component evaluates the vtime for the tasks that are ready to run.
 * Subsequently, all task details are sent to the user-space component and
 * recorded in a red-black tree (indexed by vtime). The user-space part then
 * returns to the BPF component the list of tasks (ordered by their vtime) to
 * be dispatched following the order specified by the user-space.

/// simplicity and code readability.
///
/// The scheduler is made of a BPF part that implements the basic sched-ext functionalities and a
/// user-space counterpart, written in Rust, that implement the scheduling policy itself.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/implement/implements

Comment on lines +216 to +401
// Yield to avoid using too much CPU from the scheduler itself.
thread::yield_now();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another nice thing about the ringbuffer maps is that user space can block on epoll_wait() when waiting for more messages from the kernel. This of course works just fine as well given that the kernel will only schedule the task usersched task when necessary

Comment on lines 308 to 310
thread::spawn(|| loop {
thread::sleep(Duration::from_secs(1));
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also do this by setting a timer in BPF code, but this works too

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, a bpf_timer is definitely nicer, I'll add that.

Copy link
Contributor

@htejun htejun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left several comments but here are two larger points:

  1. I think it'd be useful if userspace code becomes stateful so that it has per-task context structure. That'd be more code but I think it'd make the code easier to modify for folks who wanna play with it as the necessary scaffolding is already there and a lot can be done by just updating rust code without touching bpf part.
  2. As noted in the in-line comment, it'd be great if on-demand dispatch can be implemented. It doesn't have to be fully precise and there can be some overdispatching to reduce execution bubbles but again it'd make it a lot easier to implement a meaningful scheduling policy by just updating the rust part.

typedef int s32;
typedef unsigned long long u64;
typedef long long s64;
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated, but we probably should move the above boilerplate code into scx_utils.

* Exit info (passed to the user-space counterpart).
*/
int exit_kind = SCX_EXIT_NONE;
char exit_msg[SCX_EXIT_MSG_LEN];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another unrelated todo: Now that we have scx_utils, we should be able to make c's user_exit_info available to rust schedulers to consolidate exit handling.

} enqueued SEC(".maps");

/*
* Tasks enqueued by the user-space for dispatching.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It's a bit confusing to say "enqueued for dispatching" as "enqueue" is an operation name. Maybe use "queued"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was actually considering to rename those queue maps to somehting like to_userspace and from_userspace. Maybe it's a bit naive, but it would really help to understand that they're used for kernel/user-space communication.

/* Return a (wrapping safe) time delta */
static inline u64 time_diff(u64 end, u64 start)
{
return (s64)end - (s64)start;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the (s64) casts are doing anything. The result is the same whether the subtraction is done with signeds or unsigneds. The comment makes it a bit confusing because what it's claiming doesn't match the code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you want to do sth like?

s64 delta = end - start;
if (delta < 0)
   scx_bpf_error(...);
return delta;

Copy link
Collaborator Author

@arighi arighi Dec 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are definitely right, we don't need the s64 casting at all, I'll fix that (maybe even get rid of time_diff() since it'd be just end - start).

Edit: about erroring when delta < 0, I'm not sure if it's really needed here. This part would work just fine even if we had a wrapping (for any reason). Unless we want to explicitly detect this particular case to catch for potential kernel bugs (but even in that case the check should be in the kernel I think).

u64 now = bpf_ktime_get_ns();

ts = bpf_task_storage_get(&task_storage_map, p, 0, 0);
if (!ts)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to trigger scx_bpf_error() on lookup failure?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should definitely error here, I don't see any "sane" reason why the entry in task_storage_map should not exist. Will fix that and refactor the code a bit to always error when we can't find a corresponding entry in task_storage_map.

task.pid = p->pid;
task.vtime = task_vtime(p);
bpf_printk("enqueue: pid=%d vtime=%llu", task.pid, task.vtime);
if (bpf_map_push_elem(&enqueued, &task, 0)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a stat counter so that users can notice when things go wrong with event handling?

* Otherwise migrate to the next idle CPU, if available (this strategy
* is not ideal, since it assumes a migration cost of zero).
*/
cpu = scx_bpf_pick_idle_cpu(p->cpus_ptr, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, there is no point in picking idle cpu here. .enqueue() is still going to run from the current waking CPU. The only thing that's being determined by select_cpu() is which rq the task is going to belong to and whether rq migration needs to happen. Here, the task is either being locally dispatched or bounced to userland. Idle CPU selection and migration don't really do anything. AFAICS, none of these isn't doing anything but adding unnecessary overhead. The function can always return prev_cpu.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll probably mimic what scx_userland is doing, as suggested by @Decave.
In perspective I would also like to offload the logic of ops.select_cpu() to userspace as well.

Copy link
Contributor

@htejun htejun Dec 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably needs better documentation but .select_cpu() is purely an optimization. .enqueue() and .dispatch() have complete control over which cpu a given task runs on. The only thing .select_cpu() brings to the table is that if its verdict and the CPU .enqueue() or .dispatch() ends up choosing for the task matches, we can avoid a lock dancing to hot-migrate the task for execution thus reducing the overhead by a small amount.

Here, because all non-trivial scheduling decisions are punted to userspace, .select_cpu() doesn't have any way of predicting what the eventual CPU selection is going to be and thus whatever work it does is wasted. Also, .select_cpu() can't be offloaded to userland because it's in synchronous wakeup path. Besides, it doesn't make whole lot of sense anyway. The overhead it's saving when it successfully predicts the CPU the task is going to run on is fairly small. It's likely impossible to offset the travel to userspace even if we can offload it to userland.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm... so maybe we can just rely on select_cpu_dfl() and get rid of ops.select_cpu() in this scheduler.

Or what about offloading cpu selection to user-space without using ops.select_cpu()?

Basically add some simple logic to the Rust code that would update .target_cpu in the task context and then always use scx_bpf_dispatch(p, SCX_DSQ_LOCAL_ON | target_cpu, slice_ns, 0) in ops.dispatch(). Would that work?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the right thing to do is having a really dumb implementation which just returns @prev_cpu. That's cheaper than the default implementation and likely the best behavior anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, adding .taget_cpu can work too; however, I'm skeptical it'd be worth the complication. It's not a huge optimization. If we're bouncing through userspace to make a scheduler decision anyway, we aren't going to notice this micro optimization.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok makes sense, let's keep it simple and just return @prev_cpu. Thanks for the review and all the good suggestions @htejun !

s32 BPF_STRUCT_OPS_SLEEPABLE(rustlite_init)
{
if (!switch_partial)
scx_bpf_switch_all();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated, but we probably should switch the polarity given that basically all schedulers are calling switch_all.

std::slice::from_raw_parts(
&task as *const Task as *const u8,
std::mem::size_of::<Task>(),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh man, this part is so painful. Improving this is on libbpf-rs's roadmap so hopefully it should become better in the near future.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that's the hacky part that I mentioned during office hours. 🤣 But I didn't want to spend too much time on that, considering that something nicer will come from libbpf-rs soon (hopefully).


// Dispatch tasks from the task pool in order (sending them to the BPF part).
loop {
match self.task_pool.pop() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know the focus is on being easy but I wonder whether the fact that all tasks are immediately dispatched is too limiting. We want dispatch to be happening when the CPUs are ready to run tasks so that tasks can be queued and ordered while waiting for CPUs to open up. Maybe the interlocking is a bit tricky to figure out but it might not take a lot of code for the BPF part to indicate how many CPUs are currently looking for tasks and userspace part can dispatch accordingly. It doesn't have to be completely accurate either.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe if I move the whole select_cpu() logic to user-space this part should also become easier... need to think a little bit more about this.

@arighi
Copy link
Collaborator Author

arighi commented Dec 20, 2023

@Decave @htejun I think I've addressed everything mentioned in your reviews (thanks again for the nice suggestions!).

Now the scheduler looks more similar to scx_userland and I think that's good, because in this way we have a fairly easy template of a user-space scheduler both in C and Rust (should we call this scx_userlandrust?).

Most noticeable changes:

  • now the vtime is not based on wallclock time anymore, but on cputime and the whole logic has been moved to the Rust component (so basically the BPF part is now completely agnostic of the scheduling policy that is all in Rust)
  • the dispatcher is not called immediately after draining the queued list, but only when there's at least a cpu that is not busy
  • I rewrote the Rust part of the message passing via BPF_MAP_TYPE_QUEUE in a less hacky way (even if it would be nice to have proper abstractions from libbpf-rs)
  • ops.select_cpu() now always returns prev_cpu
  • all the bpf_printk()'s are called only when debug is enabled
  • the ugly "scheduler heartbeat" task has been replaced by a nicer bpf_timer
  • updated documentation, comments, small fixes here and there

Let me know what you think. Thanks!

/* Allow to use bpf_printk() only when @debug is set */
#define dbg_msg(_fmt, ...) do { \
if (unlikely(debug)) \
bpf_printk(_fmt, ##__VA_ARGS__); \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will fail to compile if there are no args in the debug message? What about this?

bpf_printk(_fmt __VA_OPT__(,) __VA_ARGS__);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I usually do #define MACRO(fmt, args) bpf_printk(fmt, ##args) and that works fine whether there are args or not. What @arighi did works fine too and seems easier on the eyes than __VA_OPT__, no?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#define dbg_msg(_fmt, ...) do {                         \
        if (debug)                                      \
                bpf_printk(_fmt, ##__VA_ARGS__);        \
} while(0)

This seems to work also with no args and it doesn't look too ugly. I'll go with this one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh I understand the macro now, it LGTM. Somehow I'd never seen ##VA_ARGS applied that way, but now I'm seeing that it's actually super common.

*/
struct {
__uint(type, BPF_MAP_TYPE_QUEUE);
__uint(key, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can remove the __uint(key, 0); line from the queued and dispatched maps

* is idle.
*/
if (p->nr_cpus_allowed == 1 ||
scx_bpf_test_and_clear_cpu_idle(bpf_get_smp_processor_id()))
Copy link
Contributor

@Byte-Lab Byte-Lab Dec 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the whole workflow with which actual physical core ops.enqueue() is called on is pretty confusing and probably one of the leakier parts of the API, so I understand where the confusion is coming from here. Taking a step back, let me try to clarify some of the operations going on here:

ops.select_cpu() and ops.enqueue() are called from the CPU where try_to_wake_up() is being invoked in the kernel. That does not need to be the same CPU where the task that is currently being woken up was previously running on, however. For example, you could have something like this:

CPU 0                 CPU 1
task_x blocks

                       try_to_wake_up(task_x)
                       cpu = ops.select_cpu(task_x, 0); // returns 0, i.e. prev_cpu
                       if (cpu != prev_cpu)
                              /*
                               * Even if the cpu returned by ops.select_cpu()
                               * was different than prev_cpu, we're not changing
                               * which CPU we're invoking the callbacks on here.
                               * We're just updating @p's CPU and rq pointers, which
                               * as we'll see below, is how we determine which
                               * SCX_DSQ_LOCAL DSQ to dispatch the task to from
                               * ops.enqueue().
                               */
                               set_task_cpu(p, cpu);
                       activate_task(rq /* rq of CPU 1 */, p, en_flags);
                                ops.enqueue(p, en_flags);
                                        // 1 == bpf_get_smp_processor_id()
                                        scx_bpf_test_and_clear_cpu_idle(1);

So the problem with doing things this way is that scx_bpf_test_and_clear_cpu_idle(bpf_get_smp_processor_id())) will be the CPU that called try_to_wake_up(). Not the CPU returned by ops.select_cpu().

It's admittedly very confusing because conceptually ops.enqueue() is called on the CPU that's returned by ops.select_cpu(). We can do this because in scx_bpf_dispatch(p, SCX_DSQ_LOCAL, slice_ns, enq_flags);, we enqueue the task on the local DSQ according to the task's rq, not the rq of the current core.

So, going back to the code here, I believe what we're trying to do is to directly dispatch the task if its previous CPU is still idle. If that's the case, I think we should be able to just change scx_bpf_test_and_clear_cpu_idle(bpf_get_smp_processor_id()) to scx_bpf_test_and_clear_cpu_idle(p->thread_info.cpu). Or, the more common approach would probably be to call scx_bpf_test_and_clear_cpu_idle(prev_cpu) in ops.select_cpu(), and then use task local storage to mark that the task should enqueue directly on the LOCAL_DSQ.

Note that calling scx_bpf_test_and_clear_cpu_idle(p->thread_info.cpu) from is_task_cpu_available() may confuse some readers, as the core is only marked as idle in the internal idle mask tracking once there are no tasks left to run on the CPU. So even if @p is the only task running on p->thread_info.cpu, scx_bpf_test_and_clear_cpu_idle(p->thread_info.cpu) won't return true after it's gone through its first slice.

@htejun This seems to be a common point of confusion, so I think we should probably try to figure out a way to more clearly document the semantics.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's scx_bpf_task_cpu() helper to obtain the cpu associated with a task, and yeah, this can be confusing and needs better documentation. I was wrong before too in saying that .select_cpu()'s only role is optimizing hot-migration overhead. It determines the CPU .enqueue() can directly dispatch to.

As for the single thread left on the CPU case, because we put the previous task after dispatching the next one and automatically keeps running the previous if nothing was dispatched unless SCX_OPS_ENQ_LAST is set, that'd probably be okay behavior-wise, I think. ie. the .enqueue() for the current task is only going to run if .dispatch() dispatched another task beforehand.

Also, not a blocker but if the scheduler is going to do direct dispatches from the enqueue path, it may as well do full idle CPU picking in .select_cpu()? If you go that route, as this does represent a non-trivial loss of control on the userspace scheduler's part, it may make sense to make the behavior optional.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Decave and @htejun for all the clarifications on .select_cpu(), .enqueue() and the like, you both have been super helpful! Now it makes a lot more sense to me why other schedulers are using the per-task force_local flag.

For this scheduler at the end I think it would probably make sense to mimic what scx_userland is doing (so full idle cpu picking in .select_cpu(), and add some comments in the code to clarify the difference between the "current cpu" and tasks's currrent cpu.

But in perspective also having full idle CPU picking in .select_cpu() optional is an interesting idea and have like 1) cpu selection performed by the dispatcher vs 2) cpu selection performed by the scheduler.

The former would do idle picking in .select_cpu() and then we dispatch to the local DSQ when force_local is set, as usual. With the latter .select_cpu() can always return prev_cpu, then the user-space scheduler could access that cpu via struct queued_task_ctx and potentially override this selection based on some internal logic, then send back to the dispatcher the selected cpu for the task via struct dispatched_task, and if it's different, the dispatcher can then use scx_bpf_dispatch() with SCX_DSQ_LOCAL_ON | cpu in ops.dispatch().

Does this make sense? Do you think it's doable or there's still something that I'm missing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In 2), one thing to consider is that LOCAL_ON dispatching is always the highest priority. Taking a step back, this goes into the whether the userspace scheduler is involved in deciding which CPU a given task ends up executing on. If we go there, the BPF part probably should also expose the idlemask and the current tasks to the userland scheduler so that it can make a more intelligent decision and we also likely would want to keep a tight leash on how many tasks are queued in the dispatcher as it's easier to get ordering inversions otherwise.

* Keep using the same CPU if the task can only run there or if the CPU
* is idle.
*/
if (p->nr_cpus_allowed == 1 ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to stay in the kernel for any percpu thread, I think we can bring it to the top and remove the && on 189?

Copy link
Collaborator Author

@arighi arighi Dec 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to stay in the kernel for any percpu thread, I think we can bring it to the top and remove the && on 189?

Actually on this one I think we should still distinguish between "task is a percpu kthread" vs "task can run only on 1 cpu". For the former is fine to always bypass the scheduler and dispatch in .enqueue() directly, but the latter should still go to the scheduler, otherwise regular tasks with the affinity set to 1 cpu should always go before the others.

Copy link
Contributor

@htejun htejun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments but it generally looks fine to me as the starting point. @Decave, please feel free to merge once things look okay to you.

typedef long long s64;
#endif

#include <scx/ravg.bpf.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think ravg is used anywhere. Maybe drop?

/* Allow to use bpf_printk() only when @debug is set */
#define dbg_msg(_fmt, ...) do { \
if (unlikely(debug)) \
bpf_printk(_fmt, ##__VA_ARGS__); \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I usually do #define MACRO(fmt, args) bpf_printk(fmt, ##args) and that works fine whether there are args or not. What @arighi did works fine too and seems easier on the eyes than __VA_OPT__, no?

//
// We should probably refresh this counter during the normal execution to support cpu
// hotplugging, but for now let's keep it simple and set this only at initialization).
let nr_cpus_online = libbpf_rs::num_possible_cpus().unwrap() as u64;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was thinking about adding hotplug support to rusty / layered too but I wonder whether an easier and arguably more sensible approach is building a generic mechanism to restart self after an hotplug event. ie. Have some scx_utils helpers that can be used easily in rust code that detects hotplug events, unload the scheduler, re-run the init code and reload the BPF blob. Maybe scx_utils can provide a trait with e.g. init and run operations and rust sched implementations can use them to get things like exit info reporting and restart on hotplug automatically.

let maps = self.skel.maps();
let queued = maps.queued();

loop {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker: Ideally, we shouldn't be dumping the whole queue into the dispatcher because then the userspace scheduler loses the ability to insert tasks that wake up later but at an a lot higher priority before the waiting low pri ones. As implemented, how much ordering power the userland scheduler has is rather accidental - it depends on how often it gets to dump the queue content.

This exposes a weakness in userspace scheduling in that the more immediate the scheduling decisions are made, the higher the overhead becomes as we'd then have to wake up the userlevel thread whenever a CPU becomes idle which then may have to stall a bit until the userspace scheduler makes a decision and so on. Off the top of my head, there are a couple ways to address them:

  1. Have a reasonable batch size (probably a function of number of CPUs and slice duration) so that we don't dump the whole queue which can be pretty big at once but also don't have to wakeup the userland thread on every context switch.
  2. Have some ordering on the BPF dispatch side too so that userland can retroactively insert higher pri tasks after other tasks have been dispatched. e.g. Instead of using DSQ_GLOBAL use a vtime ordered custom DSQ. It does incur double rbtree queueing overhead but I doubt that would matter in this case.

Anyways, something to mull over as this gets developed.

sum_exec_runtime: task.sum_exec_runtime,
vruntime: self.min_vruntime,
};
self.task_map.insert(task.pid, task_info);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another non-blocker: We probably wanna make the BPF side send "task exited" message to clean up the userside context. A tricky thing here is that .disable() callback can't wait so the signalling queue can overflow.

An imperfect solution which works most of the time should be fine here but the canonical solution would be establishing a common mmap'd map entry between user and BPF code during .prep_enable() and indicate task exit by chaining them from BPF side so that there's no memory allocation involved in signalling exit. Might be another thing which would be nice to have in scx_utils.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... or the scheduler could implement its own garbage collector to figure out the pids that are not running anymore, but having a feedback from the BPF part would be nicer. I'll add a TODO note for this in the code.

This scheduler is made of a BPF component (dispatcher) that implements
the low level sched-ext functionalities and a user-space counterpart
(scheduler), written in Rust, that implements the actual scheduling
policy.

The main goal of this scheduler is to be easy to read and well
documented, so that newcomers (i.e., students, researchers, junior devs,
etc.) can use this as a template to quickly experiment scheduling
theory.

For this reason the design of this scheduler is mostly focused on
simplicity and code readability.

Moreover, the BPF dispatcher is completely agnostic of the particular
scheduling policy implemented by the user-space scheduler. For this
reason developers that are willing to use this scheduler to experiment
scheduling policies should be able to simply modify the Rust component,
without having to deal with any internal kernel / BPF details.

Future improvements:

 - Transfer the responsibility of determining the CPU for executing a
   particular task to the user-space scheduler.

   Right now this logic is still fully implemented in the BPF part and
   the user-space scheduler can only decide the order of execution of
   the tasks, that significantly restricts the scheduling policies that
   can be implemented in the user-space scheduler.

 - Experiment the possibility to send tasks from the user-space
   scheduler to the BPF dispatcher using a batch size, instead of
   draining the task queue completely and sending all the tasks at once
   every single time.

   A batch size should help to reduce the overhead and it should also
   help to reduce the wakeups of the user-space scheduler.

Signed-off-by: Andrea Righi <[email protected]>
@arighi
Copy link
Collaborator Author

arighi commented Dec 21, 2023

Alright, another push with more fixes. I'm still not 100% convinced of what I'm doing in .select_cpu(), but doing some tests it doesn't look too bad (in particular I'm not convinced about not setting force_local when I find an idle cpu that is != prev_cpu).

The other open issues are:

  • introduce the dispatch batch size suggested by @htejun
  • figure out a reasonable way to move the select_cpu() logic to the user-space scheduler

I'll work on those in the next days, but for now I wanted to push a new version to avoid having too many changes piled in. :)

@Byte-Lab Byte-Lab merged commit 4cadb92 into sched-ext:main Dec 21, 2023
1 check passed
@Byte-Lab
Copy link
Contributor

Great work on this, thanks @arighi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants