Cut an scx-6.8rc.y branch for 6.8-rc releases #1

rusty.bpf.c has a few small places where we can improve either the formatting of the code, or the logic. In rusty_select_cpu(), we declare the idle_smtmask as struct cpumask *, when it could be const. Also, when initializing the pcpu_ctx, we're using an actual for-loop instead of bpf_for. Let's just fix up these small issues. Signed-off-by: David Vernet <[email protected]>

scx: Adjust a couple of small things in rusty.bpf.c

When used from a bpf scheduler that is launched via libbpf-rs this naming runs into issues because "type" is a reserved keyword in Rust. Signed-off-by: Dan Schatzberg <[email protected]>

scx: Rename "type" -> "exit_type"

The struct cpumask * argument to the ops.set_cpumask() op isn't const. It doesn't really matter in terms of mutability in a BPF program, but let's make it const just because it really is. Signed-off-by: David Vernet <[email protected]>

scx: Make cpumask arg to ops.set_cpumask() const

Andrea Righi reports that smp_load_acquire() can't be used on u64's on some 32bit architectures. pnt_seq is used to close a very short race window and 32bit should be more than enough. Use unsigned long instead of u64.

Some of the sched_ext kfuncs are fine to call from tracepoints. For example, we may want to call scx_bpf_error_bstr() if some error condition is detected in a tracepoint rather than a sched_ext ops callback. This patch therefore separates the scx_kfunc_ids_any kfunc BTF set into two sets: one of which includes kfuncs that can only be called from struct_ops, and the other which can be called from both struct_ops and tracepoint progs. Signed-off-by: David Vernet <[email protected]>

scx: Allow calling some kfuncs from tracepoints

atomic64_t can be pretty inefficient in 32bit archs and the counter being 32bit on 32bit arch is fine. Let's use atomic_long_t instead.

Some 32bit archs can't do 64bit store_release/load_acquire. Use atomic_long_t instead.

Use unsigned longs for atomics

…ples use resizing of datasec maps in examples

…cesses The function is currently returning 0 for unknown accesses which means allowing writes to anything. Fix the default return value.

scx: bpf_scx_btf_struct_access() should return -EACCES for unknown accesses

…TASK_ON_DSQ_PRIQ p->scx.flag is protected by the task's rq lock but one of the flags, SCX_TASK_ON_DSQ_PRIQ, is protected by p->dsq->lock, not its rq lock. This could lead to corruption of p->scx.flags through RMW races triggering watchdog and other sanity checks. Fix it moving it to its own flag field p->scx.dsq_flags which is protected by the dsq lock.

scx: Fix p->scx.flags corruption due to unsynchronized writes of SCX_TASK_ON_DSQ_PRIQ

scx_rusty: keep .bpf.o files for debugging

This reverts commit 664d650, reversing changes made to ee9077a.

(cherry picked from commit be81498)

Fix incorrect merge of #48

Instead, collect all per-dom cpumasks into all_cpumask and test whether that's a subset of a task's cpumask. bpf_cpumask_full() can incorrectly indicate that a task's affinity is restricted when it's not depending on the machine configuration.

Researchers at Inria-Paris are experimenting with the central scheduler, and want to try setting different slice lengths to see how they affect performance for VMs running the NAS benchmarks. Let's make this convenient by allowing it to be passed as a parameter from user space. Signed-off-by: David Vernet <[email protected]>

The scx_central scheduler specifies an infinite slice for all cores other than a "central" core where scheduling decisions are made. This scheduler currently suffers from the fact that the BPF timer may be invoked on a different core than the central scheduler, due to BPF timers not supporting being pinned to specific CPUs. That capability was proposed upstream for BPF in [0]. If and when it lands, we would need to invoke bpf_timer_start() from the core that we want the timer pinned to, because the API does not support specifying a core to have the timer invoked from. To accommodate this, we can affinitize the loading thread to the central CPU before loading the scheduler, and then pin from there. [0]: https://lore.kernel.org/bpf/[email protected]/T/ Though the BPF timer pinning feature has not yet landed, we can still set the stage for leveraging it by adding the logic to affinitize the loading thread to the central CPU. While we won't yet have a guarantee that the timer will be pinned to the same core throughout the runtime of the scheduler, in practice, it seems that affinitizing in this manner does make it very likely regardless. In addition, the user space component of the central scheduler doesn't benefit from running on a tickless core, so keeping it affinitized to the central CPU avoids it from preempting a task on a tickless core that would otherwise benefit from less preemption. Signed-off-by: David Vernet <[email protected]>

There's a comment that says can_stop_tick_scx(). The function is scx_can_stop_tick(). Signed-off-by: David Vernet <[email protected]>

scx: Fix typo in tickless comment

central: Pin timer callbacks to central CPU

Bpf next merge

Forgot to git add a small conflict to resolve Signed-off-by: David Vernet <[email protected]>

- Includes the latest timer pinning feature

In commit d6247ec ("bpf: Add ability to pin bpf timer to calling CPU"), BPF added the ability to be able to pin a BPF timer to the calling CPU. Let's use this capability from the central scheduler. Signed-off-by: David Vernet <[email protected]>

Bpf next merge

The current scx build system is a bit hacky. We put some build artifacts in a tools/ directory, and others (skel files and .bpf.o files) we leave in the current directory. This isn't conducive to environments that want to package sched_ext schedulers. This patch therefore updates the Makefile to have the build put all build artifacts (including the compiled binaries for the schedulers into an build/ directory (previously tools/). All artifacts will be deployed as follows: build/bin: Compiled binaries (e.g. scx_simple, scx_central, etc) build/sbin: Compiled binaries that are used as part of the build process, e.g. bpftool build/include: Headers that are visible from .c files build/obj: Contains object files and libraries that are used as part of the build process build/obj/bpftool: Build artifacts from compiling bpftool from source build/obj/libbpf: Build artifacts from compiling libbpf from source build/obj/sched_ext: Build artifacts from compiling and linking BPF programs and their user space counterparts. build/release: Build output from Cargo for Rust schedulers This patch also adds the following enhancement: - Support for changing the build directory output by specifying the O environment variable, as in: $ O=/tmp/sched_ext make CC=clang LLVM=1 -j to output all artifacts for that build job to /tmp/sched_ext/build - Removing code duplication by defining a ccsched make function for compiling schedulers, and an $(SCX_COMMON_DEPS) variable for common dependencies. Signed-off-by: David Vernet <[email protected]>

Another requirement of packaging systems is to be able to install compiled schedulers in some reachable PATH endpoint so they can be accessed easily. This patch adds a new install target in Make for this, which installs the schedulers on the system at /usr/bin. The user also has the option of specifying DESTDIR to indicate a prefix of /usr/bin. Signed-off-by: David Vernet <[email protected]>

It's mostly self evident, but now that we support environment variables to dictate build behavior, we should document them in a clean and easy to consume way. Signed-off-by: David Vernet <[email protected]>

Cargo supports the cargo fetch command to fetch dependencies via the network before compiling with cargo build. Let's put it into a separate Makefile target so that packaging systems can separate steps that require network access from just building. Signed-off-by: David Vernet <[email protected]>

We were previously under the impression that the rustup nightly toolchain was required to build the schedulers. Daan pointed out in [0] that he was able to build with stable, and I similarly was able to build with rust stable 1.70.0. Let's update the README accordingly. [0]: sched-ext/sched_ext#57 We also update the README to not explicitly require compiling the schedulers with $ make LLVM=1 CC=clang The BPF schedulers are automatically compiled with clang. If you compile this way, the user space portions will be compiled with gcc, which is fine. Signed-off-by: David Vernet <[email protected]>

scx: Don't specify nightly rustup as dependency

Update and refactor scheduler build system

We previously separated the scx_rusty build into two steps -- a step to download dependencies, and another to build. That mostly works, except that the download-dependencies step is always run before the build step as it's a dependency. Even though it doesn't download any cargo dependencies, it still accesses the network. Let's add a way for builders to pass --offline to cargo via a CARGO_OFFLINE make variable so that we don't need scx_rusty_deps to be a dependency of scx_rusty. Signed-off-by: David Vernet <[email protected]>

rusty: Further tweak build system

Signed-off-by: David Vernet <[email protected]>

We're missing an entry in .gitignore for the build-generated files when building the example schedulers.

We've gotten some feedback that it's confusing and/or inconvenient to know what needs to be clean built in order to be able to correctly compile and run the example schedulers. Let's update the build targets to make this simpler by: 1. Always cleaning sched_ext schedulers on make mrproper in the tree root 2. Adding a make fullclean target to the sched_ext tools directory which also invokes the root make clean target. Signed-off-by: David Vernet <[email protected]>

Update README, and improve build usability

…untime calculation Calculating runtime from the amount consumed from slice punishes yield(2)ers. There's nothing fundamentally wrong with it but it doesn't align well with how cfs does it and can have unexpected effects on applications. Note the caveat in the example schedulers and switch scx_rusty to use timestamp based one.

… task_ctx var name

As it was pointed out by Yonghong Song [1], in the bpf selftests the use of the ASSERT_* series of macros is preferred over the CHECK macro. This patch replaces all CHECK calls in bpf_iter with the appropriate ASSERT_* macros. [1] https://lore.kernel.org/lkml/[email protected] Suggested-by: Yonghong Song <[email protected]> Signed-off-by: Yuran Pereira <[email protected]> Acked-by: Yonghong Song <[email protected]> Acked-by: Kui-Feng Lee <[email protected]> Link: https://lore.kernel.org/r/DB3PR10MB6835E9C8DFCA226DD6FEF914E8A3A@DB3PR10MB6835.EURPRD10.PROD.OUTLOOK.COM Signed-off-by: Alexei Starovoitov <[email protected]>

Since some malloc calls in bpf_iter may at times fail, this patch adds the appropriate fail checks, and ensures that any previously allocated resource is appropriately destroyed before returning the function. Signed-off-by: Yuran Pereira <[email protected]> Acked-by: Yonghong Song <[email protected]> Acked-by: Kui-Feng Lee <[email protected]> Link: https://lore.kernel.org/r/DB3PR10MB6835F0ECA792265FA41FC39BE8A3A@DB3PR10MB6835.EURPRD10.PROD.OUTLOOK.COM Signed-off-by: Alexei Starovoitov <[email protected]>

Compiler complains about malloc(). We also don't need to dynamically allocate anything, so make the life easier by using statically sized buffer. Signed-off-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>

Some compilers complain about get_pprint_mapv_size() not returning value in some code paths. Fix with explicit return. Signed-off-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>

Add smin/smax derivation from appropriate umin/umax values. Previously the logic was surprisingly asymmetric, trying to derive umin/umax from smin/smax (if possible), but not trying to do the same in the other direction. A simple addition to __reg64_deduce_bounds() fixes this. Added also generic comment about u64/s64 ranges and their relationship. Hopefully that helps readers to understand all the bounds deductions a bit better. Acked-by: Eduard Zingerman <[email protected]> Acked-by: Shung-Hsi Yu <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>

All the logic that applies to u64 vs s64, equally applies for u32 vs s32 relationships (just taken in a smaller 32-bit numeric space). So do the same deduction of smin32/smax32 from umin32/umax32, if we can. Acked-by: Eduard Zingerman <[email protected]> Acked-by: Shung-Hsi Yu <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>

…stant Comments in code try to explain the idea behind why this is correct. Please check the code and comments. Acked-by: Eduard Zingerman <[email protected]> Acked-by: Shung-Hsi Yu <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>

Add a special case where we can derive valid s32 bounds from umin/umax or smin/smax by stitching together negative s32 subrange and non-negative s32 subrange. That requires upper 32 bits to form a [N, N+1] range in u32 domain (taking into account wrap around, so 0xffffffff to 0x00000000 is a valid [N, N+1] range in this sense). See code comment for concrete examples. Eduard Zingerman also provided an alternative explanation ([0]) for more mathematically inclined readers: Suppose: . there are numbers a, b, c . 2**31 <= b < 2**32 . 0 <= c < 2**31 . umin = 2**32 * a + b . umax = 2**32 * (a + 1) + c The number of values in the range represented by [umin; umax] is: . N = umax - umin + 1 = 2**32 + c - b + 1 . min(N) = 2**32 + 0 - (2**32-1) + 1 = 2, with b = 2**32-1, c = 0 . max(N) = 2**32 + (2**31 - 1) - 2**31 + 1 = 2**32, with b = 2**31, c = 2**31-1 Hence [(s32)b; (s32)c] forms a valid range. [0] https://lore.kernel.org/bpf/[email protected]/ Acked-by: Eduard Zingerman <[email protected]> Acked-by: Shung-Hsi Yu <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>

Add a few interesting cases in which we can tighten 64-bit bounds based on newly learnt information about 32-bit bounds. E.g., when full u64/s64 registers are used in BPF program, and then eventually compared as u32/s32. The latter comparison doesn't change the value of full register, but it does impose new restrictions on possible lower 32 bits of such full registers. And we can use that to derive additional full register bounds information. Acked-by: Eduard Zingerman <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Acked-by: Shung-Hsi Yu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>

There are cases (caught by subsequent reg_bounds tests in selftests/bpf) where performing one round of __reg_deduce_bounds() doesn't propagate all the information from, say, s32 to u32 bounds and than from newly learned u32 bounds back to u64 and s64. So perform __reg_deduce_bounds() twice to make sure such derivations are propagated fully after reg_bounds_sync(). One such example is test `(s64)[0xffffffff00000001; 0] (u64)< 0xffffffff00000000` from selftest patch from this patch set. It demonstrates an intricate dance of u64 -> s64 -> u64 -> u32 bounds adjustments, which requires two rounds of __reg_deduce_bounds(). Here are corresponding refinement log from selftest, showing evolution of knowledge. REFINING (FALSE R1) (u64)SRC=[0xffffffff00000000; U64_MAX] (u64)DST_OLD=[0; U64_MAX] (u64)DST_NEW=[0xffffffff00000000; U64_MAX] REFINING (FALSE R1) (u64)SRC=[0xffffffff00000000; U64_MAX] (s64)DST_OLD=[0xffffffff00000001; 0] (s64)DST_NEW=[0xffffffff00000001; -1] REFINING (FALSE R1) (s64)SRC=[0xffffffff00000001; -1] (u64)DST_OLD=[0xffffffff00000000; U64_MAX] (u64)DST_NEW=[0xffffffff00000001; U64_MAX] REFINING (FALSE R1) (u64)SRC=[0xffffffff00000001; U64_MAX] (u32)DST_OLD=[0; U32_MAX] (u32)DST_NEW=[1; U32_MAX] R1 initially has smin/smax set to [0xffffffff00000001; -1], while umin/umax is unknown. After (u64)< comparison, in FALSE branch we gain knowledge that umin/umax is [0xffffffff00000000; U64_MAX]. That causes smin/smax to learn that zero can't happen and upper bound is -1. Then smin/smax is adjusted from umin/umax improving lower bound from 0xffffffff00000000 to 0xffffffff00000001. And then eventually umin32/umax32 bounds are drived from umin/umax and become [1; U32_MAX]. Selftest in the last patch is actually implementing a multi-round fixed-point convergence logic, but so far all the tests are handled by two rounds of reg_bounds_sync() on the verifier state, so we keep it simple for now. Signed-off-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>

When performing 32-bit conditional operation operating on lower 32 bits of a full 64-bit register, register full value isn't changed. We just potentially gain new knowledge about that register's lower 32 bits. Unfortunately, __reg_combine_{32,64}_into_{64,32} logic that reg_set_min_max() performs as a last step, can lose information in some cases due to __mark_reg64_unbounded() and __reg_assign_32_into_64(). That's bad and completely unnecessary. Especially __reg_assign_32_into_64() looks completely out of place here, because we are not performing zero-extending subregister assignment during conditional jump. So this patch replaced __reg_combine_* with just a normal reg_bounds_sync() which will do a proper job of deriving u64/s64 bounds from u32/s32, and vice versa (among all other combinations). __reg_combine_64_into_32() is also used in one more place, coerce_reg_to_size(), while handling 1- and 2-byte register loads. Looking into this, it seems like besides marking subregister as unbounded before performing reg_bounds_sync(), we were also performing deduction of smin32/smax32 and umin32/umax32 bounds from respective smin/smax and umin/umax bounds. It's now redundant as reg_bounds_sync() performs all the same logic more generically (e.g., without unnecessary assumption that upper 32 bits of full register should be zero). Long story short, we remove __reg_combine_64_into_32() completely, and coerce_reg_to_size() now only does resetting subreg to unbounded and then performing reg_bounds_sync() to recover as much information as possible from 64-bit umin/umax and smin/smax bounds, set explicitly in coerce_reg_to_size() earlier. Acked-by: Eduard Zingerman <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Acked-by: Shung-Hsi Yu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>

Just taking mundane refactoring bits out into a separate patch. No functional changes. Signed-off-by: Andrii Nakryiko <[email protected]> Acked-by: Shung-Hsi Yu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>

While still assuming that second register is a constant, generalize is_branch_taken-related code to accept two registers instead of register plus explicit constant value. This also, as a side effect, allows to simplify check_cond_jmp_op() by unifying BPF_K case with BPF_X case, for which we use a fake register to represent BPF_K's imm constant as a register. Acked-by: Eduard Zingerman <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Acked-by: Shung-Hsi Yu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>

Move is_branch_taken() slightly down. In subsequent patched we'll need both flip_opcode() and is_pkt_ptr_branch_taken() for is_branch_taken(), but instead of sprinkling forward declarations around, it makes more sense to move is_branch_taken() lower below is_pkt_ptr_branch_taken(), and also keep it closer to very tightly related reg_set_min_max(), as they are two critical parts of the same SCALAR range tracking logic. Signed-off-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>

…e place Make is_branch_taken() a single entry point for branch pruning decision making, handling both pointer vs pointer, pointer vs scalar, and scalar vs scalar cases in one place. This also nicely cleans up check_cond_jmp_op(). Acked-by: Eduard Zingerman <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>

Combine 32-bit and 64-bit is_branch_taken logic for SCALAR_VALUE registers. It makes it easier to see parallels between two domains (32-bit and 64-bit), and makes subsequent refactoring more straightforward. No functional changes. Acked-by: Eduard Zingerman <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>

Similarly to is_branch_taken()-related refactorings, start preparing reg_set_min_max() to handle more generic case of two non-const registers. Start with renaming arguments to accommodate later addition of second register as an input argument. Signed-off-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>

Change reg_set_min_max() to take FALSE/TRUE sets of two registers each, instead of assuming that we are always comparing to a constant. For now we still assume that right-hand side registers are constants (and make sure that's the case by swapping src/dst regs, if necessary), but subsequent patches will remove this limitation. reg_set_min_max() is now called unconditionally for any register comparison, so that might include pointer vs pointer. This makes it consistent with is_branch_taken() generality. But we currently only support adjustments based on SCALAR vs SCALAR comparisons, so reg_set_min_max() has to guard itself againts pointers. Taking two by two registers allows to further unify and simplify check_cond_jmp_op() logic. We utilize fake register for BPF_K conditional jump case, just like with is_branch_taken() part. Acked-by: Eduard Zingerman <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>

Andrii Nakryiko says: ==================== BPF register bounds logic and testing improvements This patch set adds a big set of manual and auto-generated test cases validating BPF verifier's register bounds tracking and deduction logic. See details in the last patch. We start with building a tester that validates existing <range> vs <scalar> verifier logic for range bounds. To make all this work, BPF verifier's logic needed a bunch of improvements to handle some cases that previously were not covered. This had no implications as to correctness of verifier logic, but it was incomplete enough to cause significant disagreements with alternative implementation of register bounds logic that tests in this patch set implement. So we need BPF verifier logic improvements to make all the tests pass. This is what we do in patches #3 through #9. The end goal of this work, though, is to extend BPF verifier range state tracking such as to allow to derive new range bounds when comparing non-const registers. There is some more investigative work required to investigate and fix existing potential issues with range tracking as part of ALU/ALU64 operations, so <range> x <range> part of v5 patch set ([0]) is dropped until these issues are sorted out. For now, we include preparatory refactorings and clean ups, that set up BPF verifier code base to extend the logic to <range> vs <range> logic in subsequent patch set. Patches #10-#16 perform preliminary refactorings without functionally changing anything. But they do clean up check_cond_jmp_op() logic and generalize a bunch of other pieces in is_branch_taken() logic. [0] https://patchwork.kernel.org/project/netdevbpf/list/?series=797178&state=* v5->v6: - dropped <range> vs <range> patches (original patches #18 through #23) to add more register range sanity checks and fix preexisting issues; - comments improvements, addressing other feedback on first 17 patches (Eduard, Alexei); v4->v5: - added entirety of verifier reg bounds tracking changes, now handling <range> vs <range> cases (Alexei); - added way more comments trying to explain why deductions added are correct, hopefully they are useful and clarify things a bit (Daniel, Shung-Hsi); - added two preliminary selftests fixes necessary for RELEASE=1 build to work again, it keeps breaking. v3->v4: - improvements to reg_bounds tester (progress report, split 32-bit and 64-bit ranges, fix various verbosity output issues, etc); v2->v3: - fix a subtle little-endianness assumption inside parge_reg_state() (CI); v1->v2: - fix compilation when building selftests with llvm-16 toolchain (CI). ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>

When updating per-cpu map in map_percpu_stats test, patch_map_thread() only passes 4-bytes-sized value to bpf_map_update_elem(). The expected size of the value is 8 * num_possible_cpus(), so fix it by passing a value with enough-size for per-cpu map update. Signed-off-by: Hou Tao <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]

Export map_update_retriable() to make it usable for other map_test cases. These cases may only need retry for specific errno, so add a new callback parameter to let map_update_retriable() decide whether or not the errno is retriable. Signed-off-by: Hou Tao <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]

BPF CI failed due to map_percpu_stats_percpu_hash from time to time [1]. It seems that the failure reason is per-cpu bpf memory allocator may not be able to allocate per-cpu pointer successfully and it can not refill free llist timely, and bpf_map_update_elem() will return -ENOMEM. So mitigate the problem by retrying the update operation for non-preallocated per-cpu map. [1]: https://github.com/kernel-patches/bpf/actions/runs/6713177520/job/18244865326?pr=5909 Signed-off-by: Hou Tao <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]

Hou Tao says: ==================== List-Subscribe: <mailto:[email protected]> List-Unsubscribe: <mailto:[email protected]> MIME-Version: 1.0 X-CM-TRANSID: gCh0CgCHHt6+xEFlSpMJEg--.58519S4 X-Coremail-Antispam: 1UD129KBjvJXoW7Jr1UJrW3JFyUXw4rJrWxCrg_yoW8JrW5pF WrK3WrKrZ7tryaqw13tanrW3yrtrs5W3WjkF13tr4YvF1UJ34xKr48KF1jgrZxCrZYqr1a yay8tF1xWa1xZrUanT9S1TB71UUUUUUqnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2 9KBjDU0xBIdaVrnRJUUUk2b4IE77IF4wAFF20E14v26r4j6ryUM7CY07I20VC2zVCF04k2 6cxKx2IYs7xG6rWj6s0DM7CIcVAFz4kK6r1j6r18M28lY4IEw2IIxxk0rwA2F7IY1VAKz4 vEj48ve4kI8wA2z4x0Y4vE2Ix0cI8IcVAFwI0_tr0E3s1l84ACjcxK6xIIjxv20xvEc7Cj xVAFwI0_Gr1j6F4UJwA2z4x0Y4vEx4A2jsIE14v26rxl6s0DM28EF7xvwVC2z280aVCY1x 0267AKxVW0oVCq3wAS0I0E0xvYzxvE52x082IY62kv0487Mc02F40EFcxC0VAKzVAqx4xG 6I80ewAv7VC0I7IYx2IY67AKxVWUJVWUGwAv7VC2z280aVAFwI0_Jr0_Gr1lOx8S6xCaFV Cjc4AY6r1j6r4UM4x0Y48IcxkI7VAKI48JM4IIrI8v6xkF7I0E8cxan2IY04v7MxAIw28I cxkI7VAKI48JMxC20s026xCaFVCjc4AY6r1j6r4UMI8I3I0E5I8CrVAFwI0_Jr0_Jr4lx2 IqxVCjr7xvwVAFwI0_JrI_JrWlx4CE17CEb7AF67AKxVWUtVW8ZwCIc40Y0x0EwIxGrwCI 42IY6xIIjxv20xvE14v26r1j6r1xMIIF0xvE2Ix0cI8IcVCY1x0267AKxVW8JVWxJwCI42 IY6xAIw20EY4v20xvaj40_WFyUJVCq3wCI42IY6I8E87Iv67AKxVWUJVW8JwCI42IY6I8E 87Iv6xkF7I0E14v26r4j6r4UJbIYCTnIWIevJa73UjIFyTuYvjxUrR6zUUUUU X-CM-SenderInfo: xkrx3t3r6k3tpzhluzxrxghudrp/ X-Patchwork-Delegate: [email protected] From: Hou Tao <[email protected]> Hi, BPF CI failed due to map_percpu_stats_percpu_hash from time to time [1]. It seems that the failure reason is per-cpu bpf memory allocator may not be able to allocate per-cpu pointer successfully and it can not refill free llist timely, and bpf_map_update_elem() will return -ENOMEM. Patch #1 fixes the size of value passed to per-cpu map update API. The problem was found when fixing the ENOMEM problem, so also post it in this patchset. Patch #2 & #3 mitigates the ENOMEM problem by retrying the update operation for non-preallocated per-cpu map. Please see individual patches for more details. And comments are always welcome. Regards, Tao [1]: https://github.com/kernel-patches/bpf/actions/runs/6713177520/job/18244865326?pr=5909 ==================== Signed-off-by: Andrii Nakryiko <[email protected]>

Those configs are needed to be able to run VM somewhat consistently. For instance, ATM, s390x is missing the `CONFIG_VIRTIO_CONSOLE` which prevents s390x kernels built in CI to leverage qemu-guest-agent. By moving them to `config,vm`, we should have selftest kernels which are equal in term of VM functionalities when they include this file. The set of config unabled were picked using grep -h -E '(_9P|_VIRTIO)' config.x86_64 config | sort | uniq added to `config.vm` and then grep -vE '(_9P|_VIRTIO)' config.{x86_64,aarch64,s390x} as a side-effect, some config may have disappeared to the aarch64 and s390x kernels, but they should not be needed. CI will tell. Signed-off-by: Manu Bretelle <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]

…ges.

With the recent Makefile refactor that puts all build artifacts into a build/ directory output, there was a regression in that Make would now always rebuild schedulers even if they were unchanged. This is happening because when Make looks at a target, it looks to see if that file exists. If it doesn't, it executes the target. There are a few targets that are improperly tracked: 1. We were taking a dependency on the sched.skel.h target (e.g. scx_simple.skel.h). In the old build system this was an actual file, but now it's just a target as the target name was never updated to point to the full path to the include file output. 2. The same goes for sched.bpf.o, which is a dependency of the skel file. 3. The scheduler itself, which now resides in build/bin. The first two we can fix by updating the targets to include the build directories. The latter we'll have to fix with some more complex Make magic, which we'll do in the subsequent commit. Signed-off-by: David Vernet <[email protected]>

Now that the scheduler binaries are written to the build/bin/ directory, Make gets confused because it doesn't see the binary file in the same directory anymore and tries to rebuild it. This makes things kind of tricky, because make will always execute the recipe for the target, which is to compile it. We could add a layer of indirection by instead having the base scheduler target be empty, and just take a dependency on the actual binary that's created the compiler. This patch does that, and also cleans up the build to avoid copy-pasted scheduler recipes. Signed-off-by: David Vernet <[email protected]>

scx_rusty currently defines several build targets and recipes that would have to be duplicated by any other rust scheduler we may add. Let's add some build scaffolding to avoid people having to copy paste. Note that we can't fully avoid running any make logic if we take the same approach as with the C schedulers. The C schedulers add a layer of indirection where the "base" target (e.g. scx_simple) do nothing but take a dependency on the binary output file. This doesn't work with rust schedulers, because we're relying on Cargo to tell us when it needs to be rebuilt. Signed-off-by: David Vernet <[email protected]>

bpftool's man page lists "program" as one of possible values for OBJECT, while in fact bpftool accepts "prog" instead. Reported-by: Jerry Snitselaar <[email protected]> Signed-off-by: Artem Savkov <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Acked-by: Yonghong Song <[email protected]> Acked-by: Quentin Monnet <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]

Fix Makefile dependency tracking

…on.bpf.h

Misc example scheduler cleanups

scx_task_iter_next_filtered() is used to iterate all non-idle tasks in the init and exit paths. Idle tasks are determined using is_idle_task(). Unfortunately, cff9b23 ("kernel/sched: Modify initial boot task idle setup") changed idle task initialization so that %PF_IDLE is set during CPU startup. So, CPUs that are not brought up during boot (such as CPUs which can never be online in some AMD processors) don't have the flag set and thus fails is_idle_task() test. This makes sched_ext incorrectly try to operate on idle tasks in init/exit paths leading to oopses. Fix it by directly testing p->sched_class against idle_sched_class.

Fix sched_ext crashes on v6.6

Building an arm64 kernel and seftests/bpf with defconfig + selftests/bpf/config and selftests/bpf/config.aarch64 the fragment CONFIG_DEBUG_INFO_REDUCED is enabled in arm64's defconfig, it should be disabled in file sefltests/bpf/config.aarch64 since if its not disabled CONFIG_DEBUG_INFO_BTF wont be enabled. Signed-off-by: Anders Roxell <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]

When looking up an element in LPM trie, the condition 'matchlen == trie->max_prefixlen' will never return true, if key->prefixlen is larger than trie->max_prefixlen. Consequently all elements in the LPM trie will be visited and no element is returned in the end. To resolve this, check key->prefixlen first before walking the LPM trie. Fixes: b95a5c4 ("bpf: add a longest prefix match trie map implementation") Signed-off-by: Florian Lehner <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]

scx_rusty: Usage ravg for dom and task loads

…ps target

tools/sched_ext/Makefile: Don't hard code scx_rusty in rust-sched _deps target

So that it can be used on deref'd pointers to structs.

scx_common: Improve MEMBER_VPTR()

cpu_local_stat_show() expects CONFIG_SCHED_CLASS_EXT or CONFIG_RT_GROUP_SCHED. Signed-off-by: David Vernet <[email protected]>

scx: Fix !CONFIG_SCHED_CLASS_EXT builds

It would be useful to see what the sched_ext scheduler state is, and what scheduler is running, when we're dumping a task's stack. This patch therefore adds a new print_scx_info() function that's called in the same context as print_worker_info() and print_stop_info(). Signed-off-by: David Vernet <[email protected]>

scx: Print scheduler state in panic message

_Static_assert() without message is a later extension and can fail compilation depending on compile flag.

We want to use rust ravg_read() in other implementations too. Separate out it into a .h file and include it. Note that it also needs to take the inputs in scalar types as the ravg_data types aren't considered the same across different skel's. This can also be a module but for now let's keep it an include file so that it can be copied elsewhere together with the BPF header files. While at it, make BPF builds depend on ravg[_impl].bpf.h. cargo does the right thing without further instructions.

…y updating from tick

Misc updates to let scx_layered share ravg with rusty

sched_ext: Add scx_layered

…am naming

Pull bpf/for-next.

sched_ext needs these consts even when !CGROUPS. They got accidentally moved back inside CONFIG_CGROUPS through merge resolution.

…or EXT_GROUP_SCHED This was incorrectly fixed after an errant merge resolution. Fix it back.

Merge artifact.

…fo\.kind/ These are accessed from userspace and "type" is a reserved token in many modern languages. Let's use "kind" instead.

* Remove duplicate target lists. c-sched-targets and rust-sched-targets are the source of truth now. * Drop fullclean target. It's unexpected and unnecessary to have a target which steps up and cleans. * Minor formatting updates.

To match patch / Makefile order.

…nfo.h

…aram type update

Scx cleanups from split

scx_rusty: doc comment update

scx: Update print_scx_info() comment

- p->scx.runnable_at is in jiffies and rq->clock is in ktime ns. Subtracting the two doesn't yield anything useful. Also, it's more intuitive for negative delta to represent past. Fix delta calculation. - ops_state is always 0 for running tasks. Let's skip it for now. - Use return value from copy_from_kernel_nofault() to determine whether the read was successful and clearly report read failures. - scx_enabled() is always nested inside scx_ops_enable_state() != DISABLED. Let's just test the latter.

scx: Update print_scx_info()

rusty: Improve overview documentation as suggested by Josh Don

The new print_scx_info() uses scx_ops_enable_state_str[] outside CONFIG_SCHED_DEBUG. Let's relocated it outside of CONFIG_SCHED_DEBUG and to the top. Reported-by: Changwoo Min <[email protected]> Reported-by: Andrea Righi <[email protected]> Signed-off-by: Tejun Heo <[email protected]>

scx: Move scx_ops_enable_state_str[] outside CONFIG_SCHED_DEBUG

scx: Fix a straggling atomic64_set

…g schedulers This is to make life easier for the user sched/tools repo which uses meson to build.

The availability of s/uSIZE types are hit and miss. Let's always define them in terms of stdint types. This makes life easier for the scx user repo.

Misc updates for example schedulers to make life easier for user sched repo

Currently, skel files are put in src/bpf/.output. Place it inside $OUT_DIR where build artifacts belong.

…move .rs.h file

… rust userland schedulers - NAME_sys and NAME was used to refer to rust wrapper of the bindgen-generated header file and the bpf skeleton, respectively. The NAME part is self-referential and thus doesn't really signify anything and _sys suffix is arbitrary too. Let's use bpf_intf and bpf_skel instead. - The env vars that are used during build are a bit unusual and the SCX_RUST_CLANG name is a bit confusing as it doesn't indicate it's for compiling BPF. Let's use the names BPF_CLANG and BPF_CFLAGS instead. - build.rs is now identical between the two schedulers.

… explicit paths from includes So that build env can decide where to put these headers.

This greatly simplifies build.rs and allows building more common logic into build_helpers such as discovering BPF_CFLAGS on its own without depending on upper level Makefile. Some caveats: - Dropped static libbpf-sys dep. scx_utils is out of kernel tree and pulls in libbpf-sys through libbpf-cargo which conflicts with the explicit libbpf-sys dependency. This means that we use packaged version of libbpf-cargo for skel generation. Should be fine. - Path dependency for scx_utils is temporary during development. Should be dropped later.

scx: Common include files relocated and more build updates

scx_sync: Sync scheduler changes from https://github.com/sched-ext/scx

Internal DSQs, i.e. SCX_DSQ_LOCAL and SCX_DSQ_GLOBAL, have somewhat special behavior in that they're automatically consumed by the internal ext.c logic. A user could therefore accidentally starve tasks on either of the DSQs if they dispatch to both the vtime and FIFO queues, as they're consumed in a specific order by the internal logic. It likely doesn't make sense to ever use both FIFO and PRIQ ordering in the same DSQ, so let's explicitly disable it for the internal DSQs. In a follow-on change, we'll error out a scheduler if a user dispatches to both FIFO and vtime for any DSQ. Reported-by: Changwoo Min <[email protected]> Signed-off-by: David Vernet <[email protected]>

Currently, a user can do both FIFO and PRIQ dispatching to a DSQ. This can result in non-intuitive behavior. For example, if a user PRIQ-dispatches to a DSQ, and then subsequently FIFO dispatches, an scx_bpf_consume() operation will always favor the FIFO-dispatched task. While we could add something like an scx_bpf_consume_vtime() kfunc, given that there's not a clear use-case for doing both types of dispatching in a single DSQ, for now we'll elect to just enforce that only a single type is being used at any given time. Reported-by: Changwoo Min <[email protected]> Signed-off-by: David Vernet <[email protected]>

Change semantics of FIFO/PRIQ dispatching

scx_sync: Sync scheduler changes from https://github.com/sched-ext/scx

Signed-off-by: David Vernet <[email protected]>

scx_sync: Sync scheduler changes from https://github.com/sched-ext/scx

We're missing a closing ) on a branch that we never take. Let's close it just for correctness. Signed-off-by: David Vernet <[email protected]>

scx: Add missing ) to $(error) invocation in Makefile

We should build a selftest suite to do some basic sanity testing of scx. Some elements are going to be borrowed from tools/testing/selftests/bpf, as we're going to be building and loading BPF progs, and sometimes verifying that BPF progs fail to load. Signed-off-by: David Vernet <[email protected]>

scx: Add skeleton for scx testing framework

bpf_cgroup_from_id() (provided by sched-ext) needs to acquire kernfs_idr_lock and it can be used in the scheduler dispatch path with rq->_lock held. But any kernfs function that is acquiring kernfs_idr_lock can be interrupted by a scheduler tick, that would try to acquire rq->_lock, triggering the following deadlock scenario: CPU0 CPU1 ---- ---- lock(kernfs_idr_lock); lock(rq->__lock); lock(kernfs_idr_lock); <Interrupt> lock(rq->__lock); More in general, considering that bpf_cgroup_from_id() is provided as a kfunc, potentially similar deadlock conditions can be triggered from any kprobe/tracepoint/fentry. For this reason, in order to prevent any potential deadlock scenario, convert kernfs_idr_lock to a raw irq safe spinlock. Signed-off-by: Andrea Righi <[email protected]>

There is a race with exiting tasks in scx_move_tasks() where we may fail to check for autogroup tasks, leading to the following oops: WARNING: CPU: 2 PID: 100 at kernel/sched/ext.c:2571 scx_move_task+0x9f/0xb0 ... Sched_ext: flatcg (enabled+all), task: runnable_at=-5ms RIP: 0010:scx_move_task+0x9f/0xb0 Call Trace: <TASK> ? scx_move_task+0x9f/0xb0 ? __warn+0x85/0x170 ? scx_move_task+0x9f/0xb0 ? report_bug+0x171/0x1a0 ? handle_bug+0x3b/0x70 ? exc_invalid_op+0x17/0x70 ? asm_exc_invalid_op+0x1a/0x20 ? scx_move_task+0x9f/0xb0 sched_move_task+0x104/0x300 do_exit+0x37d/0xb70 ? lock_release+0xbe/0x270 do_group_exit+0x37/0xa0 __x64_sys_exit_group+0x18/0x20 do_syscall_64+0x44/0xf0 entry_SYSCALL_64_after_hwframe+0x6f/0x77 And a related NULL pointer dereference afterwards: BUG: kernel NULL pointer dereference, address: 0000000000000148 Prevent this by skipping scx_move_tasks() actions for exiting tasks. Moreover, make scx_move_tasks() more reliable by triggering only the WARN_ON_ONCE() and returning, instead of triggering also the bug afterwards. Signed-off-by: Andrea Righi <[email protected]>

sched_ext: fix race in scx_move_task() with exiting tasks

A common pattern in schedulers is to find and reserve an idle core in ops.select_cpu(), and to then use a task local storage map to specify that the task should be enqueued in SCX_DSQ_LOCAL on the ops.enqueue() path. At the same time, we also have a special SCX_TASK_ENQ_LOCAL enqueue flag which is used by scx_select_cpu_dfl() to notify ops.enqueue() that it may want to do a local enqueue. Taking a step back, direct dispatch is something that should be supported from the ops.select_cpu() path as well. The contract is that doing a direct dispatch to SCX_DSQ_LOCAL will dispatch the task to the local CPU of whatever is returned by ops.select_cpu(). With that in mind, let's just extend the API a bit to support direct dispatch from ops.select_cpu(). Signed-off-by: David Vernet <[email protected]>

Now that we support dispatching directly from ops.select_cpu(), the SCX_ENQ_LOCAL flag isn't needed. The last place it was used was on the SCX_ENQ_LAST path to control whether a task would be dispatched locally if ops.enqueue() wasn't defined. It doesn't really make sense to define SCX_OPS_ENQ_LAST but not ops.enqueue(), so let's remove SCX_ENQ_LOCAL and validate that SCX_OPS_ENQ_LAST is never passed if ops.enqueue() isn't defined. Signed-off-by: David Vernet <[email protected]>

Some scheduler implementations may want to have ops.enqueue() invoked even if scx_select_cpu_dfl() finds an idle core for the enqueuing task to run on. In order to enable this, we can add a new scx_bpf_select_cpu_dfl() kfunc which allows a BPF scheduler to get the same behavior as the default ops.select_cpu() implementation, and then decide whether they want to dispatch directly from ops.select_cpu(). Signed-off-by: David Vernet <[email protected]>

Let's test the new semantics for being able to do direct dispatch from ops.select_cpu(), including testing when SCX_OPS_ENQ_DFL_NO_DISPATCH is specified. Also adds a testcase validating that we can't load a scheduler with SCX_OPS_ENQ_LAST if ops.enqueue() is not defined. Signed-off-by: David Vernet <[email protected]>

Allow dispatching from ops.select_cpu()

We're currently checking whether a builtin DSQ is being used with priq in scx_bpf_dispatch_vtime(). This neglects the fact that we could end up falling back to scx_dsq_global if there's an error. If we error out with SCX_ENQ_DSQ_PRIQ set in enqueue flags, we would trigger a warning in dispatch_enqueue(). Let's instead just move the check to inside of dispatch_enqueue(). Signed-off-by: David Vernet <[email protected]>

Let's verify that we're disallowing builtin DSQs from being dispatched to. Signed-off-by: David Vernet <[email protected]>

Fix fallback

We were previously only calling it on the fork path, but we need to be calling it on the enable path as well. Signed-off-by: David Vernet <[email protected]>

Currently, the ops.enable() and ops.disable() callbacks are invoked a single time for every task on the system. ops.enable() is invoked shortly after a task succeeds in ops.prep_enable(), and ops.disable() is invoked when a task exits, or when the BPF scheduler is unloaded. This API is a bit odd because ops.enable() can be invoked well before a task actually starts running in the BPF scheduler, so it's not necessarily useful as a way to bootstrap a process. For example, scx_simple does the following: void BPF_STRUCT_OPS(simple_enable, struct task_struct *p, struct scx_enable_args *args) { p->scx.dsq_vtime = vtime_now; } If the task later switches to sched_ext, the value will of course be stale. While it ends up balancing out due to logic elsewhere in the scheduler, it's indicative of a somewhat awkward component of the API that can be improved. Instead, this patch has ops.enable() be invoked when a task is entering the scheduler for the first time, and and ops.disable() be invoked whenever a task is leaving the scheduler; be it because of exiting, the scheduler being unloaded, or the task manually switching sched policies. Signed-off-by: David Vernet <[email protected]>

ops.prep_enable() and ops.cancel_enable() have become arguably somewhat misnomers in that ops.enable() and ops.disable() may be called multiple times throughout a BPF prog being loaded, but ops.prep_enable() and ops.cancel_enable() will be called at most once. ops.prep_enable() is really more akin to initializing the task rather than preparing for ops.enable(), so let's rename it to ops.init_task() and ops.cancel_init() to reflect this. In addition, some schedulers are currently using ops.disable() to clean up whatever was initialized in (what was previously) ops.prep_enable(). This doesn't work now that ops.disable() can be called multiple times, so we also need to add a new callback called exit_task() which is called exactly once when a task is exiting (if it was previously successfully initialized). Signed-off-by: David Vernet <[email protected]>

We expect to have some sched_ext_ops callbacks be called differently depending on the scheduler, and the tasks running on the system. Let's add a testcase that verifies that the init_task(), exit_task(), enable(), and disable() callbacks are all invoked correctly and as expected. Signed-off-by: David Vernet <[email protected]>

When we added support for dispatching from ops.select_cpu(), I accidentally put the sched_ext_entity.ddsq_id field into the "modifiable fields" part of struct sched_ext_entity. It should be harmless, but there shouldn't be any reason for a scheduler to muck with it, so let's move it up. Signed-off-by: David Vernet <[email protected]>

I forgot to include these in the patch set that fixes and tests us gracefully falling back to the global DSQ. Signed-off-by: David Vernet <[email protected]>

Fix and update semantics for ops.enable() and ops.disable()

In scx_select_cpu_dfl(), we're currently returning prev_cpu if p->nr_cpus_allowed == 1. It makes sense to return prev_cpu if the task can't run on any other cores, but we might as well also try to claim the core as idle so that: 1. scx_select_cpu_dfl() will directly dispatch it 2. To prevent another core from incorrectly assuming that core will be idle when in reality that task will be enqueued to it. The mask will eventually be updated in __scx_update_idle(), but this seems more efficient. 3. To have the idle cpumask bit be unset when the task is enqueued in ops.enqueue() (if the core scheduler is using scx_bpf_select_cpu_dfl()). Signed-off-by: David Vernet <[email protected]>

select_cpu_dfl checks whether a task that's successfully dispatched from the default select_cpu implementation isn't subsequently enqueued. It's only doing the check for non-pcpu threads, but that's not really the condition we want to look for. We don't want to do the check for any task that's being enqueued on the enable path, because it won't have gone through the select_cpu path. Instead, let's just check the task name to verify it's the test task. Signed-off-by: David Vernet <[email protected]>

Claim idle core in scx_select_cpu_dfl for nr_cpus_allowed ==1

Han Xing Yi reported a syzbot lockdep error over the weekend: ====================================================== WARNING: possible circular locking dependency detected 6.6.0-g2f6ba98e2d3d #4 Not tainted ------------------------------------------------------ syz-executor.0/2181 is trying to acquire lock: ffffffff84772410 (pernet_ops_rwsem){++++}-{3:3}, at: copy_net_ns+0x216/0x590 net/core/net_namespace.c:487 but task is already holding lock: ffffffff8449dc50 (scx_fork_rwsem){++++}-{0:0}, at: sched_fork+0x3b/0x190 kernel/sched/core.c:4810 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (scx_fork_rwsem){++++}-{0:0}: percpu_down_write+0x51/0x210 kernel/locking/percpu-rwsem.c:227 scx_ops_enable+0x230/0xf90 kernel/sched/ext.c:3271 bpf_struct_ops_link_create+0x1b9/0x220 kernel/bpf/bpf_struct_ops.c:914 link_create kernel/bpf/syscall.c:4938 [inline] __sys_bpf+0x35af/0x4ac0 kernel/bpf/syscall.c:5453 __do_sys_bpf kernel/bpf/syscall.c:5487 [inline] __se_sys_bpf kernel/bpf/syscall.c:5485 [inline] __x64_sys_bpf+0x48/0x60 kernel/bpf/syscall.c:5485 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x46/0x100 arch/x86/entry/common.c:82 entry_SYSCALL_64_after_hwframe+0x6e/0x76 -> #2 (cpu_hotplug_lock){++++}-{0:0}: percpu_down_read include/linux/percpu-rwsem.h:51 [inline] cpus_read_lock+0x42/0x1b0 kernel/cpu.c:489 flush_all_backlogs net/core/dev.c:5885 [inline] unregister_netdevice_many_notify+0x30a/0x1070 net/core/dev.c:10965 unregister_netdevice_many+0x19/0x20 net/core/dev.c:11039 sit_exit_batch_net+0x433/0x460 net/ipv6/sit.c:1887 ops_exit_list+0xc5/0xe0 net/core/net_namespace.c:175 cleanup_net+0x3e2/0x750 net/core/net_namespace.c:614 process_one_work+0x50d/0xc20 kernel/workqueue.c:2630 process_scheduled_works kernel/workqueue.c:2703 [inline] worker_thread+0x50b/0x950 kernel/workqueue.c:2784 kthread+0x1fa/0x250 kernel/kthread.c:388 ret_from_fork+0x48/0x60 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:242 -> #1 (rtnl_mutex){+.+.}-{3:3}: __mutex_lock_common kernel/locking/mutex.c:603 [inline] __mutex_lock+0xc1/0xea0 kernel/locking/mutex.c:747 mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:799 rtnl_lock+0x17/0x20 net/core/rtnetlink.c:79 register_netdevice_notifier+0x25/0x1c0 net/core/dev.c:1741 rtnetlink_init+0x3a/0x6e0 net/core/rtnetlink.c:6657 netlink_proto_init+0x23d/0x2f0 net/netlink/af_netlink.c:2946 do_one_initcall+0xb3/0x5f0 init/main.c:1232 do_initcall_level init/main.c:1294 [inline] do_initcalls init/main.c:1310 [inline] do_basic_setup init/main.c:1329 [inline] kernel_init_freeable+0x40c/0x5d0 init/main.c:1547 kernel_init+0x1d/0x350 init/main.c:1437 ret_from_fork+0x48/0x60 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:242 -> #0 (pernet_ops_rwsem){++++}-{3:3}: check_prev_add kernel/locking/lockdep.c:3134 [inline] check_prevs_add kernel/locking/lockdep.c:3253 [inline] validate_chain kernel/locking/lockdep.c:3868 [inline] __lock_acquire+0x16b4/0x2b30 kernel/locking/lockdep.c:5136 lock_acquire kernel/locking/lockdep.c:5753 [inline] lock_acquire+0xc1/0x2b0 kernel/locking/lockdep.c:5718 down_read_killable+0x5d/0x280 kernel/locking/rwsem.c:1549 copy_net_ns+0x216/0x590 net/core/net_namespace.c:487 create_new_namespaces+0x2ed/0x770 kernel/nsproxy.c:110 copy_namespaces+0x488/0x540 kernel/nsproxy.c:179 copy_process+0x1b52/0x4680 kernel/fork.c:2504 kernel_clone+0x116/0x660 kernel/fork.c:2914 __do_sys_clone3+0x192/0x220 kernel/fork.c:3215 __se_sys_clone3 kernel/fork.c:3199 [inline] __x64_sys_clone3+0x30/0x40 kernel/fork.c:3199 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x46/0x100 arch/x86/entry/common.c:82 entry_SYSCALL_64_after_hwframe+0x6e/0x76 other info that might help us debug this: Chain exists of: pernet_ops_rwsem --> cpu_hotplug_lock --> scx_fork_rwsem Possible unsafe locking scenario: CPU0 CPU1 ---- ---- rlock(scx_fork_rwsem); lock(cpu_hotplug_lock); lock(scx_fork_rwsem); rlock(pernet_ops_rwsem); *** DEADLOCK *** 1 lock held by syz-executor.0/2181: #0: ffffffff8449dc50 (scx_fork_rwsem){++++}-{0:0}, at: sched_fork+0x3b/0x190 kernel/sched/core.c:4810 stack backtrace: CPU: 0 PID: 2181 Comm: syz-executor.0 Not tainted 6.6.0-g2f6ba98e2d3d #4 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 Sched_ext: serialise (enabled), task: runnable_at=-6ms Call Trace: <TASK> __dump_stack lib/dump_stack.c:89 [inline] dump_stack_lvl+0x91/0xf0 lib/dump_stack.c:107 dump_stack+0x15/0x20 lib/dump_stack.c:114 check_noncircular+0x134/0x150 kernel/locking/lockdep.c:2187 check_prev_add kernel/locking/lockdep.c:3134 [inline] check_prevs_add kernel/locking/lockdep.c:3253 [inline] validate_chain kernel/locking/lockdep.c:3868 [inline] __lock_acquire+0x16b4/0x2b30 kernel/locking/lockdep.c:5136 lock_acquire kernel/locking/lockdep.c:5753 [inline] lock_acquire+0xc1/0x2b0 kernel/locking/lockdep.c:5718 down_read_killable+0x5d/0x280 kernel/locking/rwsem.c:1549 copy_net_ns+0x216/0x590 net/core/net_namespace.c:487 create_new_namespaces+0x2ed/0x770 kernel/nsproxy.c:110 copy_namespaces+0x488/0x540 kernel/nsproxy.c:179 copy_process+0x1b52/0x4680 kernel/fork.c:2504 kernel_clone+0x116/0x660 kernel/fork.c:2914 __do_sys_clone3+0x192/0x220 kernel/fork.c:3215 __se_sys_clone3 kernel/fork.c:3199 [inline] __x64_sys_clone3+0x30/0x40 kernel/fork.c:3199 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x46/0x100 arch/x86/entry/common.c:82 entry_SYSCALL_64_after_hwframe+0x6e/0x76 RIP: 0033:0x7f9f764e240d Code: c3 e8 97 2b 00 00 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f9f75851ee8 EFLAGS: 00000246 ORIG_RAX: 00000000000001b3 RAX: ffffffffffffffda RBX: 00007f9f7661ef80 RCX: 00007f9f764e240d RDX: 0000000000000100 RSI: 0000000000000058 RDI: 00007f9f75851f00 RBP: 00007f9f765434a6 R08: 0000000000000000 R09: 0000000000000058 R10: 00007f9f75851f00 R11: 0000000000000246 R12: 0000000000000058 R13: 0000000000000006 R14: 00007f9f7661ef80 R15: 00007f9f75832000 </TASK> The issue is that we're acquiring the cpus_read_lock() _before_ we acquire scx_fork_rwsem in scx_ops_enable() and scx_ops_disable(), but we acquire and hold scx_fork_rwsem around basically the whole fork() path. I don't see how a deadlock could actually occur in practice, but it should be safe to acquire the scx_fork_rwsem and scx_cgroup_rwsem semaphores before the hotplug lock, so let's do that. Reported-by: Han Xing Yi <[email protected]> Signed-off-by: David Vernet <[email protected]>

If ops.select_cpu() isn't defined, scx_select_cpu_dfl() will be called, and a task will be dispatched directly to a core if one is found. I neglected to also set the task slice, so we see the following warning if we use the direct dispatch: [root@arch scx]# ./select_cpu_dfl [ 23.184426] sched_ext: select_cpu_dfl[356] has zero slice in pick_next_task_scx() I'm not sure why this wasn't being printed when I tested this before, but let's fix it. Signed-off-by: David Vernet <[email protected]>

scx: Set default slice for default select_cpu dispatch

scx: Avoid possible deadlock with cpus_read_lock()

They're accessed without any locking and check_rq_for_timeouts() seems to assume that last_runnable doesn't get fetched multipled times which isn't true without READ_ONCE(). Signed-off-by: Tejun Heo <[email protected]>

…unterparts The list will be used for another purpose too. Rename to indicate the generic nature. Signed-off-by: Tejun Heo <[email protected]>

…passing()/g Guaranteeing forward progress by forcing global FIFO behavior is currently used only in the disabling path. This will be used for something else too. Let's factor it out and rename accordingly. No functional change intended. Signed-off-by: Tejun Heo <[email protected]>

Implement bypass depth so that multiple users can request bypassing without conflicts. This decouples bypass on/off from ops state so that bypassing can be used in combination with any ops state. The unbypassing path isn't used yet and is to be implemented. Note that task_should_scx() needs to test whether DISABLING rather than bypassing and thus updated to test scx_ops_enable_state() explicitly. The disable path now always uses bypassing to balance bypass depth. This also leads to simpler code. Signed-off-by: Tejun Heo <[email protected]>

Bypassing overrides ops.enqueue() and .dispatch() to force global FIFO behavior. However, this was an irreversible action making it impossible to turn off bypassing. Instead, add behaviors conditional on scx_ops_bypassing() to implement global FIFO behavior while bypassing. This adds two condition checks to hot paths but they're easily predictable and shouldn't add noticeable overhead. Signed-off-by: Tejun Heo <[email protected]>

scx_ops_bypass() involves scanning all tasks in the system and can thus become pretty expensive which limits its utility. scx_ops_bypass() isn't making any persistent changes to tasks. It just wants to dequeue and re-enqueue runnable tasks so that they're queued according to the current bypass state. As such, it can iterate the runnable tasks rather than all. This patch makes scx_ops_bypass() iterate each CPU's rq->scx.runnable_list. There are subtle complications due to the inability to trust the scheduler and each task going off and getting back on the runnable_list as they get cycled. See the comments for details. After this optimization, [un]bypassing should be pretty cheap in most circumstances. Signed-off-by: Tejun Heo <[email protected]>

Signed-off-by: Tejun Heo <[email protected]>

We need more stuff to do in the init function. Give it a more generic name. Signed-off-by: Tejun Heo <[email protected]>

SCX schedulers often have userspace components which are sometimes involved in critial scheduling paths. PM operations involve freezing userspace which can lead to scheduling misbehaviors including stalls. Let's bypass while PM operations are in progress. Signed-off-by: Tejun Heo <[email protected]> Reported-by: Andrea Righi <[email protected]>

scx_bpf_kick_cpu() uses irq_work. However, if called while e.g. suspending, IRQ handling may already be offline and scheduling irq_work can hang indefinitely. There's no need for kicking while bypassing anyway, let's suppress scx_bpf_kick_cpu() while bypassing. Signed-off-by: Tejun Heo <[email protected]>

Implement generic bypass mode and use it while PM operations are in progress

This reverts commit c3c7041. We hit a locking ordering issue in the other direction. Let's revert for now. [ 9.378773] ====================================================== [ 9.379476] WARNING: possible circular locking dependency detected [ 9.379532] 6.6.0-work-10442-ga7150a9168f8-dirty #134 Not tainted [ 9.379532] ------------------------------------------------------ [ 9.379532] scx_rustland/1622 is trying to acquire lock: [ 9.379532] ffffffff8325f828 (cpu_hotplug_lock){++++}-{0:0}, at: bpf_scx_reg+0xe4/0xcf0 [ 9.379532] [ 9.379532] but task is already holding lock: [ 9.379532] ffffffff83271be8 (scx_cgroup_rwsem){++++}-{0:0}, at: bpf_scx_reg+0xdf/0xcf0 [ 9.379532] [ 9.379532] which lock already depends on the new lock. [ 9.379532] [ 9.379532] [ 9.379532] the existing dependency chain (in reverse order) is: [ 9.379532] [ 9.379532] -> #2 (scx_cgroup_rwsem){++++}-{0:0}: [ 9.379532] percpu_down_read+0x2e/0xb0 [ 9.379532] scx_cgroup_can_attach+0x25/0x200 [ 9.379532] cpu_cgroup_can_attach+0xe/0x10 [ 9.379532] cgroup_migrate_execute+0xaf/0x450 [ 9.379532] cgroup_apply_control+0x227/0x2a0 [ 9.379532] cgroup_subtree_control_write+0x425/0x4b0 [ 9.379532] cgroup_file_write+0x82/0x260 [ 9.379532] kernfs_fop_write_iter+0x131/0x1c0 [ 9.379532] vfs_write+0x1f9/0x270 [ 9.379532] ksys_write+0x62/0xc0 [ 9.379532] __x64_sys_write+0x1b/0x20 [ 9.379532] do_syscall_64+0x40/0xe0 [ 9.379532] entry_SYSCALL_64_after_hwframe+0x46/0x4e [ 9.379532] [ 9.379532] -> #1 (cgroup_threadgroup_rwsem){++++}-{0:0}: [ 9.379532] percpu_down_write+0x35/0x1e0 [ 9.379532] cgroup_procs_write_start+0x8a/0x210 [ 9.379532] __cgroup_procs_write+0x4c/0x160 [ 9.379532] cgroup_procs_write+0x17/0x30 [ 9.379532] cgroup_file_write+0x82/0x260 [ 9.379532] kernfs_fop_write_iter+0x131/0x1c0 [ 9.379532] vfs_write+0x1f9/0x270 [ 9.379532] ksys_write+0x62/0xc0 [ 9.379532] __x64_sys_write+0x1b/0x20 [ 9.379532] do_syscall_64+0x40/0xe0 [ 9.379532] entry_SYSCALL_64_after_hwframe+0x46/0x4e [ 9.379532] [ 9.379532] -> #0 (cpu_hotplug_lock){++++}-{0:0}: [ 9.379532] __lock_acquire+0x142d/0x2a30 [ 9.379532] lock_acquire+0xbf/0x1f0 [ 9.379532] cpus_read_lock+0x2f/0xc0 [ 9.379532] bpf_scx_reg+0xe4/0xcf0 [ 9.379532] bpf_struct_ops_link_create+0xb6/0x100 [ 9.379532] link_create+0x49/0x200 [ 9.379532] __sys_bpf+0x351/0x3e0 [ 9.379532] __x64_sys_bpf+0x1c/0x20 [ 9.379532] do_syscall_64+0x40/0xe0 [ 9.379532] entry_SYSCALL_64_after_hwframe+0x46/0x4e [ 9.379532] [ 9.379532] other info that might help us debug this: [ 9.379532] [ 9.379532] Chain exists of: [ 9.379532] cpu_hotplug_lock --> cgroup_threadgroup_rwsem --> scx_cgroup_rwsem [ 9.379532] [ 9.379532] Possible unsafe locking scenario: [ 9.379532] [ 9.379532] CPU0 CPU1 [ 9.379532] ---- ---- [ 9.379532] lock(scx_cgroup_rwsem); [ 9.379532] lock(cgroup_threadgroup_rwsem); [ 9.379532] lock(scx_cgroup_rwsem); [ 9.379532] rlock(cpu_hotplug_lock); [ 9.379532] [ 9.379532] *** DEADLOCK *** [ 9.379532] [ 9.379532] 3 locks held by scx_rustland/1622: [ 9.379532] #0: ffffffff83272708 (scx_ops_enable_mutex){+.+.}-{3:3}, at: bpf_scx_reg+0x2a/0xcf0 [ 9.379532] #1: ffffffff83271aa0 (scx_fork_rwsem){++++}-{0:0}, at: bpf_scx_reg+0xd3/0xcf0 [ 9.379532] #2: ffffffff83271be8 (scx_cgroup_rwsem){++++}-{0:0}, at: bpf_scx_reg+0xdf/0xcf0 [ 9.379532] [ 9.379532] stack backtrace: [ 9.379532] CPU: 7 PID: 1622 Comm: scx_rustland Not tainted 6.6.0-work-10442-ga7150a9168f8-dirty #134 [ 9.379532] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS unknown 2/2/2022 [ 9.379532] Sched_ext: rustland (prepping) [ 9.379532] Call Trace: [ 9.379532] <TASK> [ 9.379532] dump_stack_lvl+0x55/0x70 [ 9.379532] dump_stack+0x10/0x20 [ 9.379532] print_circular_bug+0x2ea/0x2f0 [ 9.379532] check_noncircular+0xe2/0x100 [ 9.379532] __lock_acquire+0x142d/0x2a30 [ 9.379532] ? lock_acquire+0xbf/0x1f0 [ 9.379532] ? rcu_sync_func+0x2c/0xa0 [ 9.379532] lock_acquire+0xbf/0x1f0 [ 9.379532] ? bpf_scx_reg+0xe4/0xcf0 [ 9.379532] cpus_read_lock+0x2f/0xc0 [ 9.379532] ? bpf_scx_reg+0xe4/0xcf0 [ 9.379532] bpf_scx_reg+0xe4/0xcf0 [ 9.379532] ? alloc_file+0xa4/0x160 [ 9.379532] ? alloc_file_pseudo+0x99/0xd0 [ 9.379532] ? anon_inode_getfile+0x79/0xc0 [ 9.379532] ? bpf_link_prime+0xe2/0x1a0 [ 9.379532] bpf_struct_ops_link_create+0xb6/0x100 [ 9.379532] link_create+0x49/0x200 [ 9.379532] __sys_bpf+0x351/0x3e0 [ 9.379532] __x64_sys_bpf+0x1c/0x20 [ 9.379532] do_syscall_64+0x40/0xe0 [ 9.379532] ? sysvec_apic_timer_interrupt+0x44/0x80 [ 9.379532] entry_SYSCALL_64_after_hwframe+0x46/0x4e [ 9.379532] RIP: 0033:0x7fc391f7473d [ 9.379532] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c3 95 0c 00 f7 d8 64 89 01 48 [ 9.379532] RSP: 002b:00007ffeb4fe4108 EFLAGS: 00000246 ORIG_RAX: 0000000000000141 [ 9.379532] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc391f7473d [ 9.379532] RDX: 0000000000000030 RSI: 00007ffeb4fe4120 RDI: 000000000000001c [ 9.379532] RBP: 000000000000000c R08: 000000000000000c R09: 000055d0a75b1a10 [ 9.379532] R10: 0000000000000050 R11: 0000000000000246 R12: 000000000000002c [ 9.379532] R13: 00007ffeb4fe4628 R14: 0000000000000000 R15: 00007ffeb4fe4328 [ 9.379532] </TASK>

Revert "scx: Avoid possible deadlock with cpus_read_lock()"

Functionally equivalent. Just a bit more idiomatic. Signed-off-by: Tejun Heo <[email protected]>

Linux 6.7

scx: Sync schedulers from SCX v0.1.5 (74923c6cdbc3)

If we've done a direct dispatch from ops.select_cpu(), we can currently hang the host if we dispatch to a non-local DSQ. This is because we circumvent some important checks, such as whether we should be bypassing ops.enqueue() and dispatching directly to the local or global DSQ. Doing a local dispatch now doesn't hang the host because we happen to be dispatching to a safe, builtin DSQ. Let's instead update the logic to only do the direct dispatch after these critical checks. Signed-off-by: David Vernet <[email protected]>

We're currently not remembering enq flags during direct dispatch. Let's record them in case someone wants to pass e.g. SCX_ENQ_PREEMPT from ops.select_cpu(). Let's also reset ddsq_id and ddsq_enq_flags before calling dispatch_enqueue() to ensure there's no races with the task being consumed from another core. Signed-off-by: David Vernet <[email protected]>

Let's test that we properly stash enq flags by doing vtime dispatching from ops.select_cpu(). Signed-off-by: David Vernet <[email protected]>

Stash enq_flags when marking for direct dispatch

We want to make it as easy as possible both to run tests, and to implement them. This means we ideally want a single test runner binary that can run the testcases, while also making it trivial to add a testcase without worrying about having to update the runner itself. To accomplish this, this patch adds a new declarative mechanism for defining scx tests by implementing a struct scx_test object. Tests can simply define such a struct, and then register it with the testrunner using a REGISTER_SCX_TEST macro. The build system will automatically compile the testcase and add machinery to have it be auto-registered into the runner binary. The runner binary then outputs test results in ktap [0] format so it can be consumed by CI systems. [0]: https://docs.kernel.org/dev-tools/ktap.html This patch simply implements the framework, adds a test_example.c file that illustrates how to add a testcase, and converts a few existing testcases to use the framework. If the framework is acceptable, we can convert the rest. Signed-off-by: David Vernet <[email protected]>

scx: Implement scx selftests framework

Now that the framework has been merged, let's update the remaining testcases to use it. Signed-off-by: David Vernet <[email protected]>

We're checking that we don't crash when we encounter these error conditions, but let's also test that we exit with the expected error condition. The next patch will update this to be built into the test framework. Signed-off-by: David Vernet <[email protected]>

Rather than define the error value in each test, let's just define it in scx_test.h. Signed-off-by: David Vernet <[email protected]>

scx: Convert remaining testcases to use new framework

cpus_read_lock() is needed for two purposes in scx_ops_enable(). First, to keep CPUs stable between ops.init() and enabling of ops.cpu_on/offline(). Second, to work around the locking order issue between scx_cgroup_rwsem and cpu_hotplug_lock caused by static_branch_*(). Currently, scx_ops_enable() acquires cpus_read_lock() and holds it through most of ops enabling covering both use cases. This makes it difficult to understand what lock is held where and resolve locking order issues among these system-wide locks. Let's separate out the two sections so that ops.init() and ops.cpu_on/offline() enabling are contained in its own critical section and cpus_read_lock() is droped and then reacquired for the second use case. Signed-off-by: Tejun Heo <[email protected]>

scx_cgroup_rwsem and scx_fork_rwsem, respectively, are in the following locking dependency chain. cpu_hotplug_lock --> cgroup_threadgroup_rwsem --> scx_cgroup_rwsem scx_fork_rwsem --> pernet_ops_rwsem --> cpu_hotplug_lock And we need to flip static_key which requires CPUs stable. The only locking order which satifies all three requirements is scx_fork_rwsem --> cpu_hotplug_lock --> scx_cgroup_rwsem Reorder locking in scx_ops_enable() and scx_ops_disable_workfn(). Signed-off-by: Tejun Heo <[email protected]>

scx: Fix locking order

b32d73ae4e19 ("Merge pull request #82 from sched-ext/htejun")

scx: Sync from scx repo

Add a github action to test the sched-ext kernel with all the shipped schedulers. The test uses a similar approach to the scx workflow [1], using virtme-ng to run each scheduler inside a sched-ext enabled kernel for a certain amount of time (30 sec) and checking for potential stall, oops or bug conditions. In this case we can use `virtme-ng --build` to build a kernel with bare minimum support to run inside virtme-ng itself, instead of generating a fully featured kernel, to expedite the testing process. The mandatory .config options required by sched-ext are stored in `.github/workflows/sched-ext.config` and they are passed to virtme-ng via the `--config` option. The test itself is defined in `.github/workflows/run-schedulers`: the script looks for all the binaries in `tools/sched_ext/build/bin` and runs each one in a separate virtme-ng instance, to ensure that each run does not impact the others. [1] https://github.com/sched-ext/scx/blob/main/.github/workflows/build-scheds.yml Signed-off-by: Andrea Righi <[email protected]>

ci: add github workflow to test the sched-ext kernel

The struct task_struct pointer passing to .dispatch can be NULL. However, we assume that the pointers passing to a struct_ops programs are always trusted (PTR_TRUSTED), that means it is always valid (not NULL). It makes the verifier fail to validate programs, and may cause a kernel crash when running these programs. This patch marks the second argument of .dispatch with PTR_MAYBE_NULL | PTR_TO_BTF_ID | PTR_TRUSTED in bpf_scx_is_valid_access(). The verifier will ensures the programs always check if the argument is NULL before reading the pointed memory.

…patch. Check if the verifier can catch the invalid access if a .dispatch program doesn't check the 2nd argument before accessing the pointed memory. Also check if the verifier allows a program which check the 2nd argument before accessing the pointed memory.

/sys/kernel/debug/sched/ext is the current interface file which can be used to determine the current state of scx. This is problematic in that it's dependent on CONFIG_SCHED_DEBUG. On kernels which don't have the option enabled, there is no easy way to tell whether scx is currently in use. Let's add a new kobject based interface which is created under /sys/kernel/sched_ext. The directory contains: - System level interface files. As it's now a non-debug interface, confine the exposed files to "state", "switch_all" and "nr_rejected". - Per-scheduler directory which currently only contains "ops". The directory is always named "root" for now. This is in preparation of the future where there can be multiple schedulers loaded in a system. Loading and unloading of a scheduler also generates a uevent with SCXOPS attribute. Signed-off-by: Tejun Heo <[email protected]>

…w_state.py Now that the state is visible through /sys/kernel/sched_ext, /sys/kernel/debug/sched/ext isn't needed to determine the current state of scx. However, /sys/kernel/sched_ext shows only a subset of information that was available in the debug interface and it can be useful to have access to the rest for debugging. Remove /sys/kernel/debug/sched/ext and add the drgn script, tools/sched_ext/scx_show_state.py, which shows the same information. Signed-off-by: Tejun Heo <[email protected]>

scx: Replace /sys/kernel/debug/sched/ext with /sys/kernel/sched_ext

Make the pionter passing to .dispatch MAYBE_NULL

- Fix a few typos and some comment formatting in ext.c - Generalize the rule for compiling a "fail" testcase variant in seltests - Update copyrights to 2024 Signed-off-by: David Vernet <[email protected]>

scx: Fix a couple follow ups to recent struct_ops changes

Conflicts: include/linux/sched.h kernel/bpf/verifier.c kernel/cgroup/cgroup.c kernel/sched/core.c Also had to add CFI stubs and kfunc annotations to ext.c, as well as remove use of strlcpy(). Signed-off-by: David Vernet <[email protected]>

Commits on Sep 19, 2023

debug patches and fix

htejun committed Sep 19, 2023

Configuration menu

View commit details

Copy full SHA for d377f5e

Browse repository at this point

Copy the full SHA

d377f5e View commit details

Browse the repository at this point in the history

Commits on Oct 31, 2023

sched_ext: Add scx_layered

htejun committed Oct 31, 2023

Configuration menu

View commit details

Copy full SHA for 2a5eb98

Browse repository at this point

Copy the full SHA

2a5eb98 View commit details

Browse the repository at this point in the history

Cut an scx-6.8rc.y branch for 6.8-rc releases #1

Cut an scx-6.8rc.y branch for 6.8-rc releases #1

Commits on Aug 1, 2023

Commits on Aug 2, 2023

Commits on Aug 3, 2023

Commits on Aug 8, 2023

Commits on Aug 14, 2023

Commits on Aug 30, 2023

Commits on Sep 19, 2023

Commits on Sep 20, 2023

Commits on Sep 21, 2023

Commits on Sep 22, 2023

Commits on Sep 26, 2023

Commits on Oct 2, 2023

Commits on Oct 3, 2023

Commits on Oct 4, 2023

Commits on Oct 7, 2023

Commits on Oct 9, 2023

Commits on Oct 11, 2023

Commits on Oct 12, 2023

Commits on Oct 16, 2023

Commits on Oct 30, 2023

Commits on Oct 31, 2023

Commits on Nov 1, 2023

Commits on Nov 2, 2023

Commits on Nov 3, 2023

Commits on Nov 4, 2023

Commits on Nov 5, 2023

Commits on Nov 6, 2023

Commits on Nov 7, 2023

Commits on Nov 8, 2023

Commits on Nov 10, 2023

Commits on Nov 14, 2023

Commits on Nov 25, 2023

Commits on Nov 28, 2023

Commits on Nov 29, 2023

Commits on Nov 30, 2023

Commits on Dec 3, 2023

Commits on Dec 4, 2023

Commits on Dec 5, 2023

Commits on Dec 6, 2023

Commits on Dec 8, 2023

Commits on Dec 28, 2023

Commits on Jan 3, 2024

Commits on Jan 4, 2024

Commits on Jan 5, 2024

Commits on Jan 6, 2024

Commits on Jan 8, 2024

Commits on Jan 9, 2024

Commits on Jan 10, 2024

Commits on Jan 11, 2024

Commits on Jan 18, 2024

Commits on Jan 19, 2024

Commits on Jan 20, 2024

Commits on Jan 23, 2024