Exploring sched_ext: BPF-Powered CPU Schedulers in the Linux Kernel

The Linux kernel's scheduler is one of its most critical components, determining which tasks run on which CPUs and for how long.

But in today's complex computing environments, the default Linux scheduler doesn't always provide optimal performance for specialized workloads. This is where sched_ext (SCX) comes in – a framework that allows implementing custom CPU schedulers in BPF (Berkeley Packet Filter) and loading them dynamically. In this technical analysis.

I'll examine the architecture and implementation of SCX schedulers, with a particular focus on scx_rustland and scx_bpflandand, compared to traditional schedulers.

What is sched_ext?

Sched_ext

Linux kernel 6.12 introduced sched_ext (“extensible scheduler”) as a new scheduling class that allows pluggable CPU schedulers via eBPF

Enables implementing and dynamically loading thread schedulers written in BPF means it unlike traditional scheduler modifications that require kernel recompilation and rebooting, sched_ext defines a set of hook points (operations) that an eBPF-based scheduler can implement (such as picking the next task, enqueuing/dequeuing tasks, etc.)

BPF struct_ops: Used to define a scheduling policy through callback functions
Dispatch queues (DSQs): Used for task queuing and execution
Safety mechanisms: Prevent system crashes from buggy schedulers

scx project is a collection of sched_ext schedulers and tools. Schedulers in scx range from simple demonstrative policies to production-oriented ones tailored for specific use cases:
* scx_simple : uses a basic FIFO or least-run-time policy
* scx_nest : places tasks on high-frequency cores
* scx_lavd : is optimized for gaming workloads
* scx_rusty : partitions CPUs by last-level cache to improve locality
* scx_bpfland : threads that block frequently (i.e. perform many voluntary context switches per second) are assumed to be interactive, and thus prioritized

Each scheduler in SCX implements the required sched_ext hooks (via eBPF programs) and can be selected at runtime. The default Linux scheduler can always be restored if needed

End-to-End Task Lifecycle in sched_ext

Source: https://www.ebpf.top/post/bpf_sched_ext_dive_into/

I'll provide a deep dive into the end-to-end task flow in sched_ext, specifically examining how tasks move through the scheduling cycle. This will cover:

The reception of a task
How it is enqueued and dequeued
The scheduling decisions made

Task Entry into sched_ext

Once a BPF scheduler is loaded, all tasks with policy SCHED_EXT are switched to the new sched_ext scheduling class
a task is under sched_ext management (either by being created with SCHED_EXT or through a global switch), it will be integrated into the BPF scheduler's queues.
The kernel’s sched_ext core calls the BPF scheduler’s ops.init_task() callback for each task joining sched_ext, giving the BPF code to initialize per-task state (e.g. tracking virtual runtime, etc.)
At this point, the task is “received” by sched_ext – it’s now subject to the BPF scheduling logic rather than the default CFS rules.

Enqueuing and Dequeuing Mechanisms in sched_ext

Dispatch Queues (DSQs): sched_ext uses dispatch queues (DSQs) as intermediate run queues between the BPF logic and actual CPU execution
By default, there is one global FIFO queue (designated SCX_DSQ_GLOBAL) and one local DSQ per CPU (SCX_DSQ_LOCAL).
Tasks are dispatched into DSQs by the BPF code, and CPUs consume from DSQs to get their next runnable task
Enqueueing a task: When a task becomes runnable, the sched_ext core invokes the BPF scheduler’s ops.select_cpu() followed by ops.enqueue() callbacks.
- it may call scx_bpf_dispatch(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, enq_flags) to place the task directly into the target CPU’s local queue
- Or it might dispatch to the global queue (SCX_DSQ_GLOBAL) or a custom DSQ.

Scheduling Logic and Decision-Making Process

We know the actual decision of which task runs next on each CPU is made by the interplay of the ops.dispatch() and (optionally) ops.consume() callbacks, in conjunction with the dispatch queues. The scheduling cycle on a CPU, from a high-level perspective, is as follows:
- Selecting a target CPU on wakeup – select_cpu(), This function allows the scheduler to choose an optimal CPU for the task before it is enqueued. For example, a scheduler might want to put a waking task on the same CPU it ran last (for cache affinity) or find an idle CPU to improve latency.
- Enqueue decision – enqueue(): As described above, the BPF scheduler’s .enqueue() either dispatches the task to a DSQ (local, global, or custom) or holds it in an internal queue
- CPU picks next task – consumption and dispatch: the scheduler core asks sched_ext for the next task to run. The CPU first check local DSQ for any tasks waiting. Then, that task is dequeued and chosen to run immediately. If the local DSQ is empty, the CPU can try to consume from a global shared queue.
- Dispatching tasks from BPF scheduler – dispatch(): If after consuming global the CPU still has no task, the core invokes the BPF scheduler’s ops.dispatch()callback. It can call scx_bpf_dispatch() to dispatch tasks to either the requesting CPU’s local DSQ, the global DSQ

Here's a summary of key functions used in SCX scheduling:

Function	Purpose	Called By
`select_cpu()`	Choose target CPU hint	Kernel on task wakeup
`enqueue()`	Place task in queue or DSQ	Kernel after CPU selection
`dispatch()`	Find next task to run on CPU	Kernel when CPU needs work
`scx_bpf_dsq_insert()`	Add task to FIFO DSQ	BPF scheduler
`scx_bpf_dsq_insert_vtime()`	Add task to priority DSQ	BPF scheduler
`scx_bpf_dsq_move_to_local()`	Move task from DSQ to CPU	BPF in `dispatch()`
`scx_bpf_kick_cpu()`	Wake up idle CPU	BPF scheduler
`running()`	Track task starting on CPU	Kernel when task runs

Build and run sched_ext

I use blog post by Andrea Righi outlines a great workflow for testing sched_ext without modifying existing system

Run the virtual environment and test a scheduler

First, start the virtual environment with the sched_ext kernel:

vng -vr ../linux

Once inside the virtual environment, you can run one of the schedulers with the helper function:

scx ./build/scheds/c/scx_simple

C-Based Schedulers

scx_simple

scx_simple is a minimal scheduler that functions either as a global weighted vtime scheduler (similar to the Completely Fair Scheduler) or as a FIFO scheduler. It's designed primarily to demonstrate basic scheduling concepts

In the code below, from scx_simple.bpf.c, the .enqueue callback handles a task that needs to be scheduled.
1. If the scheduler is in FIFO mode (fifo_sched == true), it simply inserts the task into the shared dispatch queue without any priority sorting.

If in normal (weighted vtime) mode, it retrieves the task’s current virtual time (p->scx.dsq_vtime), adjusts it so that no task gains more than one slice worth of idle credit,
inserts the task into the shared queue with that virtual time as the key:

void BPF_STRUCT_OPS(simple_enqueue, struct task_struct *p, u64 enq_flags) {
    stat_inc(1);  // increment global queue count
    if (fifo_sched) {
        // FIFO scheduling: enqueue to shared queue with default slice
        scx_bpf_dsq_insert(p, SHARED_DSQ, SCX_SLICE_DFL, enq_flags);
    } else {
        u64 vtime = p->scx.dsq_vtime;
        // Cap the vtime lag to one slice to prevent too much credit
        if (time_before(vtime, vtime_now - SCX_SLICE_DFL))
            vtime = vtime_now - SCX_SLICE_DFL;
        // Enqueue with a specific virtual time for fairness
        scx_bpf_dsq_insert_vtime(p, SHARED_DSQ, SCX_SLICE_DFL, vtime, enq_flags);
    }
}

scx_rustland

scx_rustland is designed to prioritize interactive workloads over CPU-intensive background workloads.

In practical terms, this likely means it keeps track of each task’s recent CPU usage or blocking behavior and assigns higher priority (sooner scheduling, more CPU time) to tasks that have shorter CPU bursts.

scx_rustland overiew

scx_rustland uses a hybrid approach that splits functionality between kernel space and user space:

User Space: Complex scheduling logic in Rust
    - Task prioritization
    - CPU selection algorithms
    - Complex data structures
    |
    | Ring buffer communication
    v
Kernel Space: Minimal BPF component
    - Task state tracking
    - Dispatch queue management
    - Safety mechanisms

BPF Dispatcher (scx_rustland_core): The BPF part of scx_rustland implements minimal logic required to interface with the kernel. Its enqueue hook, for example, does not directly decide a run queue as a normal scheduler would. Instead, it may place the task into a BPF queue map that represents “tasks waiting for user-space decision.”

User-Space Scheduler (Rust): On the user side, the Rust scheduler process uses libraries (like libbpf or Aya in Rust) to interact with the eBPF program. It attaches to the maps exposed by BPF. Typically, it might use a ring buffer to receives a task to schedule, it runs its algorithm to decide where/when that task should run.

scx_rustland Code Structure

scheds/rust/scx_rustland/
├── Cargo.toml                 # Rust package definition
├── src/
│   ├── main.rs                # Main userspace implementation
│   ├── scheduler.rs           # Scheduler logic
│   ├── stats.rs               # Statistics collection
│   ├── topology.rs            # CPU topology handling
│   └── bpf/                   # BPF skeleton code
└── src/bpf/
    ├── main.bpf.c             # Minimal BPF implementation
    └── intf.h                 # Interface definitions

scx_rustland has two main components:

BPF Component (kernel space):

void BPF_STRUCT_OPS(rustland_enqueue, struct task_struct *p, u64 enq_flags)
{
    // Skip scheduling the scheduler itself
    if (is_usersched_task(p))
        return;

    // Pass task information to userspace for scheduling decision
    struct queued_task_ctx *task = bpf_ringbuf_reserve(&queued, sizeof(*task), 0);
    if (!task) {
        // Fallback: direct dispatch if userspace communication fails
        scx_bpf_dsq_insert_vtime(p, SHARED_DSQ, SCX_SLICE_DFL, 0, enq_flags);
        return;
    }

    // Send task info to userspace
    populate_task_info(task, p, enq_flags);
    bpf_ringbuf_submit(task, 0);
}

Rust Component (user space):

impl Scheduler {
    fn dispatch_next_task(&mut self) {
        if let Some(task) = self.priority_queue.pop_min() {
            // Complex scheduling logic in Rust
            let target_cpu = self.find_best_cpu_for_task(&task);
            let time_slice = self.calculate_time_slice(&task);

            // Send decision back to kernel
            let mut dispatched_task = DispatchedTask::new(&task);
            dispatched_task.cpu = target_cpu;
            dispatched_task.slice_ns = time_slice;
            self.bpf.dispatch_task(&dispatched_task).unwrap();
        }
    }

    fn find_best_cpu_for_task(&self, task: &QueuedTask) -> i32 {
        // Can use complex data structures and algorithms here
        // without BPF verifier constraints
        // ...
    }
}

Advantages

Better abstraction capabilities
Rich error handling
More readable algorithms
Unit testing support
Modern language features

Scheduling Logic

The scheduling logic in scx_rustland is split across kernel (BPF) and user-space

// Main scheduler object
struct Scheduler<'a> {
    bpf: BpfScheduler<'a>,                  // BPF connector
    stats_server: StatsServer<(), Metrics>, // statistics
    task_pool: TaskTree,                    // tasks ordered by deadline
    task_map: TaskInfoMap,                  // map pids to the corresponding task information
    min_vruntime: u64,                      // Keep track of the minimum vruntime across all tasks
    init_page_faults: u64,                  // Initial page faults counter
    slice_ns: u64,                          // Default time slice (in ns)
    slice_ns_min: u64,                      // Minimum time slice (in ns)
}

In user-space, scx_rustland maintains its own runqueue data structure – specifically a BTreeSet of tasks ordered by weighted vruntime (essentially mimicking CFS in user-space). It also monitors each task’s behavior to detect interactive tasks: if a task consistently voluntarily yields CPU (releasing the CPU before its time slice is fully used), it’s considered interactive.

scx_bpfland

scx_bpfland implements its scheduling logic almost entirely in BPF (Berkeley Packet Filter) code that runs in kernel space. It follows a more traditional approach where all scheduling decisions happen within the kernel:

scx_bpfland overiew

User Space: Minimal (monitoring only)
    |
    | (Minimal interaction)
    v
Kernel Space: Full scheduler implementation in BPF
    - Task selection
    - CPU assignment
    - Priority decisions
    - All scheduling algorithms

scheds/c/scx_bpfland.bpf.c        # Main BPF scheduler implementation
scheds/c/scx_bpfland.c            # Userspace loader and monitoring
scheds/include/scx_common.bpf.h   # Common BPF utilities

scx_bpfland Code Structure

scheds/c/scx_bpfland.bpf.c        # Main BPF scheduler implementation
scheds/c/scx_bpfland.c            # Userspace loader and monitoring
scheds/include/scx_common.bpf.h   # Common BPF utilities

scx_bpfland implements all scheduling callbacks directly in BPF:

void BPF_STRUCT_OPS(bpfland_enqueue, struct task_struct *p, u64 enq_flags)
{
    // Implementation directly in BPF
    struct task_ctx *task_ctx;
    u64 vruntime = 0;

    // Skip enqueuing the scheduler itself
    if (p->pid == bpfland_pid)
        return;

    // Get or create task context
    task_ctx = get_task_ctx(p);
    if (!task_ctx)
        return;

    // Calculate vruntime based on scheduling policy
    vruntime = calc_vruntime(p, task_ctx, enq_flags);

    // Direct enqueue to DSQ with calculated vruntime
    scx_bpf_dsq_insert_vtime(p, GLOBAL_DSQ, task_slice(p), vruntime, enq_flags);
}

Scheduling Logic

scx_bpfland internal logic is very similar in spirit to scx_rustland’s algorithm, but it executes entirely within the BPF program (in kernel). It effectively merges the two-tier logic into one.

Advantages Over Traditional Schedulers

The sched_ext + eBPF approach (exemplified by SCX schedulers) brings several key advantages over traditional in-kernel schedulers:
* Dynamic Policy Changes: New schedulers can be loaded and unloaded at runtime. There’s no need to patch or reboot the kernel to try a different scheduling policy. This is invaluable for testing and tuning in production environments – administrators can switch between, say, a latency-focused scheduler during interactive sessions and a throughput-focused one for batch processing periods, with a simple command.
* Workload-Specific Optimizations: The biggest advantage is the ability to tailor scheduling to a specific workload or environment. Instead of one-size-fits-all, users can pick or write a scheduler that, for example, never migrates certain tasks off a preferred CPU, or that implements strict priority levels, or that optimizes for particular patterns (like the “block frequently == interactive” rule). SCX even allows combining policies, as shown by scx_layered which applies different scheduling rules to different groups of processes.

Future integrating SCX_GOLAND with free5GC for Data Plane Optimization

SCX_GOLAND

Architecture Overview

The scx_goland_core project represents another approach to implementing a user-space scheduler for Linux using the sched_ext framework. This project follows the architectural patterns established by scx_rustland, adapting them to the Go ecosystem.

Go Implementation

The Go implementation consists of several key packages and types:

Core Package:

Sched: Main scheduler struct that holds references to BPF maps and methods for communication
QueuedTask: Represents a task queued from the kernel
DispatchedTask: Represents a task to be dispatched back to the kernel
BssData: Data structure for accessing BPF's BSS section

Constraints and Limitations

Memory Management:
In scx_goland_core, tasks from the kernel’s ring buffers are received as byte slices and then decoded (using binary.Read) into Go structs. This decoding and copying process is relatively expensive when compared to Rust’s approach where you can often work in a more zero‑copy manner. In contrast, scx_rustland benefits from Rust’s zero‑cost abstractions, minimal runtime overhead, and more direct access to low-level memory without a garbage collector.

In go:

func (s *Sched) DequeueTask(task *QueuedTask) {
    select {
    case t := <-s.queue:
        buff := bytes.NewBuffer(t)
        err := binary.Read(buff, binary.LittleEndian, task)
        if err != nil {
            task.Pid = -1
            return
        }
        err = s.SubNrQueued()
        if err != nil {
            task.Pid = -1
            log.Printf("SubNrQueued err: %v", err)
            return
        }
        return
    default:
        task.Pid = -1
        return
    }
}

In Rust:

fn dequeue_task(&mut self) -> Result<Option<QueuedTask>, i32> {
    match self.queued.consume_raw() {
        0 => {
            self.skel.maps.bss_data.nr_queued = 0;
            Ok(None)
        }
        LIBBPF_STOP => {
            // A valid task is received, convert data to a proper task struct.
            let task = unsafe { EnqueuedMessage::from_bytes(&BUF.0).to_queued_task() };
            let _ = self.skel.maps.bss_data.nr_queued.saturating_sub(1);
            Ok(Some(task))
        }
        res if res < 0 => Err(res),
        res => panic!(
            "Unexpected return value from libbpf-rs::consume_raw(): {}",
            res
        ),
    }
}

Garbage Collection:
Go is a managed language with a garbage collector and its own scheduler for goroutines. Even though Go is very efficient, those additional layers (GC, goroutine scheduling, channel operations) add overhead compared to Rust’s zero‑cost abstractions where most things are determined at compile time.

SCX_GOLAND Summary

It shows how Go's concurrency model and ease of use can be applied to system programming tasks that were traditionally the domain of C or Rust.

While it's not as optimized or mature as scx_rustland, it provides a valuable alternative for developers more comfortable with Go. It also serves as a good example of how the sched_ext framework enables experimentation with different scheduling policies and implementations without requiring deep kernel expertise.

Data Plane and CPU Scheduling Challenges

free5GC is an open-source implementation of the 5G Core network, written largely in Go. One of its components is the UPF (User Plane Function), which handles the data plane – i.e., packet forwarding for user traffic (GTP-U tunneling, routing packets between RAN and data network). The UPF and other network functions in Free5GC are user-space processes that can be CPU-intensive, especially under high load (many subscribers or high packet rates). Ensuring low latency and high throughput in the data plane is critical.

Future Design of SCX_GOLAND Scheduler

Identifying Target Tasks: The scheduler must reliably identify the free5GC data plane threads/processes. This could be done by process name matching, by cgroup (if free5GC is in a container or specific slice, the scheduler can detect that cgroup and apply a policy), or by explicit configuration(the operator could pass PIDs or process names to scx_goland at startup to tell it which tasks to prioritize).

prioritize specific goroutines: free5GC could be run in a mode where the UPF uses dedicated pinned threads for packet RX/TX loops (ensuring those OS threads only do packet work). Then scx_goland can target those threads precisely. But need to care that free5GC’s internal scheduling of goroutines on OS threads might not be directly visible to the OS scheduler.

Conclusion

The advantages of SCX (sched_ext) – runtime pluggability, rapid development, workload-specific optimizations, and crash resilience – make it very attractive for specialized domains. One such domain is the 5G core network. We explored scx_goland, a Go-based scheduler concept, illustrating how a custom scheduler could be integrated with free5GC to optimize its performance.

Reference

About

Hello, I'm William Lin. I'd like to share my excitement about being a member of the free5gc project, which is a part of the Linux Foundation. I'm always eager to discuss any aspects of core network development or related technologies.

Connect with Me

GitHub: williamlin0518
Linkedin: Cheng Wei Lin