Exploring sched_ext: BPF-Powered CPU Schedulers in the Linux Kernel
The Linux kernel's scheduler is one of its most critical components, determining which tasks run on which CPUs and for how long.
But in today's complex computing environments, the default Linux scheduler doesn't always provide optimal performance for specialized workloads. This is where sched_ext (SCX) comes in – a framework that allows implementing custom CPU schedulers in BPF (Berkeley Packet Filter) and loading them dynamically. In this technical analysis.
I'll examine the architecture and implementation of SCX schedulers, with a particular focus on scx_rustland and scx_bpflandand, compared to traditional schedulers.
What is sched_ext?
Sched_ext
Linux kernel 6.12 introduced sched_ext
(“extensible scheduler”) as a new scheduling class that allows pluggable CPU schedulers via eBPF
Enables implementing and dynamically loading thread schedulers written in BPF means it unlike traditional scheduler modifications that require kernel recompilation and rebooting, sched_ext
defines a set of hook points (operations) that an eBPF-based scheduler can implement (such as picking the next task, enqueuing/dequeuing tasks, etc.)
- BPF struct_ops: Used to define a scheduling policy through callback functions
- Dispatch queues (DSQs): Used for task queuing and execution
- Safety mechanisms: Prevent system crashes from buggy schedulers
scx project is a collection of sched_ext
schedulers and tools. Schedulers in scx range from simple demonstrative policies to production-oriented ones tailored for specific use cases:
* scx_simple : uses a basic FIFO or least-run-time policy
* scx_nest : places tasks on high-frequency cores
* scx_lavd : is optimized for gaming workloads
* scx_rusty : partitions CPUs by last-level cache to improve locality
* scx_bpfland : threads that block frequently (i.e. perform many voluntary context switches per second) are assumed to be interactive, and thus prioritized
Each scheduler in SCX implements the required sched_ext hooks (via eBPF programs) and can be selected at runtime. The default Linux scheduler can always be restored if needed
End-to-End Task Lifecycle in sched_ext
Source: https://www.ebpf.top/post/bpf_sched_ext_dive_into/
I'll provide a deep dive into the end-to-end task flow in sched_ext, specifically examining how tasks move through the scheduling cycle. This will cover:
- The reception of a task
- How it is enqueued and dequeued
- The scheduling decisions made
Task Entry into sched_ext
- Once a BPF scheduler is loaded, all tasks with policy
SCHED_EXT
are switched to the newsched_ext
scheduling class - a task is under
sched_ext
management (either by being created withSCHED_EXT
or through a global switch), it will be integrated into the BPF scheduler's queues. - The kernel’s
sched_ext
core calls the BPF scheduler’sops.init_task()
callback for each task joiningsched_ext
, giving the BPF code to initialize per-task state (e.g. tracking virtual runtime, etc.) - At this point, the task is “received” by
sched_ext
– it’s now subject to the BPF scheduling logic rather than the default CFS rules.
Enqueuing and Dequeuing Mechanisms in sched_ext
- Dispatch Queues (DSQs): sched_ext uses dispatch queues (DSQs) as intermediate run queues between the BPF logic and actual CPU execution
By default, there is one global FIFO queue (designatedSCX_DSQ_GLOBAL
) and one local DSQ per CPU (SCX_DSQ_LOCAL
).
Tasks are dispatched into DSQs by the BPF code, and CPUs consume from DSQs to get their next runnable task - Enqueueing a task: When a task becomes runnable, the
sched_ext
core invokes the BPF scheduler’sops.select_cpu()
followed byops.enqueue()
callbacks.- it may call
scx_bpf_dispatch(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, enq_flags)
to place the task directly into the target CPU’s local queue - Or it might dispatch to the global queue (
SCX_DSQ_GLOBAL
) or a custom DSQ.
- it may call
Scheduling Logic and Decision-Making Process
We know the actual decision of which task runs next on each CPU is made by the interplay of the ops.dispatch()
and (optionally) ops.consume()
callbacks, in conjunction with the dispatch queues. The scheduling cycle on a CPU, from a high-level perspective, is as follows:
- Selecting a target CPU on wakeup – select_cpu()
, This function allows the scheduler to choose an optimal CPU for the task before it is enqueued. For example, a scheduler might want to put a waking task on the same CPU it ran last (for cache affinity) or find an idle CPU to improve latency.
- Enqueue decision – enqueue()
: As described above, the BPF scheduler’s .enqueue()
either dispatches the task to a DSQ (local, global, or custom) or holds it in an internal queue
- CPU picks next task – consumption and dispatch: the scheduler core asks sched_ext
for the next task to run. The CPU first check local DSQ for any tasks waiting. Then, that task is dequeued and chosen to run immediately. If the local DSQ is empty, the CPU can try to consume from a global shared queue.
- Dispatching tasks from BPF scheduler – dispatch()
: If after consuming global the CPU still has no task, the core invokes the BPF scheduler’s ops.dispatch()
callback. It can call scx_bpf_dispatch()
to dispatch tasks to either the requesting CPU’s local DSQ, the global DSQ
Here's a summary of key functions used in SCX scheduling:
Function | Purpose | Called By |
---|---|---|
select_cpu() |
Choose target CPU hint | Kernel on task wakeup |
enqueue() |
Place task in queue or DSQ | Kernel after CPU selection |
dispatch() |
Find next task to run on CPU | Kernel when CPU needs work |
scx_bpf_dsq_insert() |
Add task to FIFO DSQ | BPF scheduler |
scx_bpf_dsq_insert_vtime() |
Add task to priority DSQ | BPF scheduler |
scx_bpf_dsq_move_to_local() |
Move task from DSQ to CPU | BPF in dispatch() |
scx_bpf_kick_cpu() |
Wake up idle CPU | BPF scheduler |
running() |
Track task starting on CPU | Kernel when task runs |
Build and run sched_ext
I use blog post by Andrea Righi outlines a great workflow for testing sched_ext without modifying existing system
Run the virtual environment and test a scheduler
First, start the virtual environment with the sched_ext kernel:
vng -vr ../linux

Once inside the virtual environment, you can run one of the schedulers with the helper function:
scx ./build/scheds/c/scx_simple
C-Based Schedulers
scx_simple
scx_simple
is a minimal scheduler that functions either as a global weighted vtime scheduler (similar to the Completely Fair Scheduler) or as a FIFO scheduler. It's designed primarily to demonstrate basic scheduling concepts
In the code below, from scx_simple.bpf.c
, the .enqueue
callback handles a task that needs to be scheduled.
1. If the scheduler is in FIFO mode (fifo_sched == true), it simply inserts the task into the shared dispatch queue without any priority sorting.
-
If in normal (weighted vtime) mode, it retrieves the task’s current virtual time (p->scx.dsq_vtime), adjusts it so that no task gains more than one slice worth of idle credit,
-
inserts the task into the shared queue with that virtual time as the key:
void BPF_STRUCT_OPS(simple_enqueue, struct task_struct *p, u64 enq_flags) {
stat_inc(1); // increment global queue count
if (fifo_sched) {
// FIFO scheduling: enqueue to shared queue with default slice
scx_bpf_dsq_insert(p, SHARED_DSQ, SCX_SLICE_DFL, enq_flags);
} else {
u64 vtime = p->scx.dsq_vtime;
// Cap the vtime lag to one slice to prevent too much credit
if (time_before(vtime, vtime_now - SCX_SLICE_DFL))
vtime = vtime_now - SCX_SLICE_DFL;
// Enqueue with a specific virtual time for fairness
scx_bpf_dsq_insert_vtime(p, SHARED_DSQ, SCX_SLICE_DFL, vtime, enq_flags);
}
}
scx_rustland
scx_rustland
is designed to prioritize interactive workloads over CPU-intensive background workloads.
In practical terms, this likely means it keeps track of each task’s recent CPU usage or blocking behavior and assigns higher priority (sooner scheduling, more CPU time) to tasks that have shorter CPU bursts.
scx_rustland overiew
scx_rustland
uses a hybrid approach that splits functionality between kernel space and user space:
User Space: Complex scheduling logic in Rust
- Task prioritization
- CPU selection algorithms
- Complex data structures
|
| Ring buffer communication
v
Kernel Space: Minimal BPF component
- Task state tracking
- Dispatch queue management
- Safety mechanisms
BPF Dispatcher (scx_rustland_core): The BPF part of scx_rustland implements minimal logic required to interface with the kernel. Its enqueue hook, for example, does not directly decide a run queue as a normal scheduler would. Instead, it may place the task into a BPF queue map that represents “tasks waiting for user-space decision.”
User-Space Scheduler (Rust): On the user side, the Rust scheduler process uses libraries (like libbpf or Aya in Rust) to interact with the eBPF program. It attaches to the maps exposed by BPF. Typically, it might use a ring buffer to receives a task to schedule, it runs its algorithm to decide where/when that task should run.
scx_rustland Code Structure
scheds/rust/scx_rustland/
├── Cargo.toml # Rust package definition
├── src/
│ ├── main.rs # Main userspace implementation
│ ├── scheduler.rs # Scheduler logic
│ ├── stats.rs # Statistics collection
│ ├── topology.rs # CPU topology handling
│ └── bpf/ # BPF skeleton code
└── src/bpf/
├── main.bpf.c # Minimal BPF implementation
└── intf.h # Interface definitions
scx_rustland
has two main components:
- BPF Component (kernel space):
void BPF_STRUCT_OPS(rustland_enqueue, struct task_struct *p, u64 enq_flags)
{
// Skip scheduling the scheduler itself
if (is_usersched_task(p))
return;
// Pass task information to userspace for scheduling decision
struct queued_task_ctx *task = bpf_ringbuf_reserve(&queued, sizeof(*task), 0);
if (!task) {
// Fallback: direct dispatch if userspace communication fails
scx_bpf_dsq_insert_vtime(p, SHARED_DSQ, SCX_SLICE_DFL, 0, enq_flags);
return;
}
// Send task info to userspace
populate_task_info(task, p, enq_flags);
bpf_ringbuf_submit(task, 0);
}
- Rust Component (user space):
impl Scheduler {
fn dispatch_next_task(&mut self) {
if let Some(task) = self.priority_queue.pop_min() {
// Complex scheduling logic in Rust
let target_cpu = self.find_best_cpu_for_task(&task);
let time_slice = self.calculate_time_slice(&task);
// Send decision back to kernel
let mut dispatched_task = DispatchedTask::new(&task);
dispatched_task.cpu = target_cpu;
dispatched_task.slice_ns = time_slice;
self.bpf.dispatch_task(&dispatched_task).unwrap();
}
}
fn find_best_cpu_for_task(&self, task: &QueuedTask) -> i32 {
// Can use complex data structures and algorithms here
// without BPF verifier constraints
// ...
}
}
Advantages
- Better abstraction capabilities
- Rich error handling
- More readable algorithms
- Unit testing support
- Modern language features
Scheduling Logic
The scheduling logic in scx_rustland
is split across kernel (BPF) and user-space
// Main scheduler object
struct Scheduler<'a> {
bpf: BpfScheduler<'a>, // BPF connector
stats_server: StatsServer<(), Metrics>, // statistics
task_pool: TaskTree, // tasks ordered by deadline
task_map: TaskInfoMap, // map pids to the corresponding task information
min_vruntime: u64, // Keep track of the minimum vruntime across all tasks
init_page_faults: u64, // Initial page faults counter
slice_ns: u64, // Default time slice (in ns)
slice_ns_min: u64, // Minimum time slice (in ns)
}
In user-space,
scx_rustland
maintains its own runqueue data structure – specifically a BTreeSet of tasks ordered by weighted vruntime (essentially mimicking CFS in user-space). It also monitors each task’s behavior to detect interactive tasks: if a task consistently voluntarily yields CPU (releasing the CPU before its time slice is fully used), it’s considered interactive.
scx_bpfland
scx_bpfland
implements its scheduling logic almost entirely in BPF (Berkeley Packet Filter) code that runs in kernel space. It follows a more traditional approach where all scheduling decisions happen within the kernel:
scx_bpfland overiew
User Space: Minimal (monitoring only)
|
| (Minimal interaction)
v
Kernel Space: Full scheduler implementation in BPF
- Task selection
- CPU assignment
- Priority decisions
- All scheduling algorithms
scheds/c/scx_bpfland.bpf.c # Main BPF scheduler implementation
scheds/c/scx_bpfland.c # Userspace loader and monitoring
scheds/include/scx_common.bpf.h # Common BPF utilities
scx_bpfland Code Structure
scheds/c/scx_bpfland.bpf.c # Main BPF scheduler implementation
scheds/c/scx_bpfland.c # Userspace loader and monitoring
scheds/include/scx_common.bpf.h # Common BPF utilities
scx_bpfland
implements all scheduling callbacks directly in BPF:void BPF_STRUCT_OPS(bpfland_enqueue, struct task_struct *p, u64 enq_flags)
{
// Implementation directly in BPF
struct task_ctx *task_ctx;
u64 vruntime = 0;
// Skip enqueuing the scheduler itself
if (p->pid == bpfland_pid)
return;
// Get or create task context
task_ctx = get_task_ctx(p);
if (!task_ctx)
return;
// Calculate vruntime based on scheduling policy
vruntime = calc_vruntime(p, task_ctx, enq_flags);
// Direct enqueue to DSQ with calculated vruntime
scx_bpf_dsq_insert_vtime(p, GLOBAL_DSQ, task_slice(p), vruntime, enq_flags);
}
Scheduling Logic
scx_bpfland
internal logic is very similar in spirit to scx_rustland’s algorithm, but it executes entirely within the BPF program (in kernel). It effectively merges the two-tier logic into one.
Advantages Over Traditional Schedulers
The sched_ext + eBPF approach (exemplified by SCX schedulers) brings several key advantages over traditional in-kernel schedulers:
* Dynamic Policy Changes: New schedulers can be loaded and unloaded at runtime. There’s no need to patch or reboot the kernel to try a different scheduling policy. This is invaluable for testing and tuning in production environments – administrators can switch between, say, a latency-focused scheduler during interactive sessions and a throughput-focused one for batch processing periods, with a simple command.
* Workload-Specific Optimizations: The biggest advantage is the ability to tailor scheduling to a specific workload or environment. Instead of one-size-fits-all, users can pick or write a scheduler that, for example, never migrates certain tasks off a preferred CPU, or that implements strict priority levels, or that optimizes for particular patterns (like the “block frequently == interactive” rule). SCX even allows combining policies, as shown by scx_layered which applies different scheduling rules to different groups of processes.
Future integrating SCX_GOLAND with Free5GC for Data Plane Optimization
SCX_GOLAND
Architecture Overview
The scx_goland_core
project represents another approach to implementing a user-space scheduler for Linux using the sched_ext
framework. This project follows the architectural patterns established by scx_rustland, adapting them to the Go ecosystem.
Go Implementation
The Go implementation consists of several key packages and types:
Core Package:
- Sched: Main scheduler struct that holds references to BPF maps and methods for communication
- QueuedTask: Represents a task queued from the kernel
- DispatchedTask: Represents a task to be dispatched back to the kernel
- BssData: Data structure for accessing BPF's BSS section
Constraints and Limitations
Memory Management:
In scx_goland_core
, tasks from the kernel’s ring buffers are received as byte slices and then decoded (using binary.Read) into Go structs. This decoding and copying process is relatively expensive when compared to Rust’s approach where you can often work in a more zero‑copy manner. In contrast, scx_rustland
benefits from Rust’s zero‑cost abstractions, minimal runtime overhead, and more direct access to low-level memory without a garbage collector.
In go:
func (s *Sched) DequeueTask(task *QueuedTask) {
select {
case t := <-s.queue:
buff := bytes.NewBuffer(t)
err := binary.Read(buff, binary.LittleEndian, task)
if err != nil {
task.Pid = -1
return
}
err = s.SubNrQueued()
if err != nil {
task.Pid = -1
log.Printf("SubNrQueued err: %v", err)
return
}
return
default:
task.Pid = -1
return
}
}
In Rust:
fn dequeue_task(&mut self) -> Result<Option<QueuedTask>, i32> {
match self.queued.consume_raw() {
0 => {
self.skel.maps.bss_data.nr_queued = 0;
Ok(None)
}
LIBBPF_STOP => {
// A valid task is received, convert data to a proper task struct.
let task = unsafe { EnqueuedMessage::from_bytes(&BUF.0).to_queued_task() };
let _ = self.skel.maps.bss_data.nr_queued.saturating_sub(1);
Ok(Some(task))
}
res if res < 0 => Err(res),
res => panic!(
"Unexpected return value from libbpf-rs::consume_raw(): {}",
res
),
}
}
Garbage Collection:
Go is a managed language with a garbage collector and its own scheduler for goroutines. Even though Go is very efficient, those additional layers (GC, goroutine scheduling, channel operations) add overhead compared to Rust’s zero‑cost abstractions where most things are determined at compile time.
SCX_GOLAND Summary
It shows how Go's concurrency model and ease of use can be applied to system programming tasks that were traditionally the domain of C or Rust.
While it's not as optimized or mature as scx_rustland
, it provides a valuable alternative for developers more comfortable with Go. It also serves as a good example of how the sched_ext framework enables experimentation with different scheduling policies and implementations without requiring deep kernel expertise.
Data Plane and CPU Scheduling Challenges
Free5GC is an open-source implementation of the 5G Core network, written largely in Go. One of its components is the UPF (User Plane Function), which handles the data plane – i.e., packet forwarding for user traffic (GTP-U tunneling, routing packets between RAN and data network). The UPF and other network functions in Free5GC are user-space processes that can be CPU-intensive, especially under high load (many subscribers or high packet rates). Ensuring low latency and high throughput in the data plane is critical.
Future Design of SCX_GOLAND Scheduler
Identifying Target Tasks: The scheduler must reliably identify the Free5GC data plane threads/processes. This could be done by process name matching, by cgroup (if Free5GC is in a container or specific slice, the scheduler can detect that cgroup and apply a policy), or by explicit configuration(the operator could pass PIDs or process names to scx_goland
at startup to tell it which tasks to prioritize).
prioritize specific goroutines: Free5GC could be run in a mode where the UPF uses dedicated pinned threads for packet RX/TX loops (ensuring those OS threads only do packet work). Then scx_goland
can target those threads precisely. But need to care that Free5GC’s internal scheduling of goroutines on OS threads might not be directly visible to the OS scheduler.
Conclusion
The advantages of SCX (sched_ext
) – runtime pluggability, rapid development, workload-specific optimizations, and crash resilience – make it very attractive for specialized domains. One such domain is the 5G core network. We explored scx_goland
, a Go-based scheduler concept, illustrating how a custom scheduler could be integrated with Free5GC to optimize its performance.
Reference
- Re-implementing my Linux Rust scheduler in eBPF
- 内核调度客制化利器:SCHED_EXT
- BPF 赋能调度器:万字详解 sched_ext 实现机制与工作流程
- Pluggable CPU schedulers
- sched_ext: scheduler architecture and interfaces
- eBPF 隨筆(七):sched_ext
- scx_goland_core
About
Hello, I'm William Lin. I'd like to share my excitement about being a member of the free5gc project, which is a part of the Linux Foundation. I'm always eager to discuss any aspects of core network development or related technologies.
Connect with Me
- GitHub: williamlin0518
- Linkedin: Cheng Wei Lin