Hands-On with sched_ext: Building Custom eBPF CPU Schedulers
Introduction
In part 1: Exploring sched_ext: BPF-Powered CPU Schedulers in the Linux Kernel, we explored the architecture and concepts behind sched_ext
, Linux's framework for implementing custom CPU schedulers using eBPF. We examined how this revolutionary approach allows dynamically loading schedulers without kernel recompilation, and we compared different implementation styles including C-based (scx_simple
), Rust-based (scx_bpfland
,scx_rustland
), and even potential Go implementations scx_goland_core
Now it's time to get our hands dirty. In this second part, we'll move from theory to practice by building our own custom schedulers using the sched_ext
framework. Start with a minimal implementation to understand the basics, then gradually add more sophisticated scheduling policies to handle real-world scenarios.
Such as network packet processing to optimized for packet-processing intensive workloads. By prioritizing packet-handling threads and optimizing CPU allocation, we can reduce latency and improve throughput in these critical systems.
By the end of this hands-on guide, you'll have:
- Set up a proper development environment for
sched_ext
- Implemented and loaded your own custom BPF scheduler
- Explored different scheduling policies and their effects on performance
- Learned how to debug and test your scheduler under various workloads
Let's dive into the practical world of CPU scheduling with eBPF!
environment
To start building custom schedulers with sched_ext
, we need to set up a proper development environment. Let's go through this process step by step so you can follow along on your own system.
Kernel Requirements (6.12+)
sched_ext
requires Linux kernel 6.12 or newer, we can use the mainline utility to easily install newer kernels:
sudo add-apt-repository ppa:cappelikan/ppa
sudo apt update
sudo apt install -y mainline

Cloning and Building the sched_ext Project
git clone https://github.com/sched-ext/scx.git
cd scx
# Install BPF development tools:
sudo apt install libbpf-dev clang llvm libelf-dev
# Build the schedulers using meson, you also need rust in your system
cd ~/work/scx
meson setup build --prefix ~
meson compile -C build
meson install -C build
# sched_ext framework requires these configurations to work properly, check them
for config in BPF SCHED_CLASS_EXT BPF_SYSCALL BPF_JIT BPF_JIT_ALWAYS_ON BPF_JIT_DEFAULT_ON PAHOLE_HAS_BTF_TAG DEBUG_INFO_BTF SCHED_DEBUG DEBUG_INFO DEBUG_INFO_DWARF5 DEBUG_INFO_BTF_MODULES; do
grep -w CONFIG_$config /boot/config-$(uname -r)
done
# you'll see:
CONFIG_BPF=y
CONFIG_SCHED_CLASS_EXT=y
CONFIG_BPF_SYSCALL=y
CONFIG_BPF_JIT=y
CONFIG_BPF_JIT_ALWAYS_ON=y
CONFIG_BPF_JIT_DEFAULT_ON=y
CONFIG_DEBUG_INFO_BTF=y
CONFIG_SCHED_DEBUG=y
CONFIG_DEBUG_INFO=y
CONFIG_DEBUG_INFO_DWARF5=y
CONFIG_DEBUG_INFO_BTF_MODULES=y
Implementation of an Even-CPU-Only Scheduler with SCX
What We Did and Expected
We modified the scx_packet scheduler to distribute tasks exclusively to even-numbered CPUs (0, 2, 4, 6, 8) while keeping odd-numbered CPUs idle. This demonstrates fine-grained CPU control within the sched_ext framework.
Creating the Initial Scheduler
Our custom scheduler mainly use same func with scx_simple, and I called it scx_packet
Now, let's modify our scx_packet.bpf.c
file to implement our strategy of only using even-numbered CPUs. Here's our initial implementation:
/* Main dispatch function that decides which tasks run on which CPUs */
void BPF_STRUCT_OPS(packet_dispatch, s32 cpu, struct task_struct *prev)
{
/* Only dispatch tasks to even-numbered CPUs */
if ((cpu & 1) == 0) {
scx_bpf_dsq_move_to_local(SHARED_DSQ);
}
/* Odd-numbered CPUs remain idle as we don't dispatch tasks to them */
}
but not much easy!
Encountering and Fixing Stalls
When we tested this initial implementation, we ran into a critical issue:
kworker/u48:3[154254] triggered exit kind 1026:
runnable task stall (kworker/0:1[141497] failed to run for 30.357s)
This means that tasks that can only run on odd-numbered CPUs are stuck in a "runnable" state but never get scheduled to run.
So we need to modify our implementation to ensure that:
- In
packet_select_cpu
:- Simplified to use the default selection, as the real control happens in enqueue and dispatch
-
In
packet_enqueue
:- Special-cases kernel threads with single-CPU affinity to respect their requirements
- Uses
scx_bpf_dsq_insert()
instead of queue insertion for better control - For regular tasks, dispatches them to the shared queue
Actively kicks an even CPU (0, 2, 4, etc.) to process tasks from the queue
-
In
packet_dispatch
:- Only allows even CPUs to consume from the shared queue
- Odd CPUs will only run tasks that were specifically dispatched to them (kernel threads)
The Comprehensive Solution
Let's build a comprehensive solution that respects system requirements while still implementing our even-CPU policy:
void BPF_STRUCT_OPS(packet_enqueue, struct task_struct *p, u64 enq_flags)
{
stat_inc(1); /* count global queueing */
/* Handle kernel threads with restricted CPU affinity */
if ((p->flags & PF_KTHREAD) && p->nr_cpus_allowed == 1) {
scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, enq_flags);
return;
}
/* For all other tasks, use the shared queue for later dispatch to even CPUs */
if (fifo_sched) {
scx_bpf_dsq_insert(p, SHARED_DSQ, SCX_SLICE_DFL, enq_flags);
} else {
u64 vtime = p->scx.dsq_vtime;
/*
* Limit the amount of budget that an idling task can accumulate
* to one slice.
*/
if (time_before(vtime, vtime_now - SCX_SLICE_DFL))
vtime = vtime_now - SCX_SLICE_DFL;
scx_bpf_dsq_insert_vtime(p, SHARED_DSQ, SCX_SLICE_DFL, vtime, enq_flags);
}
/* If we dispatched to the shared queue, kick an even CPU to process it */
s32 target_cpu = 0; /* Start with CPU 0 */
/* Find the next even CPU by checking CPU 0, 2, 4, etc. */
for (s32 i = 0; i < 5; i++) { /* Limit to checking 5 CPUs to avoid BPF loop limits */
target_cpu = 2 * i; /* Only even CPUs */
if (target_cpu < 10) { /* Assume max 10 CPUs, adjust if needed */
scx_bpf_kick_cpu(target_cpu, SCX_KICK_PREEMPT);
break;
}
}
}
-
Task Identification: We use
(p->flags & PF_KTHREAD) && p->nr_cpus_allowed == 1
to identify kernel threads with strict CPU affinity requirements. -
CPU Selection: The bitwise operation
(cpu & 1) == 0
efficiently determines if a CPU is even-numbered (0, 2, 4...). - Dispatch Strategy:
- Regular tasks go to the shared queue
- CPU-specific kernel threads go directly to their required local CPU queue
- Only even CPUs pull from the shared queue in the dispatch function
Rebuild and Run
In scx/scheds/c/meson.build
, add our custom scheduler
c_scheds = ['scx_simple', 'scx_qmap', 'scx_central', 'scx_userland', 'scx_nest',
'scx_flatcg', 'scx_pair', 'scx_prev', 'scx_packet']
meson setup build --reconfigure
meson compile -C build scx_packet
# If successful, the binary will be available at build/scheds/c/scx_packet.
sudo ./build/scheds/c/scx_packet
# verified our scheduler
vboxuser@sch:~/work/scx$ cat /sys/kernel/sched_ext/state /sys/kernel/sched_ext/*/ops 2>/dev/null
enabled
packet
Testing Our Custom Scheduler
Now that we've implemented our even-CPU-only scheduler, it's time to put it to the test. We'll use two different types of workloads to verify its behavior:
- A CPU-intensive workload using stress-ng to see how tasks are distributed
- A graphics application (glxgears) to observe how our scheduler affects rendering performance
Installing the Testing Tools
First, let's install stress-ng for CPU stress testing:
sudo apt update
sudo apt install -y stress-ng
sudo apt install -y mangohud
sudo apt install -y mesa-utils
Test 1: CPU Stress Testing
Creates 5 worker processes performing intensive matrix multiplication
sudo stress-ng --cpu 5 --cpu-method matrixprod --timeout 15s
htop
should show activity primarily on CPUs 0, 2, 4, 6, 8
but what if we use --cpu 10
?
still only 5 cpu running!!
Test 2: Graphics Performance Testing
# Run glxgears with MangoHud overlay
MANGOHUD=1 mangohud --dlsym glxgears -info
This will display the FPS counter overlaid on the rotating gears. Note the FPS values and CPU utilization shown in the MangoHud overlay.
In scx_simple
In scx_packet
Performance Metrics Analysis, the metrics perfectly align with the logical expectations:
| Metric | scx_packet (Even CPUs) | scx_simple (All CPUs) | Difference |
|--------|-------------------|-------------------|------------|
| Bogo-ops | 5,603,885 | 8,479,124 | ~51% higher for all CPUs |
| Bogo-ops-per-second | 546,351 | 830,616 | ~52% higher for all CPUs |
| CPU usage per instance | 103.92% | 134.21% | ~29% higher for all CPUs |
| MB received per second | 5.59 | 8.46 | ~51% higher for all CPUs |
Conclusion and Future work
Our even-CPU scheduler demonstrates the basic principles of CPU control with sched_ext
, but real-world applications like Free5GC or other networking stacks present more complex scheduling challenges. Let's explore how we might adapt our scheduler for packet processing workloads.
Optimizing for Network Packet Processing
Packet processing workloads have unique characteristics that require specialized scheduling approaches:
- I/O Bound: Network interfaces generate interrupts when packets arrive, making some tasks I/O bound as they wait for new packets
- CPU Bound: Once packets arrive, processing them (parsing, encrypting/decrypting, routing) can be CPU intensive
Task Classification
/* Check if a task is a network-related process */
static inline bool is_network_task(struct task_struct *p)
{
/* Common network process names to prioritize */
const char *network_processes[] = {"upf", "dpdk", "ovs", "xdp"};
for (int i = 0; i < sizeof(network_processes)/sizeof(network_processes[0]); i++) {
if (belong network task)
return true;
}
return false;
}
Prioritizing Packet Task
we can also modify our enqueue function to give packet processing tasks higher priority
void BPF_STRUCT_OPS(packet_enqueue, struct task_struct *p, u64 enq_flags)
{
/* Prioritize network packet processing tasks */
if (is_network_task(p)) {
/* Use a shorter time slice for responsiveness */
u64 short_slice = SCX_SLICE_DFL / 2;
/* Give packet tasks negative vtime to prioritize them */
u64 priority_vtime = vtime_now - (SCX_SLICE_DFL * 2);
// same as before
return;
}
}
Next Steps
- Integrating with DPDK or XDP for Zero-Copy Packet Processing
- Benchmarking against standard schedulers with realistic network traffic
Reference
- sched-ext Tutorial
- 内核调度客制化利器:SCHED_EXT
- BPF 赋能调度器:万字详解 sched_ext 实现机制与工作流程
- Pluggable CPU schedulers
- sched_ext: scheduler architecture and interfaces
- eBPF 隨筆(七):sched_ext
- scx_goland_core
About
Hello, I'm William Lin. I'd like to share my excitement about being a member of the free5gc project, which is a part of the Linux Foundation. I'm always eager to discuss any aspects of core network development or related technologies.
Connect with Me
- GitHub: williamlin0518
- Linkedin: Cheng Wei Lin