Hands-On with sched_ext: Building Custom eBPF CPU Schedulers

Introduction

In part 1: Exploring sched_ext: BPF-Powered CPU Schedulers in the Linux Kernel, we explored the architecture and concepts behind sched_ext, Linux's framework for implementing custom CPU schedulers using eBPF. We examined how this revolutionary approach allows dynamically loading schedulers without kernel recompilation, and we compared different implementation styles including C-based (scx_simple), Rust-based (scx_bpfland,scx_rustland), and even potential Go implementations scx_goland_core

Now it's time to get our hands dirty. In this second part, we'll move from theory to practice by building our own custom schedulers using the sched_ext framework. Start with a minimal implementation to understand the basics, then gradually add more sophisticated scheduling policies to handle real-world scenarios.

Such as network packet processing to optimized for packet-processing intensive workloads. By prioritizing packet-handling threads and optimizing CPU allocation, we can reduce latency and improve throughput in these critical systems.

By the end of this hands-on guide, you'll have:

Set up a proper development environment for sched_ext
Implemented and loaded your own custom BPF scheduler
Explored different scheduling policies and their effects on performance
Learned how to debug and test your scheduler under various workloads

Let's dive into the practical world of CPU scheduling with eBPF!

environment

To start building custom schedulers with sched_ext, we need to set up a proper development environment. Let's go through this process step by step so you can follow along on your own system.

Kernel Requirements (6.12+)

sched_ext requires Linux kernel 6.12 or newer, we can use the mainline utility to easily install newer kernels:

sudo add-apt-repository ppa:cappelikan/ppa
sudo apt update
sudo apt install -y mainline

Cloning and Building the sched_ext Project

git clone https://github.com/sched-ext/scx.git
cd scx

# Install BPF development tools:
sudo apt install libbpf-dev clang llvm libelf-dev

#　Build the schedulers using meson, you also need rust in your system
cd ~/work/scx
meson setup build --prefix ~
meson compile -C build
meson install -C build

# sched_ext framework requires these configurations to work properly, check them
for config in BPF SCHED_CLASS_EXT BPF_SYSCALL BPF_JIT BPF_JIT_ALWAYS_ON BPF_JIT_DEFAULT_ON PAHOLE_HAS_BTF_TAG DEBUG_INFO_BTF SCHED_DEBUG DEBUG_INFO DEBUG_INFO_DWARF5 DEBUG_INFO_BTF_MODULES; do
    grep -w CONFIG_$config /boot/config-$(uname -r)
done

# you'll see:
CONFIG_BPF=y
CONFIG_SCHED_CLASS_EXT=y
CONFIG_BPF_SYSCALL=y
CONFIG_BPF_JIT=y
CONFIG_BPF_JIT_ALWAYS_ON=y
CONFIG_BPF_JIT_DEFAULT_ON=y
CONFIG_DEBUG_INFO_BTF=y
CONFIG_SCHED_DEBUG=y
CONFIG_DEBUG_INFO=y
CONFIG_DEBUG_INFO_DWARF5=y
CONFIG_DEBUG_INFO_BTF_MODULES=y

Implementation of an Even-CPU-Only Scheduler with SCX

What We Did and Expected

We modified the scx_packet scheduler to distribute tasks exclusively to even-numbered CPUs (0, 2, 4, 6, 8) while keeping odd-numbered CPUs idle. This demonstrates fine-grained CPU control within the sched_ext framework.

Creating the Initial Scheduler

Our custom scheduler mainly use same func with scx_simple, and I called it scx_packet

Now, let's modify our scx_packet.bpf.c file to implement our strategy of only using even-numbered CPUs. Here's our initial implementation:

/* Main dispatch function that decides which tasks run on which CPUs */
void BPF_STRUCT_OPS(packet_dispatch, s32 cpu, struct task_struct *prev)
{
    /* Only dispatch tasks to even-numbered CPUs */
    if ((cpu & 1) == 0) {
        scx_bpf_dsq_move_to_local(SHARED_DSQ);
    }
    /* Odd-numbered CPUs remain idle as we don't dispatch tasks to them */
}

but not much easy!

Encountering and Fixing Stalls

When we tested this initial implementation, we ran into a critical issue:

kworker/u48:3[154254] triggered exit kind 1026:
  runnable task stall (kworker/0:1[141497] failed to run for 30.357s)

This means that tasks that can only run on odd-numbered CPUs are stuck in a "runnable" state but never get scheduled to run.
So we need to modify our implementation to ensure that:

In packet_select_cpu:
- Simplified to use the default selection, as the real control happens in enqueue and dispatch
In packet_enqueue:
- Special-cases kernel threads with single-CPU affinity to respect their requirements
- Uses scx_bpf_dsq_insert() instead of queue insertion for better control
- For regular tasks, dispatches them to the shared queue
  Actively kicks an even CPU (0, 2, 4, etc.) to process tasks from the queue
In packet_dispatch:
- Only allows even CPUs to consume from the shared queue
- Odd CPUs will only run tasks that were specifically dispatched to them (kernel threads)

The Comprehensive Solution

Let's build a comprehensive solution that respects system requirements while still implementing our even-CPU policy:

void BPF_STRUCT_OPS(packet_enqueue, struct task_struct *p, u64 enq_flags)
{
    stat_inc(1);    /* count global queueing */

    /* Handle kernel threads with restricted CPU affinity */
    if ((p->flags & PF_KTHREAD) && p->nr_cpus_allowed == 1) {
        scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, enq_flags);
        return;
    }

    /* For all other tasks, use the shared queue for later dispatch to even CPUs */
    if (fifo_sched) {
        scx_bpf_dsq_insert(p, SHARED_DSQ, SCX_SLICE_DFL, enq_flags);
    } else {
        u64 vtime = p->scx.dsq_vtime;

        /*
         * Limit the amount of budget that an idling task can accumulate
         * to one slice.
         */
        if (time_before(vtime, vtime_now - SCX_SLICE_DFL))
            vtime = vtime_now - SCX_SLICE_DFL;

        scx_bpf_dsq_insert_vtime(p, SHARED_DSQ, SCX_SLICE_DFL, vtime, enq_flags);
    }

    /* If we dispatched to the shared queue, kick an even CPU to process it */
    s32 target_cpu = 0;  /* Start with CPU 0 */

    /* Find the next even CPU by checking CPU 0, 2, 4, etc. */
    for (s32 i = 0; i < 5; i++) {  /* Limit to checking 5 CPUs to avoid BPF loop limits */
        target_cpu = 2 * i;  /* Only even CPUs */
        if (target_cpu < 10) {  /* Assume max 10 CPUs, adjust if needed */
            scx_bpf_kick_cpu(target_cpu, SCX_KICK_PREEMPT);
            break;
        }
    }
}

Task Identification: We use (p->flags & PF_KTHREAD) && p->nr_cpus_allowed == 1 to identify kernel threads with strict CPU affinity requirements.
CPU Selection: The bitwise operation (cpu & 1) == 0 efficiently determines if a CPU is even-numbered (0, 2, 4...).
Dispatch Strategy:
- Regular tasks go to the shared queue
- CPU-specific kernel threads go directly to their required local CPU queue
- Only even CPUs pull from the shared queue in the dispatch function

Rebuild and Run

In scx/scheds/c/meson.build, add our custom scheduler

c_scheds = ['scx_simple', 'scx_qmap', 'scx_central', 'scx_userland', 'scx_nest',
            'scx_flatcg', 'scx_pair', 'scx_prev', 'scx_packet']

meson setup build --reconfigure
meson compile -C build scx_packet

# If successful, the binary will be available at build/scheds/c/scx_packet.
sudo ./build/scheds/c/scx_packet

# verified our scheduler
vboxuser@sch:~/work/scx$ cat /sys/kernel/sched_ext/state /sys/kernel/sched_ext/*/ops 2>/dev/null
enabled
packet

Testing Our Custom Scheduler

Now that we've implemented our even-CPU-only scheduler, it's time to put it to the test. We'll use two different types of workloads to verify its behavior:

A CPU-intensive workload using stress-ng to see how tasks are distributed
A graphics application (glxgears) to observe how our scheduler affects rendering performance

Installing the Testing Tools

First, let's install stress-ng for CPU stress testing:

sudo apt update
sudo apt install -y stress-ng

sudo apt install -y mangohud
sudo apt install -y mesa-utils

Test 1: CPU Stress Testing

Creates 5 worker processes performing intensive matrix multiplication

sudo stress-ng --cpu 5 --cpu-method matrixprod --timeout 15s

htop should show activity primarily on CPUs 0, 2, 4, 6, 8

but what if we use --cpu 10?

still only 5 cpu running!!

Test 2: Graphics Performance Testing

# Run glxgears with MangoHud overlay
MANGOHUD=1 mangohud --dlsym glxgears -info

This will display the FPS counter overlaid on the rotating gears. Note the FPS values and CPU utilization shown in the MangoHud overlay.

In scx_simple

In scx_packet

Performance Metrics Analysis, the metrics perfectly align with the logical expectations:
| Metric | scx_packet (Even CPUs) | scx_simple (All CPUs) | Difference |
|--------|-------------------|-------------------|------------|
| Bogo-ops | 5,603,885 | 8,479,124 | ~51% higher for all CPUs |
| Bogo-ops-per-second | 546,351 | 830,616 | ~52% higher for all CPUs |
| CPU usage per instance | 103.92% | 134.21% | ~29% higher for all CPUs |
| MB received per second | 5.59 | 8.46 | ~51% higher for all CPUs |

Conclusion and Future work

Our even-CPU scheduler demonstrates the basic principles of CPU control with sched_ext, but real-world applications like Free5GC or other networking stacks present more complex scheduling challenges. Let's explore how we might adapt our scheduler for packet processing workloads.

Optimizing for Network Packet Processing

Packet processing workloads have unique characteristics that require specialized scheduling approaches:

I/O Bound: Network interfaces generate interrupts when packets arrive, making some tasks I/O bound as they wait for new packets
CPU Bound: Once packets arrive, processing them (parsing, encrypting/decrypting, routing) can be CPU intensive

Task Classification

/* Check if a task is a network-related process */
static inline bool is_network_task(struct task_struct *p)
{
    /* Common network process names to prioritize */
    const char *network_processes[] = {"upf", "dpdk", "ovs", "xdp"};

    for (int i = 0; i < sizeof(network_processes)/sizeof(network_processes[0]); i++) {
        if (belong network task)
            return true;
    }

    return false;
}

Prioritizing Packet Task

we can also modify our enqueue function to give packet processing tasks higher priority

void BPF_STRUCT_OPS(packet_enqueue, struct task_struct *p, u64 enq_flags)
{


    /* Prioritize network packet processing tasks */
    if (is_network_task(p)) {
        /* Use a shorter time slice for responsiveness */
        u64 short_slice = SCX_SLICE_DFL / 2;

        /* Give packet tasks negative vtime to prioritize them */
        u64 priority_vtime = vtime_now - (SCX_SLICE_DFL * 2);

        // same as before

        return;
    }
}

Next Steps

Integrating with DPDK or XDP for Zero-Copy Packet Processing
Benchmarking against standard schedulers with realistic network traffic

Reference

About

Hello, I'm William Lin. I'd like to share my excitement about being a member of the free5gc project, which is a part of the Linux Foundation. I'm always eager to discuss any aspects of core network development or related technologies.

Connect with Me

GitHub: williamlin0518
Linkedin: Cheng Wei Lin