Linux Penguin LogoIf you’ve been following Linux kernel news then you’ve probably heard about the new Completely Fair Scheduler that has been merged into the upcoming 2.6.23 kernel release. It’s been a while since I’ve done much Linux kernel hacking, so the initial announcement was mostly over my head. After reading about the new scheduler in several places, I decided to do a bit of research into how the current Linux scheduler works, and what makes the new scheduling algorithm so interesting. Here’s what I learned.

What is a scheduler?

Like every multitasking operating system, Linux achieves the illusion of simultaneous execution of multiple processes (or programs) by rapidly cycling through processes that are ready to run, giving each one a short amount of time to execute on the CPU. Determining when to switch between processes, and which process should be allowed to run next is called scheduling. The kernel code that performs process scheduling is called the scheduler.

Implementing a scheduling algorithm is tricky for a couple of reasons. First, an acceptable algorithm has to ration CPU time such that higher priority tasks (e.g., interactive applications like a web browser) are given preference over low priority processes (e.g., non-interactive batch processes like program compilation). At the same time, the scheduler must protect against low priority process starvation. In other words, low priority processes must eventually be allowed to run, regardless of how many high priority processes are vying for CPU time.

Schedulers must also be carefully crafted so that processes appear to be running simultaneously, without having too large an impact on system throughput. For interactive processes like GUIs, the ideal scheduler would give each process a very small amount of time on the CPU and rapidly cycle between processes.

A process can only respond to user input when it has control over the CPU. Because users expect interactive processes to respond immediately to input, the delay between user input and process execution should ideally be imperceptible to a human (somewhere between 50 and 150ms is usually sufficient).

For non-interactive processes the situation is reversed. Switching between processes, or context switching, is a relatively expensive operation. Thus, larger slices of time on the processor, and fewer context switches can improve system performance and throughput. The scheduling algorithm must strike a balance between all of these competing needs.

The O(1) Scheduler

The current Linux scheduler uses a complicated set of heuristics to provide adequate performance to nearly every type of process. The algorithm tries to identify interactive processes by analyzing average sleep time (e.g., the amount of time the process spends waiting for input). Processes that sleep for long periods of time are probably waiting for user input, so the scheduler assumes they’re interactive.

Each process is given a time quantum based on a static priority (set via the nice system call). The time quantum determines how long the process can execute on the CPU before it is preempted and another context switch occurs.

The scheduler maintains a list of active and expired processes. When a new process is spawned, it starts in the active list. Non-interactive processes move from the active list to the expired list once their time quantum is exhausted. Expired processes are forbidden to run until all active processes expire.

Interactive processes generally remain active: after execution, the scheduler refills their time quantum and leaves them in the set of active processes. However, if the oldest expired process has waited for a long time, or if an expired process has a higher static priority than the interactive process, the scheduler moves the interactive process to the expired process list. Thus, the scheduler gives interactive processes priority while preventing process starvation by guaranteeing that all processes will eventually have a chance to run.

The defining characteristics of the current scheduler is the algorithm’s running time. If you’re not familiar with Big O notation, it’s a simple system for describing how the size of the input data set affects the length of time it takes an algorithm to execute. An O(n^2) algorithm would take 1 unit of time to run if the input data contained a single item, 4 units of time for 2 units of input, etc. The current Linux scheduler is an O(1) scheduler. In other words, the algorithm takes the same amount of time to run regardless of the input data size (in this case, the number of processes being run on the system).

If you think all of that sounds way more complicated than it has to be, you’ll be happy to know that the new Completely Fair Scheduler is far simpler, and may prove to be a superior scheduler.

The Completely Fair Scheduler

The Completely Fair Scheduler (CFS) has replaced the current O(1) Scheduler in the upcoming 2.6.23 kernel release. The CFS uses a far simpler scheduling algorithm, completely ignoring sleep time, interactive process identification, time slices, etc. Author Ingo Molnar describes the CFS as “basically [modeling] an ‘ideal, precise, multi-tasking CPU’ on real hardware.” It took a bit of homework to figure out what the heck he was talking about, but it turns out that the concept is actually quite simple.

An “ideal, precise, multi-tasking CPU” is one that can run multiple processes at the same time (which is, of course, impossible), giving each process an equal share of processor power. If a single process is running, that process would be given 100% of the processor’s power. With two processes, each would have 50% of the physical power (in parallel).

Molnar notes that “on real hardware, we can run only a single task at once, so while that one task runs, the other tasks that are waiting for the CPU are at a disadvantage - the current task gets an unfair amount of CPU time.”

With CFS, this “fairness imbalance” is tracked on a per-process basis. As a process waits for the CPU, the scheduler tracks the amount of time it would have used on our ideal processor. This time is calculated by dividing the wait time (in nanoseconds) by the total number of processes waiting. The resulting value is the amount of CPU time the process is entitled to, and is used to rank processes for scheduling and to determine the amount of time the process is allowed to execute before being preempted.

Controversy

Red Hat employee and Linux kernel hacker Ingo Molnar caused quite a stir when he sent the Completely Fair Scheduler patch out over the Linux kernel mailing list. Before Molnar announced CFS, Con Kolivas’ RSDL scheduler was positioned to be merged into the kernel. Another scheduler written by Nick Piggin called nicksched had also been considered.

Some onlookers felt that Molnar co-opted ideas that Con and Nick pioneered, re-implemented them, and called them his own. While NIH syndrome may have contributed to the creation of CFS, Molnar’s ideas are novel enough to discredit any claimed impropriety. In any case, CFS has been merged into the Linux kernel and will be included in the upcoming 2.6.23 release. That said, I doubt the debate is over.