What is the Completely Fair Scheduler?
Linux, Programming August 1st, 2007 - 21,241 views
If you’ve been following Linux kernel news then you’ve probably heard about the new Completely Fair Scheduler that has been merged into the upcoming 2.6.23 kernel release. It’s been a while since I’ve done much Linux kernel hacking, so the initial announcement was mostly over my head. After reading about the new scheduler in several places, I decided to do a bit of research into how the current Linux scheduler works, and what makes the new scheduling algorithm so interesting. Here’s what I learned.
What is a scheduler?
Like every multitasking operating system, Linux achieves the illusion of simultaneous execution of multiple processes (or programs) by rapidly cycling through processes that are ready to run, giving each one a short amount of time to execute on the CPU. Determining when to switch between processes, and which process should be allowed to run next is called scheduling. The kernel code that performs process scheduling is called the scheduler.
Implementing a scheduling algorithm is tricky for a couple of reasons. First, an acceptable algorithm has to ration CPU time such that higher priority tasks (e.g., interactive applications like a web browser) are given preference over low priority processes (e.g., non-interactive batch processes like program compilation). At the same time, the scheduler must protect against low priority process starvation. In other words, low priority processes must eventually be allowed to run, regardless of how many high priority processes are vying for CPU time.
Schedulers must also be carefully crafted so that processes appear to be running simultaneously, without having too large an impact on system throughput. For interactive processes like GUIs, the ideal scheduler would give each process a very small amount of time on the CPU and rapidly cycle between processes.
A process can only respond to user input when it has control over the CPU. Because users expect interactive processes to respond immediately to input, the delay between user input and process execution should ideally be imperceptible to a human (somewhere between 50 and 150ms is usually sufficient).
For non-interactive processes the situation is reversed. Switching between processes, or context switching, is a relatively expensive operation. Thus, larger slices of time on the processor, and fewer context switches can improve system performance and throughput. The scheduling algorithm must strike a balance between all of these competing needs.
The O(1) Scheduler
The current Linux scheduler uses a complicated set of heuristics to provide adequate performance to nearly every type of process. The algorithm tries to identify interactive processes by analyzing average sleep time (e.g., the amount of time the process spends waiting for input). Processes that sleep for long periods of time are probably waiting for user input, so the scheduler assumes they’re interactive.
Each process is given a time quantum based on a static priority (set via the nice system call). The time quantum determines how long the process can execute on the CPU before it is preempted and another context switch occurs.
The scheduler maintains a list of active and expired processes. When a new process is spawned, it starts in the active list. Non-interactive processes move from the active list to the expired list once their time quantum is exhausted. Expired processes are forbidden to run until all active processes expire.
Interactive processes generally remain active: after execution, the scheduler refills their time quantum and leaves them in the set of active processes. However, if the oldest expired process has waited for a long time, or if an expired process has a higher static priority than the interactive process, the scheduler moves the interactive process to the expired process list. Thus, the scheduler gives interactive processes priority while preventing process starvation by guaranteeing that all processes will eventually have a chance to run.
The defining characteristics of the current scheduler is the algorithm’s running time. If you’re not familiar with Big O notation, it’s a simple system for describing how the size of the input data set affects the length of time it takes an algorithm to execute. An O(n^2) algorithm would take 1 unit of time to run if the input data contained a single item, 4 units of time for 2 units of input, etc. The current Linux scheduler is an O(1) scheduler. In other words, the algorithm takes the same amount of time to run regardless of the input data size (in this case, the number of processes being run on the system).
If you think all of that sounds way more complicated than it has to be, you’ll be happy to know that the new Completely Fair Scheduler is far simpler, and may prove to be a superior scheduler.
The Completely Fair Scheduler
The Completely Fair Scheduler (CFS) has replaced the current O(1) Scheduler in the upcoming 2.6.23 kernel release. The CFS uses a far simpler scheduling algorithm, completely ignoring sleep time, interactive process identification, time slices, etc. Author Ingo Molnar describes the CFS as “basically [modeling] an ‘ideal, precise, multi-tasking CPU’ on real hardware.” It took a bit of homework to figure out what the heck he was talking about, but it turns out that the concept is actually quite simple.
An “ideal, precise, multi-tasking CPU” is one that can run multiple processes at the same time (which is, of course, impossible), giving each process an equal share of processor power. If a single process is running, that process would be given 100% of the processor’s power. With two processes, each would have 50% of the physical power (in parallel).
Molnar notes that “on real hardware, we can run only a single task at once, so while that one task runs, the other tasks that are waiting for the CPU are at a disadvantage - the current task gets an unfair amount of CPU time.”
With CFS, this “fairness imbalance” is tracked on a per-process basis. As a process waits for the CPU, the scheduler tracks the amount of time it would have used on our ideal processor. This time is calculated by dividing the wait time (in nanoseconds) by the total number of processes waiting. The resulting value is the amount of CPU time the process is entitled to, and is used to rank processes for scheduling and to determine the amount of time the process is allowed to execute before being preempted.
Controversy
Red Hat employee and Linux kernel hacker Ingo Molnar caused quite a stir when he sent the Completely Fair Scheduler patch out over the Linux kernel mailing list. Before Molnar announced CFS, Con Kolivas’ RSDL scheduler was positioned to be merged into the kernel. Another scheduler written by Nick Piggin called nicksched had also been considered.
Some onlookers felt that Molnar co-opted ideas that Con and Nick pioneered, re-implemented them, and called them his own. While NIH syndrome may have contributed to the creation of CFS, Molnar’s ideas are novel enough to discredit any claimed impropriety. In any case, CFS has been merged into the Linux kernel and will be included in the upcoming 2.6.23 release. That said, I doubt the debate is over.
August 1st, 2007 at 8:14 pm
Nice article Mike. I have done a lot of scheduling WRT computer network and fairness but never have read about CPU process fairness.
If each process is given a equal amount of processing power, couldn’t an app developer create multiple processes for a single app yielding more of the processing power. I know this doesn’t make any sense because there is only so much CPU power and from an application stand point, you will assume that you are virtually allocated a dedicated processor and stealing from yourself is ignorant.
August 1st, 2007 at 8:31 pm
Yea, a developer could fork processes and try to get an unfair share of the CPU, but I don’t know how much good it would do. Either way, I don’t think there would be much point to it. I suppose you could try to launch a DoS attack by forking a billion processes that run some null loop (assuming you have local access to the box), but that’ll work with pretty much any scheduler. And the kernel has other mechanisms of handling that sort of trickery.
I’m sure issues will arise and problems will be discovered with CFS, but I like the simplicity of it. With the current scheduler it’s very hard to visualize what’s going on. CFS just seems more deterministic and predictable.
August 2nd, 2007 at 11:23 am
You certainly can fork a bunch of processes to get yourself all the CPU time, but eventually the kernel will just stop giving you processes. It isn’t necessarily an advantage to have multiple processes, though, since it interprocess communication is a pain in the ass.
I’m intrigued by this new scheduler, though. While the old one is O(1), dependent on the number of priority levels, not the number of processes vying for attention, this is done with a red-black tree, which is O(log n). I believe the smarter people that it’ll be faster, but I’m curious now.
August 2nd, 2007 at 3:28 pm
that was surprisingly interesting. Nice post mike!
August 2nd, 2007 at 7:41 pm
“Molnar’s ideas are novel enough to discredit any claimed impropriety.”
While that’s probably true - it’d be nice if someone could spell them out so the lay-person can understand what they are and see why this one was chosen.
Any chance you could list some of these novel ideas in the comments here?
August 2nd, 2007 at 7:45 pm
[…] read more | digg story 豪杰 @ 12:42 am [filed under Digg […]
August 2nd, 2007 at 7:48 pm
I don’t know all the details, but my understanding is that the similarities between Molnar’s algorithm and the others stopped at the idea of simulating an “ideal processor.” If you check out the description of Kolivas’ RSDL Scheduler, for example, the actual implementation details are very different.
But I think Molnar has pretty much admitted that the initial idea wasn’t his.
August 2nd, 2007 at 8:01 pm
Nice site, I came from digg. How is this blog still up? Usually when a blog running wordpress gets digged, wp can’t handle the stress.
Are you running some optimizations or something?
August 2nd, 2007 at 8:13 pm
Jesus,
Thanks. I’ll keep it short since we’re off topic, but to answer your question: WordPress can handle a lot of traffic if it’s configured properly. It gets a bad rap because a lot of people run it out of the box, completely unoptomized on shared hosts. I’m using WP-Cache (slightly modified), and some custom stuff to host images and other files on Amazon S3. Feel free to contact me via email if you want a more detailed explanation.
August 2nd, 2007 at 8:36 pm
Ingo fought long and hard for the status quo. Complaints against the current O(1) scheduler (author: Ingo) were met with flames and ignorance. Con spent a year developing a new scheduler and proving that it was better than the scheduler that Ingo wrote. Ingo spent the entire year ignoring and denying Con’s scheduler, but one day he saw the light and wrote his own knockoff in “62 hours” and gets it merged the next week. This is bullshit.
The end result is that Con has been driven from Kernel development by asses like Ingo. Whether CFS is better than SD is irrelevant. What has happened is the guy who spawned the entire process of getting an improved scheduler in the kernel and spent a year overseeing and improving it has his concept stolen by Ingo. Ingo did not want a new scheduler, he wanted his name to be on the current scheduler. When he saw that disappearing he had to take drastic action and he “authored” a new, improved scheduler. All for his own glory.
August 2nd, 2007 at 8:49 pm
[…] detailed explanation about whats up with the upcoming 2.6.23 Linux kernel release. kitty X Menread more | digg story RSS feed for comments on this post. TrackBack URI Cartoons Fans Lounge […]
August 2nd, 2007 at 9:02 pm
Thanks for the article! Sometimes I find it a little hard to understand the technolingo surrounding linux stuff; you did a really great job breaking it down into understandable terms…
August 2nd, 2007 at 11:45 pm
From what I’ve been reading, Con pioneered alot of the mindset and ideology behind the CFS and the kernel’s new pluggable scheduling (allowing the user to pick from multiple schedulers rather than just the default)…however, since Con only did his kernel hacking in his free time and Ingo has a full time job doing, it was felt that he would support his better than Con would have time to do. There’s going to be controversy swirling around this for a while, but from what I’ve seen I think Linus made the right choice. That said, it’s a shame that these two great minds couldn’t have worked together on a common goal.
August 2nd, 2007 at 11:53 pm
I’m not sure I understand the benefit CFS proposes. From the description, it sounds like all processes are treated with equal right - whether they be a timing-sensitive app like a video player or a background task such as the system update auto-downloader. Besides reduced overhead of using a simpler scheduler, how will a user benefit if the two example processes mentioned above are given equal weight?
August 3rd, 2007 at 12:10 am
Stephen, I’m not 100% certain about that part =p, but I think the idea is that interactive processes that cycle through sleep/wake/wait phases then use a small amount of CPU time to process IO end up building a lot of “credit,” so to speak. Thus, they’ll be selected more quickly/frequently when they’re runnable.
August 3rd, 2007 at 4:12 am
Yes, I’m also curious to find out how it handles “sleeping” tasks… though I’m sure they have it covered.
Having 2 tasks, one doing heavy math, the other waiting for input, it would make no sense to have 50%/50% time slots…
Hope someone explains how CFS works in those scenarios…
Regarding Wordpress optimization, I sure would like to know more about it - maybe you could post it as an entry in your blog? :)
August 3rd, 2007 at 6:53 am
I have a question.. What if I have a CPU heavy program (a game or compiling a big package for example). Will it only have a share of the CPU? Even if it needs more?
August 3rd, 2007 at 8:05 am
Love your article. Great, interesting work, keep on! Your blog got into my GoogleReader portfolio :)…
August 3rd, 2007 at 8:09 am
@15
Mike, from the half dozen articles i’ve read you’re dead on. It basically makes it so that a process that runs full throttle gets capped at its ’share ratio’ of CPU time. If there are 15 processes it gets capped at 1/15th. Processes that give up control early build up credit and if they need to burst they are able to. They also get preempted more.
But since any process getting preempted ALSO builds ‘credit’ no process is CAPABLE of getting forgotten forever. Thus ‘completely fair’.
August 3rd, 2007 at 9:46 am
Great article Mike
I’ve been playing with /proc/sys/kernel/sched_granularity_ns (low latencies are recommended for desktop, higher for servers). I tried the lowest configurable value as well as the highest one, didn’t notice any performance gain or loss.
Anyone knows how to benchmark this in a quantifiable way? Or the default value should be “good” for most desktop users.
August 3rd, 2007 at 11:30 pm
[…] What is the Completely Fair Scheduler? - I’m Mike After reading about the new scheduler in several places, I decided to do a bit of research into how the current Linux scheduler works, and what makes the new scheduling algorithm so interesting. Here’s what I learned. (tags: kernel linux programming software scheduler) […]
August 4th, 2007 at 3:56 pm
[…] Mike briefly describes the new scheduler, called CFS. […]
August 8th, 2007 at 1:21 pm
[…] ja noch ein Linux/UNIX Cheet-Sheet, immer nützlich gerade für Einsteiger. Und eine Erklärung was denn dieser neue Scheduler in Linux da so macht, dieser […]
August 8th, 2007 at 1:37 pm
Mike-
Good if brief intro. I’d like to know if you would be willing to delve a little deeper into CFS? Some questions I have include:
1. How does CFS fare on multi-core and SMP systems? Is there linear improvement or.. something else?
2. On a more political bent, why the constant refusal to allow for end-user choice in the scheduler? I know that Linus is very adamant about there being One Scheduler, but being able to choose a scheduler on your expected workload (without hacking or patching) would be nice.
I think this gets to the crux of #2: “Schedulers must also be carefully crafted so that processes appear to be running simultaneously, without having too large an impact on system throughput.” That’s not entirely correct (as you know), but I see why it was said here. The point is that there is no requirement that a scheduler trick a user to see the system as truly multitasking (assuming we have a single CPU) for a throughput-intensive workload, such as batch job handling.
Just some thoughts!
–
Dustin Puryear
Author, Best Practices for Managing Linux and UNIX Servers
http://www.puryear-it.com
August 13th, 2007 at 9:47 pm
[…] home page later the same day, and another 10,000 clickthroughs followed. As a result, I’ve been asked by a few people how I managed to keep my site up under that sort of stress. Honestly, I […]