On Quora someone asked, are there any purportedly objective benchmark tests or comparisons of operating system efficiency?
Last year when I was in grad school I took Margo Seltzer‘s research-based Operating Systems class. Part of my thesis for my class project was that the accuracy of platform-independent micro-benchmarks suffers inherent limitations (as opposed to platform-specific micro-benchmarks, which do not suffer the same limitations).
Micro-benchmarks are benchmarks that measure the performance of primitive operations, such as systems calls or page faults.
Platform-independent micro-benchmarks are micro-benchmarks designed to run on multiple platforms.
To prove this part of my thesis I analyzed lmbench, the most popular OS benchmark suite that uses platform-independent micro-benchmarks.
“lmbench is a suite of simple, portable, ANSI/C microbenchmarks for UNIX/POSIX. In general, it measures two key features: latency and bandwidth. lmbench is intended to give system developers insight into basic costs of key operations.” — from the lmbench SourceForge page
True to my thesis, the platform-independent aspect of lmbench’s design caused its benchmarks to suffer inherent limitations. These limitations can only be fixed by abandoning platform-independence and developing platform-specific micro-benchmarks. The results of my study illustrate why comparative OS benchmarking is hard and why the results are often trustworthy, which leads to my response to the Quora question…
Are there any purportedly objective benchmark tests or comparisons of operating system efficiency?
It depends on what you want to benchmark.
There are two things (that I can think of) that you might want to benchmark:
- The effect of the OS on application performance, and
- The performance of operating-system primitives such as system calls.
(1) is easier to measure because you can re-run the same application benchmarks on multiple OSes and compare results. However, (2) is much more difficult because it requires portable micro-benchmarks and developing portable micro-benchmarks for OSes is inherently hard.
Lmbench is probably the most popular micro-benchmark suite that benchmarks operating-system primitives. Despite its popularity, lmbench produces inaccurate results on modern operating systems, due to the inherent difficulty of developing portable micro-benchmarks.
Here’s why. Lmbench attempts to achieve portability through ANSI C and by only using POSIX interfaces. Thus any system that supports ANSI C and POSIX may run the lmbench benchmark suite. However, POSIX only specifies interfaces and does not specify many implementation details. The lmbench benchmarks rely on implementation-specific details of the operating system in order to execute specific OS primitives. Thus, lmbench only yields accurate results on operating systems that follow its assumptions.
Example: The lat_pagefault benchmark
I’ll give a concrete example of an inaccurate lmbench benchmark. The lat_pagefault benchmark attempts to measure the average runtime of page faults. I’ve tested it on two operating systems. The benchmark is somewhat accurate on OS X but on Linux it under-reports page-fault runtime by several orders of magnitude.
Here’s how the benchmark works:
- Lmbench sets up the benchmark by mmap-ing a sequence of N pages, then attempts to page out the pages. To cause page outs, lmbench uses the POSIX msync() system call with the MS_INVALIDATE flag set.
- Once the memory is paged out, then every memory access to the pages causes a page fault. The benchmark itself simply iterates over mmaped pages, one by one, where each access is supposed to cause a page fault. Lmbench reports the page fault latency as the number of page faults divided by the total runtime = N / runtime.
The benchmark fails to be accurate for two reasons:
(Problem 1) POSIX does not specify any means to force page outs. Lmbench fallaciously assumes that msync() causes page outs. This behavior is actually a platform-specific implementation detail; the POSIX standard does not specify that invocations of msync() cause page outs. On Linux, for example, the msync system call does not cause pageouts. Therefore the benchmark only measures the latency of normal memory accesses (which is often faster by several orders of magnitudes).
You can fix the Linux benchmark in a non-portable way by instructing the kernel to flush the swap cache (write the value “1” to the file /proc/sys/vm/drop_caches). Note that by applying this fix, the benchmark becomes platform specific and loses portability.
(Problem 2) You can solve Problem 1 by adding platform-specific code to cause page-outs on Linux. However, there remains another problem: the benchmark still significantly under-reports page-fault latencies.
Many modern operating-system kernels (such as Linux and Darwin) employ anticipatory paging. Rather than just paging in the single faulting page, the kernel pages in multiple pages during a fault — hoping to reduce the chance of future page faults. The kernel uses information about the process workload to heuristically determine which pages should be paged in during a fault. Thus, the memory access patterns of the benchmark directly determine the behavior of the virtual-memory manager (VMM).
Anticipatory paging affects benchmark accuracy because a single page-fault may cause many pages to be paged in, making it less likely that other memory accesses in the benchmark will cause page faults. Thus, the benchmark thinks it is causing N page faults, but in reality it is only causing some fraction of that. As a result, lmbench severely under-reports the average page-fault runtime.
To fix this problem, you can use the mincore() system call to figure out exactly how many page faults your benchmark causes. Then, you can use that information to develop workloads that cause predictable page-faults in your benchmark. Unfortunately, this system call is not part of the POSIX standard so this solution is not portable (though it does happen to work on both Linux and OS X).
Correct results
I fixed these problems and ran the benchmark on a MacBook Pro running OS X and Linux. Here’s the results (in microseconds):
The graph shows that Linux is slightly faster at handling page faults than OS X. It’s also interesting to note that the on both OS X and Linux page-fault latency seems to scale linearly at the same rate with regard to the number of page-ins per fault (recall, a page fault usually results in multiple page ins because of the anticipatory pager).
Note that these results were only possible by scrutinizing the behavior of the benchmark on different operating systems and writing non-portable, platform-specific benchmarks. In general, portable micro-benchmarks cannot safely measure the performance of platform-specific operations (such as page faults).