Benchmarking with Stabilizer: Begone eyeball statistics!
Hello! 🦆 I've been busy working on an international move, so this issue has been quite delayed even though I started working on at the start of May. So apologies if you were expecting it to come out sooner; if it's any consolation, I made the same mistake.
This issue covers Stabilizer: Statistically Sound Performance Evaluation (PDF) (ASPLOS 2013) by Charlie Curtsinger and Emery Berger. You can also check out the code for the paper, although, that has bit-rotted by this point. The code targets LLVM 3, whereas the latest LLVM release is LLVM 14. For context, the paper came out in 2013, so the bit-rot is not surprising. The pace of LLVM development is relentless, and industrial users of LLVM have enough engineering power to keep up with breaking changes. Academics on the other hand, perhaps not so much.
Getting back to the main topic, the paper covers a few things:
- It discusses why accounting for the positions in memory of code and objects is important for robust benchmarking.
- It describes Stabilizer, which is a compiler pass plus a runtime library developed by the authors that randomizes the positions of code and objects in memory.
- It describes how powerful statistical tests can be used to analyze the results after benchmarking code with Stabilizer.
- It delivers a spicy result: the performance benefits of LLVM 3's -O3 optimizations over -O2 on the SPEC CPU2006 benchmarks are indistinguishable from random noise.
While I do try to be entertaining, I cannot possibly match the pizzazz of Emery Berger himself. Unless you'd rather read a bunch of text over watching a video, I highly recommend that you watch Berger's presentation at StrangeLoop 2019: Performance Matters (Stabilizer is discussed till the 26 minute mark, later discussion is about causal profiling using Coz). After that, come back here and scroll to the Evolving the status quo section near the end.
Okay! If you're still reading this, I'm assuming you're interested in learning about the paper before my hot (?) takes. Let's get into it.
Why care about where things are in memory
Turns out, the von Neumann model of computing misses a lot of details about how modern CPUs work. Specifically, modern CPUs have multiple levels of caches; these caches use coherency protocols, because hardware people can't meme about cache invalidation like software people do. Loads and stores from memory happen at the granularity of cache lines (usually 64 bytes), which means that loading data from (storing data to) the same cache line from the same core -- or even two different cores -- has different performance characteristics than loading data from (storing data to) different cache lines.
There's also branch prediction, which is a whole another fun topic. The basic idea is that CPUs predict the results of branches and "execute" code based on the result of the prediction. Increasingly sophisticated branch predictors are one of the reasons behind the speed improvements of modern CPUs after the breakdown of Dennard scaling, but this also means that mispredicts can be quite expensive. One of the things that affects prediction results is the memory address of the branch instruction.
Together, these factors mean that the relative and absolute placement of code and objects in memory can and does affect performance, sometimes dramatically so.
For example, one of the papers cited by the authors, the delightfully titled Producing Wrong Data Without Doing Anything Obviously Wrong! (ASPLOS 2009), mentions that the size of environment variables -- which affects the position of the program stack, and hence the positions of all stack-allocated objects -- can degrade performance by 300% in the worst-case. Have a different PATH than your colleague who has optimal performance on their machine? Well, sucks to be you.
Worst-case aside, even in the typical case, there can be performance variations in the range of single to low double digit percentages on benchmarks on changes to memory layout.
Such large effects mean that establishing a causal relationship between code changes and small improvements to benchmarks made via standard benchmarking tools (which do not account for memory layout) is difficult at best, and really just a different form of "works on my machine 😎" at its worst.
Stabilizer - Or: how I learned to stop worrying and love randomization
Stabilizer solves the problem of accounting for changes in memory layout by randomizing memory layouts for code and objects. This randomization is not done just once at program startup; that would require a lot of runs to gain statistical confidence. Instead, re-randomization happens dynamically on a periodic basis.
I'll cover how each bit works at a high level; you can read the paper for more details.
The compiler pass for Stabilizer does three key things:
- It renames
mainso that the runtime library can run stuff before
- It converts references to global variables and other functions into indirect references loaded from a function-specific relocation table. The relocation table is itself referred to with a PC-relative offset from the function's body.1
- It does a bunch of special handling for floating-point conversions and constants.
The runtime library creates one or more copies of function bodies on the randomized code heap with relocation tables attached at the end of each copy. A relocation table holds pointers to global variables and functions referred to by the function copy, as expected by the codegen.
Additionally, the runtime places trap instructions at the start of function bodies. When the trap is triggered, the runtime copies the function body elsewhere, creates a new relocation table, overwrites the trap instruction with a jump to the new copy of the function, and adds the function to a live set.
A function can have multiple copies which are simultaneously being used (for example, due to mutual recursion); the copies under use are tracked in a separate set. In addition to the earlier mentioned steps, when a trap instruction is hit, the runtime walks the stack, marks function copies being used, and frees other copies.2
The compiler pass inserts a 256 byte padding table before each stack frame and a stack slot for an index into this padding table. On function entry and exit, the compiler pass adjusts the stack pointer based on the byte from the padding table referred to by the index. It also increments the index for each call (the index is interpreted with wraparound).
The runtime library refreshes the padding tables with random values during re-randomization.
Together, these two mean that the "actual stack frames" start at varying offsets over the course of execution.
There is no compiler component for heap randomization;
the compiler inserts calls to
as if nothing funky is happening.
The runtime library intercepts calls to
(for example, you can do this using
LD_PRELOAD on Linux).
It also maintains pre-populated arrays for
different size classes.
At allocation time, it will call
select a random index from the pre-populated array
of the appropriate size class,
array[i] pointer with the newly
pointer, and return the pointer loaded from the array.
There is a similar swap during
Avoiding eyeball statistics
So we've somehow managed to randomize things in memory, and are able to keep re-randomizing them on the fly. Very cool. However, it turns out that running benchmarks isn't sufficient; we also need to interpret their results in a meaningful way.
Unfortunately, many tools only expose averages and variances, making it hard to understand the whole picture. Showing only averages and variances only works under the assumption of normality. Some tools add spice by mentioning outliers, perhaps even discarding some outliers, or discarding some warmup measurements. Tools with extra panache will show the entire distribution, and assume that I know what I'm doing when I'm looking at the distribution.
The problem is, of course, that I don't know what I'm doing. As the saying goes, there are lies, there are damned lies, and there are lies which I have forgotten, such as statistics. Unfortunately, it's not just me. Since doing statistics is hard, many people resort to what Berger jokingly calls "eyeball statistics" where you look at some results and go "hmm, that looks like an improvement."
Okay, maybe you add some p-values in because you're feeling posh and fancy. However, without a precise understanding of what the p-values do and do not convey, they might as well not be there.
So how does Stabilizer help here?
The paper provides an argument via the Central Limit Theorem on why Stabilizer's randomization results in normally distributed execution times. My statistics knowledge is too rusty to figure out if there are any gaps in their argument, so let's assume that it is correct. (Section 5.1 covers how they checked this empirically; you could repeat this for your own benchmarks.)
With the normal distribution confidently in hand, we are in business. 🏋🏽
For individual benchmarks, the paper covers how you can use the Student's t-test to evaluate if there is a statistically significant effect before vs after a change.
For a suite of benchmarks, there is another technique called Analysis of Variance (ANOVA), which correctly aggregates differences across benchmarks, unlike pairwise t-tests which have a high false-positive rate. For example, if you have a few dozen benchmarks, and 1-2 of them show statistically significant improvements and others don't, you may be tempted to consider the change as a "small win." However, it's possible that the overall result is statistically insignificant. ANOVA can correctly account for such situations.
LLVM -O3 under the microscope
Combining all of the above, the authors apply ANOVA to the SPEC CPU2006 benchmarks, compiling with LLVM and comparing the results of -O2 vs -O3. (They also apply it to -O1 vs -O2, but the results aren't as interesting.)
The results show that -O2 optimizations are significant at a 90% confidence level, but not at the 95% level [...] [with] a p-value of 0.264. We fail to reject the null hypothesis and must conclude that compared to -O2, -O3 optimizations are not statistically significant.
Now, of course, there are a bunch of caveats to these results, preventing "never use -O3" from becoming a guideline.
First, the results are specific to LLVM 3. Given that LLVM has undergone many changes since then, it's not clear if that same findings apply to recent versions of LLVM. On a related note, a small but growing number of codebases are using newer features like profile-guided optimization (PGO) and link-time optimization (LTO); it's not clear how these combine with -O2 and -O3. For example, is it the case that with PGO, there is a statistically significant difference between -O2 and -O3 that is robust across codebases?
Second, the analysis is done on the SPEC CPU2006 benchmarks; it's not obvious how well the results carry over to specific libraries and applications, as well as code written in languages other than C and C++. For the latter, it might not even matter, because the default list of passes in LLVM is tuned for C and C++ code, and different languages may be using different default passes at different optimization levels.
That said, anecdotally, I've seen multiple performance experts recommend avoiding -O3 because of overly aggressive optimizations, or at least only using -O3 if there is are workload-specific benchmarks that demonstrate improvements.
Evolving the status quo
Randomizing memory layout seems essential
My naive brain is a little surprised that no one has gone ahead and re-implemented Stabilizer to work with a more recent LLVM version. Upstreaming the code was probably considered an impossibility because of licensing (Stabilizer was licensed under GPLv2 until 2021), but that didn't preclude someone from doing a clean room re-implementation.
If I were to speculate, here are a few possible reasons why nobody has re-implemented Stabilizer yet:
- It requires both compiler and runtime work, which means that fewer engineers have the necessary expertise to drive such a project.
- Making effective use of a Stabilizer-like tool over time requires decent continuous automated benchmarking infrastructure, which isn't very common.
- From a results point-of-view, developing a Stabilizer-like tool seems likely to come up with a bunch of negative results, demonstrating that people's "performance improvements" were actually not statistically significant.3 It's less likely to come up with positive results ("oh, that change you thought was not a performance improvement was actually a performance improvement" or even "hey, that performance degradation you thought you made is not actually statistically significant") because people are unlikely to manually trigger benchmark runs for changes which they didn't think make performance improvements.
- Companies seem to consistently under-invest in benchmarking anyways (and benchmarking is often [unfortunately] primarily viewed as a marketing tool), so this seems consistent with that pattern.
That said, it's not all doom and gloom! It seems like the Cranelift project is interested in implementing the functionality of Stabilizer for benchmarking, see this accepted Cranelift RFC for the details. I did find a PR to randomize heap allocations in benchmarks; I don't know about the status of other sub-items.
Stabilizer for languages using tracing GCs
As I discussed before, one of the key parts of Stabilizer is how it intercepts calls to malloc and free (or equivalent). This means that anyone using a custom allocator, or more generally a custom memory management subsystem, will not get proper randomization out-of-the-box when using Stabilizer.
For example, consider an arena allocator, which is commonly used in high-performance C and C++ code. The minor heap for language implementations using tracing GCs is also often a (thread-local) arena allocator.
For an arena allocator, one could do a similar initialization step
where a bunch of values are allocated up front (per size class)
so that randomization can work.
However, partitioning by size classes may lead to very different
application performance characteristics due to different locality.
For example, if the application alternates between allocating
8 byte objects and 16 byte objects (assume 8 byte alignment for both),
an arena allocator could lay these out contiguously,
whereas randomization would mean that consecutive (in time)
allocations are not consecutive (in memory).
So not only is allocation code performance potentially significantly affected (which GCed languages often rely on with tons of allocations), non-allocation application performance may be affected too, increasing overhead.
The other problem with tracing GCs is that, depending on the implementation strategy, they may move objects at runtime. So the "moving allocator" also needs randomization. If it was trying to compact stuff, then you're reducing the locality of data due to randomization. Yet more overhead!
Some languages with tracing GCs also use a JIT; the JIT's code memory management subsystem (if distinct from the one for data) needs support for randomization too.
So it seems like each language with a tracing GC (and potentially a JIT) needs separate integration. In contrast, if a Stabilizer-like tool is developed for a native code toolchain once, code written in multiple systems-y languages targeting that toolchain can benefit from it.
One potential "solution" to this is developing and evangelizing a a composable memory management library that is flexible enough for a variety of use cases. There has been a bunch of work in this area such as the Memory Management Toolkit (MMtk), but it's not very mature from what I can tell.
There's also HeapLayers, which Stabilizer itself is built on top of. I haven't looked into whether HeapLayers is flexible enough for implementing complex memory management strategies such as those used by modern tracing GCs.
Stabilizer in production
The authors note that:
STABILIZER’s low overhead means that it could be used at deployment time to reduce the risk of performance outliers, although we do not explore that use case here
Elsewhere they say:
Stabilizer is useful for performance evaluation, but its ability to dynamically change layout could also be used to improve program performance.
I feel like I'm missing something here. The Stabilizer paper came out in 2013. As far as I know, there weren't really any mature languages targeting LLVM at that time other than C and C++.
The code randomization part of Stabilizer seems to rely on having memory that is both writable and executable. However, such memory seems like catnip for attackers, given the pervasiveness and exploitability of memory safety issues in C and C++ code. Maybe the authors were thinking only about the stack and heap randomization aspects for production deployments, and didn't note this caveat because they weren't going into details about the suggestion.
That nitpick aside, perhaps oddly enough, it might be the case that any future versions of Stabilizer may be more useful in production for memory-safe languages like Rust and Swift, rather than C and C++.
Debugging an instrumented binary
The invasiveness of Stabilizer's runtime makes it seem like a debugger would need to be made aware of Stabilizer's shenanigans, or have to explicitly cooperate. For examples, function addresses are no longer fixed, which means that it isn't safe for a debugger to cache them once-per-execution.
Similarly, Stabilizer overrides the SIGTRAP handler; it seems like a debugger would need to do the same. If that's right, I'm not sure how it could be decided whether Stabilizer or the debugger should handle the signal; maybe Stabilizer could a check that it placed a trap at that function's start, and if not, it could pass the signal to the debugger? I don't know.
As you can probably tell, I'm not very knowledgeable about debugger internals, so perhaps the concerns here are overblown. But it would be a shame if using a Stabilizer-like tool meant that you couldn't debug a binary properly.
Incremental stack processing for fibers
Stabilizer needs to garbage collect code references. The implementation described in the paper performs a full stack walk when a trap is hit.
Once a trapped function is called, Stabilizer walks the stack, marks all functions with return addresses on the stack, and frees the rest
What happens when the system has a lot of "stacks" in the form of fibers/green threads? Walking all of them for each trap may be prohibitively slow (and even if isn't, it may affect tail latency poorly for server-like applications running an instrumented binary in production).
So there probably needs to be some level of incrementality for processing these. It's not unheard of for GCs to process memory incrementally, so this doesn't seem like an impossible thing to implement. However, it would take some engineering effort, as well as possibly some cooperation with the runtime executor to properly ensure fiber discovery.
Timings vs instruction counts
Instruction counts for benchmarks are a lot more stable compared to timing measurements, since they are not subject to the vagaries of outside factors like other processes, thermals, CPU clock rate as well as CPU design factors such as cache hierarchies and branch prediction. This makes them appealing as a target to optimize for benchmarking, under the assumption that on average, the frequency of instructions in different parts of the code is largely the same.
Of course, this doesn't quite work for certain things like SIMD; a scalar loop could be much slower than the vectorized equivalent. But instruction counts are sufficiently useful that several projects serious about benchmarking having adopted them.
Here's an often-cited example from SQLite mailing list post, which announces that the 3.8.7 release is 50% faster than the 3.7.17 release.
We have achieved this by incorporating hundreds of micro-optimizations. Each micro-optimization might improve the performance by as little as 0.05%. If we get one that improves performance by 0.25%, that is considered a huge win. Each of these optimizations is unmeasurable on a real-world system (we have to use cachegrind to get repeatable run-times) but if you do enough of them, they add up.
Similarly, Nicholas Nethercote's series on improving the performance of the Rust compiler in incremental steps focuses a lot on instruction counts.
While this isn't exactly related to Stabilizer directly, one wrinkle is that in the presence of Stabilizer, timing measurements become more reliable. So the question becomes: under what circumstances does it make sense to focus on more stable proxies (like instruction counts), and under what circumstances should one rely on timings? Should timing measurements aggregated across a large set of changes be used to cross-check changes to instruction counts? Are there alternate ways of making use of both pieces of data together?
I haven't given this too much thought, and I don't have a clear answer.
I'm a little sad that Stabilizer isn't available in wide use today. On the bright side, I'm looking forward to seeing how Cranelift adopts different aspects of Stabilizer. Hopefully, that work inspires similar work in more toolchains. Apart from Cranelift, I don't know of any other ongoing work in this space. If you do, please let me know via Twitter DM (@typesanitizer) or by replying to this email; I'd be happy to link it here.
That’s it for now! This issue was sent to 183 subscribers. If you’re interested in getting the next issue in your inbox directly, enter your email below. :) In the next issue, I will be covering Inferring Scope through Syntactic Sugar (ICFP 2017) by Justin Pombrio, Shriram Krishnamurthi and Mitchell Wand.
I didn't understand this part of the paper fully. Based on the description, the relocation table is attached at the end of the function. So knowing the offset to the relocation table seems like it would require knowing the number of bytes between the "current instruction" and the end of the function. However, that number is needed at the point of assembling, whereas it should only be available after assembly is complete. Maybe there is some assembly cleverness going on here which allows computing the offset at the point where it is needed? ↩
In GC terminology, this is a simple mark-sweep GC with trap instructions acting as GC safepoints. ↩
Consider you're a compiler engineer who developed a Stabilizer-like tool. Running this tool reveals that a bunch of optimizations shipped by the folks working in your adjacent offices do not give statistically significant improvements for codebases and benchmarks that your org's director cares about. Someone senior working on the optimizer leaves, and there isn't replacement headcount available. You used to have lunch with the optimizer folks. They seem cold these days. Awkward doesn't begin to describe it. ↩