Benchmarking with Stabilizer: Begone eyeball statistics!

Stabilizer: Statistically Sound Performance Evaluation

        June 5, 2022

        Hello! 🦆 I've been busy working on an international move,
so this issue has been quite delayed even though I started
working on at the start of May.
So apologies if you were expecting it to come out sooner;
if it's any consolation, I made the same mistake.
This issue covers Stabilizer: Statistically Sound Performance
Evaluation (PDF) (ASPLOS 2013) by Charlie Curtsinger and Emery Berger.
You can also check out the code for the paper,
although, that has bit-rotted by this point.
The code targets LLVM 3, whereas the latest
LLVM release is LLVM 14. For context, the paper came out in 2013,
so the bit-rot is not surprising.
The pace of LLVM development is relentless,
and industrial users of LLVM have enough engineering power
to keep up with breaking changes.
Academics on the other hand, perhaps not so much.
Getting back to the main topic, the paper covers a few things:

It discusses why accounting for the positions in memory
  of code and objects is important for robust benchmarking.
It describes Stabilizer, which is a compiler pass plus a runtime library
  developed by the authors that randomizes the positions of code and objects in memory.
It describes how powerful statistical tests can be
  used to analyze the results after benchmarking code with Stabilizer.
It delivers a spicy result:
  the performance benefits of LLVM 3's -O3 optimizations
  over -O2 on the SPEC CPU2006 benchmarks are indistinguishable
  from random noise.

While I do try to be entertaining,
I cannot possibly match the pizzazz of Emery Berger himself.
Unless you'd rather read a bunch of text over watching a video,
I highly recommend that you watch Berger's presentation
at StrangeLoop 2019: Performance Matters (Stabilizer is discussed till the 26 minute mark, later discussion is about causal profiling using Coz).
After that, come back here and scroll to the Evolving the status quo section near the end.
Okay! If you're still reading this, I'm assuming
you're interested in learning about the paper
before my hot (?) takes. Let's get into it.
Why care about where things are in memory
Turns out, the von Neumann model of computing misses a lot of details
about how modern CPUs work.
Specifically, modern CPUs have multiple levels of caches;
these caches use coherency protocols,
because hardware people can't meme about cache invalidation
like software people do.
Loads and stores from memory happen
at the granularity of cache lines (usually 64 bytes),
which means that loading data from (storing data to) the same cache line
from the same core -- or even two different cores --
has different performance characteristics
than loading data from (storing data to) different cache lines.
There's also branch prediction,
which is a whole another fun topic.
The basic idea is that CPUs predict
the results of branches and "execute"
code based on the result of the prediction.
Increasingly sophisticated branch
predictors are one of the reasons
behind the speed improvements of modern
CPUs after the breakdown of Dennard scaling,
but this also means that mispredicts
can be quite expensive. One of the things that
affects prediction results is the memory address
of the branch instruction.
Together, these factors mean that the relative
and absolute placement of code and objects in memory
can and does affect performance,
sometimes dramatically so.
For example, one of the papers cited by the authors,
the delightfully titled
Producing Wrong Data Without Doing Anything Obviously Wrong!
(ASPLOS 2009),
mentions that the size of environment variables
-- which affects the position of the program stack,
and hence the positions of all stack-allocated objects
-- can degrade performance by 300% in the worst-case.
Have a different PATH than your colleague who has
optimal performance on their machine?
Well, sucks to be you.
Worst-case aside, even in the typical case, there
can be performance variations in the range of
single to low double digit percentages on benchmarks
on changes to memory layout.
Such large effects mean that establishing a causal relationship
between code changes and small improvements to benchmarks
made via standard benchmarking tools
(which do not account for memory layout)
is difficult at best,
and really just a different form of "works on my machine 😎"
at its worst.
Stabilizer - Or: how I learned to stop worrying and love randomization
Stabilizer solves the problem of accounting for changes in memory
layout by randomizing memory layouts for code and objects.
This randomization is not done just once at program startup;
that would require a lot of runs to gain statistical confidence.
Instead, re-randomization happens dynamically on a periodic basis.
I'll cover how each bit works at a high level; you can read the paper for more details.
Code randomization
The compiler pass for Stabilizer does three key things:

It renames main so that the runtime library can run stuff before main.
It converts references to global variables and other functions
  into indirect references loaded from a function-specific relocation table.
  The relocation table is itself referred to with a PC-relative offset from the function's body.¹
It does a bunch of special handling for floating-point conversions
  and constants.

The runtime library creates one or more copies of function bodies
on the randomized code heap with relocation tables attached
at the end of each copy. A relocation table holds
pointers to global variables and functions referred to
by the function copy, as expected by the codegen.
Additionally, the runtime places trap instructions
at the start of function bodies.
When the trap is triggered, the runtime copies the function body
elsewhere, creates a new relocation table,
overwrites the trap instruction with a jump to
the new copy of the function,
and adds the function to a live set.

A function can have multiple copies which are simultaneously
being used (for example, due to mutual recursion);
the copies under use are tracked in a separate set.
In addition to the earlier mentioned steps, when a trap instruction
is hit, the runtime walks the stack, marks function copies being used,
and frees other copies.²
Stack randomization
The compiler pass inserts a 256 byte padding table before each
stack frame and a stack slot for an index into this padding
table. On function entry and exit, the compiler pass adjusts the stack
pointer based on the byte from the padding table
referred to by the index.
It also increments the index for each call
(the index is interpreted with wraparound).
The runtime library refreshes the padding tables with random values during re-randomization.
Together, these two mean that the "actual stack frames"
start at varying offsets over the course of execution.
Heap randomization
There is no compiler component for heap randomization;
the compiler inserts calls to malloc and free
as if nothing funky is happening.
The runtime library intercepts calls to malloc and free
(for example, you can do this using LD_PRELOAD on Linux).
It also maintains pre-populated arrays for
different size classes.
At allocation time, it will call malloc,
select a random index from the pre-populated array
of the appropriate size class,
swap the array[i] pointer with the newly malloc-ed
pointer, and return the pointer loaded from the array.
There is a similar swap during free.
Avoiding eyeball statistics
So we've somehow managed to randomize things in memory, and are able to keep re-randomizing them on the fly. Very cool. However, it turns out that running benchmarks isn't sufficient; we also need to interpret their results in a meaningful way.
Unfortunately, many tools only expose averages and variances,
making it hard to understand the whole picture.
Showing only averages and variances only works under
the assumption of normality.
Some tools add spice by mentioning outliers,
perhaps even discarding some outliers,
or discarding some warmup measurements.
Tools with extra panache will show the entire distribution,
and assume that I know what I'm doing when I'm looking at the distribution.
The problem is, of course, that I don't know what I'm doing.
As the saying goes, there are lies,
there are damned lies, and there are lies which I have forgotten,
such as statistics.
Unfortunately, it's not just me. Since doing statistics is hard,
many people resort to
what Berger jokingly calls "eyeball statistics"
where you look at some results and go
"hmm, that looks like an improvement."

Okay, maybe you add some p-values in because you're feeling
posh and fancy. However, without a precise understanding of
what the p-values do and do not convey,
they might as well not be there.
So how does Stabilizer help here?
The paper provides an argument via the Central Limit Theorem
on why Stabilizer's randomization results in normally distributed
execution times. My statistics knowledge is too rusty to figure out
if there are any gaps in their argument, so let's assume
that it is correct.
(Section 5.1 covers how they checked this empirically;
you could repeat this for your own benchmarks.)
With the normal distribution confidently in hand, we are in business. 🏋🏽
For individual benchmarks, the paper covers how you can use
the Student's t-test to evaluate if there is a statistically significant
effect before vs after a change.
For a suite of benchmarks, there is another technique
called Analysis of Variance (ANOVA), which correctly aggregates
differences across benchmarks, unlike pairwise t-tests
which have a high false-positive rate.
For example, if you have a few dozen benchmarks,
and 1-2 of them show statistically significant improvements
and others don't, you may be tempted to consider
the change as a "small win."
However, it's possible that the overall result
is statistically insignificant.
ANOVA can correctly account for such situations.
LLVM -O3 under the microscope
Combining all of the above,
the authors apply ANOVA to the SPEC CPU2006 benchmarks,
compiling with LLVM and comparing the results of -O2 vs -O3.
(They also apply it to -O1 vs -O2, but the results aren't as interesting.)

The results show that -O2 optimizations are significant at a 90% confidence level, but not at the 95% level [...] [with] a p-value of 0.264. We fail to reject the null hypothesis and must conclude that compared to -O2, -O3 optimizations are not statistically significant.

Now, of course, there are a bunch of caveats to these results,
preventing "never use -O3" from becoming a guideline.
First, the results are specific to LLVM 3.
Given that LLVM has undergone many changes since then,
it's not clear if that same findings apply to recent versions of LLVM.
On a related note, a small but growing number of codebases
are using newer features like profile-guided optimization (PGO)
and link-time optimization (LTO); it's not clear how these combine
with -O2 and -O3. For example, is it the case that with PGO,
there is a statistically significant difference between -O2 and -O3 that is robust across codebases?
Second, the analysis is done on the SPEC CPU2006 benchmarks;
it's not obvious how well the results carry over to specific libraries and applications,
as well as code written in languages other than C and C++.
For the latter, it might not even matter,
because the default list of passes in LLVM is tuned for C and C++ code,
and different languages may be using different default passes at different optimization levels.
That said, anecdotally, I've seen multiple performance experts
recommend avoiding -O3 because of overly aggressive optimizations,
or at least only using -O3 if there is are workload-specific
benchmarks that demonstrate improvements.
Evolving the status quo
Randomizing memory layout seems essential
My naive brain is a little surprised that no one has gone ahead and re-implemented
Stabilizer to work with a more recent LLVM version.
Upstreaming the code was probably considered an impossibility
because of licensing (Stabilizer was licensed under GPLv2 until 2021),
but that didn't preclude someone
from doing a clean room re-implementation.
If I were to speculate, here are a few possible reasons why nobody
has re-implemented Stabilizer yet:

It requires both compiler and runtime work, which means that
  fewer engineers have the necessary expertise to drive such a project.
Making effective use of a Stabilizer-like tool over time requires decent continuous automated benchmarking infrastructure, which isn't very common.
From a results point-of-view, developing a Stabilizer-like tool seems likely to come up with a bunch of
  negative results, demonstrating that people's "performance improvements"
  were actually not statistically significant.³
  It's less likely to come up with positive results
  ("oh, that change you thought was not a performance improvement
  was actually a performance improvement" or even "hey, that performance degradation you thought you made is not actually statistically significant")
  because people are unlikely to manually trigger benchmark runs
  for changes which they didn't think make performance improvements.
Companies seem to consistently under-invest in benchmarking anyways (and benchmarking is often [unfortunately] primarily viewed as a marketing tool),
  so this seems consistent with that pattern.

That said, it's not all doom and gloom!
It seems like the Cranelift project is interested in implementing
the functionality of Stabilizer for benchmarking,
see this accepted Cranelift RFC for the details.
I did find a PR to randomize heap allocations in benchmarks; I don't know about the status of other sub-items.
Stabilizer for languages using tracing GCs
As I discussed before, one of the key parts of Stabilizer
is how it intercepts calls to malloc and free (or equivalent).
This means that anyone using a custom allocator,
or more generally a custom memory management subsystem,
will not get proper randomization out-of-the-box
when using Stabilizer.
For example, consider an arena allocator,
which is commonly used in high-performance C and C++ code.
The minor heap for language implementations using tracing GCs
is also often a (thread-local) arena allocator.
For an arena allocator, one could do a similar initialization step
as with malloc/free,
where a bunch of values are allocated up front (per size class)
so that randomization can work.
However, partitioning by size classes may lead to very different
application performance characteristics due to different locality.
For example, if the application alternates between allocating
8 byte objects and 16 byte objects (assume 8 byte alignment for both),
an arena allocator could lay these out contiguously,
whereas randomization would mean that consecutive (in time)
allocations are not consecutive (in memory).
So not only is allocation code performance potentially significantly
affected (which GCed languages often rely on with tons of allocations),
non-allocation application performance may be affected too,
increasing overhead.
The other problem with tracing GCs is that,
depending on the implementation strategy,
they may move objects at runtime.
So the "moving allocator" also needs randomization.
If it was trying to compact stuff, then you're reducing
the locality of data due to randomization.
Yet more overhead!
Some languages with tracing GCs also use a JIT;
the JIT's code memory management subsystem
(if distinct from the one for data)
needs support for randomization too.
So it seems like each language with a tracing GC (and potentially a JIT)
needs separate integration.
In contrast, if a Stabilizer-like tool is developed for a native code toolchain once, code written in multiple systems-y languages targeting that toolchain can benefit from it.
One potential "solution" to this is developing and evangelizing a
a composable memory management library that is flexible
enough for a variety of use cases.
There has been a bunch of work in this area such as
the Memory Management Toolkit (MMtk),
but it's not very mature from what I can tell.
There's also HeapLayers,
which Stabilizer itself is built on top of.
I haven't looked into whether HeapLayers is flexible
enough for implementing complex memory management strategies
such as those used by modern tracing GCs.
Stabilizer in production
The authors note that:

STABILIZER’s low overhead means that it could be used at deployment time to reduce the risk of performance outliers, although we do not explore that use case here

Elsewhere they say:

Stabilizer is useful for performance evaluation, but its ability to dynamically change layout could also be used to improve program performance.

I feel like I'm missing something here. The Stabilizer paper came out in 2013. As far as I know,
there weren't really any mature languages targeting LLVM at that time
other than C and C++.
The code randomization part of Stabilizer seems to rely
on having memory that is both writable and executable.
However, such memory seems like catnip for attackers,
given the pervasiveness and exploitability
of memory safety issues in C and C++ code.
Maybe the authors were thinking only about the stack and heap randomization
aspects for production deployments, and didn't note this caveat because they
weren't going into details about the suggestion.
That nitpick aside, perhaps oddly enough, it might be the case that any future versions
of Stabilizer may be more useful in production for memory-safe languages like
Rust and Swift, rather than C and C++.
Debugging an instrumented binary
The invasiveness of Stabilizer's runtime makes it seem like a debugger
would need to be made aware of Stabilizer's shenanigans, or have to explicitly cooperate. For examples, function addresses are no longer fixed, which means that it isn't safe for a debugger to cache them once-per-execution.
Similarly, Stabilizer overrides the SIGTRAP handler;
it seems like a debugger would need to do the same.
If that's right, I'm not sure how it could be decided
whether Stabilizer or the debugger should handle
the signal; maybe Stabilizer could a check that it
placed a trap at that function's start,
and if not, it could pass the signal to the debugger?
I don't know.
As you can probably tell,
I'm not very knowledgeable about debugger internals,
so perhaps the concerns
here are overblown.
But it would be a shame if using a Stabilizer-like tool
meant that you couldn't debug
a binary properly.
Incremental stack processing for fibers
Stabilizer needs to garbage collect code references.
The implementation described in the paper performs
a full stack walk when a trap is hit.

Once a trapped function is called, Stabilizer walks the stack,
marks all functions with return addresses on the stack, and frees the rest

What happens when the system has a lot of "stacks" in the form of fibers/green threads? Walking all of them for each trap may be prohibitively slow (and even if isn't, it may affect tail latency poorly for server-like applications running an instrumented binary in production).
So there probably needs to be some level of incrementality for processing these.
It's not unheard of for GCs to process memory incrementally, so this doesn't seem like an impossible thing to implement. However, it would take some engineering effort, as well as possibly some cooperation with the runtime executor to properly ensure fiber discovery.
Timings vs instruction counts
Instruction counts for benchmarks
are a lot more stable compared to timing measurements,
since they are not subject to the vagaries
of outside factors like other processes,
thermals, CPU clock rate
as well as CPU design factors such as
cache hierarchies and branch prediction.
This makes them appealing as a target to optimize
for benchmarking, under the assumption
that on average, the frequency of instructions
in different parts of the code is largely the same.
Of course, this doesn't quite work for certain things like SIMD;
a scalar loop could be much slower than the vectorized equivalent.
But instruction counts are sufficiently useful that several projects serious about benchmarking
having adopted them.
Here's an often-cited example from SQLite mailing list post, which announces that the 3.8.7 release is 50% faster than the 3.7.17 release.

We have achieved this by incorporating hundreds of
micro-optimizations.  Each micro-optimization might improve the performance
by as little as 0.05%.  If we get one that improves performance by 0.25%,
that is considered a huge win.  Each of these optimizations is unmeasurable
on a real-world system (we have to use cachegrind to get repeatable
run-times) but if you do enough of them, they add up.

Similarly, Nicholas Nethercote's series on improving the performance
of the Rust compiler in incremental steps focuses a lot on instruction counts.
While this isn't exactly related to Stabilizer directly,
one wrinkle is that in the presence of Stabilizer,
timing measurements become more reliable.
So the question becomes:
under what circumstances does it make sense
to focus on more stable proxies (like instruction counts),
and under what circumstances should one rely on timings?
Should timing measurements aggregated
across a large set of changes be used to cross-check
changes to instruction counts?
Are there alternate ways of making use of both pieces of data together?
I haven't given this too much thought, and I don't have a clear answer.
Closing thoughts
I'm a little sad that Stabilizer isn't available in wide use today. On the bright side, I'm looking forward to seeing how Cranelift adopts different aspects of Stabilizer. Hopefully, that work inspires similar work in more toolchains. Apart from Cranelift, I don't know of any other ongoing work in this space. If you do, please let me know via Twitter DM (@typesanitizer) or by replying to this email; I'd be happy to link it here.
That’s it for now! This issue was sent to 183 subscribers. If you’re interested in getting the next issue in your inbox directly, enter your email below. :) In the next issue, I will be covering Inferring Scope through Syntactic Sugar (ICFP 2017) by Justin Pombrio, Shriram Krishnamurthi and Mitchell Wand.

I didn't understand this part of the paper fully.
  Based on the description, the relocation table is attached
  at the end of the function. So knowing the offset to the relocation
  table seems like it would require knowing the number of bytes between the "current instruction" and the end of the function. However, that number is needed
  at the point of assembling, whereas it should only be available
  after assembly is complete. Maybe there is some assembly cleverness going on here which allows computing the offset
  at the point where it is needed? ↩

In GC terminology, this is a simple mark-sweep GC with trap instructions acting as GC safepoints. ↩

Consider you're a compiler engineer who developed a Stabilizer-like tool. Running this tool reveals that a bunch of optimizations shipped by the folks working in your adjacent offices do not give statistically significant improvements for codebases and benchmarks that your org's director cares about. Someone senior working on the optimizer leaves, and there isn't replacement headcount available. You used to have lunch with the optimizer folks. They seem cold these days. Awkward doesn't begin to describe it. ↩

                                Don't miss what's next. Subscribe to PL Papers You Might Love:

            Email address (required)

                    Older →

                Fearless concurrency at a discount?