Baby’s first memory allocator – background
Let me start off by saying that one really isn’t supposed to write
custom memory allocators. Berger, Zorn, and McKinley thorougly
debunked the need for them by the vast majority of the population in
. If you’re looking for a speedy allocator, projects like
and dlmalloc have done such
a great job improving on glibc without changing the
interface that you shouldn’t even have to think about your memory
allocator unless you’re doing something hugely performance
sensitive. Just throw jemalloc on the case and get a free performance
improvement, then don’t bother with it again. That was our theory when
we moved to jemalloc at AppNexus in August of 2011, and it continues
to be the theory. But then we built our own allocator anyway. This is
the story of why and how.
Unfortunately for our sanity, we’ve come a long way since August 2011. At this point, I don’t even have request rate graphs that go back that far, but our traffic has multiplied by more than 20x since 2013. Given the extremely transient and complex nature of many operations (auctions), we do a lot of allocations. There is a high probability for both leaks and for performance impact.
Broadly speaking, our large systems have two major categories of
memory allocations: extremely long-lived (on the order of hours to
days) and extremely short-lived (on the order of milliseconds). The
long-lived allocations make up the vast majority of space allocated,
but short-lived allocations make up a huge majority of the number of
allocations, and are the cause of most of our memory leaks, since
they’re allocated so often. Additionally,
their associated subroutines) can easily make up 10% of the
performance profile of our applications, even when running
under jemalloc. At 4-5 million requests per second, the amount of
hardware necessary to support 10% of our performance profile is
Jemalloc does a great job, but we’d really need a slab allocator to get great performance. So to reiterate: one should not write one’s own allocator. The only real exception is for region-based allocators which  were shown to be effective. However, region allocators make a lot of assumptions about interface design which production systems may not be able to live up to. Most require that you know which region to put an allocation into for a given transaction, and that you free that entire region at the end of the transaction. This post presents a performant, safe region allocator without interface tradeoffs.
pool_party was designed to both prevent memory leaks on the request
path (the short-lived allocations) and lower the number of resources
required to just allocate memory to our applications. The traditional
way to stop leaks is to write a garbage collector, but I just said we
were going for a speed increase as well, so that’s out. Another option
is to only allocate and free in a couple of places, and make very
sure that your
frees are correct. As implied by the title, this is
what we went with: a pool allocator that’s tied to the concept of a
transaction, which in most cases is an impression request on our
adserving frontend (impbus).
Ideally, a transaction-based pool allocator would allocate a large
amount of memory exactly once at the beginning of a request, and free
that memory when the request is over. We could associate every request
with a pool, but how would each call to our malloc know which pool
to put the allocation into? We’d have to change the malloc interface,
and that’s not something we’re curently interested in. Changing to a
typed malloc (
an_malloc(type, size)) wasn’t exactly fun, and our
codebase has grown enough since that transition that we definitely
don’t want to go through that pain again.
So, the requirements for our new allocator are:
- at least as fast as vanilla jemalloc
- prevents leaks by allocating into pools and freeing those pools at well known, well-tested points.
- doesn’t change the malloc interface that we have now.
1) and 2) are easy enough, especially when combined: allocating into a
pool is generally just an increment of something like
However, 3) really throws a wrench in the works, since we don’t know
offhand which pool to allocate into. Assuming that all allocations are
made into the most recent pool works when you have a pool per request
and requests are started and finished without preemption, but that’s
not the case. A request on impbus is made up of multiple phases with
asynchronous calls to other network elements (bidders and data
providers, for instance). These requests are shared-nothing, so
everything is thread-local. These requests multiplex onto a single
thread’s event loop, so at any given point the same pool might be used
by multiple requests.
The Hypothetical First Cut
Here’s the easiest implementation of a transaction-based pool allocator, as a base to understand some of the complexity to follow. This is my favored brand of C/Python pseudo-code, so please bear with me.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
That’s really it. This is a workable pool allocator with a many-to-many relationship between transactions and pools. Each pool is reference-counted to make sure no requests have open references to it before it’s destroyed. When we close a transaction, we decrement the refcount on its pools and free them if the refcount has become zero.
There is at least one thing we can fix, however: each transaction
needs to remember which pools it has allocated into (which is why
malloc has a transaction argument), and to make that happen we
either need to always know what transaction we’re working on, or we
need to change the malloc interface (as seen above), which violates
3). Let’s fix that.
Again, with feeling
The only time we absolutely know what transaction is current is when
we create the transaction. Other than that, we’d need some magic to
figure out which transaction to add the current pool to in
malloc(). Clearly, we need to get rid of the pool list in each
Since we’re representing the set of pools as a queue ordered by
creation time, we can use time to our advantage. Age in this system is
defined as relative position in a single-threaded linked list, not
any clock functions. We know that each transaction must allocate
memory into either the pool that was the youngest when the transaction
was created, or pools that were created after that moment. So all
pools that we might have to free upon transaction destruction are
younger than the first pool that was seen by the transaction. Let’s
call that pool
Additionally, if we don’t bother incrementing a refcount on any pool
pool_created_into, then we don’t have to keep track of
which pools we have to decrement ref counts for. This seems like it
defeats the purpose of a reference count, since we can use objects
without denoting it with an increment to their refcount. We can solve
the potential issues with this with a simple property change: we
never destroy a pool unless all older pools in the queue have a
refcount of zero.
This works because we now only use refcounts to mark pools which were the youngest pool when a transaction was created. This essentially makes them into “barriers” beyond which we won’t destroy any memory, since transactions might still be using those pools. This hack in particular is I think what made my boss start referring to this project as the “magic memory allocator”. This also makes this a form of proxy collection–the last transaction which has a refcount on a pool will help clean up for the other transctions which used that pool when all are complete.
Here’s what we have now:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
First off, the easy one: since we can memset every block of every pool
when we return it as a pointer, we’ve amortized zero-fill (
across objects. This amortization comes with significant performance
gains–most of our objects should be memset’d at allocation if they
aren’t already. N.B. the only way we can afford to memset 32MB regions
for allocations is that we do it incrementally, when we give out an
address which passes a page boundary. Automatic zero-fill eliminates an entire
class of bugs, the nastiest of which involve failed hash table lookups
due to uninitialized padding in compound keys. I don’t even want to remember how
many times that’s bitten me.
The proponents of region allocators (and especially those which attach regions to work units) have always given spatial locality as a reason to use them, and it’s easy to see why: beyond the 16 byte alignment required by POSIX, all of our allocations for a given piece of a transaction likely fall within a pool, and are as tightly packed as possible. On average with a 32MB pool as the default (very large allocations get their own region), we’re seeing the following (this is taken from a running system):
1 2 3 4 5
So we’re only actually using 12MB on average of our 32MB pools, and we can fit > 50 requests into the same pool. On average, I’ve noticed that we have almost exclusively one pool per thread at any point.
Finally, we have multiple requests in the same pool. This produces low external fragmentation, since the working set of a thread is likely to be within the same region. As a result, we don’t have to think nearly as hard about how large to make our pool size–bigger is almost always better for us.
How to actually shoehorn this into a production environment
Now that we have this fancy new allocator, how do we actually retrofit our codebase to use it?
The new class of bug for a pool allocator isn’t use-after-free, it’s
use-after-transaction-close. We actually became more tolerant of use
after frees when I deployed pool_party, simply because
free is a nop
and thus a request can access all of the objects its ever allocated
until the entire request is destroyed. Another class of bugs gone. In
fact, we don’t even have to call
free anymore, which has started a
spirited debate about whether we should or not.
Sadly, there are parts of our system (logging, mostly) that use parts of a transaction after the request it corresponds to has been replied to and destroyed. We’re gradually moving to a model in which that doesn’t happen, but in the meantime we needed to get the allocator out, and we couldn’t be susceptible to crashing or corrupt data. For memory which is created on a request thread and then used outside of a transaction, we had to support non-pool (normal, standalone) allocations.
First, we made pool_party opt-in by type, which is a concept which pre-registers allocation tokens with classes of objects which will use them. We can attach metadata to the registration, including whether allocations are fixed-size or variable, and now whether the type should use our favorite new allocator. Making the allocator opt-out would require finding each type of allocation in our system, figuring out if it’s ever used off of the request path on a request thread, and then opting it out, or opting out parts of it. That didn’t sound like fun, especially since we have at least a hundred allocation types, many of them with different semantics. So only the types that we selected and knew were only used on the request path are used for the new allocator, and this allows us to iteratively expand the coverage of the new allocator as we clean up code and add new types.
There’s another problem: how do we know which allocator to use when returning from a callback after a network request? There’s a whole dynamic closure hack that would take a little too long to detail here, but suffice it to say that all callbacks restore the allocator context of the callee seamlessly.
We also cache standard size chunks on a stack in preparation for getting used again, up to a given size of the stack, which is another basically free mechanism for boosting locality and lowering cache misses.
Finally, we noticed that there were some things in our system which
just couldn’t use pool party, either because the type was too
dangerous to migrate or for other, more ridiculous reasons. But we
still want the benefits of automated cleanup! So came up with a way
for transactions to use our proxy collection-like mechanism to run
cleanups on things that don’t use pool party. Each pool has a list
of cleanup functions to run when the pool is destroyed, and one can
add to that list with a function called an_pool_adopt. The list to add
to is the one that belongs to the youngest pool, same as
allocations. This makes cleaning up complicated types easy, but
prevents us from actually calling
free() on the object, so it’s
slightly less error tolerant than normal allocations.
All told, our new allocator:
- Results in ~40% less time spent in jemalloc code, which we can spend on making every allocation a
callocand still have performance gains left over,
- Eliminates leaks on the request path for objects which opt into it and,
- Doesn’t change our existing allocation interface.
Pool party has already seen one re-implementation since v1 (thanks @pkhuong!), and I’m sure it will see another as we migrate more apps to use it and discover new and pathological use cases. I think it’s possible to adapt it better to a multithreaded environment and make it more generally usable. More to come.
 Emery D. Berger, Benjamin G. Zorn, and Kathryn S. McKinley. 2013. OOPSLA 2002: Reconsidering custom memory allocation. SIGPLAN Not. 48, 4S (July 2013), 46-57. DOI=10.1145/2502508.2502522 http://doi.acm.org/10.1145/2502508.2502522
 David Gay and Alex Aiken. 1998. Memory management with explicit regions. SIGPLAN Not. 33, 5 (May 1998), 313-323. DOI=10.1145/277652.277748 http://doi.acm.org/10.1145/277652.277748