# tech blog

AppNexus is today’s most powerful, open, and customizable ad tech platform. Advertising’s largest and most innovative companies build their businesses on AppNexus.

## Baby’s first memory allocator – background

Let me start off by saying that one really isn’t supposed to write custom memory allocators. Berger, Zorn, and McKinley thorougly debunked the need for them by the vast majority of the population in [1]. If you’re looking for a speedy allocator, projects like jemalloc, tcmalloc, and dlmalloc have done such a great job improving on glibc without changing the malloc/free interface that you shouldn’t even have to think about your memory allocator unless you’re doing something hugely performance sensitive. Just throw jemalloc on the case and get a free performance improvement, then don’t bother with it again. That was our theory when we moved to jemalloc at AppNexus in August of 2011, and it continues to be the theory. But then we built our own allocator anyway. This is the story of why and how.

Unfortunately for our sanity, we’ve come a long way since August 2011. At this point, I don’t even have request rate graphs that go back that far, but our traffic has multiplied by more than 20x since 2013. Given the extremely transient and complex nature of many operations (auctions), we do a lot of allocations. There is a high probability for both leaks and for performance impact.

Broadly speaking, our large systems have two major categories of memory allocations: extremely long-lived (on the order of hours to days) and extremely short-lived (on the order of milliseconds). The long-lived allocations make up the vast majority of space allocated, but short-lived allocations make up a huge majority of the number of allocations, and are the cause of most of our memory leaks, since they’re allocated so often. Additionally, malloc and free (and their associated subroutines) can easily make up 10% of the performance profile of our applications, even when running under jemalloc. At 4-5 million requests per second, the amount of hardware necessary to support 10% of our performance profile is substantial.

Jemalloc does a great job, but we’d really need a slab allocator to get great performance. So to reiterate: one should not write one’s own allocator. The only real exception is for region-based allocators which [2] were shown to be effective. However, region allocators make a lot of assumptions about interface design which production systems may not be able to live up to. Most require that you know which region to put an allocation into for a given transaction, and that you free that entire region at the end of the transaction. This post presents a performant, safe region allocator without interface tradeoffs.

## Requirements

pool_party was designed to both prevent memory leaks on the request path (the short-lived allocations) and lower the number of resources required to just allocate memory to our applications. The traditional way to stop leaks is to write a garbage collector, but I just said we were going for a speed increase as well, so that’s out. Another option is to only allocate and free in a couple of places, and make very sure that your frees are correct. As implied by the title, this is what we went with: a pool allocator that’s tied to the concept of a transaction, which in most cases is an impression request on our adserving frontend (impbus).

Ideally, a transaction-based pool allocator would allocate a large amount of memory exactly once at the beginning of a request, and free that memory when the request is over. We could associate every request with a pool, but how would each call to our malloc know which pool to put the allocation into? We’d have to change the malloc interface, and that’s not something we’re curently interested in. Changing to a typed malloc (an_malloc(type, size)) wasn’t exactly fun, and our codebase has grown enough since that transition that we definitely don’t want to go through that pain again.

So, the requirements for our new allocator are:

1. at least as fast as vanilla jemalloc
2. prevents leaks by allocating into pools and freeing those pools at well known, well-tested points.
3. doesn’t change the malloc interface that we have now.

1) and 2) are easy enough, especially when combined: allocating into a pool is generally just an increment of something like pool->used. However, 3) really throws a wrench in the works, since we don’t know offhand which pool to allocate into. Assuming that all allocations are made into the most recent pool works when you have a pool per request and requests are started and finished without preemption, but that’s not the case. A request on impbus is made up of multiple phases with asynchronous calls to other network elements (bidders and data providers, for instance). These requests are shared-nothing, so everything is thread-local. These requests multiplex onto a single thread’s event loop, so at any given point the same pool might be used by multiple requests.

Shoot.

## The Hypothetical First Cut

Here’s the easiest implementation of a transaction-based pool allocator, as a base to understand some of the complexity to follow. This is my favored brand of C/Python pseudo-code, so please bear with me.

That’s really it. This is a workable pool allocator with a many-to-many relationship between transactions and pools. Each pool is reference-counted to make sure no requests have open references to it before it’s destroyed. When we close a transaction, we decrement the refcount on its pools and free them if the refcount has become zero.

There is at least one thing we can fix, however: each transaction needs to remember which pools it has allocated into (which is why malloc has a transaction argument), and to make that happen we either need to always know what transaction we’re working on, or we need to change the malloc interface (as seen above), which violates 3). Let’s fix that.

## Again, with feeling

The only time we absolutely know what transaction is current is when we create the transaction. Other than that, we’d need some magic to figure out which transaction to add the current pool to in malloc(). Clearly, we need to get rid of the pool list in each transaction.

Since we’re representing the set of pools as a queue ordered by creation time, we can use time to our advantage. Age in this system is defined as relative position in a single-threaded linked list, not any clock functions. We know that each transaction must allocate memory into either the pool that was the youngest when the transaction was created, or pools that were created after that moment. So all pools that we might have to free upon transaction destruction are younger than the first pool that was seen by the transaction. Let’s call that pool pool_created_into.

Additionally, if we don’t bother incrementing a refcount on any pool other than pool_created_into, then we don’t have to keep track of which pools we have to decrement ref counts for. This seems like it defeats the purpose of a reference count, since we can use objects without denoting it with an increment to their refcount. We can solve the potential issues with this with a simple property change: we never destroy a pool unless all older pools in the queue have a refcount of zero.

This works because we now only use refcounts to mark pools which were the youngest pool when a transaction was created. This essentially makes them into “barriers” beyond which we won’t destroy any memory, since transactions might still be using those pools. This hack in particular is I think what made my boss start referring to this project as the “magic memory allocator”. This also makes this a form of proxy collection–the last transaction which has a refcount on a pool will help clean up for the other transctions which used that pool when all are complete.

Here’s what we have now:

## Properties

First off, the easy one: since we can memset every block of every pool when we return it as a pointer, we’ve amortized zero-fill (memset) across objects. This amortization comes with significant performance gains–most of our objects should be memset’d at allocation if they aren’t already. N.B. the only way we can afford to memset 32MB regions for allocations is that we do it incrementally, when we give out an address which passes a page boundary. Automatic zero-fill eliminates an entire class of bugs, the nastiest of which involve failed hash table lookups due to uninitialized padding in compound keys. I don’t even want to remember how many times that’s bitten me.

The proponents of region allocators (and especially those which attach regions to work units) have always given spatial locality as a reason to use them, and it’s easy to see why: beyond the 16 byte alignment required by POSIX, all of our allocations for a given piece of a transaction likely fall within a pool, and are as tightly packed as possible. On average with a 32MB pool as the default (very large allocations get their own region), we’re seeing the following (this is taken from a running system):

So we’re only actually using 12MB on average of our 32MB pools, and we can fit > 50 requests into the same pool. On average, I’ve noticed that we have almost exclusively one pool per thread at any point.

Finally, we have multiple requests in the same pool. This produces low external fragmentation, since the working set of a thread is likely to be within the same region. As a result, we don’t have to think nearly as hard about how large to make our pool size–bigger is almost always better for us.

## How to actually shoehorn this into a production environment

Now that we have this fancy new allocator, how do we actually retrofit our codebase to use it?

The new class of bug for a pool allocator isn’t use-after-free, it’s use-after-transaction-close. We actually became more tolerant of use after frees when I deployed pool_party, simply because free is a nop and thus a request can access all of the objects its ever allocated until the entire request is destroyed. Another class of bugs gone. In fact, we don’t even have to call free anymore, which has started a spirited debate about whether we should or not.

Sadly, there are parts of our system (logging, mostly) that use parts of a transaction after the request it corresponds to has been replied to and destroyed. We’re gradually moving to a model in which that doesn’t happen, but in the meantime we needed to get the allocator out, and we couldn’t be susceptible to crashing or corrupt data. For memory which is created on a request thread and then used outside of a transaction, we had to support non-pool (normal, standalone) allocations.

First, we made pool_party opt-in by type, which is a concept which pre-registers allocation tokens with classes of objects which will use them. We can attach metadata to the registration, including whether allocations are fixed-size or variable, and now whether the type should use our favorite new allocator. Making the allocator opt-out would require finding each type of allocation in our system, figuring out if it’s ever used off of the request path on a request thread, and then opting it out, or opting out parts of it. That didn’t sound like fun, especially since we have at least a hundred allocation types, many of them with different semantics. So only the types that we selected and knew were only used on the request path are used for the new allocator, and this allows us to iteratively expand the coverage of the new allocator as we clean up code and add new types.

There’s another problem: how do we know which allocator to use when returning from a callback after a network request? There’s a whole dynamic closure hack that would take a little too long to detail here, but suffice it to say that all callbacks restore the allocator context of the callee seamlessly.

We also cache standard size chunks on a stack in preparation for getting used again, up to a given size of the stack, which is another basically free mechanism for boosting locality and lowering cache misses.

Finally, we noticed that there were some things in our system which just couldn’t use pool party, either because the type was too dangerous to migrate or for other, more ridiculous reasons. But we still want the benefits of automated cleanup! So came up with a way for transactions to use our proxy collection-like mechanism to run cleanups on things that don’t use pool party. Each pool has a list of cleanup functions to run when the pool is destroyed, and one can add to that list with a function called an_pool_adopt. The list to add to is the one that belongs to the youngest pool, same as allocations. This makes cleaning up complicated types easy, but prevents us from actually calling free() on the object, so it’s slightly less error tolerant than normal allocations.

# The End

All told, our new allocator:

1. Results in ~40% less time spent in jemalloc code, which we can spend on making every allocation a calloc and still have performance gains left over,
2. Eliminates leaks on the request path for objects which opt into it and,
3. Doesn’t change our existing allocation interface.

Pool party has already seen one re-implementation since v1 (thanks @pkhuong!), and I’m sure it will see another as we migrate more apps to use it and discover new and pathological use cases. I think it’s possible to adapt it better to a multithreaded environment and make it more generally usable. More to come.

# References

[1] Emery D. Berger, Benjamin G. Zorn, and Kathryn S. McKinley. 2013. OOPSLA 2002: Reconsidering custom memory allocation. SIGPLAN Not. 48, 4S (July 2013), 46-57. DOI=10.1145/2502508.2502522 http://doi.acm.org/10.1145/2502508.2502522

[2] David Gay and Alex Aiken. 1998. Memory management with explicit regions. SIGPLAN Not. 33, 5 (May 1998), 313-323. DOI=10.1145/277652.277748 http://doi.acm.org/10.1145/277652.277748