tech blog

AppNexus is today’s most powerful, open, and customizable ad tech platform. Advertising’s largest and most innovative companies build their businesses on AppNexus.

x Eighty Swift

| Comments

Understanding the low level behavior of our applications is one of the most useful skills for working on high throughput systems at AppNexus. We currently process around 4 million requests per second with extremely strict latency requirements, so being able to identify and correct inefficiencies at the instruction level can yield significant performance gains. More importantly, being able to work in assembly engenders feelings of supreme confidence and consummate pride (citation needed).

Here are some of our team’s favorite (and/or least favorite) x86 instructions described through comparisons to Taylor Swift songs.


The Story Of CLFLUSH

#### Looks a lot like a tragedy now #### CLFLUSH evicts any data associated with an address from caches; i.e., it lets programs tell the CPU that it doesn’t want that data anymore. It’s an interesting instruction, in that it’s not clear how a program would use it… except exploiting RAM to break systems. It could be useful for things like self-modifying code, except x86 doesn’t need explicit instruction cache flushes. It might also be interesting when interfacing with memory mapped devices or non-standard memory types, but that’s what memory fences are for. It could also ensure a consistent initial state for microbenchmarks, but, in our experience, filling the caches by reading clean data is more representative of real performance. In the end, we really don’t know why that’s not a privileged instruction: it’s bound to be nothing but trouble, with attacks like Rowhammer or covert communication channels via cache timings. Would not try again. Just like us. Sniff

All You Had To Do Was Stay In CPU Cache

->How can I PREFETCH you back?<-

PREFETCH is the opposite of CLFLUSH in functionality and is sometimes useful. The instruction lets a program signal to the CPU that it’ll need a piece of data soonish, and also whether it will likely need that data for a short or long amount of time. Similarly to how it’s easy to imagine that all your ex had to do was stay for the relationship to work, it is tempting to believe that “just” PREFETCHing will reduce access times. However even with known access patterns, such as walking down an array of pointers, it can be surprisingly difficult to use PREFETCH correctly. There are obvious challenges like ensuring that you prefetch with enough lead time for the data to hit cache, but not so much that it leaves cache again. The hardware might already be prefetching correctly, in which case sprinkling prefetches around only creates more instructions to decode. There are also non-obvious issues, like the fact that prefetching NULL or another unmapped address won’t result in a fault, but will still cause a pipeline stall due to the TLB miss. Of course, Jonathan Corbet has a fantastic writeup on “The Problem with Prefetch:” prefetching instructions can improve performance, but using them indiscriminately is a simple pessimisation.

That said, PREFETCHW, prefetch with intent to write, is pretty much a no-brainer. This extension was introduced by AMD with 3DNow! (TM) in 1998, and only adopted by Intel with Broadwell, last year (!). PREFETCHW essentially loads the data in cache, but pre-emptively marks the cache line as written. This is useful in multi-core systems for reducing the number of cache coherency messages. The core acquires the line exclusively when it is loaded, as opposed to having to load the line and subsequently confirm that no other threads are using it. This is a common scenario for locks where we know that we’ll write to the cache line to acquire the lock and typically first have to wait until the lock is free. We can sometimes perform an eager compare-and-swap to acquire the cache line with a write, but not every lock is amenable to this technique. PREFETCHW is a general solution to this performance problem for synchronization primitives. On another note, it is possible that some songs would simply not exist if Ms. Swift also had more options for establishing exclusivity.

Got A Long List Of Ex-Dependencies

->She ran out of instruction cache<-

The execution model for contemporary out of order machines is closer to dataflow machines than to the von Neumann architecture exposed by typical instruction sets. This is particularly salient on x86 and x86-64, with their relative paucity of architectural general purpose registers (GPR). Program express operations by manipulating values in a set of 6 or 15 GPRs, and 8 or 16 SSE registers. However, hardware has many more registers and may perform multiple operations at the same time, so hardware scheduling logic transforms the instruction stream into a dependency graph, on the fly.

For that transformation to uncover maximal instruction-level parallelism and fully exploit our superscalar chips, we need as many independent subgraphs as possible. Loading a value from memory is a natural way to seed an independent subgraph, but it’s counter productive to hit the cache (or worse, memory) to load a common constant like 0 or a mask full of 1, and literals use a lot of instruction cache space.

This is where dependency breaking idioms come in and show us incredible things: early in the execution pipeline, the result of these instructions is marked as independent of the input, as if they had loaded a constant, only with fewer instruction bytes. For example, XOR x, x always results in 0; same with PXOR for SSE. PCMPEQD compares its inputs for equality; PCMPEQD x, x will thus always fill x with 1s, as x == x (bitwise).

As hardware designers get increased gates to work with, they add hardware to detect additional cases more quickly: the earlier we can detect these special case, the more execution resources are left for real work. Some microarchitectures even added support for XCHG at the rename stage of the pipeline: exchanging two registers is handled when transforming to a dependency graph, without any real data movement.

This is fortuitous. As contemporary compilers push static single assignment form further in the compilation pipeline, including register allocation, we need something like XCHG to implement Phi functions (page 5 of this comprehensive PDF) without causing register spillage.

So we’re gonna be using register renaming instructions forever, or our performance is gonna go down in flames.

Everything Will Be Alright If We Just Keep Computing Like We’re 22

->Happy Birthday, Pentium<-

The Pentium is turning 22 this year, and we’re happy, free, confused, and a little lonely. Our favorite instruction introduced in the 586 is RDTSC (PDF), which gives unprivileged programs efficient access to a high precision timer: the chip’s cycle counter (it literally increments for every cycle of the processor’s clock). By itself, RDTSC can be used for timestamping or, with cooperation from the operating system, to pull operations like gettimeofday(2) into userspace. However, RDTSC can’t be used by itself reliably: it doesn’t act as a processor barrier that prevents possible reordering of the instructions you want to benchmark! That’s where CPUID (LFENCE also works) comes in. By some interesting twist, the instruction set architects decided that CPUID could be a handy instruction which they could use to guarantee no reordering (on top of its normal duties telling you what kind of processor you’re running). Combined with fully serializing instructions like CPUID, RDTSC(P) is the most reliable tool at our disposal to gauge the performance of microbenchmarks. The *P variant is a fully serializing instruction in its own right, so you don’t need CPUID.

High performance processors have grown so ridiculously complicated that it’s likely no one has a complete and accurate mental model of their performance; at some point, we all have to double check with experiments, and the low overhead of RDTSC enables us to time very short code sequences, on the order of 100 cycles per iteration or less. In step with that complexity growth, variable frequency clocks became popular. Understandably, a deviation from wall-clock syntony can screw up your cycle-counting benchmarks, so recently RDTSC has been defined to increment at the rate of the fastest frequency the processor will run, constantly. Of course, this introduces skew of its own, but hopefully only while your processor comes up to speed at the start of your benchmark, and your benchmark is taking stats across millions of runs anyway, right? :)

With its superscalar execution, the Pentium sounded the end of the naïve “cycle counting” static performance model; it’s a good thing that, as they took away that performance model, Intel gave us tools to better understand how chips behave, like their not-sure-how-it’s-free code analyzer to go with classical sampling utilities like perf and VTune.

I Knew You Were Trouble When I Ran Perf

#### So shame on GCC #### Flew me to branches I shouldn’t go #### Now my IPC is on the cold hard ground #### [screaming] ->Branch Trouble<-

Branch misprediction can be less of an issue now that systems are so heavily bottlenecked on memory, but it still regularly hurts the performance in what would otherwise be tight code. The Pentium 4, with 20 to 31 pipeline stages (depending on the iteration), was a particularly bad example. On earlier Pentiums, pretty much any program would get performance improvements compared to the previous generation. That wasn’t the case for the P4: initially it wasn’t clear that higher-clocked P4s were actually quicker than lower-clocked Athlons or even older and cheaper P III, at least for real code in the wild! Developers had to learn all sorts of branch elimination tricks to extract good performance out of the P4. That includes adopting a coding style that let compilers emit CMOVcc (conditional move) or SETcc (convert a flag bit to a one-byte boolean), but also other tricks like using arithmetic shifts to propagate the sign bit, or SBB (subtract with borrow) a register from itself to convert the carry flag into a mask. The amazing part is that none of these instructions were well supported by the original P4!

Agner Fog’s tables of instruction latencies (PDF) quantify this badness:

Instruction Latency
CMOVcc 6 cycles
SETcc 5 cycles
Shifts (constant)    4 cycles
SBB 6 cycles

As Linus Torvalds points out, this sad situation meant that we couldn’t simply convert simple branches to branch-free code: if the branch was easily predictable, the branch-free alternative would be markedly slower. Thankfully, processors are getting better, and it looks like, even on Westmere, it’s always safe to convert simple branches to conditional moves, regardless of probabilities.

* The Cache Hits are John Wittrock and Aaron Dornbrand-Lo