Intern Series: Diary of an Ad Server Intern

3 Comments

These days, the guys and gals on the data pipeline team find themselves in an interesting position: wrangling 20TB of billing data and records into databases and data warehouses every single day. Our optimization team couldn’t be more pleased: the more data they have to work with, the better and more fine-tuned their bidding algorithms can be. However, this also presents an immense technical challenge for the rest of the company. Not only do we store all of this data, but we also provide reports for clients that let them know how their campaigns are doing on our platform, and we ship these reports once an hour. Finally, all of this data has to be redundant. No one would be happy if a datacenter’s worth of logs went missing.

To solve the storage and network transport problems induced by our growth, we built a system called Packrat. Its function is simple: receive data in rows, maybe store a copy, and then ship it out elsewhere.  Packrat is developed and maintained by the ad server team, since we interface so closely with our core ad server and bidding engine, but Packrat’s real purpose is to move the data somewhere it can be effectively dealt with: the data team. To further our data redundancy effort, each datacenter (three globally) restreams all its data to another. As you might imagine, we transmit a tremendous amount of data all over the world, and network bandwidth is a huge factor in capacity planning and our overall growth.

When I started at AppNexus, I was told that one of my big projects for the summer would be compressing large portions of our data before it crosses the wire.  I’ll admit, I was a little unsure of how it would turn out. I had just walked in the door, and by the end of my first week, I had been given access to a system that was exponentially larger, faster, and more uptime-dependent than anything I had previously touched. Let’s just say that I had a busy first week.

One of the first things I figured out about engineering at AppNexus is that data means everything. When you’re working on a platform with as much reach and as many realtime performance requirements as the AppNexus ad serving system, you have to be sure that any changes you make won’t break reporting or the data pipeline.

I was given quite a bit of leeway to design a system that would compress records and billing data on one end of the data pipe and then (rather crucially) decompress it on the other. The design freedom that’s given to engineers isn’t entirely without restriction, though – you’re expected to prove that your proposal is the best option, and then back it up with data. To that end, I undertook a survey of various compression techniques, testing a myriad of algorithms against real-world data to see what would work best for both our performance needs, as well as our need to minimize bandwidth usage and network transfer time. My microbenchmark tested Gzip, Bzip2, Zlib, FastLZ, and Snappy. After a variety of tests, including comparing compression ratio and the total computation time of compression and decompression for a given piece of data, Snappy came out the winner, just edging out FastLZ in terms of overall computation time. However, if you’re looking to drastically reduce the size of your data (when was the last time you backed up your parents’ computer, for instance?), Bzip2 is definitely the way to go, offering compression ratios almost a factor of two better in many cases.

"time-size""compression zoom"

The great part about taking on a project like this is that most of the effort is in the design phase. Implementing the use of a compression library is nothing new, but figuring out the codebase and the best method of implementation is really the interesting part. System design is something I’m really interested in, and my team members worked hard to make sure I could take on projects that would make an impact.

After picking a compression algorithm, the remaining design decisions were really left up to me. When I was ready to release something, my code was reviewed and I got very helpful constructive criticism. Overall, I was given enough autonomy to figure out the best way to work things out. And hey, how is one supposed to figure out a massive codebase if they don’t have the ability to dig around, make a couple of mistakes, and actually learn it?

When I finally pushed my code out to production, it touched almost every major component of the AppNexus ad serving platform. It cut bandwidth usage on the Packrat network by a little over 60 percent, giving us room to grow for at least the near future without any major network infrastructure upgrades.

"bandwidth" "bandwidth2"

Overall, I couldn’t be happier with how the summer turned out. I learned more, about more elements of practical software engineering than I ever thought I would, and got to make a fairly decent impact on a company that’s growing by leaps and bounds. And all in the span of about three months.

About John the Intern

John was an intern for the Ad Server Team during the summer of 2012.

This entry was posted in Architecture, Back-end Feature, Culture. Bookmark the permalink.

3 Comments
  • http://www.facebook.com/ashersnyder Asher Snyder

    Great writeup!

  • Sam IT

    Not to compare apples to oranges but I think it’s worth mentioning that OpenVPN has an option to compress the data stream using LZO.

    This is how I set things up between my WAN endpoints, but hey, I don’t need to transfer 15 TB per day.

    All details aside (VPN, compression algorithm, complexity/cost of the implementation, etc.), from an architectural point of view, I would prefer to implement this at the “network layer” because it would be super-easy to integrate such a solution into an existing software stack.

    Feel free to correct me if you think I’m wrong or to elaborate on why you guys chose to implement it this way, etc.

    By the way, AppNexus seems like a fun place to work!

    • http://www.facebook.com/sbahra Samy Al Bahra

      Hi Sam,

      Apologies for the late response. Our systems are very much uptime and latency-constrained, so it is important for us to have fine-tuned mechanisms to manage the scheduling and routing of requests across a myriad of load levels. By implementing this in our stack, we not only pay a minimal cost to latency (no hops across much more expensive mediums whether it is the userspace-kernelspace boundary or a network interconnect) but we can also apply more complex compression logic. For example, we may want to only compress messages that will cross a packet boundary, compress messages of a certain type or maybe only opportunistically compress messages according to the availability of CPU. By having complete control of the compression policies with direct access to application-level logic, we can more effectively explore the options that provide the highest ROI.

      The compression abilities that were added were not specific to packrat. These changes were made to our request routing sub-system, which means that fine-tuned compression is now an option for all of our software on AdServer at zero cost. For example, some portions of our stack have a strict 10ms real-time requirement, we can now explore adopting a compression policy fine-tuned for this time-scale.

      Personally, I think it is better to architect for scalability, flexibility, reliability and observability. In my opinion, a third party network-level component contributes less to all of these factors. Based on the trade-offs, we decided to go with the route of building this technology into our software stack (which also doesn’t exclude hardware-assisted approaches in the future).

      By the way, we’re already pushing over 30TB a day… :-)