These days, the guys and gals on the data pipeline team find themselves in an interesting position: wrangling 20TB of billing data and records into databases and data warehouses every single day. Our optimization team couldn’t be more pleased: the more data they have to work with, the better and more fine-tuned their bidding algorithms can be. However, this also presents an immense technical challenge for the rest of the company. Not only do we store all of this data, but we also provide reports for clients that let them know how their campaigns are doing on our platform, and we ship these reports once an hour. Finally, all of this data has to be redundant. No one would be happy if a datacenter’s worth of logs went missing.
To solve the storage and network transport problems induced by our growth, we built a system called Packrat. Its function is simple: receive data in rows, maybe store a copy, and then ship it out elsewhere. Packrat is developed and maintained by the ad server team, since we interface so closely with our core ad server and bidding engine, but Packrat’s real purpose is to move the data somewhere it can be effectively dealt with: the data team. To further our data redundancy effort, each datacenter (three globally) restreams all its data to another. As you might imagine, we transmit a tremendous amount of data all over the world, and network bandwidth is a huge factor in capacity planning and our overall growth.
When I started at AppNexus, I was told that one of my big projects for the summer would be compressing large portions of our data before it crosses the wire. I’ll admit, I was a little unsure of how it would turn out. I had just walked in the door, and by the end of my first week, I had been given access to a system that was exponentially larger, faster, and more uptime-dependent than anything I had previously touched. Let’s just say that I had a busy first week.
One of the first things I figured out about engineering at AppNexus is that data means everything. When you’re working on a platform with as much reach and as many realtime performance requirements as the AppNexus ad serving system, you have to be sure that any changes you make won’t break reporting or the data pipeline.
I was given quite a bit of leeway to design a system that would compress records and billing data on one end of the data pipe and then (rather crucially) decompress it on the other. The design freedom that’s given to engineers isn’t entirely without restriction, though – you’re expected to prove that your proposal is the best option, and then back it up with data. To that end, I undertook a survey of various compression techniques, testing a myriad of algorithms against real-world data to see what would work best for both our performance needs, as well as our need to minimize bandwidth usage and network transfer time. My microbenchmark tested Gzip, Bzip2, Zlib, FastLZ, and Snappy. After a variety of tests, including comparing compression ratio and the total computation time of compression and decompression for a given piece of data, Snappy came out the winner, just edging out FastLZ in terms of overall computation time. However, if you’re looking to drastically reduce the size of your data (when was the last time you backed up your parents’ computer, for instance?), Bzip2 is definitely the way to go, offering compression ratios almost a factor of two better in many cases.
The great part about taking on a project like this is that most of the effort is in the design phase. Implementing the use of a compression library is nothing new, but figuring out the codebase and the best method of implementation is really the interesting part. System design is something I’m really interested in, and my team members worked hard to make sure I could take on projects that would make an impact.
After picking a compression algorithm, the remaining design decisions were really left up to me. When I was ready to release something, my code was reviewed and I got very helpful constructive criticism. Overall, I was given enough autonomy to figure out the best way to work things out. And hey, how is one supposed to figure out a massive codebase if they don’t have the ability to dig around, make a couple of mistakes, and actually learn it?
When I finally pushed my code out to production, it touched almost every major component of the AppNexus ad serving platform. It cut bandwidth usage on the Packrat network by a little over 60 percent, giving us room to grow for at least the near future without any major network infrastructure upgrades.
Overall, I couldn’t be happier with how the summer turned out. I learned more, about more elements of practical software engineering than I ever thought I would, and got to make a fairly decent impact on a company that’s growing by leaps and bounds. And all in the span of about three months.