On September 17, 2013 starting at 17:54 UTC (1:54 PM America/New_York) the AppNexus platform experienced a technical failure that initially fully halted ad serving and later partially degraded ad serving with the entire incident lasting approximately two and a half hours. We messed up and we apologize. Here is what happened and what we are doing to make sure it does not happen again.
I suppose every software engineer accumulates a few stories about unusually elusive bugs. These stories are fun to reminisce about because, often, there’s a perfect set of circumstances that led to the perfect bug. One of my projects while interning on the data team has been to work on the Job Management Framework (JMF)—an internal (web) service used by the data team to manage and monitor various data pipeline jobs such as aggregations, syncs, purges, etc. This bug chasing story began a few weeks ago, when I was preparing JMF for its weekly deployment after implementing some routine bug fixes.
Dwight Merriman is a tech legend and entrepreneur extraordinaire. Dwight co-founded DoubleClick in 1995 and served as the company’s CTO for a decade. As CTO, Dwight designed the infrastructure for the DART ad serving technology that now drives Google’s profits. After selling DoubleClick in 2005, he and fellow executive Kevin Ryan left to start their own company. They ended up starting five, including Gilt Groupe, 10gen, and businessinsider.com. No big deal.
These days Dwight is focused on 10gen, the company behind MongoDB, a leading open source NoSQL database. At the June 12, 2013, installment of AppNexus Engineering@Scale, Dwight sat down with AppNexus CEO and Co-Founder Brian O’Kelley to talk scaling and the future of big data.
When DoubleClick launched, much of what now constitutes a tech stack didn’t exist. As a result, scaling in the early days of DoubleClick wasn’t about improving or expanding a tech stack but about creating one. Take geolocation software for example. In 1995 Dwight wrote his own geotargeting code because that critical tool for Internet ad tech hadn’t yet been invented. Even basic technology that did exist – like web browsers – had so many scaling limitations that Dwight and his team at DoubleClick developed their own homegrown solutions to meet the company’s scaling needs.
These days, as the former CEO and now Chairman of 10gen, Dwight has a full tech stack to scale. Within that tech stack, he believes that the data layer poses the most challenges for scalability – specifically horizontal scalability. Two things make it particularly hard to scale traditional databases horizontally: distributed joins and distributed transactions. Continue reading…
Few people are more familiar with website scalability problems than Theo Schlossnagle. Not only is Theo the founder and CEO of OmniTI, he is also the author of Scalable Internet Architectures, a book that draws on his 15 years of experience to provide developers with a blueprint for tackling the biggest obstacles to successful scaling. Theo shared his wisdom in a recent AppNexus Engineering@Scale talk.
Theo kicks the discussion off by breaking down the three biggest challenges of scaling systems: storing and accessing data, messaging, and caching.
When you face the issue of scaling, your first step is to decide whether to scale up (bigger boxes) or scale out (more boxes). It’s a simple question but one that people often get wrong. Theo’s rule of thumb is that you should never scale out when you know you can scale up, if your projections show you growing at or below the pace of Moore’s Law, you should always just use bigger boxes. Scaling out incurs the cost of using engineers to solve a technical infrastructure problem when their time could be better spent.
If you are unsure of whether or not you can scale up, Theo advises doing the following:
Understand your problem: This seems obvious, but it is important to understand what exact problems you are trying to address.
Project your possible needs and growth over the next 12 – 24 months: You do not want to deploy a solution that will be immediately obsolete because the size of the problem changed while you were building.
Deeply understand the technology at hand: Remember “new” technology is often not thoroughly tested, well understood, or supported. The older technologies have been in production systems for years, are well understood, well supported, and have strong communities behind them. Just because there is a new hotness out there does not mean it is a good match for your needs.
The importance of not only how your subsystems communicate with each other, but also how you make your data available for consumption can’t be overemphasized. Continue reading…
The latest AppNexus Engineering@Scale talk comes to you from Portland, where AppNexus just purchased 10,000 square feet of downtown office space. AppNexus has been aggressively growing its Portland presence and can’t wait to move in to their downtown office.
In this Portlandia installment of AppNexus Engineering@Scale, two AppNexians share the stage. First up: Travis Johnson, Director of Engineering, shares insight into how the AppNexus User Interface (UI) team is using Grunt for better testing. Then Nathan Wall, one of the UI team’s software engineers, introduces the concept of high integrity coding. Caution: it might blow your mind.
Building a test-driven culture with Grunt (Travis Johnson)
About six months ago, AppNexus began to focus on creating a test-driven culture in the UI team. Drawing on advice from Daniel “dB” Doubrovkine, the team knew they needed creative solutions that would let them test early, often, and fast.
(Update: The UI team has recently stopped using grunt-regarde and has switched grunt-contrib-watch.)
Don’t worry about it and assume it is a safe environment.
Lock the environment.