2013-09-17 Outage Postmortem

2 Comments

On September 17, 2013 starting at 17:54 UTC (1:54 PM America/New_York) the AppNexus platform experienced a technical failure that initially fully halted ad serving and later partially degraded ad serving with the entire incident lasting approximately two and a half hours.  We messed up and we apologize.  Here is what happened and what we are doing to make sure it does not happen again.

What happened?

At 17:54 UTC (1:54 PM America/New_York), a data update from our internal database caused a crash in the server clusters that we refer to as the “impression bus” (or “impbus”). The impbus is the part of our architecture that receives ad requests, conducts auctions and serves the winning ads. This data update had passed our production validation tests and was being distributed to the approximately 900 impbus systems that we have deployed worldwide.  As a result, all impbus systems worldwide crashed, at nearly exactly the same time. Because the impbus is the entry point to the AppNexus platform, this effectively stopped all ad serving.

Within three minutes, by 17:57 UTC (1:57 PM America/New_York), two engineering teams were assembled and working on restarting the servers worldwide and determining the root cause of the crash.

By 18:30 UTC (2:30PM America/New_York), one engineering team restored enough capacity in our Amsterdam data center (AMS1) to service 50% of Amsterdam’s regular volume of direct transactions. When we refer to “direct transactions,” we mean traffic that comes to us directly from ad tags in pages, versus traffic that comes through inter-exchange supply sources. We prioritize direct traffic because failure to service direct traffic results in broken pages, whereas ignoring inter-exchange supply means that we will not be participating in those auctions. In that case,  a non-AppNexus buyer will win the auction and an advertisement will be served without any negative impact on the end user experience.

Our engineers restored service in AMS1 using a new snapshot of data from the internal database. This snapshot was tested and shown to not trigger the bug. Generating a snapshot usually takes a significant amount of  time (on the order of 45 minutes) but in the case of AMS1 we benefited from the fact that  the automatic process that periodically generates new snapshots was just completing at the time of failure.  As such we had a good snapshot ready to use in AMS1.   The engineers continued their work to bring the other 5 data centers online which required generating new snapshots in each of the other data centers.

By 18:47 UTC (2:47PM America/New_York), the second team had identified the root cause of the crash in the impbus and created and tested a fix. The root cause was an old bug in the impbus, its effect (i.e. crashing) triggered by the change in data from the internal database. The data in question is something that very rarely changes, and our test coverage for this data was incomplete. Further, the bug was also of a nature that it was not caught by the systems that validate all data changes going to production. (We call these systems the “Validation Engines” – think of them as “canaries in the coal mine”). The reason the Validation Engines did not catch the problem is because the failure relied on time having to pass between when the delta was applied and when the effect of the delta caused the server to crash. As of yesterday, we had things configured so that once a candidate data change is processed by the Validation Engine, it is eligible for distribution without waiting for any time to pass. (This is something we are going to change).

The problem is essentially a double free.  In order to help keep latency as low as possible in our systems, we use what we call “time based memory reclamation” in parts of the system to avoid other more expensive memory management schemes (locking, reference counting, etc). For this case, the data update that caused the problem was a delete on a rarely-changed in-memory object. The result of the processing of the update is to unlink the deleted object from other objects, and schedule the object’s memory for deletion at what is expected to be a safe time in the future.  This future time is a time when any thread that could have been using the old version at the time of update would no longer be using it. There was a bug in the code that deleted the object twice, and when it finally executed, it caused the crash.  So in the Validation Engine, the data update was processed which resulted in scheduling the memory delete for later, and because that operation was successful (as the delete did not actually happen yet), the change was deemed good and distributed to all the impbus servers.

Once the testing of the new impbus was complete, we made the decision to deploy the updated impbus to all data centers worldwide.

At 19:11 UTC (3:11PM America/New_York), we discovered that Mr. Murphy is alive and well.  We have never had to do a simultaneous deployment to all 6 data centers.  While we originally constructed our deployment system to handle large scale deployments, our business has been growing steadily and we did not scale the deployment system in proportion to how we scaled the production server population. Thus, when we did the global deploy, it failed. The teams dealt with this, turning it into a rolling deploy, which further added delay to the restart process.

By 20:00 UTC (4:00PM America/New_York), we had completed the deployment and were handling 100% of direct traffic worldwide, and over the next hour or so added all remaining supply traffic.

Now what?

Our next steps fall into three areas: changes to how we validate and distribute data to the production systems, improvements to how we test impbus, and a review of our emergency production processes.

In the area of data distribution, we will be making two major changes to how we deliver any data changes to our production systems in order to better protect from latent bugs (or broken data).

  •  we will be changing our current validation systems to allow changes to pass only after the deferred deletion time window passes. In yesterday’s event, the validation systems did expose the problem by eventually crashing, but because the expression of the bug required both the specific data as well a time delay to let the deferred deletion happen, the effect of the bug on the validation system was delayed until the change had already affected the impressions buses worldwide.
  •  we will be adding a second tier of validation systems that processes the data changes in the presence of production request traffic to ensure that we test the data changes under the same conditions as regular production systems.

Only after data changes pass through both of these validation tiers will those changes be given to the rest of the production systems.

In the area of testing, we are reviewing our software testing process to ensure that corner cases like the one that triggered yesterday’s outage are fully covered. This bug has been present in our systems since April 4th, but because the portion of the data model that triggered it is almost never changed (until yesterday, had not since April 4th…), we do not have adequate test coverage for this portion of the data model. This will be remedied with more tests as well as different kinds of tests.

In terms of our operational processes and deployment systems, we are reviewing our emergency procedures to ensure that we are able to stabilize and restore service in the shortest amount of time. Specifically, we going to re-engineer our recovery processes so that in the event of a failure of any size due to a data or configuration change, we have a clear and predictable path to return to a “known good” state to limit service disruption. This work is well underway and we hope to begin testing it later this week. Further, we are going to enhance and test our deployment systems to be sure that in the event a global deployment is needed, we have the needed capacity to perform such an operation.

- o -

We know that words about how sorry we are and how much we understand the impact will only go so far, and we know that we have to earn your business each and every day.  On behalf of all of the engineering teams involved in yesterday’s outage, I can tell you that we are committed to doing everything we can to ensure that this never happens again, and in the unfortunate event that it does, that it affects you and your business as little as possible.

geir

Geir Magnusson Jr.
VP, Engineering
AppNexus

This entry was posted in Uncategorized. Bookmark the permalink.

2 Comments