AppNexus engineers have it good. They don’t lie awake at night wondering if we can handle the next increase of impressions. They don’t worry that our systems are down and we don’t know it. They don’t develop in a bubble, toss their code over the wall to a mysterious group of people, and wash their hands clean.
How do we do this, when our world is measured in milliseconds and the time between our releases measures in hours? There are three main components to our DevOps world. They interact with each other and empower our engineers in many ways.
First, there is monitoring. At AppNexus we use Nagios internally to alert of us of problems. We monitor the usual systems level checks (disk, CPU, memory), and then a slew of application-specific checks as well. Today we are monitoring approximately 12,000 different pieces of our production world (or “services” in Nagios parlance). Whenever we spool a new server up from our cloud infrastructure, Nagios is automatically configured to add monitoring for that server. Nagios sends us text messages when these components break, and escalates to others if the primary responder is unavailable.
The second component is metrics. As discussed in a previous post, we use Graphite for most of our metrics needs. We are collecting approximately 1 million datapoints every minute. When a server is provisioned, it is automatically configured to send metrics into graphite for immediate use. We’ve also written a Nagios plugin that queries Graphite and alerts if values of certain metrics go above or below specified thresholds. Our engineers are looking at the Graphite dashboard constantly to see where the problem areas are and to look out for anomalies as their code deploys to the various environments.
Deployment: Puppet and Maestro
Finally, there is code deployment. Code has to move from our Revision Control System (Subversion) to our sand, stage, and production environments. This code might be deployed on a small handful of machines, or might be deployed to multiple hundreds of machines. To handle our deployments, we currently use on Puppet backed by a MySQL database and fronted by an in-house application we call Maestro.
At AppNexus there is no wall between engineers and operations, and automation is crucial to scaling our infrasctructure. Engineers control their own destiny, and we give them the tools to dive deep into production problems and give them tools to dive deep into production problems, make fixes, and improve their products as quickly as they can code.