DevOps in Milliseconds

1 Comment

AppNexus engineers have it good. They don’t lie awake at night wondering if we can handle the next increase of impressions. They don’t worry that our systems are down and we don’t know it. They don’t develop in a bubble, toss their code over the wall to a mysterious group of people, and wash their hands clean.

How do we do this, when our world is measured in milliseconds and the time between our releases measures in hours? There are three main components to our DevOps world. They interact with each other and empower our engineers in many ways.

Monitoring: Nagios

First, there is monitoring. At AppNexus we use Nagios internally to alert of us of problems. We monitor the usual systems level checks (disk, CPU, memory), and then a slew of application-specific checks as well. Today we are monitoring approximately 12,000 different pieces of our production world (or “services” in Nagios parlance). Whenever we spool a new server up from our cloud infrastructure, Nagios is automatically configured to add monitoring for that server. Nagios sends us text messages when these components break, and escalates to others if the primary responder is unavailable.

Nagios

Nagios alerts for a development instance

 

Metrics: Graphite

The second component is metrics. As discussed in a previous post, we use Graphite for most of our metrics needs. We are collecting approximately 1 million datapoints every minute. When a server is provisioned, it is automatically configured to send metrics into graphite for immediate use. We’ve also written a Nagios plugin that queries Graphite and alerts if values of certain metrics go above or below specified thresholds. Our engineers are looking at the Graphite dashboard constantly to see where the problem areas are and to look out for anomalies as their code deploys to the various environments.

Graphite

The Graphite metrics dashboard

 

Deployment: Puppet and Maestro

Finally, there is code deployment. Code has to move from our Revision Control System (Subversion) to our sand, stage, and production environments. This code might be deployed on a small handful of machines, or might be deployed to multiple hundreds of machines. To handle our deployments, we currently use on Puppet backed by a MySQL database and fronted by an in-house application we call Maestro.

Maestro3

Sneak preview of our latest Maestro

 

At AppNexus there is no wall between engineers and operations, and automation is crucial to scaling our infrasctructure. Engineers control their own destiny, and we give them the tools to dive deep into production problems and give them tools to dive deep into production problems, make fixes, and improve their products as quickly as they can code.

About Pete

I've been a DevOps guy at AppNexus for the past three years. When I'm not working, I'm either spending family time with my wife and three daughters, reading, or growing my beard.

This entry was posted in Uncategorized. Bookmark the permalink.

1 Comment
  • http://tech-chops.blogspot.com/ Pete

    Maestro does a lot for us, it probably merits a separate post to do it justice! But in a nutshell, it lets us provision and deploy servers from our cloud, upgrade them to new versions of our code (rolling them out gracefully in batches), manipulate GSLB and internal load balancers, change Nagios settings, build out code packages (RPMs), et cetera.