At AppNexus the production adserving component of our business has over 600 servers (a combination of virtual and baremetal) spread across 4 datacenters (AMS1 in Amsterdam, NYM1 and NYM2 in New York, and LAX1 in Los Angeles).
We aggregate metrics from all of our servers to look at the trend of various measurements over time. We collect system-level metrics such as CPU load, disk usage, and memory usage. We also collect application-level metrics, measuring speed or quantity of a particular component.
Across all of our servers, we are currently collecting over 1 million data points–every minute. In order to collect and display these metrics, we use a remarkable open source package called Graphite.
Graphite consists of several components, all written in Python:
- A database library called Whisper. Whisper is Graphite’s current storage format. The data is stored in files in an round robin database fashion.
- Backend daemons called carbon that marshall the data
- Several web UIs to extract and display the data
The metrics system gives us a fantastic window into our world. We can forecast when we are going to need to add new machines to our pool. We can prove or disprove performance increases as we roll out new releases of our components. Metrics extends our “spidey-sense” and lets us pinpoint problems. Here’s an example:
This graph shows the CPU usage of our bidders across our datacenters for the previous 24 hours.
The anomoly that jumps out at us is that at 4 AM there is a significant load spike. What’s going on here?
Let’s look at all of our bidders (the boxes we use to submit bids for a particular ad impression on behalf of our clients) in a single datacenter.
Interesting. All of the bidders exhibit a similar spike, but there seems to be two camps of bidders, one which peaks at approximately 40% CPU idle, and the other peaks at about 55% idle. That might be worth investigating.
Let’s look at a specific bidder, but let’s add another metric on top of it (the new dashboard makes this easy via drag and drop).
Aha! This particular metric shoots up significantly at exactly 4AM. Although I haven’t proved cause and effect, I can now take this data and go talk to Engineering and / or the business side to figure out why we got this spike, and perhaps address the problem.
As we started collecting more and more metrics and adding more and more servers we noticed two things:
- Aggregating metrics together (say, average CPU usage across a datacenter) was starting to get slow.
- We were creating custom dashboards to display multiple graphs on one page
Enter Chris Davis, lead developer for Graphite. We hired Chris as a contractor to make improvements to the Graphite backend and UI, with the stipulation that all of the code be contributed back to the open source project for others to use. Chris has addressed our aggregation and dashboard concerns and made many other improvements to Graphite.
For aggregation of metrics, Chris created a new carbon application that aggregates metrics on the fly according to a prescribed configuration. All of our instance-level metrics get aggregated to create a clustered view to either sum up or average those metrics. The end result is that we don’t need to precompute our sums and averages; the carbon-aggregator does this for us automatically.
For the UI, Chris created a whole new dashboard. We can drill down to our metrics in one of two ways: a tree structure and an advanced completer (not yet released on the Graphite website, but it’s on the way!). We can view lots of graphs on one page. We can drag and drop graphs to merge them together. We can apply functions to the graphs en masse. We can change the look and feel easily. Dashboards can be saved and shared. For the pro users, the metrics completer lets us get to the metrics we need quickly. The possibilities are endless, and because various teams use metrics in different ways, this flexibility is very powerful.
Having walked the path of writing my own cobbled-together metrics systems and using other open source packages, I can say that Graphite stands head and shoulders above the rest. The performance is stunning (with even more improvements to come soon), and careful consideration has been put in to the UI design and function. In combination with our monitoring system (a post for another day, perhaps), a comprehensive and scalable metrics system like Graphite is an invaluable tool across all teams in the company.
Pingback: DevOps in Milliseconds | AppNexus Tech Blog
Pingback: Statsd, Graphite and Nagios | Recoset Machine Learning and Predictive Analytics