Metrics Collection and Display via Graphite

13 Comments

At AppNexus the production adserving component of our business has over 600 servers (a combination of virtual and baremetal) spread across 4 datacenters (AMS1 in Amsterdam, NYM1 and NYM2 in New York, and LAX1 in Los Angeles).

We aggregate metrics from all of our servers to look at the trend of various measurements over time. We collect system-level metrics such as CPU load, disk usage, and memory usage. We also collect application-level metrics, measuring speed or quantity of a particular component.

Across all of our servers, we are currently collecting over 1 million data points–every minute. In order to collect and display these metrics, we use a remarkable open source package called Graphite.

Graphite consists of several components, all written in Python:

  1. A database library called Whisper. Whisper is Graphite’s current storage format. The data is stored in files in an round robin database fashion.
  2. Backend daemons called carbon that marshall the data
  3. Several web UIs to extract and display the data

The metrics system gives us a fantastic window into our world. We can forecast when we are going to need to add new machines to our pool. We can prove or disprove performance increases as we roll out new releases of our components. Metrics extends our “spidey-sense” and lets us pinpoint problems. Here’s an example:

This graph shows the CPU usage of our bidders across our datacenters for the previous 24 hours.

The anomoly that jumps out at us is that at 4 AM there is a significant load spike. What’s going on here?

Let’s look at all of our bidders (the boxes we use to submit bids for a particular ad impression on behalf of our clients) in a single datacenter.

Interesting. All of the bidders exhibit a similar spike, but there seems to be two camps of bidders, one which peaks at approximately 40% CPU idle, and the other peaks at about 55% idle. That might be worth investigating.

Let’s look at a specific bidder, but let’s add another metric on top of it (the new dashboard makes this easy via drag and drop).

Aha! This particular metric shoots up significantly at exactly 4AM. Although I haven’t proved cause and effect, I can now take this data and go talk to Engineering and / or the business side to figure out why we got this spike, and perhaps address the problem.

As we started collecting more and more metrics and adding more and more servers we noticed two things:

  1. Aggregating metrics together (say, average CPU usage across a datacenter) was starting to get slow.
  2. We were creating custom dashboards to display multiple graphs on one page

Enter Chris Davis, lead developer for Graphite. We hired Chris as a contractor to make improvements to the Graphite backend and UI, with the stipulation that all of the code be contributed back to the open source project for others to use. Chris has addressed our aggregation and dashboard concerns and made many other improvements to Graphite.

For aggregation of metrics, Chris created a new carbon application that aggregates metrics on the fly according to a prescribed configuration. All of our instance-level metrics get aggregated to create a clustered view to either sum up or average those metrics. The end result is that we don’t need to precompute our sums and averages; the carbon-aggregator does this for us automatically.

For the UI, Chris created a whole new dashboard. We can drill down to our metrics in one of two ways: a tree structure and an advanced completer (not yet released on the Graphite website, but it’s on the way!). We can view lots of graphs on one page. We can drag and drop graphs to merge them together. We can apply functions to the graphs en masse. We can change the look and feel easily. Dashboards can be saved and shared. For the pro users, the metrics completer lets us get to the metrics we need quickly. The possibilities are endless, and because various teams use metrics in different ways, this flexibility is very powerful.

Having walked the path of writing my own cobbled-together metrics systems and using other open source packages, I can say that Graphite stands head and shoulders above the rest. The performance is stunning (with even more improvements to come soon), and careful consideration has been put in to the UI design and function. In combination with our monitoring system (a post for another day, perhaps), a comprehensive and scalable metrics system like Graphite is an invaluable tool across all teams in the company.

About Pete

I've been a DevOps guy at AppNexus for the past three years. When I'm not working, I'm either spending family time with my wife and three daughters, reading, or growing my beard.

This entry was posted in Architecture, Back-end Feature and tagged , , . Bookmark the permalink.

13 Comments
  • http://www.adopsinsider.com Ben Kneen

    Appnexus team, just wanted to say these are a blast to read. So unique and interesting – keep up the great work!

  • http://nicolas.kruchten.com/ Nicolas Kruchten

    We at Recoset also use and love Graphite, and we’ve even written a Nagios plugin to query Graphite for alarming! Check it out: code at https://github.com/recoset/check_graphite and related blog post at http://nicolas.kruchten.com/content/2011/05/statsd-graphite-and-nagios/

    • http://tech-chops.blogspot.com/ Pete

      We’re doing something very similar. My Nagios plugin looks at the last N minutes for a valid value and uses it, but I like your averaging of the last N minutes too. One nice feature we use quite a bit in our checks is applying Graphite’s derivative() function to the rawData before checking our thresholds.

      • http://nicolas.kruchten.com/ Nicolas Kruchten

        Cool. We also have a check_multi_graphite (not up on GitHub yet) to handle ‘*’ in targets so we can pull data from a bunch of campaigns, say, and check that the minimum or maximum of the average is within bounds (e.g. to answer “have any of our campaigns stopped winning anything?” type questions: is the minimum average for any campaigns.*.wins at 0?)

        • http://tech-chops.blogspot.com/ Pete

          I read your post; it’s interesting to see how different people get their data into Graphite!

          Keep checking back on the AppNexus techblog for a follow-up post. I’m going to talk about our approach to the metrics infrastructure, and the new Graphite Dashboard that we hired Chris Davis to build for us.

          Are you using the aggregator feature that Graphite now has? It lets you make rules that allow carbon to sum or average up metrics across groups of machines, so that you don’t have to apply a sumSeries() function. It’s particularly helpful when you’re summing across hundreds of machines.

  • Marc

    The new Dashboard UI that you mention was built recently. Do you know if that is the Dashboard that is in the 0.9.8 release of Graphite, or is that something else? If so, any ideas when that will be released, or where I can find the source?

    • http://tech-chops.blogspot.com/ Pete

      Marc,

      I believe the dashboard is available if you check the source code out of bazaar at http://graphite.wikidot.com/downloads. I’ll check in with Chris Davis on when he’s planning on an “official” release that contains the dashboard. Keep checking back here for a detailed post about the dashboard and our architecture behind getting the metrics to graphite to begin with.

      Pete

    • http://tech-chops.blogspot.com/ Pete

      Marc,

      Chris Davis says the dust should be settling on his 0.9.10 release within the next two weeks.

      Pete

      • Marc

        Thanks! I look forward to seeing it.

  • Pingback: DevOps in Milliseconds | AppNexus Tech Blog()

  • Pingback: Statsd, Graphite and Nagios | Recoset Machine Learning and Predictive Analytics()

  • Martin

    We have been evaluating the Graphite and it sure looks really nice. But there is one thing I’m wondering about and that is the access rights for it. When I add a user it looks like I can add different kind of access rights but it does not seems to work. Is that something you are using? Doesn’t seems to be much to read about it out there.

  • http://tech-chops.blogspot.com/ Pete

    We used to use Ganglia to do that collection, but now we have a custom daemon that runs on each server and then sends the metrics over to graphite. You can get data into graphite by piping your data to nc:

    cat FILE | nc 127.0.0.1 2023