One of the key points of GitHub’s engineering culture —and I believe, of any good engineering culture— is our obsession with aggressively measuring everything.
Coda Hale’s seminal talk “Metrics, Metrics Everywhere” has been a cornerstone of our monitoring philosophy. Since the very early days, our engineering team has been extremely performance-focused; our commitment is to ship products that perform as well as they possibly can (“it’s not fully shipped until it’s fast”), and the only way to accomplish this is to reliably and methodically measure, monitor and graph every single part of our production systems. We firmly believe that metrics are the most important tool we have to keep GitHub fast.
In our quest to graph everything, we became one of the early adopters of Etsy’s
statsd is a Metrics (with capital M) aggregation daemon written in Node.js. Its job is taking complex Metrics and aggregating them in a server so they can be reported to your storage of choice (in our case, Graphite) as individual datapoints.
statsd is, by and large, a good idea. Its simple design based around an UDP socket, and the very Unix-y, text-based format for reporting complex metrics makes it trivial to measure every nook and cranny of our infrastructure: from the main Rails app, where each server process reports hundreds of individual metrics for each request, to the smallest, most trivial command-line tool that we run in our datacenter, as reporting a simple metric is just one
nc -u invocation away.
However, like most technologies, the idea behind
statsd only stretches up to a point. Three and a half years ago, as we spun up more and more machines to keep up with GitHub’s incessant growth, weird things started to happen to our graphs. Engineers started complaining that specific metrics (particularly, new ones), were not showing up on dashboards. For some older metrics, their values were simply too low to be right.
In the spirit of aggressively monitoring everything, we used our graphs to debug the issue with our graphs.
statsd uses a single UDP socket to receive the metrics of all our infrastructure: the simplicity of this technical decision allowed us to build a metrics cluster quickly and without hassle, starting with a single metrics aggregator back when we had very limited resources and very limited programmer time to grow our company. But this decision had now come back to bite us in the ass. A quick look at the
/proc/net/dev-generated graphs on the machine made the situation very obvious: slowly but steadily over time, the percentage of UDP packets that were being dropped in our monitoring server was increasing. From 3% upwards to 40%. We were dropping almost half of our metrics!
Before we even considered checking our network for congestion issues, we went to the most probable source of the problem: the daemon that was actually receiving and aggregating the metrics. The original
statsd was a Node.js application, back in the days when Node.js was still new and with limited tooling available. After some tweaking with V8, we managed to profile
statsd and tracked down the bottleneck to a critical area of the code: the UDP socket that served as only input for the daemon. The single-thread event-based nature of Node was a rather poor fit for the single socket; its high throughput made polling essentially a waste of cycles, and the reader callback API (again, enforced by Node’s event based design) was causing unnecessary GC pressure by creating individual “message” objects for each received packet.
Node.js has always been a foreign technology at GitHub (over the years we’ve taken down the few critical Node.js services we had running, and we’ve rewritten them in languages we have more expertise working with). Because of this, rather than trying to dig into
statsd itself, our first approach to fix the performance issues was a more pragmatic one:
statsd process was bottlenecking on CPU, we decided to load balance the incoming metrics between several processes. It quickly became obvious, however, that load-balancing metric packets is not possible with the usual tools (e.g. HAproxy) because the consistent hashing must be performed on the metric name, not the source IP.
The whole point of a metric aggregator is to unify metrics so they get reported only once to the storage backend. With several aggregators running, balancing metrics between them based on their source IP will cause the same metric to be aggregated in more than one daemon, and hence reported more than once to the backend. We’re back to the initial problem that
statsd was trying to solve.
To work around this, we wrote a custom load balancer that parsed the metric packets and used consistent hashing on the metric’s name to balance it between a cluster of StatsD instances. This was an improvement, but not quite the results we were looking for.
statsd instances that couldn’t handle our metrics load and a load balancer that was already essentially “parsing” the metrics packets, we decided to take a more drastic
approach and design the way we were aggregating metrics from a clean slate by writing a
statsd-compatible daemon called Brubeck.
Taking an existing application and rewriting it in another language very rarely gives good results. Especially in the case of a Node.js server, you go from having an event loop written in C (libuv, the cross-platform library that powers the Node.js event framework is written in C), to having, well… an event loop written in C.
A straight port of
statsd to C would hardly offer the performance improvement we required. Instead of micro-optimizing a straight port to squeeze performance out of it, we focused on redesigning the architecture of the application so it became efficient, and then implemented it as simply as possible: that way, the app will run fast even with few optimizations, and the code will be less complex and hence more reliable.
The first thing we changed in Brubeck was the event-loop based approach of the original
statsd. Evented I/O on a single socket is a waste of cycles; while receiving 4 million packets per second, polling for read events will give you unsurprisingly predictable results: there is always a packet ready to be read. Because of this, we replaced the event loop with several worker threads sharing a single listen socket.
Several threads working on aggregating the same metrics means that access to the metrics table needs to be synchronized. We used a modified version of a concurrent, read-lock-free hash table with optimistic locking on writes optimized for applications with a high read-to-write ratios, which performs exceedingly well for our use case.
The individual metrics were synchronized for aggregation with a simple spinlock per metric, which also fit our use case rather well, as we have almost no contention on the individual metrics and perform quick operations on them.
The simple, multi-threaded design of the daemon got us very far: during the last two years, our single Brubeck instance has scaled to up to 4.3 million metrics per second (with no packet drops even at peak hours — as we’ve always intended).
On top of that, last year we added a lot more headroom to our future growth by enabling the marvelous
SO_REUSEPORT, a new flag available in Linux 3.9+ that allows several threads to bind the same port simultaneously instead of having to race for the same socket, having the kernel round-robin between them. This decreases user-land contention incredibly, simplifies our code and sets us up for almost NIC-bound performance. The only possible next step would be skipping the kernel network stack, but we’re not quite there yet.
After three years running in our production servers (with very good results both in performance and reliability), today we’re finally releasing Brubeck as open-source.
Brubeck is a thank-you gift to the monitoring community for all the insight and invaluable open-source tooling that has been developed over the last couple years, and without which bootstrapping and keeping GitHub running and monitored would have been significantly more difficult.
Please skim the technical documentation of the daemon, carefully read and acknowledge its limitations, and figure out whether it’s of any use to you.