How the Grafana Alerting team scales their issue management with GitHub Projects
Hear from Grafana Labs' Armand Grillet about how his team uses GitHub Projects.
On Tuesday, March 11th, GitHub was largely unreachable for roughly 2 hours as the result of an evolving distributed denial of service (DDoS) attack. I know that you rely on…
On Tuesday, March 11th, GitHub was largely unreachable for roughly 2 hours as the result of an evolving distributed denial of service (DDoS) attack. I know that you rely on GitHub to be available all the time, and I’m sorry we let you down. I’d like to explain what happened, how we responded to it, and what we’re doing to reduce the impact of future attacks like this.
Over the last year, we have seen a large number and variety of denial of service attacks against various parts of the GitHub infrastructure. There are two broad types of attack that we think about when we’re building our mitigation strategy: volumetric and complex.
We have designed our DDoS mitigation capabilities to allow us to respond to both volumetric and complex attacks.
Volumetric attacks are intended to exhaust some resource through the sheer weight of the attack. This type of attack has been seen with increasing frequency lately through UDP based amplification attacks using protocols like DNS, SNMP, or NTP. The only way to withstand an attack like this is to have more available network capacity than the sum of all of the attacking nodes or to filter the attack traffic before it reaches your network.
Dealing with volumetric attacks is a game of numbers. Whoever has more capacity wins. With that in mind, we have taken a few steps to allow us to defend against these types of attacks.
We operate our external network connections at very low utilization. Our internet transit circuits are able to handle almost an order of magnitude more traffic than our normal daily peak. We also continually evaluate opportunities to expand our network capacity. This helps to give us some headroom for larger attacks, especially since they tend to ramp up over a period of time to their ultimate peak throughput.
In addition to managing the capacity of our own network, we’ve contracted with a leading DDoS mitigation service provider. A simple Hubot command can reroute our traffic to their network which can handle terabits per second. They’re able to absorb the attack, filter out the malicious traffic, and forward the legitimate traffic on to us for normal processing.
Complex attacks are also designed to exhaust resources, but generally by performing expensive operations rather than saturating a network connection. Examples of these are things like SSL negotiation attacks, requests against computationally intensive parts of web applications, and the “Slowloris” attack. These kinds of attacks often require significant understanding of the application architecture to mitigate, so we prefer to handle them ourselves. This allows us to make the best decisions when choosing countermeasures and tuning them to minimize the impact on legitimate traffic.
First, we devote significant engineering effort to hardening all parts of our computing infrastructure. This involves things like tuning Linux network buffer sizes, configuring load balancers with appropriate timeouts, applying rate limiting within our application tier, and so on. Building resilience into our infrastructure is a core engineering value for us that requires continuous iteration and improvement.
We’ve also purchased and installed a software and hardware platform for detecting and mitigating complex DDoS attacks. This allows us to perform detailed inspection of our traffic so that we can apply traffic filtering and access control rules to block attack traffic. Having operational control of the platform allows us to very quickly adjust our countermeasures to deal with evolving attacks.
Our DDoS mitigation partner is also able to assist with these types of attacks, and we use them as a final line of defense.
At 21:25 UTC we began investigating reports of connectivity problems to
github.com. We opened an incident on our status site at 21:29 UTC to let customers know we were aware of the problem and working to resolve it.
As we began investigating we noticed an apparent backlog of connections at our load balancing tier. When we see this, it typically corresponds with a performance problem with some part of our backend applications.
After some investigation, we discovered that we were seeing several thousand HTTP requests per second distributed across thousands of IP addresses for a crafted URL. These requests were being sent to the non-SSL HTTP port and were then being redirected to HTTPS, which was consuming capacity in our load balancers and in our application tier. Unfortunately, we did not have a pre-configured way to block these requests and it took us a while to deploy a change to block them.
By 22:35 UTC we had blocked the malicious request and the site appeared to be operating normally.
Despite the fact that things appeared to be stabilizing, we were still seeing a very high number of SSL connections on our load balancers. After some further investigation, we determined that this was an additional vector that the attack was using in an effort to exhaust our SSL processing capacity. We were able to respond quickly using our mitigation platform, but the countermeasures required significant tuning to reduce false positives which impacted legitimate customers. This resulted in approximately 25 more minutes of downtime between 23:05-23:30 UTC.
By 23:34 UTC, the site was fully operational. The attack continued for quite some time even once we had successfully mitigated it, but there were no further customer impacts.
The vast majority of attacks that we’ve seen in the last several months have been volumetric in terms of bandwidth, and we’d grown accustomed to using throughput as a way of confirming that we were under attack. This attack did not generate significantly more bandwidth but it did generate significantly more packets per second. It didn’t look like what we had grown to expect an attack to look like and we did not have the monitoring we needed to detect it as quickly as we would have liked.
Once we had identified the problem, it took us much longer than we’d like to mitigate it. We had the ability to mitigate attacks of this nature in our load balancing tier and in our DDoS mitigation platform, but they were not configured in advance. It took us valuable minutes to configure, test, and tune these countermeasures which resulted in a longer than necessary downtime.
We’re happy that we were able to successfully mitigate the attack but we have a lot of room to improve in terms of how long the process takes.
This attack was painful, and even though we were able to successfully mitigate the effects of it, it took us far too long. We know that you depend on GitHub and our entire company is focused on living up to the trust you place in us. I take problems like this personally. We will do whatever it takes to improve how we respond to problems to ensure that you can rely on GitHub being available when you need us.
Thanks for your support!