How the Grafana Alerting team scales their issue management with GitHub Projects
Hear from Grafana Labs' Armand Grillet about how his team uses GitHub Projects.
Last week GitHub was unavailable for two hours and six minutes. We understand how much you rely on GitHub and consider the availability of our service one of the core…
Last week GitHub was unavailable for two hours and six minutes. We understand how much you rely on GitHub and consider the availability of our service one of the core features we offer. Over the last eight years we have made considerable progress towards ensuring that you can depend on GitHub to be there for you and for developers worldwide, but a week ago we failed to maintain the level of uptime you rightfully expect. We are deeply sorry for this, and would like to share with you the events that took place and the steps we’re taking to ensure you’re able to access GitHub.
At 00:23am UTC on Thursday, January 28th, 2016 (4:23pm PST, Wednesday, January 27th) our primary data center experienced a brief disruption in the systems that supply power to our servers and equipment. Slightly over 25% of our servers and several network devices rebooted as a result. This left our infrastructure in a partially operational state and generated alerts to multiple on-call engineers. Our load balancing equipment and a large number of our frontend applications servers were unaffected, but the systems they depend on to service your requests were unavailable. In response, our application began to deliver HTTP 503 response codes, which carry the unicorn image you see on our error page.
Our early response to the event was complicated by the fact that many of our ChatOps systems were on servers that had rebooted. We do have redundancy built into our ChatOps systems, but this failure still caused some amount of confusion and delay at the very beginning of our response. One of the biggest customer-facing effects of this delay was that status.github.com wasn’t set to status red until 00:32am UTC, eight minutes after the site became inaccessible. We consider this to be an unacceptably long delay, and will ensure faster communication to our users in the future.
Initial notifications for unreachable servers and a spike in exceptions related to Redis connectivity directed our team to investigate a possible outage in our internal network. We also saw an increase in connection attempts that pointed to network problems. While later investigation revealed that a DDoS attack was not the underlying problem, we spent time early on bringing up DDoS defenses and investigating network health. Because we have experience mitigating DDoS attacks, our response procedure is now habit and we are pleased we could act quickly and confidently without distracting other efforts to resolve the incident.
With our DDoS shields up, the response team began to methodically inspect our infrastructure and correlate these findings back to the initial outage alerts. The inability to reach all members of several Redis clusters led us to investigate uptime for devices across the facility. We discovered that some servers were reporting uptime of several minutes, but our network equipment was reporting uptimes that revealed they had not rebooted. Using this, we determined that all of the offline servers shared the same hardware class, and the ones that rebooted without issue were a different hardware class. The affected servers spanned many racks and rows in our data center, which resulted in several clusters experiencing reboots of all of their member servers, despite the clusters’ members being distributed across different racks.
As the minutes ticked by, we noticed that our application processes were not starting up as expected. Engineers began taking a look at the process table and logs on our application servers. These explained that the lack of backend capacity was a result of processes failing to start due to our Redis clusters being offline. We had inadvertently added a hard dependency on our Redis cluster being available within the boot path of our application code.
By this point, we had a fairly clear picture of what was required to restore service and began working towards that end. We needed to repair our servers that were not booting, and we needed to get our Redis clusters back up to allow our application processes to restart. Remote access console screenshots from the failed hardware showed boot failures because the physical drives were no longer recognized. One group of engineers split off to work with the on-site facilities technicians to bring these servers back online by draining the flea power to bring them up from a cold state so the disks would be visible. Another group began rebuilding the affected Redis clusters on alternate hardware. These efforts were complicated by a number of crucial internal systems residing on the offline hardware. This made provisioning new servers more difficult.
Once the Redis cluster data was restored onto standby equipment, we were able to bring the Redis-server processes back online. Internal checks showed application processes recovering, and a healthy response from the application servers allowed our HAProxy load balancers to return these servers to the backend server pool. After verifying site operation, the maintenance page was removed and we moved to status yellow. This occurred two hours and six minutes after the initial outage.
The following hours were spent confirming that all systems were performing normally, and verifying there was no data loss from this incident. We are grateful that much of the disaster mitigation work put in place by our engineers was successful in guaranteeing that all of your code, issues, pull requests, and other critical data remained safe and secure.
Complex systems are defined by the interaction of many discrete components working together to achieve an end result. Understanding the dependencies for each component in a complex system is important, but unless these dependencies are rigorously tested it is possible for systems to fail in unique and novel ways. Over the past week, we have devoted significant time and effort towards understanding the nature of the cascading failure which led to GitHub being unavailable for over two hours. We don’t believe it is possible to fully prevent the events that resulted in a large part of our infrastructure losing power, but we can take steps to ensure recovery occurs in a fast and reliable manner. We can also take steps to mitigate the negative impact of these events on our users.
We identified the hardware issue resulting in servers being unable to view their own drives after power-cycling as a known firmware issue that we are updating across our fleet. Updating our tooling to automatically open issues for the team when new firmware updates are available will force us to review the changelogs against our environment.
We will be updating our application’s test suite to explicitly ensure that our application processes start even when certain external systems are unavailable and we are improving our circuit breakers so we can gracefully degrade functionality when these backend services are down. Obviously there are limits to this approach and there exists a minimum set of requirements needed to serve requests, but we can be more aggressive in paring down the list of these dependencies.
We are reviewing the availability requirements of our internal systems that are responsible for crucial operations tasks such as provisioning new servers so that they are on-par with our user facing systems. Ultimately, if these systems are required to recover from an unexpected outage situation, they must be as reliable as the system being recovered.
A number of less technical improvements are also being implemented. Strengthening our cross-team communications would have shaved minutes off the recovery time. Predefining escalation strategies during situations that require all hands on deck would have enabled our incident coordinators to spend more time managing recovery efforts and less time navigating documentation. Improving our messaging to you during this event would have helped you better understand what was happening and set expectations about when you could expect future updates.
We realize how important GitHub is to the workflows that enable your projects and businesses to succeed. All of us at GitHub would like to apologize for the impact of this outage. We will continue to analyze the events leading up to this incident and the steps we took to restore service. This work will guide us as we improve the systems and processes that power GitHub.