A Note on the Recent Outages
Following three months of near 100% uptime, we’ve just been through three major outages in as many days. I wanted to take some time to detail the problems and what…
Following three months of near 100% uptime, we’ve just been through three major outages in as many days. I wanted to take some time to detail the problems and what we intend to do to prevent similar downtime in the future.
Outage #1 (02/02/2010 9:55:09AM PST) was initiated by a load spike on one of our file servers (fs1a). When a file server stops responding to heartbeat, the slave server in the pair kills the master and takes over. In this case, the master was not killed quickly enough and the storage partitions did not migrate cleanly to the slave. Cleanup on the split-blain file server pair was delayed due to some inefficient DRBD configuration that we’ve been meaning to update. By rolling out improvements to the DRBD configuration, this type of problem should be prevented from happening in the future.
Outage #2 (02/03/2010 6:10:08PM PST) looked like a power outage at first, since so many machines were affected, but the root cause was the deployment of a faulty DRBD configuration update that propagated to all machines (courtesy of Puppet) and started causing pairs of machines to halt replication to prevent corruption caused by an invalid configuration file. Eventually the load balancer pair was affected and we could no longer even serve the Angry Unicorn page. The way that the servers went down, the number of servers that went down, and the length of time it takes to resync downed pairs resulted in a lengthy outage. There are several steps to preventing this kind of outage in the future. First and most obvious is to maintain tighter control and testing of proposed system-wide configuration changes. We also plan to deploy (well-tested) changes to the DRBD configuration that will reduce cleanup times and automate the startup process for downed machines. These changes will result in shorter recovery times in the event of single failovers and wider machine-level restarts.
Outage #3 (02/04/2010 2:37:08AM PST) was caused by massive load spikes across all five file servers. To prevent extended downtime we marked all file servers as offline (preventing them from going into failover) and looking for the cause of the load. After inspecting the HTTP logs, we identified a Yahoo! spider that was making thousands of requests but never waiting for responses. After banning the spider, the load returned to normal and we were able to bring the file servers back online. We are looking at our rate limiting strategy and will be making improvements over time to get the best performance for legitimate users and the best protection from anomalous behavior.
In order to execute the improvements to various infrastructure elements, we will be having scheduled maintenance windows at 10PM PST over the next week. Most of these changes will not require any downtime, but some of them may result in temporary unavailability of file server partitions. As we perform the maintenance, we’ll keep you updated via the GitHub Twitter account, so make sure to check there for the latest maintenance news.
We sincerely apologize for the recent problems and are working very hard to address each flaw. Stability is one of our biggest goals this year, and I look forward to making your GitHub experience as flawless as possible.
Written by
Related posts

Explore the best of GitHub Universe: 9 spaces built to spark creativity, connection, and joy
See what’s happening at Universe 2025, from experimental dev tools and career coaching to community-powered spaces. Save $400 on your pass with Early Bird pricing.

Agents panel: Launch Copilot coding agent tasks anywhere on GitHub
Delegate coding tasks to Copilot and track progress wherever you are on GitHub. Copilot works in the background, creates a pull request, and tags you for review when finished.

Q1 2025 Innovation Graph update: Bar chart races, data visualization on the rise, and key research
Discover the latest trends and insights on public software development activity on GitHub with the quarterly release of data for the Innovation Graph, updated through March 2025.