Postmortem of last week’s fileserver failure
As you may know, on thursday, 2010-06-10, we had a machine in one of our fileserver pairs fail. This caused an outage for all users on that fileserver, and corrupted…
As you may know, on thursday, 2010-06-10, we had a machine in one of our fileserver pairs fail. This caused an outage for all users on that fileserver, and corrupted files in a number of repos on the file server.
Lead up
Prior to the failure, we had been seeing some anomalous behaviour from the A server in this pair. The machine had been taken offline and memory tests performed, but no issue was found. Concerned about the machine being out of active rotation if an issue were to arise on the B server, the A server was put back into rotation. The server was left in standby mode so that disk syncs could occur.
The failure
At approximately 14:55 PST, load on B spiked. This lead to a fallover to A. Over the next hour, reports of repo corruption and rollbacks were investigated. It was apparent that most of the issues were with repos that had been pushed to since the fallover, so at 16:40 PST, the fileserver pair was taken offline. The decision was made to bring B back online, after verifying that the corruption was not being caused on it. This was completed and the B server was put back into service at 18:40 PST.
Recovery
Scanning and recovery began immediately after the B server was put back into service, and proceded through the weekend. Every repo on the server was scanned with git fsck
. Any repo that failed this check was re-scanned and its corrupted objects were restored from the last uncorrupted disk snapshot or from backup. A small handful of repos were pushed to during the time A was serving and these pushes were unrecoverable. Owners of the unrecoverable repos were notified of the issue and given instructions on how to push the missing commits to their repos.
Resolution
After investigation we’ve concluded that faulty hardware on the A server was the cause. The server has been replaced with new hardware and is currently being tested. We are updating our post-fallover procedures to ensure filesystem snapshots remain intact and uncorrupted. We are also updating our snapshot job to perform fscks to identify corruption early.
Written by
Related posts
GitHub Availability Report: September 2024
In September, we experienced three incidents that resulted in degraded performance across GitHub services.
Code referencing now generally available in GitHub Copilot and with Microsoft Azure AI
Announcing the general availability of code referencing in GitHub Copilot and Microsoft Azure AI, allowing developers to permit code suggestions containing public code matches while receiving detailed information about the match.
The nuances and challenges of moderating a code collaboration platform
Sharing the latest data update to our Transparency Center alongside a new research article on what makes moderating a code collaboration platform unique.