Postmortem of last week’s fileserver failure
As you may know, on thursday, 2010-06-10, we had a machine in one of our fileserver pairs fail. This caused an outage for all users on that fileserver, and corrupted…
As you may know, on thursday, 2010-06-10, we had a machine in one of our fileserver pairs fail. This caused an outage for all users on that fileserver, and corrupted files in a number of repos on the file server.
Lead up
Prior to the failure, we had been seeing some anomalous behaviour from the A server in this pair. The machine had been taken offline and memory tests performed, but no issue was found. Concerned about the machine being out of active rotation if an issue were to arise on the B server, the A server was put back into rotation. The server was left in standby mode so that disk syncs could occur.
The failure
At approximately 14:55 PST, load on B spiked. This lead to a fallover to A. Over the next hour, reports of repo corruption and rollbacks were investigated. It was apparent that most of the issues were with repos that had been pushed to since the fallover, so at 16:40 PST, the fileserver pair was taken offline. The decision was made to bring B back online, after verifying that the corruption was not being caused on it. This was completed and the B server was put back into service at 18:40 PST.
Recovery
Scanning and recovery began immediately after the B server was put back into service, and proceded through the weekend. Every repo on the server was scanned with git fsck
. Any repo that failed this check was re-scanned and its corrupted objects were restored from the last uncorrupted disk snapshot or from backup. A small handful of repos were pushed to during the time A was serving and these pushes were unrecoverable. Owners of the unrecoverable repos were notified of the issue and given instructions on how to push the missing commits to their repos.
Resolution
After investigation we’ve concluded that faulty hardware on the A server was the cause. The server has been replaced with new hardware and is currently being tested. We are updating our post-fallover procedures to ensure filesystem snapshots remain intact and uncorrupted. We are also updating our snapshot job to perform fscks to identify corruption early.
Written by
Related posts
Inside the research: How GitHub Copilot impacts the nature of work for open source maintainers
An interview with economic researchers analyzing the causal effect of GitHub Copilot on how open source maintainers work.
OpenAI’s latest o1 model now available in GitHub Copilot and GitHub Models
The December 17 release of OpenAI’s o1 model is now available in GitHub Copilot and GitHub Models, bringing advanced coding capabilities to your workflows.
Announcing 150M developers and a new free tier for GitHub Copilot in VS Code
Come and join 150M developers on GitHub that can now code with Copilot for free in VS Code.