Postmortem of last week’s fileserver failure
As you may know, on thursday, 2010-06-10, we had a machine in one of our fileserver pairs fail. This caused an outage for all users on that fileserver, and corrupted…
As you may know, on thursday, 2010-06-10, we had a machine in one of our fileserver pairs fail. This caused an outage for all users on that fileserver, and corrupted files in a number of repos on the file server.
Lead up
Prior to the failure, we had been seeing some anomalous behaviour from the A server in this pair. The machine had been taken offline and memory tests performed, but no issue was found. Concerned about the machine being out of active rotation if an issue were to arise on the B server, the A server was put back into rotation. The server was left in standby mode so that disk syncs could occur.
The failure
At approximately 14:55 PST, load on B spiked. This lead to a fallover to A. Over the next hour, reports of repo corruption and rollbacks were investigated. It was apparent that most of the issues were with repos that had been pushed to since the fallover, so at 16:40 PST, the fileserver pair was taken offline. The decision was made to bring B back online, after verifying that the corruption was not being caused on it. This was completed and the B server was put back into service at 18:40 PST.
Recovery
Scanning and recovery began immediately after the B server was put back into service, and proceded through the weekend. Every repo on the server was scanned with git fsck
. Any repo that failed this check was re-scanned and its corrupted objects were restored from the last uncorrupted disk snapshot or from backup. A small handful of repos were pushed to during the time A was serving and these pushes were unrecoverable. Owners of the unrecoverable repos were notified of the issue and given instructions on how to push the missing commits to their repos.
Resolution
After investigation we’ve concluded that faulty hardware on the A server was the cause. The server has been replaced with new hardware and is currently being tested. We are updating our post-fallover procedures to ensure filesystem snapshots remain intact and uncorrupted. We are also updating our snapshot job to perform fscks to identify corruption early.
Written by
Related posts

Explore the best of GitHub Universe: 9 spaces built to spark creativity, connection, and joy
See what’s happening at Universe 2025, from experimental dev tools and career coaching to community-powered spaces. Save $400 on your pass with Early Bird pricing.

Agents panel: Launch Copilot coding agent tasks anywhere on GitHub
Delegate coding tasks to Copilot and track progress wherever you are on GitHub. Copilot works in the background, creates a pull request, and tags you for review when finished.

Q1 2025 Innovation Graph update: Bar chart races, data visualization on the rise, and key research
Discover the latest trends and insights on public software development activity on GitHub with the quarterly release of data for the Innovation Graph, updated through March 2025.