Postmortem of last week’s fileserver failure
As you may know, on thursday, 2010-06-10, we had a machine in one of our fileserver pairs fail. This caused an outage for all users on that fileserver, and corrupted…
As you may know, on thursday, 2010-06-10, we had a machine in one of our fileserver pairs fail. This caused an outage for all users on that fileserver, and corrupted files in a number of repos on the file server.
Lead up
Prior to the failure, we had been seeing some anomalous behaviour from the A server in this pair. The machine had been taken offline and memory tests performed, but no issue was found. Concerned about the machine being out of active rotation if an issue were to arise on the B server, the A server was put back into rotation. The server was left in standby mode so that disk syncs could occur.
The failure
At approximately 14:55 PST, load on B spiked. This lead to a fallover to A. Over the next hour, reports of repo corruption and rollbacks were investigated. It was apparent that most of the issues were with repos that had been pushed to since the fallover, so at 16:40 PST, the fileserver pair was taken offline. The decision was made to bring B back online, after verifying that the corruption was not being caused on it. This was completed and the B server was put back into service at 18:40 PST.
Recovery
Scanning and recovery began immediately after the B server was put back into service, and proceded through the weekend. Every repo on the server was scanned with git fsck
. Any repo that failed this check was re-scanned and its corrupted objects were restored from the last uncorrupted disk snapshot or from backup. A small handful of repos were pushed to during the time A was serving and these pushes were unrecoverable. Owners of the unrecoverable repos were notified of the issue and given instructions on how to push the missing commits to their repos.
Resolution
After investigation we’ve concluded that faulty hardware on the A server was the cause. The server has been replaced with new hardware and is currently being tested. We are updating our post-fallover procedures to ensure filesystem snapshots remain intact and uncorrupted. We are also updating our snapshot job to perform fscks to identify corruption early.
Written by
Related posts
Students: Start building your skills with the GitHub Foundations certification
The GitHub Foundations Certification exam fee is now waived for all students verified through GitHub Education.
Announcing GitHub Secure Open Source Fund: Help secure the open source ecosystem for everyone
Applications for the new GitHub Secure Open Source Fund are now open! Applications will be reviewed on a rolling basis until they close on January 7 at 11:59 pm PT. Programming and funding will begin in early 2025.
Software is a team sport: Building the future of software development together
Microsoft and GitHub are committed to empowering developers around the world to innovate, collaborate, and create solutions that’ll shape the next generation of technology.