As you may know, on thursday, 2010-06-10, we had a machine in one of our fileserver pairs fail. This caused an outage for all users on that fileserver, and corrupted files in a number of repos on the file server.
Prior to the failure, we had been seeing some anomalous behaviour from the A server in this pair. The machine had been taken offline and memory tests performed, but no issue was found. Concerned about the machine being out of active rotation if an issue were to arise on the B server, the A server was put back into rotation. The server was left in standby mode so that disk syncs could occur.
At approximately 14:55 PST, load on B spiked. This lead to a fallover to A. Over the next hour, reports of repo corruption and rollbacks were investigated. It was apparent that most of the issues were with repos that had been pushed to since the fallover, so at 16:40 PST, the fileserver pair was taken offline. The decision was made to bring B back online, after verifying that the corruption was not being caused on it. This was completed and the B server was put back into service at 18:40 PST.
Scanning and recovery began immediately after the B server was put back into service, and proceded through the weekend. Every repo on the server was scanned with
git fsck. Any repo that failed this check was re-scanned and its corrupted objects were restored from the last uncorrupted disk snapshot or from backup. A small handful of repos were pushed to during the time A was serving and these pushes were unrecoverable. Owners of the unrecoverable repos were notified of the issue and given instructions on how to push the missing commits to their repos.
After investigation we’ve concluded that faulty hardware on the A server was the cause. The server has been replaced with new hardware and is currently being tested. We are updating our post-fallover procedures to ensure filesystem snapshots remain intact and uncorrupted. We are also updating our snapshot job to perform fscks to identify corruption early.