A few hours ago I was upgrading our continuous integration setup when a configuration error caused it to run against our production environment rather than our testing environment.
Before every run of our test suite we destroy then re-create the database so that we have a known, clean starting point. This also allows us to continuously integrate topic branches with potentially different database schemas. Due to the configuration error GitHub’s production database was destroyed then re-created. Not good.
We immediately began restoring the database from our most recent backup. Unfortunately, while most tables in the GitHub database are small, our “events” table is large. This significantly slowed the restoration process.
Eventually the decision was made to skip the events table in order to speed up the restoration process. As a result, your dashboard and profile might currently be blank – rather annoying, but hopefully only temporary. We will be restoring the events table bit by bit over the next few days in an attempt to minimize downtime.
Worse, however, is that we may have lost some data from between the last good database backup and the time of the deletion. Newly created users and repositories are being restored, but pull request state changes and similar might be gone.
Obviously, this should have never had happened. It should be very difficult to cause a database failure like this and very easy to recover from it.
Our plan moving forward:
- Completely isolate the test environment from the production environment, i.e. make production hosts unreachable from the testing VM.
- Reduce the size and growth rate of our events table. This is already well underway but is now one of our top priorities.
- Begin storing binlogs to reduce data loss in the event of a future db restoration. The completion of #2 will help make this much easier.
We’re very sorry about this, especially if we ruined your work day or Sunday afternoon. Please email firstname.lastname@example.org if you are still having problems or need to discuss the outage further.