Today’s Outage
A few hours ago I was upgrading our continuous integration setup when a configuration error caused it to run against our production environment rather than our testing environment. Before every…
A few hours ago I was upgrading our continuous integration setup when a configuration error caused it to run against our production environment rather than our testing environment.
Before every run of our test suite we destroy then re-create the database so that we have a known, clean starting point. This also allows us to continuously integrate topic branches with potentially different database schemas. Due to the configuration error GitHub’s production database was destroyed then re-created. Not good.
We immediately began restoring the database from our most recent backup. Unfortunately, while most tables in the GitHub database are small, our “events” table is large. This significantly slowed the restoration process.
Eventually the decision was made to skip the events table in order to speed up the restoration process. As a result, your dashboard and profile might currently be blank – rather annoying, but hopefully only temporary. We will be restoring the events table bit by bit over the next few days in an attempt to minimize downtime.
Worse, however, is that we may have lost some data from between the last good database backup and the time of the deletion. Newly created users and repositories are being restored, but pull request state changes and similar might be gone.
Obviously, this should have never had happened. It should be very difficult to cause a database failure like this and very easy to recover from it.
Our plan moving forward:
- Completely isolate the test environment from the production environment, i.e. make production hosts unreachable from the testing VM.
- Reduce the size and growth rate of our events table. This is already well underway but is now one of our top priorities.
- Begin storing binlogs to reduce data loss in the event of a future db restoration. The completion of #2 will help make this much easier.
We’re very sorry about this, especially if we ruined your work day or Sunday afternoon. Please email support@github.com if you are still having problems or need to discuss the outage further.
Written by
Related posts
GitHub Availability Report: September 2024
In September, we experienced three incidents that resulted in degraded performance across GitHub services.
Code referencing now generally available in GitHub Copilot and with Microsoft Azure AI
Announcing the general availability of code referencing in GitHub Copilot and Microsoft Azure AI, allowing developers to permit code suggestions containing public code matches while receiving detailed information about the match.
The nuances and challenges of moderating a code collaboration platform
Sharing the latest data update to our Transparency Center alongside a new research article on what makes moderating a code collaboration platform unique.