In September, we experienced two incidents that resulted in degraded performance across GitHub services.
September 5 16:24 UTC (lasting 19 minutes)
On September 5, from 16:24-16:43 UTC, multiple GitHub services were down or degraded due to an outage in one of our primary databases. The primary host for a shared datastore for GitHub experienced an underlying file system write error, which affected availability for the majority of public-facing GitHub services. SAML login was affected, as was access to GitHub Actions, GitHub Issues, pull requests, GitHub Pages, GitHub API, Webhooks, GitHub Codespaces, and GitHub Packages.
The primary database suffered a partial host failure when the disk storage for the operating system became unreachable. In this case, our automatic failover was unable to detect the partial file system failure mode. We mitigated by manually failing over to a healthy host, initiated 17 minutes after our first alert and completed 2 minutes later.
With the incident mitigated, we have worked to assess more detailed impact and resilience improvements to each affected service to reduce the scope of any future incident with this shared dependency. Some of those are complete and the rest will be completed within our standard repair item SLAs. To increase the resiliency of our system, we have improved our automation that will detect and initiate a failover for this type of partial host failure. Additionally, we have identified a source of resource contention that is consistent with this type of failure and patched a fix to reduce the likelihood of recurrence.
September 19 20:36 UTC (lasting 7 hours 30 minutes)
On September 19 at 20:36 UTC, while migrating the primary datastore for GitHub Projects, an incident occurred that disrupted 95% of GitHub Projects data availability for 3.5 hours. A misconfigured index constraint on the primary GitHub Projects database table caused GitHub Projects to become fully unavailable between 20:36 UTC and 00:06 UTC. By 00:06, we restored GitHub Projects data to its state from the beginning of the incident. New project data created by users while the incident was being mitigated was fully recovered and available to users by 04:28 UTC.
In addition, a database replication interruption caused by our remediation steps created limited availability for some Git Operations, APIs, and GitHub Issues for 1.25 hours from 21:48 UTC to 23:00 UTC.
To prevent similar incidents in the future, we have improved validation of data migrations in testing and during rollout. We have evaluated and are making improvements to the constraints for any data migration to prevent the unexpected behavior that led to this data loss. To reduce the time to mitigate similar incidents, we are also in the process of rolling out improvements to reduce both the time to restore data and fix replication issues.