In August, we experienced two distinct incidents resulting in significant impact and degraded state of availability for Git operations, API requests, webhooks, issues, pull requests, GitHub Pages, GitHub Packages, and GitHub Actions services.
This incident was caused when one of our MySQL database primaries entered a degraded state, affecting a number of internal services. This caused an impact to GitHub.com services requiring write access to this particular database cluster, which resulted in some users being unable to perform operations.
Investigation had identified an edge case in one of our most active applications, which caused the generation of a poorly performing query capable of impacting overall database capacity. This combined with application retry and queueing logic meant that the MySQL primary was placed into a state where the cluster was unable to automatically recover.
We have been able to address this query, as well as some of the application retry logic, to reduce the chance of recurrence in the future.
One of the novel elements to this incident was the breadth of impact across multiple services. This led to a discussion about the overall service status as we were reporting it within the incident, and, so we’d like to take this opportunity to discuss the approach we took at the time, as well as the way we look to increase our learning potential after the incident.
When we first introduced the monthly availability report, we aimed to provide post-incident reviews for major incidents that impact service availability, in addition to background on how we’re continuing to evolve the process. As part of our standard post-incident analysis process, we are using this incident as a valuable source of data to evaluate the responsiveness of our internal metrics and alerting. These systems guide our responders during incidents on both when to status and what degree of impact to status for. As a result, we’re continuing to tune and optimize these activities to ensure we are able to status—both quickly and accurately—so that we continue to earn the trust our users place in us everyday.
Following ongoing maintenance of the Actions service, our service monitors detected a high error rate on workflow runs for new and in progress jobs, which affected the Actions service. This incident resulted in the failure of all queued jobs for a period of time. This was a new incident, unrelated to the earlier issue on August 10. We immediately reverted recent Actions deployments and started to investigate the issue.
The incident was caused by work to set up a new Actions Premium Runner microservice in the Actions service. The impacting portion of this work involved alterations to the service discovery process within the Actions microservices architecture. A bad service record pushed to this system resulted in many of the microservices being unable to make Service-to-Service calls.
Ultimately, the mitigation for this incident was to remove the bad record from the service discovery infrastructure. After investigating whether this mitigation would address the incident, we were able to confidently confirm that the bad record was the root cause of the issue, and removing it would restore the Actions service with no unintended side effects.
We have prioritized several changes as a result of this incident, including fixing this part of the Actions microservice discovery process to properly handle potential bad records. We’ve also added a broader scope of visibility into what’s changed recently across all of the Actions microservices, so we can quickly focus investigations in the correct place.
We will continue to keep you updated on the progress and investments we’re making to ensure the reliability of our services. Please follow our status page for real time updates and watch our blog for next month’s availability report. To learn more about what we’re working on, check out the GitHub Engineering blog.