In March, we experienced three incidents resulting in significant impact and degraded state of availability for issues, pull requests, webhooks, API requests, GitHub Pages, and GitHub Actions services.
Follow up to March 1 09:59 UTC (lasting one hour and 42 minutes)
As mentioned in the February availability report, our service monitors detected a high error rate on creating check suites for workflow runs, which affected the Actions service. This incident resulted in the failure or delay of some queued jobs for a period of time. Additionally, some customers using the search/filter functionality to find workflows may have experienced incomplete search results.
Upon further investigation of this incident, we identified this issue was caused by check suite IDs exceeding max Int32. We had anticipated the check suite IDs and check run IDs would cross the limit and migrated all database columns to bigint six months back. Our codebase that consists of Ruby, Go, and C# does not have explicit type casting to Int32. We failed to identify a GraphQL library we depend on using Int32 when unmarshalling JSON.
When Actions identifies that a job needs to be run on a repo (triggered by webhooks or cron schedules), we first create a check suite. Those individual check suites were successfully created since the database could handle values greater than Int32, but processing those responses failed due to an external library we were using expecting an Int32. Jobs failed to be queued as a result and the check suites were left in a pending state. We deployed a code fix to mitigate after validating it would not lead to data integrity issues in other microservices that may be relying on check suite IDs.
Another impact of check suite ID exceeding max Int32 was in searching for workflow runs via the UI and the API. We used check suite ID to index the data and the service handling that was similarly impacted. The index had to be re-built and while it was in progress, the search results were incomplete.
To help avoid this class of failure in the future, we have audited and updated usage of all external libraries. Furthermore, through drill testing exercises, we’ve mocked other possible failure points like check run IDs exceeding max Int32 to validate and avoid repeat incidents.
March 12 19:11 UTC (lasting one hour and 10 minutes)
This incident was caused by a database migration to flip the order of an index to improve a query’s performance, resulting in a degraded state of availability for GitHub.com. Reversing the index caused a full table scan since there was a missed dependency on the changed index by a generated ActiveRecord query. The performance degradation due to the table scan had a cascading effect on query response times which resulted in timeouts for various dependent services.
To mitigate this issue, we’re determining better tooling to identify index regressions. We’ve also created an inventory of indexes used by generated queries to further ensure we’re compliant on all ActiveRecord best practices.
March 15 2021-03-15 20:38 UTC (lasting one hour and 18 minutes)
Our service monitors detected high failure rates and the inability to run hosted jobs for the Actions service. This incident resulted in a period of time where all hosted jobs were queued for extended periods of time and eventually abandoned due to an inability to serve them with virtual environments. Due to this backup, we were unable to process any build requests delayed over 30 minutes.
This issue was caused by an outage from our authentication provider that is used for our infrastructure that runs the hosted compute. This left us unable to provide hosted builds during the time of the outage. Once the authentication service was restored, we were able to quickly process the backlog of requests and return to normal operation.
In order to mitigate this issue in the future, we are determining ways to extend the lifetime of authentication tokens to handle brief outages so we rely less on this authentication mechanism.
We place great importance in the reliability of our service, and know that the more we invest in the infrastructure and tooling that powers GitHub, the faster we can deliver a more delightful experience to boot.
Last month, we shared details on how Actions renders large-scale logs, and we’ll continue to spotlight our ongoing investments in scaling the Actions service to be faster and even more reliable. To learn more about our efforts in making GitHub more resilient every day, check out the GitHub engineering blog.