In October, we experienced two incidents that resulted in degraded performance across GitHub services.
October 17 10:59 UTC (lasting 2 hours and 49 minutes)
From 10:59 UTC to 13:48 UTC on October 17, GitHub Codespaces service was degraded due to an outage in authentication. This issue impacted 67% of users over this time period, with users seeing failures to create and start their Codespaces. The regional authentication layer experienced throttling with a global third-party dependency due to increased load from onboarding a new Codespaces region. The Codespaces team mitigated manually by reducing load on the external dependency. Following the incident, the Codespaces team is actively evaluating and implementing scaling improvements to make the service more resilient to increasing demands. These include implementing regional-level caching to minimize calls to the dependency and incorporating measures to ensure the continued health of the authentication service in the event of errors.
October 25 09:13 UTC (lasting 3 hours and 27 minutes cumulatively)
On October 25 through 26, GitHub Copilot experienced multiple short and partial outages which affected code completions.
GitHub Copilot completions are currently hosted in multiple regions globally. Users are typically routed to the nearest geographic region, but may be routed to other regions when the nearest region is unhealthy. Beginning at 09:13 UTC on October 25, GitHub Copilot began experiencing partial outages of individual regions, lasting approximately 12 minutes per region. These outages were due to the nodes hosting the completion model being upgraded by an automated process, and a subset of GitHub Copilot users experienced completion errors during this timeframe. The issue was fully resolved at 02:40 UTC on October 26.
In order to prevent similar outages from happening in the future, we have taken steps to disable the automated upgrade behavior that we identified as the root cause, as well as prioritizing improvements to our global load balancing during regional outages.