GitHub Availability Report: January 2024
In January, we experienced three incidents that resulted in degraded performance across GitHub services.
In January, we experienced three incidents that resulted in degraded performance across GitHub services.
January 09 12:20 UTC (lasting 140 minutes)
On January 9 between 12:20 and 14:40 UTC, services in one of our three sites experienced elevated latency for connections. This led to a sustained period of timed-out requests across a number of services, including but not limited to our Git backend. An average of 5% and max of 10% of requests failed with a 5xx response or timed out during this period.
This was caused by an upgrade of hosts, which led to temporarily reduced capacity as the upgrade rolled through the fleet. While these hosts had plenty of capacity to handle the increased load, we found that the configured connection limit was lower than it should have been. We have increased that limit to prevent this from recurring. We have also identified improvements to our monitoring of connection limits and behavior and changes to reduce the risk of host upgrades leading to reduced capacity.
January 21 02:01 UTC (lasting 7 hours 3 minutes)
On January 21 at 2:01 UTC, we experienced an incident that affected customers using GitHub Codespaces. Customers encountered issues creating and resuming Codespaces in multiple regions due to operational issues with compute and storage resources.
Around 25% of customers were impacted, primarily in East US and West Europe. We re-routed traffic for Codespace creations to less impacted regions, but existing Codespaces in these regions may have been unable to resume during the incident.
By 7:30 UTC, we had recovered connectivity to all regions except West Europe, which had an extended recovery time due to increased load in that particular region. The incident was resolved on January 21 at 9:34 UTC once Codespace creations and resumes were working normally in all regions.
We are working to improve our alerting and resiliency to reduce the duration and impact of region-specific outages.
January 31 12:30 UTC (lasting 147 minutes)
On January 31, we deployed an infrastructure change to our load balancers in preparation towards our longer term goal of IPv6 enablement at GitHub.com. This change was deployed to a subset of our global edge sites. The change had the unintended consequence of causing IPv4 addresses to start being passed as an IPv4-mapped IPv6-compatible address (for example, 10.1.2.3 became ::ffff:10.1.2.3) to our IP Allow List functionality. While our IP Allow List functionality was developed with IPv6 in mind, it wasn’t developed to handle these mapped addresses, and hence, started blocking requests as it deemed these to be not in the defined list of allowed addresses. Request error rates peaked at 0.23% of all requests.
In addition to changes deployed to remediate the issues, we have taken steps to improve testing and monitoring to better catch these issues in the future.
Please follow our status page for real-time updates on status changes and post-incident recaps. To learn more about what we’re working on, check out the GitHub Engineering Blog.
Tags:
Written by
Related posts
The top 10 gifts for the developer in your life
Whether you’re hunting for the perfect gift for your significant other, the colleague you drew in the office gift exchange, or maybe (just maybe) even for yourself, we’ve got you covered with our top 10 gifts that any developer would love.
Congratulations to the winners of the 2024 Gaady Awards
The Gaady Awards are like the Emmy Awards for the field of digital accessibility. And, just like the Emmys, the Gaadys are a reason to celebrate! On November 21, GitHub was honored to roll out the red carpet for the accessibility community at our San Francisco headquarters.
Students: Start building your skills with the GitHub Foundations certification
The GitHub Foundations Certification exam fee is now waived for all students verified through GitHub Education.