GitHub Availability Report: January 2025

In January, we experienced two incidents that resulted in degraded performance across GitHub services.

| 3 minutes

In January, we experienced three incidents that resulted in degraded performance across GitHub services.

January 09 1:26 UTC (lasting 31 minutes)

On January 9, 2025, between 01:26 UTC and 01:56 UTC, GitHub experienced widespread disruption to many services, with users receiving 500 responses when trying to access various functionality. This was due to a deployment which introduced a query that saturated a primary database server. On average, the error rate was 6% and peaked at 6.85% of update requests.

We were able to mitigate the incident by identifying the source of the problematic query and rolling back the deployment. The internal tooling and our dashboards surfaced the relevant data that helped us quickly identify the problematic query. It took us a total of 14 minutes from the time to engage to finding the errant query.

However, we are investing in tooling to detect problematic queries prior to deployment to prevent and to reduce our time to detection and mitigation of issues like this one in the future.

January 13 23:35 UTC (lasting 49 minutes)

On January 13, 2025, between 23:35 UTC and 00:24 UTC, all Git operations were unavailable due to a configuration change related to traffic routing and testing that caused our internal load balancer to drop requests between services that Git relies upon.

We mitigated the incident by rolling back the configuration change.

We are improving our monitoring and deployment practices to improve our time to detection and automated mitigation for issues like this in the future.

January 30 14:22 UTC (lasting 26 minutes)

On January 30, 2025, between 14:22 UTC and 14:48 UTC, web requests to github.com experienced failures (at peak the error rate was 44%), with the average successful request taking over three seconds to complete.

This outage was caused by a hardware failure in the caching layer that supports rate limiting. In addition, the impact was prolonged due to a lack of automated failover for the caching layer. A manual failover of the primary to trusted hardware was performed following recovery to ensure that the issue would not reoccur under similar circumstances.

As a result of this incident, we will be moving to a high availability cache configuration and adding resilience to cache failures at this layer to ensure requests are able to be handled should similar circumstances happen in the future.


Please follow our status page for real-time updates on status changes and post-incident recaps. To learn more about what we’re working on, check out the GitHub Engineering Blog.

Related posts