GitHub Availability Report: July 2020
Last month we introduced GitHub’s monthly availability report to address service disruptions and share our learnings with the community.

In July we experienced one specific incident resulting in a degraded state of availability for GitHub.com. We’d like to share our learnings from this incident with the community in the spirit of being transparent about our service disruptions, and helping other services improve their own operations.
July 13 08:18 UTC (lasting for four hours, 25 minutes)
The incident started when our production Kubernetes Pods started getting marked as unavailable. This cascaded through our clusters resulting in a reduction in capacity, which ultimately brought down our services. Investigation into the Pods revealed that a single container within the Pod was exceeding its defined memory limits and being terminated. Even though that container is not required for production traffic to be processed, the nature of Kubernetes requires that all containers be healthy for a Pod to be marked as available.
Normally when a Pod runs into this failure mode, the cluster will recover within a minute or so. In this case, the container in the Pod was configured with an ImagePullPolicy of Always, which instructed Kubernetes to fetch a new container image every time. However, due to a routine DNS maintenance operation that had been completed earlier, our clusters were unable to successfully reach our registry resulting in Pods failing to start. This issue impact was increased when a redeploy was triggered in an attempt to mitigate, and we saw the failure start to propagate across our production clusters. It wasn’t until we restarted the process with the cached DNS records that we were able to successfully fetch container images, redeploy, and recover our services.
Moving forward, we’ve identified a number of areas to address this quarter:
- Enhancing monitoring ensuring Pod restarts would not fail again based on this same pattern
- Minimizing our dependency on the image registry
- Expanding validation during DNS changes
- Reevaluating all the existing Kubernetes deployment policies
In parallel, we have an ongoing workstream to improve our approach to progressive deployments that will provide the ability to carefully evaluate the impact of deployments in a more incremental fashion. This is part of a broader engineering initiative focused on reliability that we will have more details on in the coming months.
In summary
We place great importance in the reliability of our service along with the trust that our users place in us every day. We look forward to continuing to share more details of our journey and hope you can learn from our experiences along the way.
Tags:
Written by
Related posts

Racing into 2025 with new GitHub Innovation Graph data
Discover the latest trends and insights on public software development activity on GitHub with the quarterly release of data for the Innovation Graph, updated through December 2024.

GitHub Availability Report: March 2025
In March, we experienced one incident that resulted in degraded performance across GitHub services.

Vibe coding with GitHub Copilot: Agent mode and MCP support rolling out to all VS Code users
In celebration of MSFT’s 50th anniversary, we’re rolling out Agent Mode with MCP support to all VS Code users. We are also announcing the new GitHub Copilot Pro+ plan w/ premium requests, the general availability of models from Anthropic, Google, and OpenAI, next edit suggestions for code completions & the Copilot code review agent.