GitHub Availability Report: July 2020
Last month we introduced GitHub’s monthly availability report to address service disruptions and share our learnings with the community.
In July we experienced one specific incident resulting in a degraded state of availability for GitHub.com. We’d like to share our learnings from this incident with the community in the spirit of being transparent about our service disruptions, and helping other services improve their own operations.
July 13 08:18 UTC (lasting for four hours, 25 minutes)
The incident started when our production Kubernetes Pods started getting marked as unavailable. This cascaded through our clusters resulting in a reduction in capacity, which ultimately brought down our services. Investigation into the Pods revealed that a single container within the Pod was exceeding its defined memory limits and being terminated. Even though that container is not required for production traffic to be processed, the nature of Kubernetes requires that all containers be healthy for a Pod to be marked as available.
Normally when a Pod runs into this failure mode, the cluster will recover within a minute or so. In this case, the container in the Pod was configured with an ImagePullPolicy of Always, which instructed Kubernetes to fetch a new container image every time. However, due to a routine DNS maintenance operation that had been completed earlier, our clusters were unable to successfully reach our registry resulting in Pods failing to start. This issue impact was increased when a redeploy was triggered in an attempt to mitigate, and we saw the failure start to propagate across our production clusters. It wasn’t until we restarted the process with the cached DNS records that we were able to successfully fetch container images, redeploy, and recover our services.
Moving forward, we’ve identified a number of areas to address this quarter:
- Enhancing monitoring ensuring Pod restarts would not fail again based on this same pattern
- Minimizing our dependency on the image registry
- Expanding validation during DNS changes
- Reevaluating all the existing Kubernetes deployment policies
In parallel, we have an ongoing workstream to improve our approach to progressive deployments that will provide the ability to carefully evaluate the impact of deployments in a more incremental fashion. This is part of a broader engineering initiative focused on reliability that we will have more details on in the coming months.
In summary
We place great importance in the reliability of our service along with the trust that our users place in us every day. We look forward to continuing to share more details of our journey and hope you can learn from our experiences along the way.
Tags:
Written by
Related posts
Announcing GitHub Secure Open Source Fund: Help secure the open source ecosystem for everyone
Applications for the new GitHub Secure Open Source Fund are now open! Applications will be reviewed on a rolling basis until they close on January 7 at 11:59 pm PT. Programming and funding will begin in early 2025.
Software is a team sport: Building the future of software development together
Microsoft and GitHub are committed to empowering developers around the world to innovate, collaborate, and create solutions that’ll shape the next generation of technology.
Does GitHub Copilot improve code quality? Here’s what the data says
Findings in our latest study show that the quality of code written with GitHub Copilot is significantly more functional, readable, reliable, maintainable, and concise.