GitHub Availability Report: October 2020
In October, we experienced one incident resulting in significant impact and degraded state of availability for multiple services.
Introduction
In October, we experienced one incident resulting in significant impact and degraded state of availability for issues, pull requests, webhooks, GitHub Actions, and GitHub Pages services.
October 9 21:30 UTC (lasting for two hours and 32 minutes)
While reprovisioning ZooKeeper nodes as a part of routine upgrades, new hosts were introduced too quickly, which resulted in the election of a second leader, effectively introducing a logically distinct second ZooKeeper cluster where there should have been only one.
While the ZooKeeper hosts were in this state, a single Kafka broker in the cluster that powers our internal background job system connected to the newly-formed second ZooKeeper cluster and elected itself as the Kafka controller. At this point, there were two distinct Kafka clusters that were serving conflicting cluster state information to clients. This incorrect state caused write failures for approximately 10% of the requests to our background job service, resulting in a backup of jobs as we migrated traffic and worker capacity to our secondary job processing system.
No background jobs were lost during this incident. While we experienced significant queue backups for some systems, the retry behavior in our clients and the presence of redundant queueing systems mitigated such issues.
To avoid this class of failure in the future, we have updated our ZooKeeper provisioning checklist and plan on introducing automation to perform ZooKeeper and Kafka cluster maintenance.
In summary
To learn more about what we are working on, check out our new Building GitHub blog series, which provides deep dives on how teams across the GitHub engineering organization identify and address opportunities to improve our internal development tooling and infrastructure.
Tags:
Written by
Related posts
Celebrating the GitHub Awards 2024 recipients 🎉
The GitHub Awards celebrates the outstanding contributions and achievements in the developer community by honoring individuals, projects, and organizations for creating an outsized positive impact on the community.
New from Universe 2024: Get the latest previews and releases
Find out how we’re evolving GitHub and GitHub Copilot—and get access to the latest previews and GA releases.
Bringing developer choice to Copilot with Anthropic’s Claude 3.5 Sonnet, Google’s Gemini 1.5 Pro, and OpenAI’s o1-preview
At GitHub Universe, we announced Anthropic’s Claude 3.5 Sonnet, Google’s Gemini 1.5 Pro, and OpenAI’s o1-preview and o1-mini are coming to GitHub Copilot—bringing a new level of choice to every developer.