Post-Mortem of This Morning’s Outage
At 07:53 PDT this morning the site was hit with an abnormal number of SSH connections. The script that runs after an SSH connection is accepted makes an RPC call…
At 07:53 PDT this morning the site was hit with an abnormal number of SSH connections. The script that runs after an SSH connection is accepted makes an RPC call to the backend to check for the existence of the repository so that we can display a nice error message if it is not present. The vast number of these calls that came in simultaneously caused some delays in the backend that cascaded to the frontends and resulted in a piling up of the scripts waiting for their RPC results. This, in turn, caused load to spike on the frontends further exacerbating the problem. I removed the RPC call from the SSH script to prevent this bottlenecking and soon after the barrage of SSH connections ceased.
Another unrelated problem caused the outage to continue even after the SSH connection load became nominal. Last night I deployed some package upgrades to our RPC stack that had tested out fine in staging for two days. While debugging the SSH problem, I restarted the backend RPC servers to rule them out as the problem source. This was the first time these processes had been restarted since the package upgrades, as they were deemed to be backward compatible with the changes and staging had shown no problems in this regard. However, it appears that these restarts put the RPC servers into an unworking state, and they began serving requests very sporadically. After failing to identify the problem within a short period, we decided to roll back to the previous known working state. After the packages were rolled back and the daemons restarted, the site picked up and began operating normally.
Full site operation returned at 09:34 PDT (some sporadic uptime was seen during the outage).
Over the next week we will be doing several things:
- Further testing on staging to attempt to reproduce the behavior seen on production and resolve the underlying issue.
- Better SSH script logging to more quickly identify abnormal behavior.
- Working towards a more fine-grained rolling deploy of infrastructure packages to limit the impact of unforeseen problems.
On a positive note, the outage led me to identify the source of several subtle bugs that have been eluding our detection for a few weeks. We are all rapidly learning the quirks of our new architecture in a production environment, and every problem leads to a more robust system in the future. Thanks for your patience over the last month and during the coming months as we work to improve the GitHub experience on every level.
Written by
Related posts
Students: Start building your skills with the GitHub Foundations certification
The GitHub Foundations Certification exam fee is now waived for all students verified through GitHub Education.
Announcing GitHub Secure Open Source Fund: Help secure the open source ecosystem for everyone
Applications for the new GitHub Secure Open Source Fund are now open! Applications will be reviewed on a rolling basis until they close on January 7 at 11:59 pm PT. Programming and funding will begin in early 2025.
Software is a team sport: Building the future of software development together
Microsoft and GitHub are committed to empowering developers around the world to innovate, collaborate, and create solutions that’ll shape the next generation of technology.