How the Grafana Alerting team scales their issue management with GitHub Projects
Hear from Grafana Labs' Armand Grillet about how his team uses GitHub Projects.
In May, we experienced two incidents resulting in significant impact to multiple GitHub services.
In May, we experienced two incidents resulting in significant impact and degraded state of availability for API requests, GitHub Pages, GitHub Actions and the GitHub Packages service, specifically the GitHub Packages Container registry service.
This incident was caused by failures in an underlying MySQL database, which caused some operations to time out for the GitHub Container registry service. During this incident, some customers viewing packages in the UI or interacting with the registry through “docker push” and “docker pull” may have experienced failures as the engineering team investigated the incident. After performing a failover to one of our database replicas, the affected systems were properly restored.
Our internal engineering team is now prioritizing work that will help ensure reduced impact to customers should such underlying outages happen again. This work includes creating internal documentation, dashboards, and enhanced alerts to quickly triage the cause of operation failures. We will also continue to actively maintain and increase replicas in different regions and availability zones that serve as a line of defense against unexpected region outages.
This incident was caused when a foreign key for scoped tokens exceeded max INT32, which resulted in high failure rates for GitHub Actions and GitHub Pages. It also prevented some access to operations against the GitHub API and low-level git commands, such as “push” and “pull”, using scoped tokens. We mitigated this with a long-running schema migration to change the foreign key to INT64.
Once the foreign key migration was successful, the internal engineering teams then worked to slowly remove token records stored in our cache layer that were considered invalid. After these cached records were removed, newly created API tokens were able to generate new records and API calls resumed working as expected.
Alerting and linting are already in place to help prevent integer overflows in the database. Unfortunately, these mechanisms were not sufficient in this case due to it being a foreign key that predated our linting. In response, we are manually auditing all our INT32 columns and investigating further improvements to our automation to help prevent these types of issues moving forward.
Given the nature of this overflow, only a single GitHub Action used on a single repository received unauthorized access grants for a short period of time. We revoked these grants and confirmed that no unauthorized access was gained through the use of this Action in this repository.
Our internal engineering teams are actively working on reducing the impact and likelihood of this class of issue happening in the future. This work includes tooling to prevent database inconsistencies and improved alerting to allow faster remediation.
From our open source release of the GitHub Artifact Exporter to our adoption of OpenTelemetry SDKs, you can learn more about what we are working on to improve our internal development tooling and infrastructure in the GitHub Engineering Blog.