How the Grafana Alerting team scales their issue management with GitHub Projects
Hear from Grafana Labs' Armand Grillet about how his team uses GitHub Projects.
Introduction In August, we experienced no incidents resulting in service downtime. This month's GitHub Availability Report will dive into updates to the GitHub Status Page and provide follow-up details on…
In August, we experienced no incidents resulting in service downtime. This month’s GitHub Availability Report will dive into updates to the GitHub Status Page and provide follow-up details on how we’ve addressed the incident mentioned in July’s report.
At GitHub, we are always striving to be more transparent and clear with our users. We know our customers trust us to communicate the current availability of our services and we are now making that more clear for everyone to see. Previously, our status page shared a 90-day history of GitHub’s availability by service, but this history can distract users from what’s happening actively, which during an incident is the most important piece of information. Starting today, the status page will display our current availability and inform users of any degradation in service with real-time updates from the team. The history view will continue to be available and can be found under the “incident history” link. As a reminder, if you want to receive updates on any status changes, you can subscribe to get notifications whenever GitHub creates, updates, or resolves an incident.
Check out the new GitHub Status Page.
Since the incident mentioned in July’s GitHub Availability Report, we’ve worked on a number of improvements to both our deployment tooling and to the way we configure our Kubernetes deployments, with the goal of improving the reliability posture of our systems.
First, we audited all the Kubernetes deployments used in production to remove all usages of the ImagePullPolicy of Always configuration.
Our philosophy when dealing with Kubernetes configuration is to make it easy for internal users to ship their code to production while continuing to follow best practices. For this reason, we implemented a change that automatically replaces the ImagePullPolicy of Always setting in all our Kubernetes-deployed applications, while still allowing experienced users with particular needs to opt out of this automation.
Second, we implemented a mechanism equivalent to the one of Kubernetes mutating admission controllers that we use to inject the latest version of sidecar containers, identified by the SHA256 digest of the image.
These changes allowed us to remove the strong coupling between Kubernetes Pods and the availability of our Docker registry in case of container restarts. We have more improvements in the pipeline that will help further increase the resilience of our Kubernetes deployments and we plan to share more information about those in the future.
We’re excited to be able to share these updates with you and look forward to future updates as we continue our efforts in making GitHub more resilient every day.