DevOps success looks different for everyone. But like open source, sharing best practices helps us all build better software. In this Q&A, Nina Kaufman, Senior Software Engineer on GitHub’s Deploy Team, explains how automation ensures code gets deployed to github.com safely and reliably.
I’m a software engineer, but you could also call me an infrastructure engineer. Aside from myself, our team is made up of five other engineers, a manager, and one product manager. Day to day, our biggest goal is ensuring our teams across GitHub can deploy with a high velocity, safely and securely to github.com.
We support hundreds of engineers, as well as hundreds of applications that are being deployed 24/7. For github.com alone, we have between 120 and 150 deploys a week just to production and in the past week we shipped 421 pull requests within those deploys.
Ultimately, we push code to production on our own GitHub cloud platform, on our data centers, utilizing features provided by the GitHub UI and API along the way. The deployment process can be initiated with ChatOps, a series of Hubot commands. They enable us to automate all sorts of workflows and have a pretty simple interface for people to engage with in order to roll out their changes.
When folks have a change that they’d like to ship or deploy to github.com, they just need to run
.deploy with a link to their pull request and the system will automatically deconstruct what’s within that link, using GitHub’s API for understanding important details such as the required CI checks, authorization, and authentication. Once the deployment has progressed through a series of stages—which we will talk about in more detail later—you’re able to merge your pull request in GitHub, and from there you can continue on with your day, continue making improvements, and shipping features. The system will know exactly how to deploy it, which servers are involved, and what systems to run. The person running the command has no need to know that it’s all happening. Before any changes are made, we run a series of authentication processes to ensure a user even has the right access to run these commands.
After you hit
.deploy and your changes go through a series of stages, we use canary deployments (Canary) to gradually roll out and verify new functionality before sending the changes to full production. Canary is a smaller subset of our production hosts that contains a new change, so that if there’s a breaking change, not everyone will encounter the error. It would only be a very small percentage of servers.
During deployment, you can look at a series of dashboards to monitor for errors. You can drill in and see how your change is making an impact on users; if it’s increasing errors, or if it has engagement, things like that. You’re able to get a pretty good grasp on what you’re shipping and the impact it has during the deployment process since it rolls out changes a bit at a time, not all at once.
Our team has a set of service level objectives (SLOs) defined, so we have metrics that we measure our success for deployment time, local development setup time, and more. We also have developer satisfaction surveys that we conduct internally, as well as interviews to measure the way folks perceive deployments and ways to improve. We’re always looking back at those metrics to see where we can improve and make changes to our process.
One of the other things that we look at is the number of rollbacks that we do over a period of time, or how often we ship something that ends up breaking or not performing in the way we expect. We found that our rate of rollbacks was fairly low. Folks were generally shipping changes that look good, are successful, and perform as everyone expects and intends them to. In this case, we could shift to a culture of trust around saying, “Hey, developers know what they’re shipping. It’s been tested. They’re going to make sure that things work.” Then, generally it can progress and go to a full rollout with minimal intervention, if any at all. At this point, the people who are shipping can just merge their changes.
Along with automation, what impact does this “culture of trust” have on the organization and the developer experience?
We’re doing the same number of deployments as we did several years ago, but with our adoption of batched changes, increased automation, and canary deploys, we’ve actually increased throughput. Previously, we deployed around 150 deployments a week. Now we batch a lot more changes together than we used to, so while we’re still doing roughly the same amount of deployments, we’re shipping more changes on each deploy.
There’s less wait time to get your changes out, and it also means that you have a greater confidence level that your changes are going to play nicely with other people’s changes. Since we’re shipping hundreds of pull requests—over 400 in a single week—you want to make sure that your changes are compatible with as many others at a given time as possible, and we don’t want folks to wait hours in line to be able to ship a change.
What DevOps best practices or advice would you recommend to teams who want to improve their process, workflow, or developer experience?
I would recommend treating infrastructure as a product and treating internal users as if they were external. Implement developer surveys, satisfaction scores, and interview people across the organization to see what pain points they have, or what a day in their life would look like. Having the empathy to understand the problems of other engineers within the organization can definitely improve the product. Sometimes as a result of surveys or interviews, you might find responses that are surprising or hard to hear, but it ultimately makes for a much better product.
We use an NSAT score to measure our satisfaction internally. From May 2020 to September 2020, we committed to raising that score even more. We still have many improvements to make, but this took an entire company-wide effort. It brought greater harmony between our feature teams and our infrastructure teams, so we integrated folks on the user-facing teams and we brought them in with infrastructure. We asked, “How would you like to improve the deploy interface and how would you like to see these things happen?” Even though those folks don’t do infrastructure day to day, we gained valuable insights from their experience with customer-facing products. We were able to work together to make UI/UX changes that improved shipping velocity, reduced support hours spent debugging deploy-related issues, and ultimately increased developer satisfaction.
Another thing that’s really useful is unifying and simplifying the tools and processes that you have so that folks don’t need to worry about finding them. At GitHub, we rely on GitHub itself for everything. We use it for our authentication and for ensuring that folks have access to the right things to be able to run certain commands. We use project management on GitHub, GitHub Actions, and all sorts of different tools. Going back to the idea of treating it like an external product, it’s important to have discoverable documentation for any internal APIs, dedicated support channels, and first-class customer support in the same way you would have for anyone outside the company.
At the end of the day, successful DevOps comes down to people, not processes. How does your team stay connected?
GitHub is a global company. We have folks on my team in Berlin, Vancouver, and all over the US. Going out of our way to come together and keep up with one another is really important because the team camaraderie and overall gratification of working with one another will push and propel a lot of the features forward more quickly than they would otherwise. We do bi-weekly coffee chats, book clubs, and art chats where people share their projects to stay connected.
Sometimes I wake up and check notifications on my GitHub issues, and it feels like waking up on my birthday. If I ran into a problem and documented it, someone in Berlin will take it and fix it while I’m asleep. To me, it doesn’t feel like I’m isolated because there’s always an ongoing conversation. There are always updates to see when I wake up. We’re always deploying GitHub, always improving it.