Extending GitOps to reliability-as-code with GitHub and StackPulse

This is a partner post by Leonid Belkind, the Co-Founder and CTO at StackPulse Over the past decade, engineering-led practices have replaced traditional IT operations across the software development lifecycle.…

GitHub Partnerships

Updated on February 4, 2022

| 5 minutes

This is a partner post by Leonid Belkind, the Co-Founder and CTO at StackPulse

Over the past decade, engineering-led practices have replaced traditional IT operations across the software development lifecycle. This has allowed developers to build, test and deploy software services faster than ever before possible. Consider the following examples:

In each instance, a mostly manual process has been replaced with automation, removing bottlenecks and increasing deployment velocity. But when it comes to ensuring the reliability of software services in production, the process is still predominantly manual. Even in the most advanced technological organizations, when a production monitoring system produces an alert, what follows is mostly manual process of going through the following steps:

Additionally, throughout the above steps, there is always a matter of communicating the current state and the planned steps/time-lines to relevant stakeholders, inviting additional stakeholders to participate in the process, and so on.

Turning operations into code

StackPulse is a platform that allows defining the above processes and steps as code—to an extent similar to defining testing or deployment procedures as one. The code, leveraging software engineering best practices, can be modular, generic, with clear interfaces between modules, allowing its developers to model operational processes the same way they would do it with business logic. It allows the operational process to be subjected to versioning, testing and review cycles, ensuring that it “delivers on its promise” when being executed in production.

Let’s take a concrete example for an operational process—every time we receive an alert on a database server being slower than usual to respond to requests, we would start handling it by taking the following actions:

Understand which database server is responsible for originating the alert.
Check if all database requests are equally slow or only specific ones and group them by source, data type, operation, etc.
Check if the services impacted by the database slowness are failing to deliver their responses according to their expectations and which business processes are impacted by it.

Similarly, we could define the following steps of remediating and resolving various issues and creating well-defined processes both for generic remediations and for handling very specific situations.

We could also approach describing the above mentioned steps as a process defined similarly to the below:

apiVersion: stackpulse.io/v1
kind: Playbook
metadata:
name: enrich-and-diagnose-database-alerts
description: This playbook enriches alerts related to database services and tries to identify most common sources of possible problems
parameters:
- name: database_server
type: var
description: Address of the database server

steps:

# Retrieve general information about the database server
- name: stackpulse/general_db_info
id: get_db_info
env:
DATABASE: '{{ $.params.database_server }}'
AUTH: '{{ secret "database_access_key" }}'

# Get a breakdown of database requests latency for the past 30 minutes
- name: stackpulse/db_requests_latency_analysis
id: get_db_latency
env:
DATABASE: '{{ $.params.database_server }}'
AUTH: '{{ secret "database_access_key" }}'
TIME: "30m

* * *

The above code is an expression of an operational process, consisting of well-defined steps each receiving arguments and performing a certain action. In this case, the actions are taken during the enrichment phase of the incident, but the principle is not limited in any way to this stage.

When building the platform that would allow expressing processes as code it was important for us to maintain some important basic principles:

Building such processes as code should be very easy and should allow focusing on the logic of the process
Expressing logical connections between various steps should be possible, in order to translate multi-step processes into as code equivalents
The result should be portable between various environments to allow re-use

By turning the definition of our operational processes into a specialized version of software, we can now subject it to the same software development lifecycle principles already adopted by the organization to ensure high quality and efficient delivery.

Enabling SDLC for reliability

The moment our processes become another form of software code, leveraging an existing infrastructure that is in place for managing other forms of software can ensure that developing, maintaining and reviewing them becomes yet another pillar in software development.

For example, every GitHub repository containing business logic modules, in addition to containing the automated tests and deployment logic for these modules, can now add the representation of enrichment, diagnostics and remediation processes required to operationalize the relevant module.

These processes can rely on generic components provided as a shared infrastructure or be completely unique. Modifying these processes can be done in branches, where merging them into production will take a form of pull requests and subsequent GitHub Actions pipelines. Versioning of the processes can be clearly tracked via the repositories, requests for changes can be submitted via issues, and pace of changes can be seen via Insights. Just as with any other form of software code.

Some of these processes may remain private, whereas others, related to operationalizing common open source components or libraries can be shared with peers, with the help of public repositories, enabling collaboration and driving to operational excellence not only within a single organization, but across the whole industry.

To sum up, turning operational processes into code allows developers to leverage practices of agile development, continuous integration/deployment and GitOps to manage the processes ensuring reliability of their services—taking the next evolutionary step in the adoption of “You build it, you run it” principle and guaranteeing meeting SLOs.

Sample processes can be found in the StackPulse public playbooks repository
GitHub Action enabling GitOps flows can be found in the GitHub Actions Marketplace
Get started with the free edition of StackPulse today

Written by

GitHub Partners

GitHub partners with Arm to revolutionize Internet of Things software development with GitHub Actions

Developers creating Internet of Things software use a complex stack of software that needs to be custom built into their CI/CD platform. Arm is leveraging the simplicity and scalability of GitHub Actions with a native integration that will revolutionize IoT software development.

Laura Paine

New and simplified Enterprise and Partner Terms

Can agreement terms be a great user experience? This was the challenge GitHub’s legal department set for itself last year. We’re excited to announce all-new GitHub Customer Terms for our…

Leithia Williams

Get free access to GitHub Enterprise Choose from two trial plans designed to help your business grow. Start a free trial

CI/CD

Extending GitOps to reliability-as-code with GitHub and StackPulse

Turning operations into code

Enabling SDLC for reliability

Tags:

Written by

GitHub Partnerships

Related posts

GitHub Actions, Arm64, and the future of automotive software development

The architecture of SAST tools: An explainer for developers

Frenemies to friends: Developers and security tools

Turning operations into code

Enabling SDLC for reliability

Tags:

Written by

Related posts

GitHub Actions, Arm64, and the future of automotive software development

The architecture of SAST tools: An explainer for developers

Frenemies to friends: Developers and security tools

Subscribe to our newsletter