Keeping your data pipelines healthy with the Great Expectations GitHub Action
This post is the second in our series on using GitHub for MLOps and data science. Just joining in? Get started with part one. Most continuous integration (CI) tools only…
This post is the second in our series on using GitHub for MLOps and data science. Just joining in? Get started with part one.
Most continuous integration (CI) tools only focus on providing ways to help validate code. However, data professionals—particularly data engineers, data scientists, and machine learning engineers—spend a large portion of their time cleaning data and maintaining data pipelines.
In this post, we show you how you can use GitHub Actions together with the open source project Great Expectations to automatically test, document, and profile your data pipelines as part of your traditional CI workflows. Checking data in this way can help data teams save time and promote analytic integrity of their data. We’ll also show how to generate dashboards that give you insight into data problems as part of your CI process. Below is a demonstration of this at work, triggered by a change to a SQL query in a pull request that causes a data integrity issue:
In the above example, a GitHub Actions workflow is triggered by a change to a SQL file in a pull request. In response, the proposed SQL query from the pull request is run against a development database, and the Great Expectations GitHub Action validates the results. Finally, the validation fails, which triggers GitHub Actions to make a comment on the pull request with a link to a data validation dashboard. This dashboard enumerates where expectations diverge from the observed value(s), which are labeled “Expectations” and “Observed Value”, respectively.
Great Expectations can automatically create expectations or unit tests for your data by profiling your data. Additionally, users have the flexibility of adding expectations manually using a declarative API. Many people decide to use a hybrid approach—letting Great Expectations bootstrap their initial set of rules and iterating from there.
Great Expectations can also connect to a variety of external data sources such as S3, GCS, and Azure Blob Storage, as well as a large number of databases. To learn more about how to configure Great Expectations with your project, take a look at the documentation.
To make it easy to use Great Expectations in your GitHub Actions workflow, we have partnered with them to create a Great Expectations GitHub Action. Head over to the Great Expectations Action repository to learn more about getting started.
This blog post is the latest post in a series of how to use GitHub for machine learning ops (MLOps) and data science. We have an active machine learning and data science community on GitHub, and we want to highlight the ways our products can be useful for this community, too. For example, GitHub Actions don’t just do CI/CD, they provide powerful and flexible automation for ML engineers and data scientists. For more, check out our MLOps page, which includes links to blog posts, GitHub Actions, talks, and examples that are relevant to this topic.
Looking for more on CI, Great Expectations, or GitHub Actions? Visit our MLOps page above, or check out these helpful resources:
- GitHub Actions and CI/CD best practices, cheat sheets, and videos
- GitHub Actions official documentation
- Documentation for self-hosted runners, which could be useful if your data is only accessible within a private network
- Great Expectations Action repository
- Great Expectations project repository
- Great Expectations website