Keeping your data pipelines healthy with the Great Expectations GitHub Action
This post is the second in our series on using GitHub for MLOps and data science. Just joining in? Get started with part one. Most continuous integration (CI) tools only…
This post is the second in our series on using GitHub for MLOps and data science. Just joining in? Get started with part one.
Most continuous integration (CI) tools only focus on providing ways to help validate code. However, data professionals—particularly data engineers, data scientists, and machine learning engineers—spend a large portion of their time cleaning data and maintaining data pipelines.
In this post, we show you how you can use GitHub Actions together with the open source project Great Expectations to automatically test, document, and profile your data pipelines as part of your traditional CI workflows. Checking data in this way can help data teams save time and promote analytic integrity of their data. We’ll also show how to generate dashboards that give you insight into data problems as part of your CI process. Below is a demonstration of this at work, triggered by a change to a SQL query in a pull request that causes a data integrity issue:
In the above example, a GitHub Actions workflow is triggered by a change to a SQL file in a pull request. In response, the proposed SQL query from the pull request is run against a development database, and the Great Expectations GitHub Action validates the results. Finally, the validation fails, which triggers GitHub Actions to make a comment on the pull request with a link to a data validation dashboard. This dashboard enumerates where expectations diverge from the observed value(s), which are labeled “Expectations” and “Observed Value”, respectively.
Great Expectations can automatically create expectations or unit tests for your data by profiling your data. Additionally, users have the flexibility of adding expectations manually using a declarative API. Many people decide to use a hybrid approach—letting Great Expectations bootstrap their initial set of rules and iterating from there.
Great Expectations can also connect to a variety of external data sources such as S3, GCS, and Azure Blob Storage, as well as a large number of databases. To learn more about how to configure Great Expectations with your project, take a look at the documentation.
Try it yourself
To make it easy to use Great Expectations in your GitHub Actions workflow, we have partnered with them to create a Great Expectations GitHub Action. Head over to the Great Expectations Action repository to learn more about getting started.
Using GitHub for MLOps and data science
This blog post is the latest post in a series of how to use GitHub for machine learning ops (MLOps) and data science. We have an active machine learning and data science community on GitHub, and we want to highlight the ways our products can be useful for this community, too. For example, GitHub Actions don’t just do CI/CD, they provide powerful and flexible automation for ML engineers and data scientists. For more, check out our MLOps page, which includes links to blog posts, GitHub Actions, talks, and examples that are relevant to this topic.
Additional resources
Looking for more on CI, Great Expectations, or GitHub Actions? Visit our MLOps page above, or check out these helpful resources:
- GitHub Actions and CI/CD best practices, cheat sheets, and videos
- GitHub Actions official documentation
- Documentation for self-hosted runners, which could be useful if your data is only accessible within a private network
- Great Expectations Action repository
- Great Expectations project repository
- Great Expectations website
Written by
Related posts
Enhance build security and reach SLSA Level 3 with GitHub Artifact Attestations
Learn how GitHub Artifact Attestations can enhance your build security and help your organization achieve SLSA Level 3. This post breaks down the basics of SLSA, explains the importance of artifact attestations, and provides a step-by-step guide to securing your build process.
Streamlining your MLOps pipeline with GitHub Actions and Arm64 runners
Explore how Arm’s optimized performance and cost-efficient architecture, coupled with PyTorch, can enhance machine learning operations, from model training to deployment and learn how to leverage CI/CD for machine learning workflows, while reducing time, cost, and errors in the process.
GitHub Enterprise: The best migration path from AWS CodeCommit
AWS CodeCommit is discontinuing new customer access and will no longer introduce new features. Learn how to migrate to GitHub Enterprise and why it’s the best option for you.