Behind the scenes of GitHub Token Scanning

Image of Patrick Toomey

Several years ago we started scanning all pushes to public repositories for GitHub OAuth tokens and personal access tokens. Now we’re extending this capability to include tokens from cloud service providers and additional credentials, such as unencrypted SSH private keys associated with a user’s GitHub account.

We live in amazing times for software development. Capabilities that were once only available to large technology companies are now accessible to the smallest of startups. Developers can leverage cloud services to quickly perform continuous integration testing, deploy their code to fully scalable infrastructure, accept credit card payments from customers, and nearly anything else you can imagine.

Composing cloud services like this is the norm going forward, but it comes with inherent security complexities. Each cloud service a developer typically uses requires one or more credentials, often in the form of API tokens. In the wrong hands, they can be used to access sensitive customer data—or vast computing resources for mining cryptocurrency, presenting significant risks to both users and cloud service providers.

diff --git a/config.yml b/config.yml
index 1110370..da7b5de 100644
--- a/config.yml
+++ b/config.yml
@@ -1,2 +1,2 @@
-  github_token: 4e862a8ec12ef0ad7c1337e8b16d98f4d764b8f6
+  github_token: <%= ENV["GITHUB_TOKEN"] %>

Fig 1: Your master branch might look innocent, but exposed credentials in your project’s history could prove costly.

Token Scanning 1.0

GitHub, and our users, aren’t immune to this problem. That is why we started scanning for GitHub OAuth tokens in public repositories several years ago. But, our users are not just GitHub customers. They are also customers of cloud infrastructure providers, cloud payment processing providers, and other cloud service providers that have become commonplace in modern development.

Our existing solution focused on GitHub OAuth tokens exclusively and was never designed for extensibility. The existing code leveraged hand-tuned assembly that was extremely fast at finding 40-hex character strings (the format of GitHub OAuth tokens). This bit of code was patched into Git and run inline whenever code was pushed to GitHub. It was an amazing piece of work, but could not support multiple credential formats. Our vision was to support all of the popular cloud service providers.

Token Scanning 2.0

So began our journey into the next generation of GitHub Token Scanning. The obvious path to a more extensible scanner is some form of regular expression support. However, scanning for credentials using typical regular expression libraries doesn’t scale from a performance perspective, as they optimize for a slightly different problem than the one we have.

The vast majority of regular expression libraries are designed to return the first match in a set of patterns. Given we have at least one pattern for each cloud service provider, we require all matches are returned and not just the first. The only way to ensure this with traditional libraries is to scan a given input once for each pattern. However, this increases the scan time dramatically for large repositories or large sets of patterns. Fortunately, scanning Git data for credentials is just a specific case of a general problem. For example, high-performance application-level firewalls similarly need to scan high-volume network traffic for sets of patterns to identify known viruses or malware. If you squint, scanning high-volume Git push data for credentials is a very similar problem.

Our research eventually lead us to a GitHub repository hosting the amazing Hyperscan library by Intel. This library is incredibly performant and provides exactly what we need. We will explore the technical details in more depth in a follow-up engineering post. But, in short, Hyperscan let us replace all of the assembly code patches to Git with a new standalone scanner, written in Go, that has scaled nicely.

In parallel with working on the implementation, we reached out to several cloud service providers we thought would be interested in testing out Token Scanning in a private beta. They were all enthusiastic to participate, as many of them had contacted us in the past looking for a solution to this widespread problem.

Since April, we’ve worked with cloud service providers in private beta to scan all changes to public repositories and public Gists for credentials (GitHub doesn’t scan private code). Each candidate credential is sent to the provider, including some basic metadata such as the repository name and the commit that introduced the credential. The provider can then validate the credential and decide if the credential should be revoked depending on the associated risks to the user or provider. Either way, the provider typically contacts the owner of the credential, letting them know what occurred and what action was taken.

Where we go from here

We have received amazing feedback from both providers and users during the private beta. Cloud service providers have told us that GitHub Token Scanning has been tremendously effective in helping them identify credentials before malicious users. And, while GitHub users were not aware that Token Scanning was in beta, we did take notice of a tweet from a GitHub user that included a number of enthusiastic exclamation marks and 🎉 emojis. This user was extremely grateful for having received a notification from a participating cloud service provider less than a minute after they had accidentally pushed a highly sensitive credential to a public repository.

During the beta we have scanned millions of public repository changes and identified millions of candidate credentials. As announced yesterday at GitHub Universe, Token Scanning is now in public beta, and supports an increasing number of cloud providers. We’re excited by the impact of Token Scanning today and have lots of ideas about how to make it even more powerful in the future. Dealing with credentials is an unavoidable part of modern development. With GitHub by your side, we hope to minimize the security impact of such accidents.

Catch early bird pricing for GitHub Satellite

Sync up with us and leading developers from around the world in Berlin, May 22-23, and get €100 off regular-priced tickets until April 11.

Get tickets

Join us at Maintainerati

Calling all maintainers: Unite at Maintainerati, a one-day unconference to gather, present, and discuss the day after GitHub Satellite.