Introducing the GHES repository cache
If you’re a GHES customer with heavy read traffic on your monorepo, check out the repository cache, especially if you have CI workloads distributed around the world.
Continuous integration (CI) runners often drive the lion’s share of load on a Git server. They can build everything from merges to main, pull requests, or even every single commit. Any of these mean frequent clones and pulls. A large enough farm of CI runners may cause slowdowns for users of Git. A related problem for large organizations is geo-distribution, such as a CI farm in Europe pulling from a Git server in North America. Each runner has to pull each change over the same high-latency, possibly expensive, link. It can be cheaper, faster, and even more reliable if a single machine pulls the changes from North America and then distributes them to the runners locally in Europe.
Since Git traffic isn’t suitable for classic content delivery networks, and geo-replicas can cause slowdowns for developers trying to push data, large organizations need something different. Enter the GitHub Enterprise Server repository cache, now available in public beta.
The repository cache is an eventually-consistent replica of your Git data and is updated by a background job. This offers the data locality and convenience of geo-replicas, but doesn’t participate in the initial receipt of Git data, meaning that developer push workflows aren’t affected.
The other advantage repository caches offer is selective replication: you choose exactly which repositories are replicated to specific caches. Several enterprise customers expressed a need for strict data residency. There are repositories that should be replicated elsewhere, and repositories that should remain solely in their home location. Replication policies allow you to identify these repositories and honor these restrictions.
There are a few tradeoffs when using a repository cache. Repository caches are purely read-only; all writes have to go to the primary. This is why we recommend that only CI farms use repository caches, and that developers continue to work with the primary. We’ve detailed other strategies for speeding up developer workflows with large monorepos, and those strategies are compatible with offloading CI traffic to repository caches. Also, it can take up to several minutes for new Git data to arrive at the cache. Readers must implement a backoff-and-retry strategy for commits which they expect but don’t find immediately.
This feature shipped in beta with GHES 3.3 and has been improved in GHES 3.4. There are a handful of known issues, listed below, but if these aren’t showstoppers for you, we’d love to have you try it out and share feedback with us. Get started by visiting the documentation.
Known cache server issues in GHES 3.3 and 3.4:
- LFS files are not replicated. Instead, LFS acts as a proxy and will stream the content from the primary instance. If you make heavy use of LFS, then you may not reap much benefit from the cache server in its current state. LFS caching will be implemented in a future release of the cache server.
- Repositories which aren’t allowed by replication policy may still be streamed through the cache server. It’s important to note that the cache server is not an authorization boundary. If a user with access to a repository attempts to access it through a cache server, the cache server will proxy the data, but will not store the data. This behavior will be changed in a future release of the cache server.
Tags:
Written by
Related posts
GitHub Availability Report: December 2024
In December, we experienced two incidents that resulted in degraded performance across GitHub services.
Inside the research: How GitHub Copilot impacts the nature of work for open source maintainers
An interview with economic researchers analyzing the causal effect of GitHub Copilot on how open source maintainers work.
OpenAI’s latest o1 model now available in GitHub Copilot and GitHub Models
The December 17 release of OpenAI’s o1 model is now available in GitHub Copilot and GitHub Models, bringing advanced coding capabilities to your workflows.