Engineering

Counting Objects

The Systems Team at GitHub works to solve complex bugs and performance bottlenecks at the lowest levels of our infrastructure. Over the past two years we've undertaken a major project…

Author

Vicent Martí

September 22, 2015

The Systems Team at GitHub works to solve complex bugs and performance
bottlenecks at the lowest levels of our infrastructure. Over the past two years
we’ve undertaken a major project to improve the performance of Git network
operations (like clones or fetches) for the repositories we host.

Two years ago, if you cloned a large repository from any Git host, you probably
found yourself waiting several minutes before data started being sent to your
machine, while the server proudly announced that it was “counting objects”.

$ git clone https://github.com/torvalds/linux
Cloning into 'linux'...
remote: Counting objects: 4350078, done.
remote: Compressing objects: 100% (4677/4677), done.
Receiving objects:   4% (191786/4350078), 78.19 MiB | 8.70 MiB/s

However, if you try to clone the equivalent repository hosted on GitHub.com
today, or on a GitHub Enterprise instance, you’ll notice that data starts being
sent immediately and the transfer is only limited by your available bandwidth to
the server.

Inquiring minds may wonder: What are these objects, why did you have to count
them every single time, and why are you not counting them anymore?

The objects

All data in a Git repository is stored in the shape of a Directed Acyclic
Graph. The history of the repository is ordered throughout time by the
way commits are linked to each other. Each commit has a link to its parent, the
commit that came right before it; some commits have two parents, the result of
merging two branches together. At the same time, each commit has a link to a
tree. The tree is a snapshot of the contents of the working directory in the
moment where the commit was created. It contains links to all the files in the
root of your repository, and links to other subtrees which are the folders and
recursively link to more files and subtrees.

Git repository DAG organization

The result is a highly dense forest of interlinked nodes stored in the shape of
a graph, but essentially accessed as a key-value store (each
object in the database is only indexed by the SHA1 of its contents).

The counting

When you connect to a Git daemon to perform a fetch, the client and the server
perform a negotiation. The server shows the tips of all the branches it has,
and the client replies by comparing those against the tips of its own branches.
“I’ve already got all these objects. I want those! And those!”

Object negotiation in a Git repository

There’s a very specific case of a fetch operation: a fresh clone, where little
negotiation is necessary. The server offers the tips of its branches, and the
client wants all of them, because it doesn’t have any.

This is when this whole ordeal starts becoming expensive, because the server
has little information on what to actually send. Git doesn’t keep a definite
list of all objects reachable from the graph, and it cannot send every single
object in its database as a whole, because it could very well be that some of
those objects are not reachable at all in the repository and should be thrown
away instead of sent to the client. The only thing Git knows are the tips of
all branches, so its only option is to walk down the graph, all the way to the
beginning of the history, listing every single object that needs to be sent.

Each commit in the history is loaded and added to the list of objects to send,
and from each commit its tree is also expanded, and blobs and subtrees are also
added to the list. This process scales with the size of the history of the
repository (i.e. the amount of commits) and the actual size of the repository.
The more files that exist on the repository, the longer it takes to iterate
through the trees of each commit to find the few blobs that haven’t been listed
yet.

This is the “Counting Objects” phase, and as you may gather, some repositories
have a pretty big graph, with lots of objects to count.

The problem

At GitHub we serve some of the highest-traffic Git repositories on the internet,
repositories which generally are also very large. A clone of the Linux kernel
can take up to 8 minutes of CPU time in our infrastucture. Besides the poor user
experience of having to wait several minutes before you can start receiving data
in a clone, a thundering herd of these operations could significantly impact the
availability of our platform. This is why the Systems team made it a priority to
start researching workarounds for this issue.

Our initial attempts involved caching the result of every clone in order to
replay it to other clients asking for the same data. We quickly found out that
this approach was not elegant and definitely not scalable.

When it comes to caching clones, the repositories that matter the most are the
most active ones. This makes caching the full result of a clone pointless,
since the state of the repository changes constantly. On top of that, the
initial caching still takes the same unreasonable amount of CPU time, and the
size of the cache is about the same as that of the repository on disk. Each
cached response roughly doubles the amount of disk space needed by the
repository.

Because of these reasons, we decided against pursuing the caching approach.
Generally speaking, caching specific results to queries is a weak approach to
performance in complex systems. What you want to do instead is caching
intermediate steps of the computation, to be able to efficiently answer any kind
of query. In this case, we were looking for a system that would not only allow
us to serve clones efficiently, but also complex fetches. Caching responses
cannot accomplish this.

In 2013 we organized and obviously attended Git Merge in Berlin, the yearly
conference (previously known as GitTogether) where developers of all the
different Git implementations meet, mingle and attempt to cope with the fact
that it’s 2015 and we’re writing version control systems for a living.

There, we were delighted to learn about the “bitmap index” patchset that
Google’s JGit engineers had written for their open-source Java
implementation of Git. Back then the patchset was still experimental, but the
concept was sound and the performance implications were very promising, so we
decided to tread down that path and attempt to port the patchset to Git, with
the hopes of being able to run it in our infrastucture and push it upstream for
the whole community to benefit.

The idea

The idea behind bitmap indexes is not particularly novel: it is the same method
that databases (especially relational ones) have used since the 1970s to speed up
their more complex queries.

In Git’s case, the queries we’re trying to perform are reachability queries:
given a set of commits, what are all of the objects in the graph that can be
reached from those commits? This is the operation that the “Counting Objects”
phase of a fetch performs, but we want to find the set of reachable objects
without having to traverse every single object on the way.

To do so, we create indexes (stored as bitmaps) that contain the information
required to answer these queries. For any given commit, its bitmap index marks
all the objects that can be reached from it. To find the objects that can be
reached from a commit, we simply look up its bitmap and check the marked bits
on it; the graph doesn’t need to be traversed anymore, and an operation that
used to take several minutes of CPU time (loading and traversing every single
object in the graph) now takes less than 3ms.

Git index implementation

Of course, a practial implementation of these indexes has many details to work
around:

First and foremost, we cannot create an index for every single commit in the
repository. That would consume too much storage! Instead, we use heuristics to
decide the set of commits that will receive bitmaps. We’re interested in
indexing the most recent commits: the tips of all branches need an index,
because those are the reachability computations that we will perform with a
fresh clone. Besides that, we also add indexes to recent commits; it is likely
that their reachability will be computed when people fetch into a clone that is
slightly out of date. As we progress deeper into the commit graph, we decrease
the frequency at which we pick commits to give indexes to. Towards the end of
the graph, we pick one of every 3000 commits.

Last, but not least, we minimize the amount of disk used by the indexes by
compressing the bitmaps. JGit’s implementation originally used Daniel Lemire’s
EWAH Bitmaps, so we ported his C++ implementation to C to be backwards
compatible ^[1]. EWAH Bitmaps offer a very reasonable
amount of compression while still being extremely fast to decompress and work
with. The end result is an index that takes between 5 and 10% of the original
size of the repository — significantly less than caching the result of
fetches, whilst allowing us to speed up any fetch negotiation, not only those
of fresh clones.

The implementation

Once we started working on the Git implementation, we found that JGit’s design
was so dramatically different from Git’s that it was simply unfeasible to port or
translate the patchset. JGit is after all a Java library, while Git is a
collection of small binaries written in C.

What we did instead was take the idea of bitmap indexes and re-implement it
from scratch, using only the on-disk format of JGit’s bitmap indexes as
reference (as to ensure interoperability between the two implementations).

The first iteration of the patchset took about 2 months of work, and was
drastically different from the original JGit implementation. But it seemed to
be compatible. We could open and query bitmap indexes generated by JGit. We did
several rounds of synthetic benchmarks, which showed that our implementation
was slightly faster than JGit’s (mostly because of performance differences
between the languages, not between the different designs).

So we started testing cloning real production repositories in a staging
environment. The benchmarks showed an exciting improvement of 40%… In the
wrong direction. The bitmap-based code was about 40% slower to generate the
packfile for a repository.

Packing objects run in torvalds/linux

These results raised many questions: how can a clone operation be significantly
slower than in vanilla Git, when we’ve already tested that the bitmap indexes
are being queried instantly? Did we really spend 3 months writing 5000 lines of
C that make clones slower instead of faster? Should we have majored in Math
instead of Computer Science?

Our existential crisis vanished swiftly with some in-depth profiling. If you
break down the graph by the amount of time spent during each phase of the
cloning operation you can see that, indeed, we’ve managed to reduce an operation
that took minutes to mere miliseconds, but unfortunately we’ve made another
operation — seemingly unrelated — three times slower.

Packing objects run in torvalds/linux (breakdown)

Deltas all the way down

The job of a Version Control System is tracking changes on a set of files over
time; hence, it needs to keep a snapshot of every single file in the repository
at every changeset: thousands of versions of thousands of different files which,
if stored individually, would take up an incredible amount of space on disk.

Because of this, all version control systems use smarter techniques to store
all these different versions of files. Usually, files are stored as the delta
(i.e. difference) between the previous version of a file and its current one.
This delta is then compressed to further reduce the amount of size it takes on
disk.

In Git’s case, every snapshot of a file (which we call a blob) is stored
together with all the other objects (the tags, commits and trees that link them
together) in what we call a “packfile”.

When trying to minimize the size of a packfile, however, Git mostly disregards
history and considers blobs in isolation. To find a delta base for a given
blob, Git blindly scans for blobs that look similar to it. Sometimes it will
choose the previous version of the same file, but very often it’s a blob from
an unrelated part of the history.

In general, Git’s aggressive approach of delta-ing blobs against anything is
very efficient and will often be able to find smaller deltas. Hence the size of
a Git repository on-disk will often be smaller than its Subversion or Mercurial
equivalent.

This same process is also performed when serving a fetch negotiation. Once Git
has the full list of objects it needs to send, it needs to compress them into a
single packfile (Git always sends packfiles over the network). If the packfile
on disk has a given object X stored as a delta against an object Y, it could
very well be the case that object X needs to be sent, but object Y doesn’t. In
this case, object X needs to be decompressed and delta’ed against an object
that does need to be sent in the resulting packfile.

This is a pretty frequent ocurrence when negotiating a small fetch, because
we’re sending few objects, all from the tip of the repository, and these objects
will usually be delta’ed against older objects that won’t be sent. Therefore,
Git tries to find new delta bases for these objects.

The issue here is that we were actually serving full clones, and our
benchmarks were showing that we were recompressing almost every single object
in the clone, which really makes no sense. The point of a clone is sending all
the objects in a repository! When we send all objects, we shouldn’t need to
find new delta bases for them, because their existing delta bases will be sent
too!

And yet the profiling did not lie. There was something fishy going on the
compression phase.

Your very own fork of Rails

Very early on we figured out that actually forking people’s repositories was
not sustainable. For instance, there are almost 11,000 forks of Rails
hosted on GitHub: if each one of them were its own copy of the repository, that
would imply an incredible amount of redundant disk space, requiring several
times more fileservers than the ones we have in our infrastructure.

That’s why we decided to use a feature of Git called alternates.
When you fork a repository on GitHub, we create a shallow copy of it. This copy
has no objects of its own, but it has access to all the objects of an alternate,
a root repository we call network.git and which contains the objects for all
the forks in the network. When you push to your repository, eventually we move
over the objects you pushed to the network.git root, so they’re already
available in case you decide to create a Pull Request against the original
repository.

With all the repositories in the network sharing the same pool of objects, we
can keep all the objects inside of a single packfile, and hence minimize the
on-disk size of the repository network by not storing all the duplicated
objects between the forks.

Here’s the issue though: when we delta two objects inside this packfile (a
packfile that is shared between all the forks of a repository), Git uses the
straightforward, aggressive heuristic of “finding an object that looks alike”.
As stated earlier, this is very effective for minimizing the size of a packfile
on disk, but like all simple things in life, this comes back to bite us.

What happens when you delta an object of a fork against an object of another
fork? Well, when you clone that fork, that object cannot be sent as a delta,
because its base is not going to be sent as part of the clone. The object
needs to be blown up (recomposed from its original delta), and then delta’ed on
the fly against an object that is actually going to be sent on the clone.

This is the reason why the “Compressing Objects” phase of our fetches was
recompressing all the objects and taking so long to run: yes, we were sending a
whole repository in the clone, but the objects in that repository were delta’ed
against objects of a different fork altogether. They did need to be
recompressed on the fly. Ouch!

The missing optimization

We had a solid explanation of why we were seeing all these objects being
recompressed, but something was still off. The “compressing objects” phase is
completely independent of the “counting objects” phase, and our bitmap changes
didn’t affect it at all.

The behavior of this compression phase is dictated by the way we store forks
jointly on disk, so it made no sense that it would take twice as long after our
patch, again considering these code paths were not touched. Why was object
compression suddenly so slow?

Here’s what happens: when traversing the graph during the “Counting Objects”
phase, traditional Git collected metadata about blobs, like their filenames,
sizes and paths. During the compression phase, it used this metadata as part of
its heuristics for finding good delta bases.

However, after three months of work in our bitmap index changes, we could now
find the list of all the objects to send through the bitmap index, without
having to actually load a single one of them. This was great for removing the
intense CPU requirements of the “counting objects” phase, but it had unintended
side effects when it was time to to compress the objects.

When Git noticed that none of the objects on the list could be sent because they
were delta’ed against other objects that were not in the list, it was forced
to re-delta these objects. But without any metadata to hint at the size or
contents of the objects, it had to randomly compare them against each other
(obviously by loading them and forfeiting all the benefits that the bitmap
index had in the first place) until it was able to find similar objects to
delta against and piece together a packfile that wasn’t outrageously large.

This is how we were losing all the benefits of our optimization, and in fact
making the process of generating a packfile 40% slower than it was before,
despite the fact that the most expensive phase of the process was essentially
optimized away.

The performance fix

Our dooming performance issue was being caused, again, because we were merging
several forks of a repository in the same packfile, but unfortunately, splitting
these forks into individual packs was not an option because it would multiply up
the storage costs of repositories several times.

After some thought, we agreed that the bitmap index approach was too promising
to give up on, despite the fact it could not run properly with our current
infrastucture. Therefore, we started looking for a workaround in the way we
store the different forks of repositories on disk.

The idea of “forks” is foreign to Git’s storage layer: when creating the
packfile for a network of forks, Git only sees hundreds of thousands of objects
and is not aware of the fact that they actually come from different forks of the
same repository.

However, we are aware of the source of the objects as we add them to the
packfile for delta’ing, and we can influence the way that Git will delta these
packfiles.

What we did was “paint” the graph of all forks, starting from the roots, with a
different color for each fork. The “painting” was performed by marking a bit on
each object for each fork that contained the object.

Git repository repack marking

With this, it’s trivial to propagate the markings to the descendant objects, and
change the delta-base heuristic to only consider a parent object as a base if
its marking were a strict superset of the markings of the child object (i.e. if
sending the child object would always imply sending also the parent, for all
forks).

As the markings trickle down, the final packfile will only have objects that can
be sent as-is, because when cloning any of the forks in the network, every
object will only be delta’ed against another object that will also be sent in
the clone.

The results

With this new heuristic, we repacked the original repository for our benchmarks
(which in our case meant repacking the whole network of forks that contained the
repository), and ran the packfile generation test again.

Packing objects run in torvalds/linux (with perf fix)

The first thing we verified is that our heuristics worked, and that we had to
re-compress exactly 0 objects during the “Compressing Objects” phase. That was
the case. As for the performance of the pack generation phase… Well, let’s
just say we hit our performance target.

It took another couple months to work around the kinks in our implementation,
but we eventually deployed the patchset to production, where the average CPU
time spent on Git network operations was reduced by more than 90% — an order of
magnitude faster than the old implementation.

Shortly after our initial deploy, we also started the process of upstreaming
the changes to Git so the whole community could benefit from them. The
bitmap-index patchset ^[2] finally shipped in the much
anticipated Git 2.0 release, and now Git network operations from most
major Git hosts are enjoying the overhaul.

Since the initial deployment of our patchset more than a year ago, it has
already saved our users roughly a century of waiting for their fetches to
complete (and the equivalent amount of CPU time in our fileservers).

But this is only the beginning: now that bitmap indexes are widely available,
they can be used to accelerate many other Git operations. We’ve implemented a
number of such improvements which we already run on production, and we are
preparing them for submission to the open source Git project.

The story behind these second generation optimizations, however, will have to
wait until another day and another blog post.

1. Daniel Lemire was extremely helpful assisting us with
questions regarding his EWAH compression algorithm, and allowed us to relicense
our C port under the GPLv2 to make it compatible with Git. His state-of-the-art
research on bitmap and array
compression is used in a lot of open-source and commercial software.

2. The final series that made it upstream was
composed of 14 commits. They include thorough documentation of the whole
implementation in their commit messages. The key parts of the design are
explained in
compressed bitmap implementation,
add documentation for the bitmap format,
add support for bitmap indexes,
use bitmaps when packing objects and
implement bitmap writing.