Highlights from Git 2.31
The open source Git project just released Git 2.31 with features and bug fixes from 85 contributors, 23 of them new. Last time we caught up with you, Git 2.29…
The open source Git project just released Git 2.31 with features and bug fixes from 85 contributors, 23 of them new. Last time we caught up with you, Git 2.29 had just been released. Two versions later, let’s take a look at the most interesting features and changes that have happened since.
Introducing git maintenance
Picture this: you’re at your terminal, writing commits, pulling from another repository, and pushing up the results when all of the sudden, you’re greeted by this unfriendly message:
Auto packing the repository for optimum performance. You may also run "git gc" manually. See "git help gc" for more information.
…and, you’re stuck. Now you’ve got to wait for Git to finish running git gc --auto
before you can get back to work.
What happened here? In the course of normal use, Git writes lots of data: objects, packfiles, references, and the like. Some of those paths are optimized for write performance. For example, it’s much quicker to write a single “loose” object, but it’s faster to read a packfile.
To keep you productive, Git makes a trade-off: in general, it optimizes for the write path while you’re working, pausing every so often to represent its internal data-structures in a way that is more efficient to read in order to keep you productive in the long-run.
Git has its own heuristics about when is a good time to perform this “pause,” but sometimes those heuristics trigger a blocking git gc
at the worst possible time. You could manage these data-structures yourself, but you might not want to invest the time figuring out when and how to do that.
Starting in Git 2.31, you can get the best of both worlds with background maintenance. This cross-platform feature allows Git to keep your repository healthy while not blocking any of your interactions. In particular, this will improve your git fetch
times by pre-fetching the latest objects from your remotes once an hour.
Getting started with background maintenance couldn’t be easier. Simply navigate your terminal to any repository you want to enable background maintenance on, and run the following:
$ git maintenance start
…and Git will take care of the rest. Besides pre-fetching the latest objects once an hour, Git will make sure that its own data is organized, too. It will update its commit-graph
file once an hour, and pack any loose objects (as well as incrementally repack packed objects) nightly.
Read more about this feature in the git maintenance
documentation and learn how to customize it with maintenance.* config
options. If you have any trouble, you can check the troubleshooting documentation.
[source, source, source, source]
On-disk reverse indexes
You may know that Git stores all data as “objects:” commits, trees, and blobs which store the contents of individual files. For efficiency, Git puts many objects into packfiles, which are essentially a concatenated stream of objects (this same stream is also how objects are transferred by git fetch
and git push
). In order to efficiently access individual objects, Git generates an index for each packfile. Each of these .idx
files allows quick conversion of an object id into its byte offset within the packfile.
What happens when you want to go in the other direction? In particular, if all Git knows is what byte it’s looking at in some packfile, how does it go about figuring out which object that byte is part of?
To accomplish this, Git uses an aptly-named reverse index: an opaque mapping between locations in a packfile, and the object each location is a part of. Prior to Git 2.31, there was no on-disk format for reverse indexes (like there is for the .idx
file), and so it had to generate and store the reverse index in memory each time. This roughly boils down to generating an array of object-position pairs, and then sorting that array by position (for the curious, the exact details can be found here).
But this takes time. In the case of repositories with large packfiles, this can take a lot of time. To better understand the scale, consider an experiment which compares the time it takes to print the size of an object, versus the time it a takes to print that object’s contents. To simply print an object’s contents, Git uses the forward index to locate the desired object in a pack, and then it reassembles and prints out its contents. But to print an object’s size in a packfile, Git needs to locate not just the object we want to measure, but the object immediately following it, and then subtract the two to find out how much space it’s using. To find the position of the first byte in the adjacent object, Git needs to use the reverse index.
Comparing the two, it is more than 62 times slower to print the size of an object than it is to print that entire object’s contents. You can try this at home with hyperfine by running:
$ git rev-parse HEAD >tip
$ hyperfine --warmup=3 \
'git cat-file --batch <tip' \
'git cat-file --batch-check="%(objectsize:disk)" <tip'
In 2.31, Git gained the ability to serialize the reverse index into a new, on-disk format with the .rev
extension. After generating an on-disk reverse index and repeating the above experiment, our results now show that it takes roughly the same amount of time to print an object’s contents as it does its size.
Observant readers may ask themselves why Git even needs to bother using a reverse index. After all, if you can print the contents of an object, then surely printing that object’s size is no more difficult than knowing how many bytes you wrote when printing the contents. But, this depends on the size of the object. If it’s enormous, then counting up all of its bytes is much more expensive than simply subtracting.
Reverse indexes can help beyond synthetic experiments like these: when sending objects for a fetch or push, the reverse index is used to send object bytes directly from disk. Having a reverse index computed ahead of time makes this process run faster.
Git doesn’t generate .rev
files by default yet, but you can experiment with them yourself by running git config pack.writeReverseIndex true
, and then repacking your repository (with git repack -Ad
). We have been using these at GitHub for the past couple of months to enable dramatic improvements in many different Git operations.
Tidbits
- We’ve talked on this blog before about the
commit-graph
file. It’s an incredibly useful serialization of common information about commits, like which parents they have, what their root tree is, and so on. (For a more detailed exposition, the blog post series beginning here is a great exposition). Commit graphs also store information about a commit’s generation number, which can be used to accelerate many kinds of commit walks. In Git 2.31, a new kind of generation number was used, which can improve performance further in certain situations.These patches were contributed by Abhishek Kumar, a Google Summer of Code student.[source]
- In recent versions of Git, it has become easier to change the default name for the main branch in a new repository with the
init.defaultBranch
configuration. Git has always tried to check out the branch at theHEAD
of your remote (i.e., if the remote’s default branch was “foo
“, thengit clone
would try to checkoutfoo
locally), but this hasn’t worked with empty repositories.In Git 2.31, this now works with empty repositories, too. Now if you are cloning a newly-created repository locally to start writing the first patches, your local copy will respect the default branch name set by the remote, even if there aren’t any commits yet.[source]
- On the topic of renaming things, Git 2.30 makes it easier to change the name of another default: a repository’s first remote. When
git clone
-ing a repository, the first remote initialized is always named “origin”.Prior to Git 2.30, your options for renaming this were limited to runninggit remote rename origin <newname>
. Git 2.30 allows you to configure a different name to be chosen by default, instead of always using “origin”. To give it a try for yourself, set theclone.defaultRemoteName
configuration.[source]
- When a repository grows large, it can be hard to figure out which branches are responsible. In Git 2.31,
git rev-list
now has a--disk-usage
option which is both simpler and faster than using the existing tools to sum up object sizes. The examples section of therev-list
manual shows off some uses (and check out the source link below for timings and to see the “old” way of doing it).[source]
- You may have used Git’s
-G<regex>
option to find commits which modified a line that mentions a particular string (e.g.,git log -G'foo\('
will look for changes that added, removed, or modified calls to thefoo()
function). But you may also want to ignore lines matching a certain pattern. Git 2.30 introduces-I<regex>
, which lets you ignore changes in lines matching a regular expression. For instance,git log -p -I'//'
would show the patch for each commit, but omit any hunks that only touched comment lines (those containing//
).[source]
- In preparation for replacing the merge backend, rename detection has been substantially optimized. You can read more about these changes from their author in Optimizing git’s merge machinery, #1, and Optimizing git’s merge machinery, #2.
That’s just a sample of changes from the last couple of releases. For more, check out the release notes for 2.30 and 2.31, or any previous version in the Git repository.
Tags:
Written by
Related posts
What the EU’s new software legislation means for developers
The EU Cyber Resilience Act will introduce new cybersecurity requirements for software released in the EU. Learn what it means for your open source projects and what GitHub is doing to ensure the law will be a net win for open source maintainers.
Game Off 2024 theme announcement
GitHub’s annual month-long game jam, where creativity knows no limits! Throughout November, dive into your favorite game engines, libraries, and programming languages to bring your wildest game ideas to life. Whether you’re a seasoned dev or just getting started, it’s all about having fun and making something awesome!
Highlights from Git 2.47
Git 2.47 is here, with features like incremental multi-pack indexes and more. Check out our coverage of some of the highlights here.