Highlights from Git 2.37
The open source Git project just released Git 2.37. Take a look at some of our highlights from the latest release.
The open source Git project just released Git 2.37, with features and bug fixes from over 75 contributors, 20 of them new. We last caught up with you on the latest in Git back when 2.36 was released.
To celebrate this most recent release, here’s GitHub’s look at some of the most interesting features and changes introduced since last time.
Before we get into the details of Git 2.37.0, we first wanted to let you know that Git Merge is returning this September. The conference features talks, workshops, and more all about Git and the Git ecosystem. There is still time to submit a proposal to speak. We look forward to seeing you there!
A new mechanism for pruning unreachable objects
In Git, we often talk about classifying objects as either “reachable” or “unreachable”. An object is “reachable” when there is at least one reference (a branch or a tag) from which you can start an object walk (traversing from commits to their parents, from trees into their sub-trees, and so on) and end up at your destination. Similarly, an object is “unreachable” when no such reference exists.
A Git repository needs all of its reachable objects to ensure that the repository is intact. But it is free to discard unreachable objects at any time. And it is often desirable to do just that, particularly when many unreachable objects have piled up, you’re running low on disk space, or similar. In fact, Git does this automatically when running garbage collection.
But observant readers will notice the gc.pruneExpire
configuration. This setting defines a “grace period” during which unreachable objects which are not yet old enough to be removed from the repository completely are left alone. This is done in order to mitigate a race condition where an unreachable object that is about to be deleted becomes reachable by some other process (like an incoming reference update or a push) before then being deleted, leaving the repository in a corrupt state.
Setting a small, non-zero grace period makes it much less likely to encounter this race in practice. But it leads us to another problem: how do we keep track of the age of the unreachable objects which didn’t leave the repository? We can’t pack them together into a single packfile; since all objects in a pack share the same modification time, updating any object drags them all forward. Instead, prior to Git 2.37, each surviving unreachable object was written out as a loose object, and the mtime of the individual objects was used to store their age. This can lead to serious problems when there are many unreachable objects which are too new and can’t be pruned.
Git 2.37 introduces a new concept, cruft packs, which allow unreachable objects to be stored together in a single packfile by writing the ages of individual objects in an auxiliary table stored in an *.mtimes
file alongside the pack.
While cruft packs don’t eliminate the data race we described earlier, in practice they can help make it much less likely by allowing repositories to prune with a much longer grace period, without worrying about the potential to create many loose objects. To try it out yourself, you can run:
$ git gc --cruft --prune=1.day.ago
and notice that your $GIT_DIR/objects/pack
directory will have an additional .mtimes
file, storing the ages of each unreachable object written within the last 24 hours
$ ls -1 .git/objects/pack
pack-243103d0f640e0096edb3ef0c842bc5534a9f9a4.idx
pack-243103d0f640e0096edb3ef0c842bc5534a9f9a4.mtimes
pack-243103d0f640e0096edb3ef0c842bc5534a9f9a4.pack
pack-5a827af6f1a793a45c816b05d40dfd4d5f5edf28.idx
pack-5a827af6f1a793a45c816b05d40dfd4d5f5edf28.pack
There’s a lot of detail we haven’t yet covered on cruft packs, so expect a more comprehensive technical overview in a separate blog post soon.
[source]
A builtin filesystem monitor for Windows and macOS
As we have discussed often before, one of the factors that significantly impact Git’s performance is the size of your working directory. When you run git status
, for example, Git has to crawl your entire working directory (in the worst case) in order to figure out which files have been modified.
Git has its own cached understanding of the filesystem to avoid this whole-directory traversal in many cases. But it can be expensive for Git to update its cached understanding of the filesystem with the actual state of the disk while you work.
In the past, Git has made it possible to integrate with tools like Watchman via a hook, making it possible to replace Git’s expensive refreshing process with a long-running daemon which tracks the filesystem state more directly.
But setting up this hook and installing a third-party tool can be cumbersome. In Git 2.37, this functionality is built into Git itself on Windows and macOS, removing the need to install an external tool and configure the hook.
You can enable this for your repository by enabling the core.fsmonitor
config setting.
$ git config core.fsmonitor true
After setting up the config, an initial git status
will take the normal amount of time, but subsequent commands will take advantage of the monitored data and run significantly faster.
The full implementation is impossible to describe completely in this post. Interested readers can follow along later this week with a blog post written by Jeff Hostetler for more information. We’ll be sure to add a link here when that post is published.
[source, source, source, source]
The sparse index is ready for wide use
We previously announced Git’s sparse index feature, which helps speed up Git commands when using the sparse-checkout feature in a large repository.
In case you haven’t seen our earlier post, here’s a brief refresher. Often when working in an extremely large repository, you don’t need the entire contents of your repository present locally in order to contribute. For example, if your company uses a single monorepo, you may only be interested in the parts of that repository that correspond to the handful of products you work on.
Partial clones make it possible for Git to only download the objects that you care about. The sparse index is an equally important component of the equation. The sparse index makes it possible for the index (a key data structure which tracks the content of your next commit, which files have been modified, and more) to only keep track of the parts of your repository that you’re interested in.
When we originally announced the sparse index, we explained how different Git subcommands would have to be updated individually to take advantage of the sparse index. With Git 2.37.0, all of those integrations are now included in the core Git project and available to all users.
In this release, the final integrations were for git show
, git sparse-checkout
, and git stash
. In particular, git stash
has the largest performance boost of all of the integrations so far because of how the command reads and writes indexes multiple times in a single process, achieving a near 80% speed-up in certain cases (though see this thread for all of the details).
That’s just a sample of changes from the latest release. For more, check out the release notes for 2.37, or any previous version in the Git repository.
Tidbits
Now that we have looked at some of the bigger features in detail, let’s turn to a handful of smaller topics from this release.
- Speaking of sparse checkouts, this release deprecates the non-
--cone
-mode style of sparse checkout definitions.For the uninitiated, the
git sparse-checkout
command supports two kinds of patterns which dictate which parts of your repository should be checked out: “cone” mode, and “non-cone” mode. The latter, which allows specifying individual files with a.gitignore
-style syntax, can be confusing to use correctly, and has performance problems (namely that in the worst case all patterns must try to be matched with all files, leading to slow-downs). Most importantly, it is incompatible with the sparse-index, which brings the performance enhancements of using a sparse checkout to all of the Git commands you’re familiar with.For these reasons (and more!), the non-
--cone
mode style of patterns is discouraged, and users are instead encouraged to use--cone
mode.[source]
-
In our highlights from the last Git release, we talked about more flexible
fsync
configuration, which made it possible to more precisely define what files Git would explicitly synchronize withfsync()
and what strategy it would use to do that synchronization.This release brings a new strategy to the list supported by
core.fsyncMethod
: “batch”, which can provide significant speed-ups on supported filesystems when writing many individual files. This new mode works by staging many updates to the disk’s writeback cache before preforming a singlefsync()
causing the disk to flush its writeback cache. Files are then atomically moved into place, guaranteeing that they arefsync()
-durable by the time they enter the object directory.For now, this mode only supports batching loose object writes, and will only be enabled when
core.fsync
includes theloose-objects
value. On a synthetic test of adding 500 files to the repository withgit add
(each resulting in a new loose object), the newbatch
mode imposes only a modest penalty over notfsync
ing at all.On Linux, for example, adding 500 files takes .06 seconds without any
fsync()
calls, 1.88 seconds with anfsync()
after each loose object write, and only .15 seconds with the new batchedfsync()
. Other platforms display similar speed-ups, with a notable example being Windows, where the numbers are .35 seconds, 11.18 seconds, and just .41 seconds, respectively.[source]
-
If you’ve ever wondered, “what’s changed in my repository since yesterday?”, one way you can figure that out is with the
--since
option, which is supported by all standard revision-walking commands, likelog
andrev-list
.This option works by starting with the specified commits, and walking recursively along each commit’s parents, stopping the traversal as soon as it encounters a commit older than the
--since
date. But in occasional circumstances (particularly when there is) clock skew this can produce confusing results.For example, suppose you have three commits,
C1
,C2
, andC3
, whereC2
is the parent ofC3
, andC1
is the parent ofC2
. If bothC1
andC3
were written in the last hour, butC2
is a day old (perhaps because the committer’s clock is running slow), then a traversal with--since=1.hour.ago
will only showC3
, since seeingC2
causes Git to halt its traversal.If you expect your repository’s history has some amount of clock skew, then you can use
--since-as-filter
in place of--since
, which only prints commits newer than the specified date, but does not halt its traversal upon seeing an older one.[source]
-
If you work with partial clones, and have a variety of different Git remotes, it can be confusing to remember which partial clone filter is attached to which remote.
Even in a simple example, trying to remember what object filter was used to clone your repository requires this incantation:
$ git config remote.origin.partialCloneFilter
In Git 2.37, you can now access this information much more readily behind the
-v
flag ofgit remote
, like so:$ git remote -v origin git@github.com:git/git.git (fetch) [tree:0] origin git@github.com:git/git.git (push)
Here, you can easily see between the square-brackets that the remote
origin
uses atree:0
filter.This work was contributed by Abhradeep Chakraborty, a Google Summer of Code student, who is one of three students participating this year and working on Git.
[source]
-
Speaking of remote configuration, Git 2.37 ships with support for warning or exiting when it encounters plain-text credentials stored in your configuration with the new
transfer.credentialsInUrl
setting.Storing credentials in plain-text in your repository’s configuration is discouraged, since it forces you to ensure you have appropriately restrictive permissions on the configuration file. Aside from storing the data unencrypted at rest, Git often passes the full URL (including credentials) to other programs, exposing them on systems where other processes have access to arguments list of sensitive processes. In most cases, it is encouraged to use Git’s credential mechanism, or tools like GCM.
This new setting allows Git to either ignore or halt execution when it sees one of these credentials by setting the
transfer.credentialsInUrl
to “warn” or “die” respectively. The default, “allow”, does nothing. -
If you’ve ever used
git add -p
to stage the contents of your working tree incrementally, then you may be familiar withgit add
‘s “interactive mode”, orgit add -i
, of whichgit add -p
is a sub-mode.In addition to “patch” mode,
git add -i
supports “status”, “update”, “revert”, “add untracked”, “patch”, and “diff”. Until recently, this mode ofgit add -i
was actually written in Perl. This command has been the most recent subject of a long-running effort to port Git commands written in Perl into C. This makes it possible to use Git’s libraries without spawning sub-processes, which can be prohibitively expensive on certain platforms.The C reimplementation of
git add -i
has shipped in releases of Git as early as v2.25.0. In more recent versions, this reimplementation has been in “testing” mode behind an opt-in configuration. Git 2.37 promotes the C reimplementation by default, so Windows users should notice a speed-up when usinggit add -p
. -
Last but not least, there is a lot of exciting work going on for Git developers, too, like improving the localization workflow, improving CI output with GitHub Actions, and reducing memory leaks in internal APIs.
If you’re interested in contributing to Git, now is a more exciting time than ever to start. Check out this guide for some tips on getting started.
The rest of the iceberg
That’s just a sample of changes from the latest release. For more, check out the release notes for 2.37 or any previous version in the Git repository.
Tags:
Written by
Related posts
How to build an open source metrics dashboard
How GitHub volunteers built an open source metrics dashboard for the World Health Organization and some best practices they picked up along the way.
Automating open source: How Ersilia distributes AI models to advance global health equity
Discover how the Ersilia Open Source Initiative accelerates drug discovery by using GitHub Actions to disseminate AI/ML models.
Highlights from Git 2.46
Git 2.46 is here with new features like pseudo-merge bitmaps, more capable credential helpers, and a new git config command. Check out our coverage on some of the highlights here.