Highlights from Git 2.43

The last Git release of 2023 is here! Take a look at some of our highlights on what’s new in Git 2.43.

November 20, 2023

| 11 minutes

The open source Git project just released Git 2.43 with features and bug fixes from over 80 contributors, 17 of them new. We last caught up with you on the latest in Git back when 2.42 was released.

To celebrate this most recent release, here is GitHub’s look at some of the most interesting features and changes introduced since last time.

New features in `git repack`

In Git 2.43, git repack learned a couple of new tricks. If you’re unfamiliar, git repack is used to reorganize the packs in your repository. It has a number of different modes, many of which we’ve discussed in this series before:

Combine all unpacked objects into a single pack, and then delete the unpacked copies (git repack -d).
Repack new objects into packs which form a geometric progression of object counts (with git repack --geometric=<n>) [source].
Generating cruft packs to store unreachable objects, or moving expired objects to a separate directory (with git repack --cruft or --expire-to) [source].

In this release, that list got a little longer, with two major new features being added to git repack. In Git 2.43, git repack now supports working with multiple cruft packs, as well as splitting the contents of repositories by an object filter. For more details, read on!

Multiple cruft packs

Long-time readers may be familiar with our discussion of cruft packs. But if you’re new around here, or could use a brief refresher, here’s a quick overview to get you oriented. Cruft packs are used to store groups of unreachable objects together while they wait to be removed, or “pruned” from a repository.

In the past, Git would perform garbage collection (via git gc) and split a repository’s objects into three different categories:

Reachable objects, which are the set of objects you could collect by starting at each of a repositories references (for example, its branches and tags) and recursively exploring object links (moving from a commit to its parent(s), a tree to its sub-tree(s), etc.). These objects must remain in a repository following any garbage collection operation.
Stale unreachable objects, which are the non-reachable objects (that is, any object that you couldn’t get to using the above procedure) that have not been re-written or added to the repository recently (after a configurable cut-off window).
Fresh unreachable objects, which are the remaining objects not grouped into the other two categories.

Historically, the “fresh unreachable objects” group was left in the repository, with each such object being stored individually instead of packed. This was done so that Git could use the mtime of each loose object as a proxy for tracking the last time that object was written. If an object is only written once, its mtime will be the time that it was added to the repository. If an object is written again, and Git realizes that it already has that object, it will simply update that object’s mtime to the current timestamp with utime(). Only objects with sufficiently old mtime values are eligible to be pruned from the repository.

If there are many unreachable objects which were modified too recently to be removed, then Git can run into trouble by creating too many loose objects, leading to performance degradation. To combat this, Git introduced “cruft packs” to store the collection of unreachable objects which were modified too recently to be pruned together in a single pack, instead of individually as loose.

GitHub has used this feature to eliminate a large class of problems that arise from repositories in this state (curious readers can learn more about GitHub’s deployment of cruft packs in our post Scaling Git’s garbage collection).

But there was a remaining drawback of using cruft packs to manage unreachable objects: all of the unreachable objects had to be stored together in a single cruft pack. That means that if a repository has many unreachable objects (especially if pruned infrequently) that git repack has to spend many I/O cycles rewriting a large cruft pack over and over again, each time producing similar results.

In Git 2.43, this drawback was eliminated with native support for multiple cruft packs.

In particular, Git learned a new --max-cruft-size option to limit the maximum size (in bytes) of each individual cruft pack, allowing you to split the set of unreachable objects in your repository across multiple packs:

$ git repack -d --cruft --max-cruft-size=10M
Enumerating objects: 538262, done.
Counting objects: 100% (538262/538262), done.
Delta compression using up to 20 threads
Compressing objects: 100% (103507/103507), done.
Writing objects: 100% (538262/538262), done.
Total 538262 (delta 432204), reused 538262 (delta 432204), pack-reused 0
Enumerating cruft objects: 538362, done.
Counting objects: 100% (100/100), done.
Delta compression using up to 20 threads
Compressing objects: 100% (100/100), done.
Writing objects: 100% (100/100), done.
Total 100 (delta 0), reused 0 (delta 0), pack-reused 0

$ ls -la .git/objects/pack/pack-*.mtimes
-r--r--r-- 1 ttaylorr ttaylorr  88 Nov 14 11:51 .git/objects/pack/pack-01d70a911d700e0344252ba5ab7ac5fa3771d774.mtimes
-r--r--r-- 1 ttaylorr ttaylorr 104 Nov 14 11:51 .git/objects/pack/pack-0cc21d689139a9e69eb51ee62dcbbe3829e2cef8.mtimes
-r--r--r-- 1 ttaylorr ttaylorr 104 Nov 14 11:51 .git/objects/pack/pack-10840deb9d008097e8ed3dcc837a47afc2229d8b.mtimes
-r--r--r-- 1 ttaylorr ttaylorr 104 Nov 14 11:51 .git/objects/pack/pack-1b8ea5945b67ce16403d3e9c7f98a31b0a19050e.mtimes
-r--r--r-- 1 ttaylorr ttaylorr 104 Nov 14 11:51 .git/objects/pack/pack-2417efa0e79eb87692c2247ae366ce3e5c1c805d.mtimes
-r--r--r-- 1 ttaylorr ttaylorr 104 Nov 14 11:51 .git/objects/pack/pack-3c38ef91837c7e71c24756706adf509075afaf89.mtimes
-r--r--r-- 1 ttaylorr ttaylorr 104 Nov 14 11:51 .git/objects/pack/pack-819cdbeed7701ba2ed23f551058ad3ee7932d101.mtimes
-r--r--r-- 1 ttaylorr ttaylorr 104 Nov 14 11:51 .git/objects/pack/pack-a6a0275bf2e18f819b5d23900d033ce274d24ddf.mtimes

The option works by combining existing cruft packs together in order from smallest to largest, keeping track of their combined size at each step. If the current size can grow to accommodate the next largest cruft pack while still staying below the threshold value, the cruft packs are combined, along with any new unreachable objects not yet packed into a cruft pack. If the resulting cruft pack happens to be too large (for example, because there were a large number of unreachable objects not yet packed into cruft packs), any spill-over will be split into a separate pack and combined in the next git repack invocation.

This new option will save significant I/O time when repacking repositories with large numbers of unreachable objects by no longer requiring Git to rewrite the entire cruft pack during each repack operation. To try it out for yourself, run the following in your repository today:

$ git repack -d --cruft --cruft-expiration=<date> --max-cruft-size=<N>

[source, source]

Split repositories with object filters

Another frequently mentioned feature in this series is Git’s “partial clone” mechanism, which allows interacting with a repository containing a limited subset of its objects.

For instance, you can ask for a “tree-less” clone of the git/git repository by running the following:

$ git clone --filter=tree:0 git@github.com:git/git.git

The resulting clone will contain only the necessary blobs and trees to check out the most recent commit in the repository (along with all of the historical commit objects, which are relatively small by comparison). This allows you to get started working in a large repository quickly by only asking for the parts that you need. Git will fault-in any missing objects on-demand in the future instead of loading them all up front. For those unfamiliar with partial clones or want to learn more about their internals, you can read the guide, Get up to speed with partial clone and shallow clone.

But what if you want to adjust the filter you used to clone your repository? Say, for instance, that you want to remove all large blobs from your local copy and off-load them elsewhere? Previously, the only way to do this was by re-cloning your repository from the remote with the new filter, and then bringing over all of your changes.

In Git 2.43, git repack learned a pair of new options to repack your repository according to an object filter specification, and optionally move the filtered objects elsewhere via --filter, and --filter-to.

For example, let’s say you’re working with a large repository, and want to filter it down to only blobs smaller than 1MiB. You can now do this easily with git repack‘s new options, like so:

$ git init --bare ../backup.git
$ git repack -ad --filter='blob:limit=1m' \
   --filter-to=../backup.git/objects/pack/pack

and your repository will only retain blobs which are smaller than the specified threshold. As long as your repository was initialized via a partial clone, any missing objects will be faulted in as normal, allowing you to easily off-load or remove unwanted objects as your needs (and filter specification) change.

[source]

Have you ever found yourself spelunking through a Git repository’s history, and noticed commits like these? When we want to revert a commit, you can do so by running git revert <commit>, and Git will apply the opposite of whatever changes were found in <commit> in a new commit with the subject Revert: "<commit>".
But what happens if you decide to revert that? In historical versions of Git, the same rule would be applied, resulting in a commit message like Revert: "Revert: "fix bug"" (producing all >1M commits from the search above!). Though technically correct, these double-reverts produce commit messages that are somewhat cumbersome to read.

Beginning in Git 2.43, Git will realize when it’s about to perform a double-revert, and instead produce the much more pleasing message:
```
$ git revert --no-edit HEAD >/dev/null
$ git revert --no-edit HEAD >/dev/null
$ git log --oneline
a300922 (HEAD -> main) Reapply "fix bug"
0050730 Revert "fix bug"
b290810 fix bug
[...]
```
If you decide to revert for a third time, Git will produce a commit message like Revert "Reapply "fix bug", causing the length of the commit message to grow at a much more reasonable rate over many reverts.

[source]

If you’ve worked with Git in a mailing-list workflow, you are likely aware of Git’s format-patch tool that prepares patches for submission via e-mail.
A lesser known feature of format-patch is its --subject-prefix option, which allows changing the standard [PATCH N/M]: subject line to instead begin with an arbitrary prefix. In effect, this allows for users to replace “PATCH” with a string of their choice. Kernel developers sometimes use this feature to designate which sub-system their patch is for, for example with --subject-prefix="PATCH bpf-next".

But what if you also pass the --rfc option, which changes the subject to instead begin with [RFC PATCH N/M] ...? In previous versions of Git, this would overwrite any custom subject prefix you wrote earlier with --subject-prefix, meaning that git format-patch --subject-prefix="PATCH bpf-next" --rfc would confusingly produce a patch that begins with [RFC PATCH .... The only way to get the desired effect was to invoke format-patch with --subject-prefix="RFC PATCH net-next".

In Git 2.43, the --rfc and --subject-prefix options work together, meaning that you can now type:
```
  $ git format-patch --subject-prefix="PATCH bpf-next" --rfc
  
```
and get an e-mail whose subject begins with [RFC PATCH net-next ...] as you had intended.

[source]

You may have noticed that modern versions of Git have additional "decorations" when viewing the output of git log, annotating each commit with the branches and tags that refer to it, like so:
```
$ git log --oneline
e0939bec27 (HEAD -> master, origin/master, origin/HEAD) RelNotes: minor wording fixes in 2.43.0 release notes
dadef801b3 (tag: v2.43.0-rc1) Git 2.43-rc1
8ed4eb7538 Merge branch 'tb/rev-list-unpacked-fix'
[...]
```
where the information between the parenthesis are the decorations.

If you've ever scripted around the output of git log, you are likely familiar with its --format option, which allows you to customize the output of each line. For instance, the --oneline option in the above example causes Git to format each commit using an abbreviated portion of its hash, as well as the title line of each commit message.

But what about the decorations? In previous versions of Git, it was not possible to specify custom format options that simulate the decorations when using a custom format specifier, like git log --format='%cr (%h) %s'.

Now, you can add decorations when using custom git log formats with the new %(decorate) placeholder, producing output like this:
```
$ git log --format='%cr%(decorate) (%h) %s'
3 days ago (HEAD -> master, origin/master, origin/HEAD) (e0939bec27) RelNotes: minor wording fixes in 2.43.0 release notes
7 days ago (tag: v2.43.0-rc1) (dadef801b3) Git 2.43-rc1
7 days ago (8ed4eb7538) Merge branch 'tb/rev-list-unpacked-fix'
```
For those that want to spruce up their terminal output even further, the %(decorate) option has a handful of optional modifiers, like specifying the prefix, suffix, separators, and more.

[source]
While we're talking about custom formats, git for-each-ref also learned some new --format-related tricks. For custom format specifiers like %(authorname), %(committeremail), and so on, you can now ask git for-each-ref to apply any .mailmap rules you have specified in your repository.

This makes it possible to apply any email or name changes specified in the .mailmap ahead of time when formatting output from for-each-ref instead of having to post-process the results.

This new feature was implemented by Kousik Sanagavarapu, who was one of the Git project's Google Summer of Code students this past year. To learn more about the new format specifiers, you can check out the updated documentation. Thanks, Kousik!

[source]
Last but not least, Git's CI system has evolved in a couple of important ways useful for both existing and new developers interested in working on Git.

First, Git learned to cancel in-progress CI runs when new pushes are made to branches which currently have CI checks in progress. This can save a significant amount of CI usage and runtime, particularly in scenarios where there may be frequent force-pushing.

Git also learned how to use and report results to Coverity, a static analysis tool built by Synopsys. Individuals can configure their repository to scan and report results to their personal Coverity account, providing a detailed analysis of any potential bugs or security vulnerabilities. As a result, developers now have more tools to ensure that new features in Git are more secure from the moment they are introduced.

[source, source]

The rest of the iceberg

That's just a sample of changes from the latest release. For more, check out the release notes for 2.43, or any previous version in the Git repository.

Written by

Taylor Blau is a Staff Software Engineer at GitHub where he works on Git.

Git

Highlights from Git 2.43

New features in `git repack`

Multiple cruft packs

Split repositories with object filters

The rest of the iceberg

Tags:

Written by

Taylor Blau

Related posts

Git security vulnerabilities announced

Highlights from Git 2.50

4 trends shaping open source funding—and what they mean for maintainers

Tags:

Written by

Related posts

We do newsletters, too