For Good First Issue is a curated list of open source projects that are also digital public goods and need the help of developers.
To celebrate this most recent release, here is GitHub’s look at some of the most interesting features and changes introduced since last time.
In Git 2.43,
git repack learned a couple of new tricks. If you’re unfamiliar,
git repack is used to reorganize the packs in your repository. It has a number of different modes, many of which we’ve discussed in this series before:
- Combine all unpacked objects into a single pack, and then delete the unpacked copies (
git repack -d).
- Repack new objects into packs which form a geometric progression of object counts (with
git repack --geometric=<n>) [source].
- Generating cruft packs to store unreachable objects, or moving expired objects to a separate directory (with
git repack --cruftor
In this release, that list got a little longer, with two major new features being added to
git repack. In Git 2.43,
git repack now supports working with multiple cruft packs, as well as splitting the contents of repositories by an object filter. For more details, read on!
Long-time readers may be familiar with our discussion of cruft packs. But if you’re new around here, or could use a brief refresher, here’s a quick overview to get you oriented. Cruft packs are used to store groups of unreachable objects together while they wait to be removed, or “pruned” from a repository.
In the past, Git would perform garbage collection (via
git gc) and split a repository’s objects into three different categories:
- Reachable objects, which are the set of objects you could collect by starting at each of a repositories references (for example, its branches and tags) and recursively exploring object links (moving from a commit to its parent(s), a tree to its sub-tree(s), etc.). These objects must remain in a repository following any garbage collection operation.
- Stale unreachable objects, which are the non-reachable objects (that is, any object that you couldn’t get to using the above procedure) that have not been re-written or added to the repository recently (after a configurable cut-off window).
- Fresh unreachable objects, which are the remaining objects not grouped into the other two categories.
Historically, the “fresh unreachable objects” group was left in the repository, with each such object being stored individually instead of packed. This was done so that Git could use the
mtime of each loose object as a proxy for tracking the last time that object was written. If an object is only written once, its
mtime will be the time that it was added to the repository. If an object is written again, and Git realizes that it already has that object, it will simply update that object’s mtime to the current timestamp with
utime(). Only objects with sufficiently old
mtime values are eligible to be pruned from the repository.
If there are many unreachable objects which were modified too recently to be removed, then Git can run into trouble by creating too many loose objects, leading to performance degradation. To combat this, Git introduced “cruft packs” to store the collection of unreachable objects which were modified too recently to be pruned together in a single pack, instead of individually as loose.
GitHub has used this feature to eliminate a large class of problems that arise from repositories in this state (curious readers can learn more about GitHub’s deployment of cruft packs in our post Scaling Git’s garbage collection).
But there was a remaining drawback of using cruft packs to manage unreachable objects: all of the unreachable objects had to be stored together in a single cruft pack. That means that if a repository has many unreachable objects (especially if pruned infrequently) that
git repack has to spend many I/O cycles rewriting a large cruft pack over and over again, each time producing similar results.
In Git 2.43, this drawback was eliminated with native support for multiple cruft packs.
In particular, Git learned a new
--max-cruft-size option to limit the maximum size (in bytes) of each individual cruft pack, allowing you to split the set of unreachable objects in your repository across multiple packs:
$ git repack -d --cruft --max-cruft-size=10M Enumerating objects: 538262, done. Counting objects: 100% (538262/538262), done. Delta compression using up to 20 threads Compressing objects: 100% (103507/103507), done. Writing objects: 100% (538262/538262), done. Total 538262 (delta 432204), reused 538262 (delta 432204), pack-reused 0 Enumerating cruft objects: 538362, done. Counting objects: 100% (100/100), done. Delta compression using up to 20 threads Compressing objects: 100% (100/100), done. Writing objects: 100% (100/100), done. Total 100 (delta 0), reused 0 (delta 0), pack-reused 0 $ ls -la .git/objects/pack/pack-*.mtimes -r--r--r-- 1 ttaylorr ttaylorr 88 Nov 14 11:51 .git/objects/pack/pack-01d70a911d700e0344252ba5ab7ac5fa3771d774.mtimes -r--r--r-- 1 ttaylorr ttaylorr 104 Nov 14 11:51 .git/objects/pack/pack-0cc21d689139a9e69eb51ee62dcbbe3829e2cef8.mtimes -r--r--r-- 1 ttaylorr ttaylorr 104 Nov 14 11:51 .git/objects/pack/pack-10840deb9d008097e8ed3dcc837a47afc2229d8b.mtimes -r--r--r-- 1 ttaylorr ttaylorr 104 Nov 14 11:51 .git/objects/pack/pack-1b8ea5945b67ce16403d3e9c7f98a31b0a19050e.mtimes -r--r--r-- 1 ttaylorr ttaylorr 104 Nov 14 11:51 .git/objects/pack/pack-2417efa0e79eb87692c2247ae366ce3e5c1c805d.mtimes -r--r--r-- 1 ttaylorr ttaylorr 104 Nov 14 11:51 .git/objects/pack/pack-3c38ef91837c7e71c24756706adf509075afaf89.mtimes -r--r--r-- 1 ttaylorr ttaylorr 104 Nov 14 11:51 .git/objects/pack/pack-819cdbeed7701ba2ed23f551058ad3ee7932d101.mtimes -r--r--r-- 1 ttaylorr ttaylorr 104 Nov 14 11:51 .git/objects/pack/pack-a6a0275bf2e18f819b5d23900d033ce274d24ddf.mtimes
The option works by combining existing cruft packs together in order from smallest to largest, keeping track of their combined size at each step. If the current size can grow to accommodate the next largest cruft pack while still staying below the threshold value, the cruft packs are combined, along with any new unreachable objects not yet packed into a cruft pack. If the resulting cruft pack happens to be too large (for example, because there were a large number of unreachable objects not yet packed into cruft packs), any spill-over will be split into a separate pack and combined in the next
git repack invocation.
This new option will save significant I/O time when repacking repositories with large numbers of unreachable objects by no longer requiring Git to rewrite the entire cruft pack during each repack operation. To try it out for yourself, run the following in your repository today:
$ git repack -d --cruft --cruft-expiration=<date> --max-cruft-size=<N>
Another frequently mentioned feature in this series is Git’s “partial clone” mechanism, which allows interacting with a repository containing a limited subset of its objects.
For instance, you can ask for a “tree-less” clone of the
git/git repository by running the following:
$ git clone --filter=tree:0 email@example.com:git/git.git
The resulting clone will contain only the necessary blobs and trees to check out the most recent commit in the repository (along with all of the historical commit objects, which are relatively small by comparison). This allows you to get started working in a large repository quickly by only asking for the parts that you need. Git will fault-in any missing objects on-demand in the future instead of loading them all up front. For those unfamiliar with partial clones or want to learn more about their internals, you can read the guide, Get up to speed with partial clone and shallow clone.
But what if you want to adjust the filter you used to clone your repository? Say, for instance, that you want to remove all large blobs from your local copy and off-load them elsewhere? Previously, the only way to do this was by re-cloning your repository from the remote with the new filter, and then bringing over all of your changes.
In Git 2.43,
git repack learned a pair of new options to repack your repository according to an object filter specification, and optionally move the filtered objects elsewhere via
For example, let’s say you’re working with a large repository, and want to filter it down to only blobs smaller than 1MiB. You can now do this easily with
git repack‘s new options, like so:
$ git init --bare ../backup.git $ git repack -ad --filter='blob:limit=1m' \ --filter-to=../backup.git/objects/pack/pack
and your repository will only retain blobs which are smaller than the specified threshold. As long as your repository was initialized via a partial clone, any missing objects will be faulted in as normal, allowing you to easily off-load or remove unwanted objects as your needs (and filter specification) change.
- Have you ever found yourself spelunking through a Git repository’s history, and noticed commits like these? When we want to revert a commit, you can do so by running
git revert <commit>, and Git will apply the opposite of whatever changes were found in
<commit>in a new commit with the subject
But what happens if you decide to revert that? In historical versions of Git, the same rule would be applied, resulting in a commit message like
Revert: "Revert: "fix bug""(producing all >1M commits from the search above!). Though technically correct, these double-reverts produce commit messages that are somewhat cumbersome to read.
Beginning in Git 2.43, Git will realize when it’s about to perform a double-revert, and instead produce the much more pleasing message:
$ git revert --no-edit HEAD >/dev/null $ git revert --no-edit HEAD >/dev/null $ git log --oneline a300922 (HEAD -> main) Reapply "fix bug" 0050730 Revert "fix bug" b290810 fix bug [...]
If you decide to revert for a third time, Git will produce a commit message like
Revert "Reapply "fix bug", causing the length of the commit message to grow at a much more reasonable rate over many reverts.
- If you’ve worked with Git in a mailing-list workflow, you are likely aware of Git’s
format-patchtool that prepares patches for submission via e-mail.
A lesser known feature of
--subject-prefixoption, which allows changing the standard
[PATCH N/M]:subject line to instead begin with an arbitrary prefix. In effect, this allows for users to replace “PATCH” with a string of their choice. Kernel developers sometimes use this feature to designate which sub-system their patch is for, for example with
But what if you also pass the
--rfcoption, which changes the subject to instead begin with
[RFC PATCH N/M] ...? In previous versions of Git, this would overwrite any custom subject prefix you wrote earlier with
--subject-prefix, meaning that git format-patch
--subject-prefix="PATCH bpf-next" --rfcwould confusingly produce a patch that begins with
[RFC PATCH ....The only way to get the desired effect was to invoke format-patch with
--subject-prefix="RFC PATCH net-next".
In Git 2.43, the
--subject-prefixoptions work together, meaning that you can now type:
$ git format-patch --subject-prefix="PATCH bpf-next" --rfc
and get an e-mail whose subject begins with
[RFC PATCH net-next ...]as you had intended.
You may have noticed that modern versions of Git have additional "decorations" when viewing the output of
git log, annotating each commit with the branches and tags that refer to it, like so:
$ git log --oneline e0939bec27 (HEAD -> master, origin/master, origin/HEAD) RelNotes: minor wording fixes in 2.43.0 release notes dadef801b3 (tag: v2.43.0-rc1) Git 2.43-rc1 8ed4eb7538 Merge branch 'tb/rev-list-unpacked-fix' [...]
where the information between the parenthesis are the decorations.
If you've ever scripted around the output of
git log, you are likely familiar with its
--formatoption, which allows you to customize the output of each line. For instance, the
--onelineoption in the above example causes Git to format each commit using an abbreviated portion of its hash, as well as the title line of each commit message.
But what about the decorations? In previous versions of Git, it was not possible to specify custom format options that simulate the decorations when using a custom format specifier, like
git log --format='%cr (%h) %s'.
Now, you can add decorations when using custom
git logformats with the new
%(decorate)placeholder, producing output like this:
$ git log --format='%cr%(decorate) (%h) %s' 3 days ago (HEAD -> master, origin/master, origin/HEAD) (e0939bec27) RelNotes: minor wording fixes in 2.43.0 release notes 7 days ago (tag: v2.43.0-rc1) (dadef801b3) Git 2.43-rc1 7 days ago (8ed4eb7538) Merge branch 'tb/rev-list-unpacked-fix'
For those that want to spruce up their terminal output even further, the
%(decorate)option has a handful of optional modifiers, like specifying the prefix, suffix, separators, and more.
While we're talking about custom formats,
git for-each-refalso learned some new
--format-related tricks. For custom format specifiers like
%(committeremail), and so on, you can now ask
git for-each-refto apply any
.mailmaprules you have specified in your repository.
This makes it possible to apply any email or name changes specified in the
.mailmapahead of time when formatting output from
for-each-refinstead of having to post-process the results.
This new feature was implemented by Kousik Sanagavarapu, who was one of the Git project's Google Summer of Code students this past year. To learn more about the new format specifiers, you can check out the updated documentation. Thanks, Kousik!
Last but not least, Git's CI system has evolved in a couple of important ways useful for both existing and new developers interested in working on Git.
First, Git learned to cancel in-progress CI runs when new pushes are made to branches which currently have CI checks in progress. This can save a significant amount of CI usage and runtime, particularly in scenarios where there may be frequent force-pushing.
Git also learned how to use and report results to Coverity, a static analysis tool built by Synopsys. Individuals can configure their repository to scan and report results to their personal Coverity account, providing a detailed analysis of any potential bugs or security vulnerabilities. As a result, developers now have more tools to ensure that new features in Git are more secure from the moment they are introduced.