Highlights from Git 2.21

Image of Taylor Blau

The open source Git project just released Git 2.21 with features and bug fixes from over 60 contributors. We last caught up with you on the latest Git releases when 2.19 was released. Here’s a look at some of the most interesting features and changes introduced since then.

Human-readable dates with --date=human

As part of its output, git log displays the date each commit was authored. Without making an alternate selection, timestamps will display in Git’s “default” format (for example, “Tue Feb 12 09:00:33 2019 -0800”).

That’s very precise, but a lot of those details are things that you might already know, or don’t care about. For example, if the commit happened today, you already know what year it is. Likewise, if a commit happened seven years ago, you don’t care which second it was authored. So what do you do? You could use --date=relative, which would give you output like 6 days ago, but often, you want to relate “six days ago” to a specific event, like “my meeting last Wednesday.” So, was six days ago Tuesday, or was it Wednesday that you were interested in?

Git 2.21 introduces a new date format that tells you exactly when something occurred with just the right amount of detail: --date=human. Here’s how git log looks with the new format:

git log --date=human example

That’s both more accurate than --date=relative and easier to consume than the full weight of --date=default.

But what about when you’re scripting? Here, you might want to frequently switch between the human and machine-readable formats while putting together a pipeline. Git 2.21 has an option suitable for this setting, too: --date=auto:human. When printing output to a pager, if Git is given this option it will act as if it had been passed --date=human. When otherwise printing output to a non-pager, Git will act as if no format had been given at all. If human isn’t quite your speed, you can combine auto with any other format of your choosing, like --date=auto:relative.

git log --date=auto:human example[source]

Detecting case-insensitive path collisions

One commonly asked Git question is, “After cloning a repository, why does git status report some of the files as modified?” Quite often, the answer is that the repository contains a tree which cannot be represented on your file system. For instance, if it contains both file as well as FILE and your file system is case-insensitive, Git can only checkout one of those files. Worse, Git doesn’t actually detect this case during the clone; it simply writes out each path, unaware that the file system considers them to be the same file. You only find out something has gone wrong when you see a mystery modification.

The exact rules for when this occurs will vary from system to system. In addition to “folding” what we normally consider upper and lowercase characters in English, you may also see this from language-specific conversions, non-printing characters, or Unicode normalization.

In Git 2.20, git clone now detects and reports colliding groups during the initial checkout, which should remove some of the confusion. Unfortunately, Git can’t actually fix the problem for you. What the original committer put in the repository can’t be checked out as-is on your file system. So if you’re thinking about putting files into a multi-platform project that differ only in case, the best advice is still: don’t.

git
[source]

Performance improvements and other bits

Behind the scenes, a lot has changed over the last couple of Git releases, too. We’re dedicating this section to overview a few of these changes. Not all of them will impact your Git usage day-to-day, but some will, and all of the changes are especially important for server administrators.

Multi-pack indexes

Git stores objects (e.g., representations of the files, directories, and more that make up your Git repository) in both the “loose” and “packed” formats. A “loose” object is a compressed encoding of an object, stored in a file. A “packed” object is stored in a packfile, which is a collection of objects, written in terms of deltas of one another.

Because it can be costly to rewrite these packs every time a new object is added to the repository, repositories tend to accumulate many loose objects or individual packs over time. Eventually, these are reconciled during a “repack” operation. However, this reconciliation is not possible for larger repositories, like the Windows repository.

Instead of repacking, Git can now create a multi-pack index file, which is a listing of objects residing in multiple packs, removing the need to perform expensive repacks (in many cases).

[source]

Delta islands

An important optimization for Git servers is that the format for transmitted objects is the same as the heavily-compressed on-disk packfiles. That means that in many cases, Git can serve repositories to clients by simply copying bytes off disk without having to inflate individual objects.

But sometimes this assumption breaks down. Objects on disk may be stored as “deltas” against one another. When two versions of a file have similar content, we might store the full contents of one version (the “base”), but only the differences against the base for the other version. This creates a complication when serving a fetch. If object A is stored as a delta against object B, we can only send the client our on-disk version of A if we are also sending them B (or if we know they already have B). Otherwise, we have to reconstruct the full contents of A and re-compress it.

This happens rarely in many repositories where clients clone all of the objects stored by the server. But it can be quite common when multiple distinct but overlapping sets of objects are stored in the same packfile (for example, due to repository forks or unmerged pull requests). Git may store a delta between objects found only in two different forks. When someone clones one of the forks, they want only one of the objects, and we have to discard the delta.

Git 2.20 solves this by introducing the concept of “delta islands. Repository administrators can partition the ref namespace into distinct “islands”, and Git will avoid making deltas between islands. The end result is a repository which is slightly larger on disk but is still able to serve client fetches much more cheaply.

[source 1, source 2]

Delta reuse with bitmaps

We already discussed the importance of reusing on-disk deltas when serving fetches, but how do we know when the other side has the base object they’ need to use the delta we send them? If we’re sending them the base, too, then the answer is easy. But if we’re not, how do we know if they have it?

That answer is deceptively simple: the client will have already told us which commits it has (so that we don’t bother sending them again). If they claim to have a commit which contains the base object, then we can re-use the delta. But there’s one hitch: we not only need to know about the commit they mentioned, but also the entire object graph. The base may have been part of a commit hundreds or thousands of commits deep in the history of the project.

Git doesn’t traverse the entire object graph to check for possible bases because it’s too expensive to do so. For instance, walking the entire graph of a Linux kernel takes roughly 30 seconds.

Fortunately, there’s already a solution within Git: reachability bitmaps. Git has an optional on-disk data structure to record the sets of objects “reachable” from each commit. When this data is available, we can query it to quickly determine whether the client has a base object. This results in the server generating smaller packs that are produced more quickly for an overall faster fetch experience.

[source]

Custom alternates reference advertisement

Repository alternates are a tool that server administrators have at their disposal to reduce redundant information. When two repositories are known to share objects (like a fork and its parent), the fork can list the parent as an “alternate”, and any objects the fork doesn’t have itself, it can look for in its parent. This is helpful since we can avoid storing twice the vast number of objects shared between the fork and parent.

Likewise, a repository with alternates advertises “tips” it has when receiving a push. In other words, before writing from your computer to a remote, that remote will tell you what the tips of its branches are, so you can determine information that is already known by the remote, and therefore use less bandwidth. When a repository has alternates, the tips advertisement is the union of all local and alternate branch tips.

But what happens when computing the tips of an alternate is more expensive than a client sending redundant data? It makes the push so slow that we have disabled this feature for years at GitHub. In Git 2.20, repositories can hook into the way that they enumerate alternate tips, and make the corresponding transaction much faster.

[source]

Tidbits

Now that we’ve highlighted a handful of the changes in the past two releases, we want to share a summary of a few other interesting changes. As always, you can learn more by clicking the “source” link, or reading the documentation or release notes.

  • Have you ever tried to run git cherry-pick on a merge commit only to have it fail? You might have found that the fix involves passing -m1 and moved on. In fact, -m1 says to select the first parent as the mainline, and it replays the relevant commits. Prior to Git 2.21, passing this option on a non-merge commit caused an error, but now it transparently does what you meant. [source]

  • Veteran Git users from our last post might recall that git branch -l establishes a reflog for a newly created branch, instead of listing all branches. Now, instead of doing something you almost certainly didn’t mean, git branch -l will list all of your repository’s branches, keeping in line with other commands that accept -l. [source]

  • If you’ve ever been stuck or forgotten what a certain command or flag does, you might have run git --help (or git -h) to learn more. In Git 2.21, this invocation now follows aliases, and shows the aliased command’s helptext. [source]

  • In repositories with large on-disk checkouts, git status can take a long time to complete. In order to indicate that it’s making progress, the status command now displays a progress bar. [source]

  • Many parts of Git have historically been implemented as shell scripts, calling into tools written in C to do the heavy lifting. While this allowed rapid prototyping, the resulting tools could often be slow due to the overhead of running many separate programs. There are continuing efforts to move these scripts into C, affecting git submodule, git bisect, and git rebase. You may notice rebase in particular being much faster, due to the hard work of the Summer of Code students, Pratik Karki and Alban Gruin.

  • The -G option tells git log to only show commits whose diffs match a particular pattern. But until Git 2.21, it was searching binary files as if they were text, which made things slower and often produced confusing results. [source]

That’s all for now

We went through a few of the changes that have happened over the last couple of versions, but there’s a lot more to discover. Read the release notes for 2.21, or review the release notes for previous versions in the Git repository.

Catch early bird pricing for GitHub Satellite

Sync up with us and leading developers from around the world in Berlin, May 22-23, and get €100 off regular-priced tickets until April 11.

Get tickets

Join us at Maintainerati

Calling all maintainers: Unite at Maintainerati, a one-day unconference to gather, present, and discuss the day after GitHub Satellite.

RSVP