Measuring the many sizes of a Git repository
Is your Git repository bursting at the seams? git-sizer is a new open source tool that can tell you when your repo is getting too big. git-sizer computes various Git…
Is your Git repository bursting at the seams? git-sizer
is a new open source tool that can tell you when your repo is getting too big. git-sizer
computes various Git repository size metrics and alerts you to any that might cause problems or inconvenience.
What is “big”?
When people talk about the size of a Git repository, they often talk about the total size needed by Git to store the project’s history in its internal, highly-compressed format—basically, the amount of disk space used by the .git
directory. This number is easy to measure. It’s also useful, because it indicates how long it takes to clone the repository and how much disk space it will use.
At GitHub we host over 78 million Git repositories, so we’ve seen it all. What we find is that many of the repositories that tax our servers the most are not unusually big. The most challenging repositories to host are often those that have an unusual internal layout that Git is not optimized for.
Many properties aside from overall size can make a Git repository unwieldy. For example:
-
It could contain an astronomical number of Git objects (which are used to store the repository’s history)
-
The total size of the Git objects could be huge when uncompressed (even though their size is reasonable when compressed)
-
When the repository is checked out, the size of the working copy might be gigantic
-
The repository could have an unreasonable number of commits in its history
-
It could include enormous individual files or directories
-
It could contain large files/directories that have been modified very many times
-
It could contain too many references (branches, tags, etc)
Any of these properties, if taken to an extreme, can cause certain Git operations to perform poorly. And surprisingly, a repository can be grossly oversized in almost any of these ways without using a worrying amount of disk space.
It also makes sense to consider whether the size of your repository is commensurate with the type and scope of your project. The Linux kernel has been developed over 25 years by thousands of contributors, so it is not at all alarming that it has grown to 1.5 GB. But if your weekend class assignment is already 1.5 GB, that’s probably a strong hint that you could be using Git more effectively!
Sizing up your repository
You can use git-sizer
to measure many size-related properties of your repository, including all of those listed above. To do so, you’ll need a local clone of the repository and a copy of the Git command-line client installed and in your execution PATH
. Then:
- Install
git-sizer
- Change to the directory containing your repository
- Run
git-sizer
. You can learn about its command-line options by runninggit-sizer --help
, but no options are required
git-sizer
will gather statistics about all of the references and reachable Git objects in your repository and output a report. For example, here is the verbose output for the Linux kernel repository:
$ git-sizer --verbose
Processing blobs: 1652370
Processing trees: 3396199
Processing commits: 722647
Matching commits to trees: 722647
Processing annotated tags: 534
Processing references: 539
| Name | Value | Level of concern |
| ---------------------------- | --------- | ------------------------------ |
| Overall repository size | | |
| * Commits | | |
| * Count | 723 k | * |
| * Total size | 525 MiB | ** |
| * Trees | | |
| * Count | 3.40 M | ** |
| * Total size | 9.00 GiB | **** |
| * Total tree entries | 264 M | ***** |
| * Blobs | | |
| * Count | 1.65 M | * |
| * Total size | 55.8 GiB | ***** |
| * Annotated tags | | |
| * Count | 534 | |
| * References | | |
| * Count | 539 | |
| | | |
| Biggest objects | | |
| * Commits | | |
| * Maximum size [1] | 72.7 KiB | * |
| * Maximum parents [2] | 66 | ****** |
| * Trees | | |
| * Maximum entries [3] | 1.68 k | |
| * Blobs | | |
| * Maximum size [4] | 13.5 MiB | * |
| | | |
| History structure | | |
| * Maximum history depth | 136 k | |
| * Maximum tag depth [5] | 1 | * |
| | | |
| Biggest checkouts | | |
| * Number of directories [6] | 4.38 k | ** |
| * Maximum path depth [7] | 14 | * |
| * Maximum path length [8] | 134 B | * |
| * Number of files [9] | 62.3 k | * |
| * Total size of files [9] | 747 MiB | |
| * Number of symlinks [10] | 40 | |
| * Number of submodules | 0 | |
[1] 91cc53b0c78596a73fa708cceb7313e7168bb146
[2] 2cde51fbd0f310c8a2c5f977e665c0ac3945b46d
[3] 4f86eed5893207aca2c2da86b35b38f2e1ec1fc8 (refs/heads/master:arch/arm/boot/dts)
[4] a02b6794337286bc12c907c33d5d75537c240bd0 (refs/heads/master:drivers/gpu/drm/amd/include/asic_reg/vega10/NBIO/nbio_6_1_sh_mask.h)
[5] 5dc01c595e6c6ec9ccda4f6f69c131c0dd945f8c (refs/tags/v2.6.11)
[6] 1459754b9d9acc2ffac8525bed6691e15913c6e2 (589b754df3f37ca0a1f96fccde7f91c59266f38a^{tree})
[7] 78a269635e76ed927e17d7883f2d90313570fdbc (dae09011115133666e47c35673c0564b0a702db7^{tree})
[8] ce5f2e31d3bdc1186041fdfd27a5ac96e728f2c5 (refs/heads/master^{tree})
[9] 532bdadc08402b7a72a4b45a2e02e5c710b7d626 (e9ef1fe312b533592e39cddc1327463c30b0ed8d^{tree})
[10] f29a5ea76884ac37e1197bef1941f62fda3f7b99 (f5308d1b83eba20e69df5e0926ba7257c8dd9074^{tree})
The git-sizer
project page explains the output in detail. The most interesting thing to look at is the “level of concern” column, which gives a rough indication of which parameters are high compared with a typical, modest-sized Git repository. A lot of asterisks would suggest that your repository is stretching Git beyond its sweet spot, and that some Git operations might be noticeably slower than usual. If you see exclamation marks instead of asterisks in this column, then you likely have a problem that needs addressing.
As you can see from the output, even though the Linux kernel is a big project by most standards, it is fairly well-balanced and none of its parameters have extreme values. Some Git operations will certainly take longer than they would in a small repository, but not unreasonably, and not out of proportion to the scope of the project. The kernel project is comfortably manageable in Git.
If the git-sizer
analysis flags up any problems in your repository, we suggest referring again to the git-sizer
project page, where you will find many suggestions and resources for improving the structure of your Git repository. Please note that by far the easiest time to improve your repository structure is when you are just beginning to use Git, for example when migrating a repository from another version control system, before a lot of developers have started cloning and contributing to the repository. And keep in mind that repositories only grow over time, so it is preferable to establish good practices early.
Summary
Git is famous for its speed and ability to deal with even quite large development projects. But every system has its limits, and if you push its limits too hard, your experience might suffer. git-sizer
can help you evaluate whether your Git repository will live happily within Git, or whether it would be advisable to slim it down to make your Git experience as delightful as it can be.
Getting involved: git-sizer
is open source! If you’d like to report bugs or contribute new features, head over to the project page.
Written by
Related posts
Introducing Annotated Logger: A Python package to aid in adding metadata to logs
We’re open sourcing Annotated Logger, a Python package that helps make logs searchable with consistent metadata.
Boost your CLI skills with GitHub Copilot
Want to know how to take your terminal skills to the next level? Whether you’re starting out, or looking for more advanced commands, GitHub Copilot can help us explain and suggest the commands we are looking for.
Beginner’s guide to GitHub: Setting up and securing your profile
As part of the GitHub for Beginners guide, learn how to improve the security of your profile and create a profile README. This will let you give your GitHub account a little more personality.