Measuring the many sizes of a Git repository

Is your Git repository bursting at the seams? git-sizer is a new open source tool that can tell you when your repo is getting too big. git-sizer computes various Git…

|
| 6 minutes

Is your Git repository bursting at the seams? git-sizer is a new open source tool that can tell you when your repo is getting too big. git-sizer computes various Git repository size metrics and alerts you to any that might cause problems or inconvenience.

What is “big”?

When people talk about the size of a Git repository, they often talk about the total size needed by Git to store the project’s history in its internal, highly-compressed format—basically, the amount of disk space used by the .git directory. This number is easy to measure. It’s also useful, because it indicates how long it takes to clone the repository and how much disk space it will use.

At GitHub we host over 78 million Git repositories, so we’ve seen it all. What we find is that many of the repositories that tax our servers the most are not unusually big. The most challenging repositories to host are often those that have an unusual internal layout that Git is not optimized for.

Many properties aside from overall size can make a Git repository unwieldy. For example:

  • It could contain an astronomical number of Git objects (which are used to store the repository’s history)

  • The total size of the Git objects could be huge when uncompressed (even though their size is reasonable when compressed)

  • When the repository is checked out, the size of the working copy might be gigantic

  • The repository could have an unreasonable number of commits in its history

  • It could include enormous individual files or directories

  • It could contain large files/directories that have been modified very many times

  • It could contain too many references (branches, tags, etc)

Any of these properties, if taken to an extreme, can cause certain Git operations to perform poorly. And surprisingly, a repository can be grossly oversized in almost any of these ways without using a worrying amount of disk space.

It also makes sense to consider whether the size of your repository is commensurate with the type and scope of your project. The Linux kernel has been developed over 25 years by thousands of contributors, so it is not at all alarming that it has grown to 1.5 GB. But if your weekend class assignment is already 1.5 GB, that’s probably a strong hint that you could be using Git more effectively!

Sizing up your repository

You can use git-sizer to measure many size-related properties of your repository, including all of those listed above. To do so, you’ll need a local clone of the repository and a copy of the Git command-line client installed and in your execution PATH. Then:

  1. Install git-sizer
  2. Change to the directory containing your repository
  3. Run git-sizer. You can learn about its command-line options by running git-sizer --help, but no options are required

git-sizer will gather statistics about all of the references and reachable Git objects in your repository and output a report. For example, here is the verbose output for the Linux kernel repository:

$ git-sizer --verbose
Processing blobs: 1652370
Processing trees: 3396199
Processing commits: 722647
Matching commits to trees: 722647
Processing annotated tags: 534
Processing references: 539
| Name                         | Value     | Level of concern               |
| ---------------------------- | --------- | ------------------------------ |
| Overall repository size      |           |                                |
| * Commits                    |           |                                |
|   * Count                    |   723 k   | *                              |
|   * Total size               |   525 MiB | **                             |
| * Trees                      |           |                                |
|   * Count                    |  3.40 M   | **                             |
|   * Total size               |  9.00 GiB | ****                           |
|   * Total tree entries       |   264 M   | *****                          |
| * Blobs                      |           |                                |
|   * Count                    |  1.65 M   | *                              |
|   * Total size               |  55.8 GiB | *****                          |
| * Annotated tags             |           |                                |
|   * Count                    |   534     |                                |
| * References                 |           |                                |
|   * Count                    |   539     |                                |
|                              |           |                                |
| Biggest objects              |           |                                |
| * Commits                    |           |                                |
|   * Maximum size         [1] |  72.7 KiB | *                              |
|   * Maximum parents      [2] |    66     | ******                         |
| * Trees                      |           |                                |
|   * Maximum entries      [3] |  1.68 k   |                                |
| * Blobs                      |           |                                |
|   * Maximum size         [4] |  13.5 MiB | *                              |
|                              |           |                                |
| History structure            |           |                                |
| * Maximum history depth      |   136 k   |                                |
| * Maximum tag depth      [5] |     1     | *                              |
|                              |           |                                |
| Biggest checkouts            |           |                                |
| * Number of directories  [6] |  4.38 k   | **                             |
| * Maximum path depth     [7] |    14     | *                              |
| * Maximum path length    [8] |   134 B   | *                              |
| * Number of files        [9] |  62.3 k   | *                              |
| * Total size of files    [9] |   747 MiB |                                |
| * Number of symlinks    [10] |    40     |                                |
| * Number of submodules       |     0     |                                |

[1]  91cc53b0c78596a73fa708cceb7313e7168bb146
[2]  2cde51fbd0f310c8a2c5f977e665c0ac3945b46d
[3]  4f86eed5893207aca2c2da86b35b38f2e1ec1fc8 (refs/heads/master:arch/arm/boot/dts)
[4]  a02b6794337286bc12c907c33d5d75537c240bd0 (refs/heads/master:drivers/gpu/drm/amd/include/asic_reg/vega10/NBIO/nbio_6_1_sh_mask.h)
[5]  5dc01c595e6c6ec9ccda4f6f69c131c0dd945f8c (refs/tags/v2.6.11)
[6]  1459754b9d9acc2ffac8525bed6691e15913c6e2 (589b754df3f37ca0a1f96fccde7f91c59266f38a^{tree})
[7]  78a269635e76ed927e17d7883f2d90313570fdbc (dae09011115133666e47c35673c0564b0a702db7^{tree})
[8]  ce5f2e31d3bdc1186041fdfd27a5ac96e728f2c5 (refs/heads/master^{tree})
[9]  532bdadc08402b7a72a4b45a2e02e5c710b7d626 (e9ef1fe312b533592e39cddc1327463c30b0ed8d^{tree})
[10] f29a5ea76884ac37e1197bef1941f62fda3f7b99 (f5308d1b83eba20e69df5e0926ba7257c8dd9074^{tree})

The git-sizer project page explains the output in detail. The most interesting thing to look at is the “level of concern” column, which gives a rough indication of which parameters are high compared with a typical, modest-sized Git repository. A lot of asterisks would suggest that your repository is stretching Git beyond its sweet spot, and that some Git operations might be noticeably slower than usual. If you see exclamation marks instead of asterisks in this column, then you likely have a problem that needs addressing.

As you can see from the output, even though the Linux kernel is a big project by most standards, it is fairly well-balanced and none of its parameters have extreme values. Some Git operations will certainly take longer than they would in a small repository, but not unreasonably, and not out of proportion to the scope of the project. The kernel project is comfortably manageable in Git.

If the git-sizer analysis flags up any problems in your repository, we suggest referring again to the git-sizer project page, where you will find many suggestions and resources for improving the structure of your Git repository. Please note that by far the easiest time to improve your repository structure is when you are just beginning to use Git, for example when migrating a repository from another version control system, before a lot of developers have started cloning and contributing to the repository. And keep in mind that repositories only grow over time, so it is preferable to establish good practices early.

Summary

Git is famous for its speed and ability to deal with even quite large development projects. But every system has its limits, and if you push its limits too hard, your experience might suffer. git-sizer can help you evaluate whether your Git repository will live happily within Git, or whether it would be advisable to slim it down to make your Git experience as delightful as it can be.

Getting involved: git-sizer is open source! If you’d like to report bugs or contribute new features, head over to the project page.

Related posts