Update on the future stability of source code archives and hashes

A look at what happened on January 30, what measures we’re putting in place to prevent surprises, and how we’ll handle future changes.

|
| 5 minutes

On January 30, 2023, GitHub deployed a change which slightly altered the compression settings on source code downloads. This change had unforeseen consequences for a number of communities, and after they let us know, we rolled the change back. We’d like to explain what happened, what measures we’re putting in place to prevent surprises, and how we’ll handle future changes.

Here’s what happened

Source downloads on GitHub depend on Git’s archive command. Because of the volume of data on GitHub, we don’t keep git archive results permanently. They’re cached for a time, then deleted and recreated if requested again. This strikes a good balance between making releases, tags, and even arbitrary commits available without ballooning our storage needs to an unsustainable level.

On January 30, we deployed Git 2.38 to the service that powers source downloads. This version of Git changed the default compression command used for git archive generation from external gzip to an internal copy of gzip. Although the files contained in the archive were identical, small changes to compression settings meant that the byte layout of the archive itself changed. This in turn meant that any hash or checksum (think SHA256, CRC64, etc.) of the archive also changed.

As it turned out, many communities had built assumptions about source downloads and their hashes. To help ensure reproducibility and/or security, many systems download archives once centrally and record the hash in their own repository. When a user later downloads the archive from GitHub, their client automatically checks the hash of the archive against what was recorded earlier. If there’s a mismatch, the client refuses to proceed (on the assumption that if something has changed, a human needs to determine whether it was tampering, a corrupted download, or something else).

Was this a surprise?

Yes and no. We were aware of the change of default in the git archive command. What we didn’t expect was the broad impact this might have on a number of communities.

Internally, we’ve long believed that we shouldn’t guarantee the byte-for-byte stability of git archive. The defaults and even the available options are controlled by the Git project, which similarly doesn’t make such a guarantee. We work to minimize the differences between our fork of Git and upstream Git, so we want to avoid carrying permanent patches to the archive code.

We learned during this incident that we haven’t always been clear about this stance, though. What we need to do now is to commit to a stance, so read on for that new commitment. In addition to this commitment, we’re adding testing to our development cycle to detect any future changes before they hit GitHub.com. (GitHub Docs will be updated shortly to reflect this commitment.)

Future stability of archives and hashes

  1. GitHub will hold the source downloads byte-for-byte stable for no less than a year from today (February 21, 2023). This covers both tarball (.tar.gz) and zipball (.zip) formats.
  2. In the future, if we intend to change either archive format, we’ll provide six months’ notice in documentation, and on the blog and changelog. (If we discover a critical vulnerability in the compression path, we reserve the right to shorten or omit the notice period in order to protect our systems and our customers. We don’t expect this outcome, but you never know.)
  3. We presently have no intent to change either format, as we have a new appreciation for the magnitude of the impact this change would have. In full transparency, there are a few deficiencies we wish we could fix (timestamps embedded in zipballs; dependency on system gzip for tarballs), but for the foreseeable future, we’ll engineer around these minor problems.

If you rely on stable archives only for reproducibility (ensuring you always get identical files inside your archive), then we recommend you download source archives using the source archives REST API with a commit ID for the :ref parameter. There is no need to record the hash, since the commit ID ensures you’ll always get the same file contents inside the archive. Git and GitHub both guarantee this by the nature of how commit IDs are generated. By using a commit ID, you’ll be immune to repositories rewriting tags or moving branch heads. The tarball and zipball formats have built-in protections against truncation, and TLS (by way of HTTPS) protects against corruption of the archive.

If you rely on stable archives for security (ensuring you don’t accidentally trigger a tarbomb, for example), we recommend you switch to release assets instead of using source downloads. On the Releases page, these are the assets which were uploaded to GitHub and appear with their file size. Files can be added to a release manually in the web or with something like this (third-party) GitHub Action. You can later use the Release Assets REST API to retrieve them. If relying on release assets isn’t possible, we urge you to consider designs that can accommodate (infrequent) future hash changes.

Tags:

Related posts

Software as a public good

Open source software underpins all sectors of the economy, public services and even international organizations like the United Nations. How can all its beneficiaries work together to make the open source ecosystem more sustainable?