
Developer experience: what is it and why should you care?
Explore how investing in a better developer experience frees developers to do what matters most: building great software.
Edit: DGit is now called Spokes GitHub hosts over 35 million repositories and over 30 million Gists on hundreds of servers. Over the past year, we've built DGit, a new…
Edit: DGit is now called Spokes
GitHub hosts over 35 million repositories and over 30 million Gists on
hundreds of servers. Over the past year, we’ve built DGit, a new
distributed storage system that dramatically improves the availability,
reliability, and performance of serving and storing Git content.
DGit is short for “Distributed Git.” As many readers already know, Git
itself is distributed—any copy of a Git repository contains every file,
branch, and commit in the project’s entire history. DGit uses
this property of Git to keep three copies of every repository,
on three different servers. The design of DGit keeps repositories fully
available without interruption even if one of those servers goes down. Even
in the extreme case that two copies of a repository become unavailable at
the same time, the repository remains readable; i.e., fetches, clones, and
most of the web UI continue to work.
DGit performs replication at the application layer, rather than at the disk
layer. Think of the replicas as three loosely-coupled Git repositories kept
in sync via Git protocols, rather than identical disk images full of
repositories. This design gives us great flexibility to decide where to
store the replicas of a repository and which replica to use for read
operations.
If a file server needs to be taken offline, DGit automatically determines
which repositories are left with fewer than three replicas and creates new
replicas of those repositories on other file servers. This “healing” process
uses all remaining servers as both sources and destinations. Since healing
throughput is N-by-N, it is quite fast. And all this happens
without any downtime.
Most end users store their Git repositories as objects, pack files, and
references in a single .git
directory. They access the repository using
the Git command-line client or using graphical clients like GitHub Desktop
or the built-in support for Git in IDEs like Visual Studio. Perhaps it’s
surprising that GitHub’s repository-storage tier, DGit, is built using the same
technologies. Why not a SAN? A distributed file system? Some other magical
cloud technology that abstracts away the problem of storing bits durably?
The answer is simple: it’s fast and it’s robust.
Git is very sensitive to latency. A simple git log
or git blame
might
require thousands of Git objects to be loaded and traversed sequentially. If there’s any
latency in these low-level disk accesses, performance suffers dramatically.
Thus, storing the repository in a distributed file system is not viable. Git
is optimized to be fast when accessing fast disks, so the DGit file servers
store repositories on fast, local SSDs.
At a higher level, Git is also optimized to exchange updates between Git
repositories (e.g., pushes and fetches) over efficient protocols.
So we use these protocols to keep the DGit replicas in sync.
Git is a mature and well-tested technology. Why reinvent the wheel
when there is a Formula One racing car already available?
It has always been GitHub’s philosophy to use Git on our servers in a manner
that is as close as possible to how Git is used by our users. DGit continues
this tradition. If we find performance bottlenecks or other problems, we
have several core Git and libgit2 contributors
on staff who fix the problems and contribute the fixes to the open-source
project for everybody to use. Our level of experience and expertise with Git
helped make it the obvious choice to use for DGit replication.
Until recently, we kept copies of repository data using off-the-shelf,
disk-layer replication technologies—namely, RAID and
DRBD. We organized our file servers in
pairs. Each active file server had a dedicated, online spare
connected by a cross-over cable. Each disk had four copies: two
copies on the main file server, using RAID, and another two copies on that
file server’s hot spare, using DRBD. If anything went wrong with a file
server—e.g., hardware failure, software crash, or an overload situation—a
human would confirm the fault and order the spare to take over. Thus, there
was a good level of redundancy, but the failover process required manual
intervention and inevitably caused a little bit of downtime for the
repositories on the failed server. To make such incidents as rare as
possible, we have always stored repositories on specialized, highly reliable
servers.
Now with DGit, each repository is stored on three servers chosen
independently in locations distributed around our large pool of
file servers. DGit automatically selects the servers to host each repository,
keeps those replicas in sync, and picks the best server to handle each
incoming read request. Writes are synchronously streamed to all three
replicas and are only committed if at least two replicas confirm success.
GitHub now stores repositories in a cluster called github-dfs
—dfs
is short
for “DGit file server.” The repositories are stored on
local disks on these file servers and are served up by Git and libgit2.
The clients of this cluster include the web front end and the proxies that
speak to users’ Git clients.
DGit delivers many advantages both to GitHub users and to the
internal GitHub infrastructure team. It is also a key foundation that will
enable more upcoming innovations.
git
processes on old file servers (blue) and on DGit servers (green).DGit is a big change, and so we’ve been rolling it out gradually. The most
complicated aspect of DGit is that replication is no longer transparent:
every repository is now stored explicitly on three servers,
rather than on one server with an automatically synchronized hot spare.
Thus, DGit must implement its own
serializability, locking, failure detection, and resynchronization,
rather then relying on DRBD and the RAID controller to keep the copies in
sync. Those are rich topics that we’ll explore in later posts; suffice it to
say, we wanted to test them all thoroughly before relying on DGit to store
customer data. Our deployment progressed over many steps:
During the rollout phase we routinely powered down servers, sometimes
several at once, while they were serving live production traffic. User
operations were not affected.
As of this writing, 58% of repositories and 96% of Gists, representing 67% of
Git operations, are in DGit. We are moving the rest as quickly as we
can turn pre-DGit file server pairs into DGit servers.
GitHub always strives to make fetching, pushing, and viewing repositories quick
and reliable. DGit is how our repository-storage tier will meet those
goals for years to come while allowing us to scale horizontally and increase fault tolerance.
Over the next month we will be following up with in-depth posts on the technology behind DGit.