Revamping GitHub’s Subversion Bridge
One of GitHub’s niche features is the ability to access a Git repository on GitHub using Subversion clients. Last year we re-architected a large portion of the Subversion bridge to work with our changing infrastructure.
One of GitHub’s niche features is the ability to access a Git repository on GitHub using Subversion clients. Last year we re-architected a large portion of the Subversion bridge to work with our changing infrastructure.
The Problem
A key part of the Subversion bridge is a mapping between Git commits and Subversion revision numbers. The mapping is persisted so that we can produce a consistent view of the repository. A bit of the mapping is exposed via a custom SVN property, git-commit
. For example, you can see that revision 2504 of phantomjs is an SVN representation of Git commit 2837f28
.
$ svn propget --revprop -r 2504 git-commit https://github.com/ariya/phantomjs
2837f28c739f823f2eff061c8e41cf47654b8016
During the initial development of the Subversion bridge, we chose to store the mapping data in the repository directory as a serialized Ruby data structure. This worked well as it co-located the mapping information with the target of the mapping.
Storing the mapping this way has some disadvantages. The mapping file required special treatment in the tools that manage our infrastructure. For example, backup scripts couldn’t simply git clone
a repository. As our infrastructure was evolving, these types of special cases made it impractical to store ad-hoc files in Git repositories.
The Solution
So, in 2015, we undertook an effort to move the Subversion mapping into the Git repository’s object database. This keeps the mapping data co-located with the repository and it also means that there are no longer different special cases related to handling the mapping data. The result of this effort is that the mapping data is now just an ordinary Git commit:
$ git show refs/__gh__/svn/v4
Author: Vitaly Slobodin <Vitallium@users.noreply.github.com>
Date: Sat Dec 19 10:15:10 2015 +0300
---
yaml
---
r: 2504
b: refs/heads/gh-pages
c: 2837f28c739f823f2eff061c8e41cf47654b8016
h: refs/heads/master
In addition to moving the mapping data to Git our goals were to maintain feature parity and to not negatively impact performance for our end users.
In order to provide a seamless rollover from the old mapping to the new mapping we used our Scientist library to run both mappings in parallel.
Using Scientist
The Scientist library helps you take two or more implementations, run the same inputs through them, and then compare the output of each in production. This helps build confidence that the new implementation is equivalent to the old. Testing accomplishes some of this. But in complex systems, real use provides a scope of testing that is simply not possible in a reasonable amount of engineering time.
The first step of this project was to extract a MsgpackMapping
class. The new class encapsulated all of the storage needs of the SVN bridge. It was an interface that we could re-implement in terms of the new Git-backed mapping.
The MsgpackMapping
class has a fairly wide interface (30 methods!) and the old and new mapping implementations are different enough that replacing one method at a time wasn’t possible. With Scientist we were able to incrementally implement and refine the new implementation. The new implementation could run alongside the old implementation. We could compare the accuracy and performance of each new method as we implemented it.
Next we created a new GitMapping
class with the empty methods that matched the methods in MsgpackMapping
.
class GitMapping
# Returns current svn revision.
def current_version; end
# Returns a list of paths and the svn revisions where they were modified.
def path_history(ref, path); end
# Returns the git commit sha for an svn revision at a specific git ref.
def sha_for_ref_version(ref, version); end
# Updates the svn revision and git mapping.
def update_mapping; end
# ...
end
Then we created the ScientificMapping
class that uses Scientist to run the experiments. This class let us enable and disable experiments for each method as we implemented it.
class ScientificMapping
# Include Scientist's API for running experiments.
include ::Scientist
# Original mapping class passed into the initialize method.
attr_reader :msgpack_mapping
# New mapping class passed into the initialize method.
attr_reader :git_mapping
# Class method for enabling experiments per method. See examples below.
def self.experimental_methods(*names)
names.each do |name|
define_method(name) do |*args|
science name.to_s do |experiment|
experiment.context :args => args
experiment.use { msgpack_mapping.send(name, *args) }
experiment.try { git_mapping.send(name, *args) }
experiment.run_if { run_experiment?(experiment) }
end
end
end
end
# Enable the current_version and update_mapping method experiments.
experimental_methods(
:current_version,
:update_mapping
)
# Class method for disabling experiments per method. See examples below.
def self.disabled_experiments(*methods)
extend Forwardable
def_delegators :msgpack_mapping, *methods
end
# Disable the path_history and sha_for_ref_version method experiments.
disabled_experiments(
:path_history,
:sha_for_ref_version
)
# Enable or disable experiments based on specifics to this instance of the
# experiment. In the SvnApp::Experiment class below are examples of broader
# ways to determine whether an experiment is run.
def run_experiment?(experiment); end
end
Finally we created a class to represent experiments. This class controls how often experiments are run. It records the results of the experiments.
# Tell Scientist how to create a new experiment record.
Scientist::Experiment.module_eval do
def self.new(name)
::SvnApp::Experiment.new(name)
end
end
class SvnApp
class Experiment
include ::Scientist::Experiment
# Experiment name passed into initialize method. For example:
# - current_version
# - path_history
attr_reader :name
# Override Scientist's default implementation to only run experiments a
# certain percentage of the time.
def enabled?
percent_enabled > 0 && rand(100) < percent_enabled
end
# Only run experiments 10% of the time. In our case this method had a couple
# of conditionals and returned different percentages based on the situation.
def percent_enabled
10
end
# This is the scientist method that you use to do something with the results
# of the experiment. In our case we logged result metrics to statsd, and
# mismatches and slow candidates to our raw log and error reporting systems.
def publish(result)
$stats.increment "experiments.#{name}.total"
if result.mismatched?
$stats.increment "experiments.#{name}.mismatch"
log_mismatch result
end
result.observations.each do |observation|
$stats.timing "experiments.#{name}.#{observation.name}", observation.duration
if observation.raised?
$stats.increment "experiments.#{name}.#{observation.name}_raised"
end
end
result.candidates.each do |candidate|
if candidate.duration > 10.0 # 10 seconds
log_slow_candidate result, candidate
end
end
end
# ...
end
end
One thing to note in the code above is how we stored results. We counted results with Statsd. We stored details about experiments in log files at first, and later switched to storing them in our exception reporting system. None of these were new to the Subversion bridge. They’ve all been in use for a long time, and we have good tooling for querying them. This highlights one of the boons of using Scientist: it makes no assumptions about how you want to store your results. For example, other apps at GitHub use Redis and/or MySQL to store Scientist’s results.
With Scientist configured and our new mapping classes in place we started the process of implementing each mapping method.
Implementation process
Our process for implementing the GitMapping
class did not change much over the course of the project. However, we frequently made small changes that shortened the feedback loop for each step. The core of our process looked something like this:
- Write a naive implementation that satisfies the existing unit and integration tests.
- Enable the experiment in the
ScientificMapping
class. - Deploy to production and watch our graphs and error reporting system for mismatches.
- Try to replicate a mismatch in development and add a new unit or integration test to cover the scenario.
- Make the new test pass.
- Repeat steps 3-5 until there are no mismatches in production.
We relied heavily on graphs, logs, and scripts for identifying mismatches, measuring performance, and tightening our feedback loop.
Graphs and logs
At the beginning we had a dashboard that summarized experiment mismatches, performance, and response times to ensure our experiments weren’t negatively impacting customers.
With a general sense of how things were going, we dug into specific mismatches with our raw log and error reporting systems.
This is an expanded view of a mismatch in our error reporting system.
Further into the project as mismatches became less of an issue and maintaining performance became more of a concern we added a new dashboard that split out each method into its own graph and gave us a quick visual of how the candidate was performing against the control. The new graphs enabled us to track down some significant performance regressions.
We used a handful of caching strategies to fix performance regressions in addition to implementation changes.
Scripts
The process for replicating issues in development quickly became a bottleneck. Each repository is a special snowflake, made up of a set of commits from a variety of committers with a variety of native languages and a variety of Git versions. Each repository has a mapping file that was built up by many versions of the Subversion bridge. It was hard to reproduce bugs in development with contrived repositories. So we added script/clone
. This script cloned open source repositories and pulled a copy of the msgpack mapping file. This allowed us to reproduce, test, and debug problems locally.
$ script/clone ariya/phantomjs
Cloning into bare repository repositories/ariya/phantomjs.git...
remote: Counting objects: 76355, done.
remote: Total 76355 (delta 0), reused 0 (delta 0), pack-reused 76355
Receiving objects: 100% (76355/76355), 138.71 MiB | 6.05 MiB/s, done.
Resolving deltas: 100% (39951/39951), done.
Checking connectivity... done.
x svn.history.msgpack
The next script we wrote opened a console for our app with instances of the MsgpackMapping
, GitMapping
, and ScientificMapping
classes already initialized. It’s not like doing this manually took a long time. But we were doing it so often that scripting it saved time in the long run. It also made construction of the objects more consistent.
$ script/console ariya/phantomjs
repo = #<SvnApp::Repo:0x007fa0baa83358>
msgpack_mapping = #<MsgpackMapping:0x007fa0baa831a0>
git_mapping = #<GitMapping:0x007fa0baa913b8>
science_mapping = #<ScientificMapping:0x007fa0baa90c10>
irb(main):001:0> msgpack_mapping.current_version
=> 2435
irb(main):002:0> git_mapping.current_version
=> 2435
As performance tuning became more of a concern we added script/benchmark
to help us quickly iterate on a single repository in development without having to deploy and then wait for performance data to be collected in production.
$ script/benchmark ariya/phantomjs
--------------------------------
repositories/ariya/phantomjs.git
--------------------------------
level: 1, current_version: 2435, ref: refs/heads/master, file_path: src/qt/src/3rdparty/webkit/Source/JavaScriptCore/runtime/PutPropertySlot.h, sha: bb3df8057037aa3e49dd1818ee73967e2ea72487
Benchmarking branches_at(2330)
msgpack: 7.378ms => ["refs/heads/1.0", "refs/heads/1.1", ...
git: 6.147ms => ["refs/heads/1.0", "refs/heads/1.1", ...
Benchmarking refs_at(2330)
msgpack: 3.399ms => ["refs/heads/1.0", "refs/heads/1.1", ...
git: 2.736ms => ["refs/heads/1.0", "refs/heads/1.1", ...
Benchmarking source_branch("refs/heads/master")
msgpack: 3.427ms => nil
git: 0.218ms => nil
Benchmarking tags_at(2330)
msgpack: 3.401ms => ["refs/tags/1.0.0", "refs/tags/1.1.0"...
git: 3.814ms => ["refs/tags/1.0.0", "refs/tags/1.1.0"...
Wrapping Up
In the end we were able to swap out the msgpack based mapping for the new Git-backed mapping in production for thousands of customers. Our Git infrastructure team was able to continue making improvements without the Subversion mapping file in the way.
To learn more about how we use Scientist read Scientist: Measure Twice, Cut Over Once by @jesseplusplus and Move Fast and Fix Things by @vmg.
Authors
Written by
Related posts
Inside the research: How GitHub Copilot impacts the nature of work for open source maintainers
An interview with economic researchers analyzing the causal effect of GitHub Copilot on how open source maintainers work.
OpenAI’s latest o1 model now available in GitHub Copilot and GitHub Models
The December 17 release of OpenAI’s o1 model is now available in GitHub Copilot and GitHub Models, bringing advanced coding capabilities to your workflows.
Announcing 150M developers and a new free tier for GitHub Copilot in VS Code
Come and join 150M developers on GitHub that can now code with Copilot for free in VS Code.