Revamping GitHub’s Subversion Bridge

One of GitHub’s niche features is the ability to access a Git repository on GitHub using Subversion clients. Last year we re-architected a large portion of the Subversion bridge to work with our changing infrastructure.

|
| 10 minutes

One of GitHub’s niche features is the ability to access a Git repository on GitHub using Subversion clients. Last year we re-architected a large portion of the Subversion bridge to work with our changing infrastructure.

The Problem

A key part of the Subversion bridge is a mapping between Git commits and Subversion revision numbers. The mapping is persisted so that we can produce a consistent view of the repository. A bit of the mapping is exposed via a custom SVN property, git-commit. For example, you can see that revision 2504 of phantomjs is an SVN representation of Git commit 2837f28.

$ svn propget --revprop -r 2504 git-commit https://github.com/ariya/phantomjs
2837f28c739f823f2eff061c8e41cf47654b8016

During the initial development of the Subversion bridge, we chose to store the mapping data in the repository directory as a serialized Ruby data structure. This worked well as it co-located the mapping information with the target of the mapping.

Storing the mapping this way has some disadvantages. The mapping file required special treatment in the tools that manage our infrastructure. For example, backup scripts couldn’t simply git clone a repository. As our infrastructure was evolving, these types of special cases made it impractical to store ad-hoc files in Git repositories.

The Solution

So, in 2015, we undertook an effort to move the Subversion mapping into the Git repository’s object database. This keeps the mapping data co-located with the repository and it also means that there are no longer different special cases related to handling the mapping data. The result of this effort is that the mapping data is now just an ordinary Git commit:

$ git show refs/__gh__/svn/v4
Author: Vitaly Slobodin <Vitallium@users.noreply.github.com>
Date:   Sat Dec 19 10:15:10 2015 +0300

    ---
    yaml
    ---
    r: 2504
    b: refs/heads/gh-pages
    c: 2837f28c739f823f2eff061c8e41cf47654b8016
    h: refs/heads/master

In addition to moving the mapping data to Git our goals were to maintain feature parity and to not negatively impact performance for our end users.

In order to provide a seamless rollover from the old mapping to the new mapping we used our Scientist library to run both mappings in parallel.

Using Scientist

The Scientist library helps you take two or more implementations, run the same inputs through them, and then compare the output of each in production. This helps build confidence that the new implementation is equivalent to the old. Testing accomplishes some of this. But in complex systems, real use provides a scope of testing that is simply not possible in a reasonable amount of engineering time.

The first step of this project was to extract a MsgpackMapping class. The new class encapsulated all of the storage needs of the SVN bridge. It was an interface that we could re-implement in terms of the new Git-backed mapping.

The MsgpackMapping class has a fairly wide interface (30 methods!) and the old and new mapping implementations are different enough that replacing one method at a time wasn’t possible. With Scientist we were able to incrementally implement and refine the new implementation. The new implementation could run alongside the old implementation. We could compare the accuracy and performance of each new method as we implemented it.

Next we created a new GitMapping class with the empty methods that matched the methods in MsgpackMapping.

class GitMapping
  # Returns current svn revision.
  def current_version; end

  # Returns a list of paths and the svn revisions where they were modified.
  def path_history(ref, path); end

  # Returns the git commit sha for an svn revision at a specific git ref.
  def sha_for_ref_version(ref, version); end

  # Updates the svn revision and git mapping.
  def update_mapping; end

  # ...
end

Then we created the ScientificMapping class that uses Scientist to run the experiments. This class let us enable and disable experiments for each method as we implemented it.

class ScientificMapping
  # Include Scientist's API for running experiments.
  include ::Scientist

  # Original mapping class passed into the initialize method.
  attr_reader :msgpack_mapping

  # New mapping class passed into the initialize method.
  attr_reader :git_mapping

  # Class method for enabling experiments per method. See examples below.
  def self.experimental_methods(*names)
    names.each do |name|
      define_method(name) do |*args|
        science name.to_s do |experiment|
          experiment.context :args => args
          experiment.use { msgpack_mapping.send(name, *args) }
          experiment.try { git_mapping.send(name, *args) }
          experiment.run_if { run_experiment?(experiment) }
        end
      end
    end
  end

  # Enable the current_version and update_mapping method experiments.
  experimental_methods(
    :current_version,
    :update_mapping
  )

  # Class method for disabling experiments per method. See examples below.
  def self.disabled_experiments(*methods)
    extend Forwardable
    def_delegators :msgpack_mapping, *methods
  end

  # Disable the path_history and sha_for_ref_version method experiments.
  disabled_experiments(
    :path_history,
    :sha_for_ref_version
  )

  # Enable or disable experiments based on specifics to this instance of the
  # experiment. In the SvnApp::Experiment class below are examples of broader
  # ways to determine whether an experiment is run.
  def run_experiment?(experiment); end
end

Finally we created a class to represent experiments. This class controls how often experiments are run. It records the results of the experiments.

# Tell Scientist how to create a new experiment record.
Scientist::Experiment.module_eval do
  def self.new(name)
    ::SvnApp::Experiment.new(name)
  end
end

class SvnApp
  class Experiment
    include ::Scientist::Experiment

    # Experiment name passed into initialize method. For example:
    # - current_version
    # - path_history
    attr_reader :name

    # Override Scientist's default implementation to only run experiments a
    # certain percentage of the time.
    def enabled?
      percent_enabled > 0 && rand(100) < percent_enabled
    end

    # Only run experiments 10% of the time. In our case this method had a couple
    # of conditionals and returned different percentages based on the situation.
    def percent_enabled
      10
    end

    # This is the scientist method that you use to do something with the results
    # of the experiment. In our case we logged result metrics to statsd, and
    # mismatches and slow candidates to our raw log and error reporting systems.
    def publish(result)
      $stats.increment "experiments.#{name}.total"

      if result.mismatched?
        $stats.increment "experiments.#{name}.mismatch"
        log_mismatch result
      end

      result.observations.each do |observation|
        $stats.timing "experiments.#{name}.#{observation.name}", observation.duration

        if observation.raised?
          $stats.increment "experiments.#{name}.#{observation.name}_raised"
        end
      end

      result.candidates.each do |candidate|
        if candidate.duration > 10.0 # 10 seconds
          log_slow_candidate result, candidate
        end
      end
    end

    # ...
  end
end

One thing to note in the code above is how we stored results. We counted results with Statsd. We stored details about experiments in log files at first, and later switched to storing them in our exception reporting system. None of these were new to the Subversion bridge. They’ve all been in use for a long time, and we have good tooling for querying them. This highlights one of the boons of using Scientist: it makes no assumptions about how you want to store your results. For example, other apps at GitHub use Redis and/or MySQL to store Scientist’s results.

With Scientist configured and our new mapping classes in place we started the process of implementing each mapping method.

Implementation process

Our process for implementing the GitMapping class did not change much over the course of the project. However, we frequently made small changes that shortened the feedback loop for each step. The core of our process looked something like this:

  1. Write a naive implementation that satisfies the existing unit and integration tests.
  2. Enable the experiment in the ScientificMapping class.
  3. Deploy to production and watch our graphs and error reporting system for mismatches.
  4. Try to replicate a mismatch in development and add a new unit or integration test to cover the scenario.
  5. Make the new test pass.
  6. Repeat steps 3-5 until there are no mismatches in production.

We relied heavily on graphs, logs, and scripts for identifying mismatches, measuring performance, and tightening our feedback loop.

Graphs and logs

At the beginning we had a dashboard that summarized experiment mismatches, performance, and response times to ensure our experiments weren’t negatively impacting customers.

science dashboard

With a general sense of how things were going, we dug into specific mismatches with our raw log and error reporting systems.

jonmagic
jonmagic
/splunk -1h production app=svnapp at=mismatch | tail 5
hubot
Hubot
2015-08-12T23:24:21-07:00 experiment=path_version control_value=1358 candidate_value=11 args='[1358, "branches"]'
2015-08-12T22:59:54-07:00 experiment=sha_for_ref_version control_value=67ac52c329bd04c29e84099a45bf8b763e181557 candidate_value=d561c00c2695fa46d93b5a8e88eeb10b5b256e39 args='["refs/heads/master", 20]'
2015-08-12T19:38:29-07:00 experiment=sha_for_ref_version control_value=nil candidate_value=8555c1e64383c286592f84cb7bf16e3a370a4358 args="[nil, nil]"
2015-08-12T18:32:31-07:00 experiment=path_version control_value=283 candidate_value=287 args='[287, "branches/master"]'
2015-08-12T17:37:34-07:00 experiment=commit_at control_value=67ac52c329bd04c29e84099a45bf8b763e181557 candidate_value=d561c00c2695fa46d93b5a8e88eeb10b5b256e39 args=[20]

This is an expanded view of a mismatch in our error reporting system.

haystack needle

Further into the project as mismatches became less of an issue and maintaining performance became more of a concern we added a new dashboard that split out each method into its own graph and gave us a quick visual of how the candidate was performing against the control. The new graphs enabled us to track down some significant performance regressions.

experiment dashboard

We used a handful of caching strategies to fix performance regressions in addition to implementation changes.

Scripts

The process for replicating issues in development quickly became a bottleneck. Each repository is a special snowflake, made up of a set of commits from a variety of committers with a variety of native languages and a variety of Git versions. Each repository has a mapping file that was built up by many versions of the Subversion bridge. It was hard to reproduce bugs in development with contrived repositories. So we added script/clone. This script cloned open source repositories and pulled a copy of the msgpack mapping file. This allowed us to reproduce, test, and debug problems locally.

$ script/clone ariya/phantomjs
Cloning into bare repository repositories/ariya/phantomjs.git...
remote: Counting objects: 76355, done.
remote: Total 76355 (delta 0), reused 0 (delta 0), pack-reused 76355
Receiving objects: 100% (76355/76355), 138.71 MiB | 6.05 MiB/s, done.
Resolving deltas: 100% (39951/39951), done.
Checking connectivity... done.
x svn.history.msgpack

The next script we wrote opened a console for our app with instances of the MsgpackMapping, GitMapping, and ScientificMapping classes already initialized. It’s not like doing this manually took a long time. But we were doing it so often that scripting it saved time in the long run. It also made construction of the objects more consistent.

$ script/console ariya/phantomjs
repo            = #<SvnApp::Repo:0x007fa0baa83358>
msgpack_mapping = #<MsgpackMapping:0x007fa0baa831a0>
git_mapping     = #<GitMapping:0x007fa0baa913b8>
science_mapping = #<ScientificMapping:0x007fa0baa90c10>
irb(main):001:0> msgpack_mapping.current_version
=> 2435
irb(main):002:0> git_mapping.current_version
=> 2435

As performance tuning became more of a concern we added script/benchmark to help us quickly iterate on a single repository in development without having to deploy and then wait for performance data to be collected in production.

$ script/benchmark ariya/phantomjs
--------------------------------
repositories/ariya/phantomjs.git
--------------------------------

level: 1, current_version: 2435, ref: refs/heads/master, file_path: src/qt/src/3rdparty/webkit/Source/JavaScriptCore/runtime/PutPropertySlot.h, sha: bb3df8057037aa3e49dd1818ee73967e2ea72487
Benchmarking branches_at(2330)
msgpack:       7.378ms  => ["refs/heads/1.0", "refs/heads/1.1", ...
git:           6.147ms  => ["refs/heads/1.0", "refs/heads/1.1", ...
Benchmarking refs_at(2330)
msgpack:       3.399ms  => ["refs/heads/1.0", "refs/heads/1.1", ...
    git:       2.736ms  => ["refs/heads/1.0", "refs/heads/1.1", ...
Benchmarking source_branch("refs/heads/master")
msgpack:       3.427ms  => nil
    git:       0.218ms  => nil
Benchmarking tags_at(2330)
msgpack:       3.401ms  => ["refs/tags/1.0.0", "refs/tags/1.1.0"...
    git:       3.814ms  => ["refs/tags/1.0.0", "refs/tags/1.1.0"...

Wrapping Up

In the end we were able to swap out the msgpack based mapping for the new Git-backed mapping in production for thousands of customers. Our Git infrastructure team was able to continue making improvements without the Subversion mapping file in the way.

To learn more about how we use Scientist read Scientist: Measure Twice, Cut Over Once by @jesseplusplus and Move Fast and Fix Things by @vmg.

Authors

Related posts