Orchestrator at GitHub
GitHub uses MySQL to store its metadata: Issues, Pull Requests, comments, organizations, notifications and so forth. While git repository data does not need MySQL to exist and persist, GitHub’s service…
GitHub uses MySQL to store its metadata: Issues, Pull Requests, comments, organizations, notifications and so forth. While git
repository data does not need MySQL to exist and persist, GitHub’s service does. Authentication, API, and the website itself all require the availability of our MySQL fleet.
Our replication topologies span multiple data centers and this poses a challenge not only for availability but also for manageability and operations.
Automated failovers
We use a classic MySQL master-replicas setup, where the master is the single writer, and replicas are mainly used for read traffic. We expect our MySQL fleet to be available for writes. Placing a review, creating a new repository, adding a collaborator, all require write access to our backend database. We require the master to be available.
To that effect we employ automated master failovers. The time it would take a human to wake up & fix a failed master is beyond our expectancy of availability, and operating such a failover is sometimes non-trivial. We expect master failures to be automatically detected and recovered within 30 seconds or less, and we expect failover to result with minimal loss of available hosts.
We also expect to avoid false positives and false negatives. Failing over when there’s no failure is wasteful and should be avoided. Not failing over when failover should take place means an outage. Flapping is unacceptable. And so there must be a reliable detection mechanism that makes the right choice and takes a predictable course of action.
orchestrator
We employ Orchestrator to manage our MySQL failovers. orchestrator
is an open source MySQL replication management and high availability solution. It observes MySQL replication topologies, auto-detects topology layout and changes, understands replication rules across configurations and versions, detects failure scenarios and recovers from master and intermediate master failures.
Failure detection
orchestrator
takes a different approach to failure detection than the common monitoring tools. The common way to detect master failure is by observing the master: via ping, via simple port scan, via simple SELECT
query. These tests all suffer from the same problem: What if there’s an error?
Network glitches can happen; the monitoring tool itself may be network partitioned. The naive solutions are along the lines of “try several times at fixed intervals, and on the n-th successive failure, assume master is failed”. While repeated polling works, they tend to lead to false positives and to increased outages: the smaller n is (or the smaller the interval is), the more potential there is for a false positive: short network glitches will cause for unjustified failovers. However larger n values (or longer poll intervals) will delay a true failure case.
A better approach employs multiple observers, all of whom, or the majority of whom must agree that the master has failed. This reduces the danger of a single observer suffering from network partitioning.
orchestrator
uses a holistic approach, utilizing the replication cluster itself. The master is not an isolated entity. It has replicas. These replicas continuously poll the master for incoming changes, copy those changes and replay them. They have their own retry count/interval setup. When orchestrator
looks for a failure scenario, it looks at the master and at all of its replicas. It knows what replicas to expect because it continuously observes the topology, and has a clear picture of how it looked like the moment before failure.
orchestrator
seeks agreement between itself and the replicas: if orchestrator
cannot reach the master, but all replicas are happily replicating and making progress, there is no failure scenario. But if the master is unreachable to orchestrator
and all replicas say: “Hey! Replication is broken, we cannot reach the master”, our conclusion becomes very powerful: we haven’t just gathered input from multiple hosts. We have identified that the replication cluster is broken de-facto. The master may be alive, it may be dead, may be network partitioned; it does not matter: the cluster does not receive updates and for all practical purposes does not function. This situation is depicted in the image below:
Masters are not the only subject of failure detection: orchestrator
employs similar logic to intermediate masters: replicas which happen to have further replicas of their own.
Furthermore, orchestrator
also considers more complex cases as having unreachable replicas or other scenarios where decision making turns more fuzzy. In some such cases, it is still confident to proceed to failover. In others, it suffices with detection notification only.
We observe that orchestrator
‘s detection algorithm is very accurate. We spent a few months in testing its decision making before switching on auto-recovery.
Failover
Once the decision to failover has been made, the next step is to choose where to failover to. That decision, too, is non trivial.
In semi-sync replication environments, which orchestrator
supports, one or more designated replicas are guaranteed to be most up-to-date. This allows one to guarantee one or more servers that would be ideal to be promoted. Enabling semi-sync is on our roadmap and we use asynchronous replication at this time. Some updates made to the master may never make it to any replicas, and there is no guarantee as for which replica will get the most recent updates. Choosing the most up-to-date replica means you lose the least data. However in the world of operations not all replicas are created equal: at any given time we may be experimenting with a recent MySQL release, that we’re not ready yet to put to production; or may be transitioning from STATEMENT
based replication to ROW
based; or have servers in a remote data center that preferably wouldn’t take writes. Or you may have a designated server of stronger hardware that you’d like to promote no matter what.
orchestrator
understands all replication rules and picks a replica that makes most sense to promote based on a set of rules and the set of available servers, their configuration, their physical location and more. Depending on servers’ configuration, it is able to do a two-step promotion by first healing the topology in whatever setup is easiest, then promoting a designated or otherwise best server as master.
We build trust in the failover procedure by continuously testing failovers. We intend to write more on this in a later post.
Anti-flapping and acknowledgements
Flapping is strictly unacceptable. To that effect orchestrator
is configured to only perform one automated failover for any given cluster in a preconfigured time period. Once a failover takes place, the failed cluster is marked as “blocked” from further failovers. This mark is cleared after, say, 30
minutes, or until a human says otherwise.
To clarify, an automated master failover in the middle of the night does not mean stakeholders get to sleep it over. Pages will arrive, even as failover takes place. A human will observe the state, and may or may not acknowledge the failover as justified. Once acknowledged, orchestrator
forgets about that failover and is free to proceed with further failovers on that cluster should the case arise.
Topology management
There’s more than failovers to orchestrator
. It allows for simplified topology management and visualization.
We have multiple clusters of differing size, that span multiple datacenters (DCs). Consider the following:
The different colors indicate different data centers, and the above topology spans three DCs. Cross-DC network has higher latency and network calls are more expensive than within the intra-DC network, and so we typically group DC servers under a designated intermediate master, aka local DC master, and reduce cross-DC network traffic. In the above instance-64bb
(blue, 2nd from bottom on the right) could replicate from instance-6b44
(blue, bottom, middle) and free up some cross-DC traffic.
This design leads to more complex topologies: replication trees that go deeper than one or two levels. There are more use cases to having such topologies:
- Experimenting with a newer version: to test, say, MySQL
5.7
we create a subtree of5.7
servers, with one acting as an intermediate master. This allows us to test5.7
replication flow and speed. - Migrating from
STATEMENT
based replication toROW
based replication: we again migrate slowly by creating subtrees, adding more and more nodes to those trees until they consume the entire topology. - By way of simplifying automation: a newly provisioned host, or a host restored from backup, is set to replicate from the backup server whose data was used to restore the host.
- Data partitioning is achieved by incubating and splitting out new clusters, originally dangling as sub-clusters then becoming independent.
Deep nested replication topologies introduce management complexity:
- All intermediate masters turn to be point of failure for their nested subtrees.
- Recoveries in mixed-versions topologies or mixed-format topologies are subject to cross-version or cross-format replication constraints. Not any server can replicate from any other.
- Maintenance requires careful refactoring of the topology: you can’t just take down a server to upgrade its hardware; if it serves as a local/intermediate master taking it offline would break replication on its own replicas.
orchestrator
allows for easy and safe refactoring and management of such complex topologies:
- It can failover dead intermediate masters, eliminating the “point of failure” problem.
- Refactoring (moving replicas around the topology) is made easy via GTID or Pseudo-GTID (an application level injection of sparse GTID-like entries).
-
orchestrator
understands replication rules and will refuse to place, say, a5.6
server below a5.7
server.
orchestrator
also serves as the de-facto topology state/inventory indicator. It complements puppet
or service discoveries configuration which imply desired state, by actually observing the existing state. State is queryable at various levels, and we employ orchestrator
at some of our automation tasks.
Chatops integration
We love our chatops as they make our operations visible and accessible to our greater group of engineers.
While the orchestrator service provides a web interface, we rarely use it; one’s browser is her own private command center, with no visibility to others and no history.
We rely on chatops for most operations. As a quick example of visibility we get by chatops, let’s examine a cluster:
Say we wanted to upgrade instance-fadf
to 5.6.31-77.0-log
. It has two replicas attached, that I don’t want to be affected. We can:
To the effect of:
The instance is now free to be taken out of the pool.
Other actions are available to us via chatops. We can force a failover, acknowledge recoveries, query topology structure etc. orchestrator
further communicates with us on chat, and notifies us in the event of a failure/recovery.
orchestrator
also runs as a command-line tool, and the orchestrator
service supports web API, and so can easily participate in automated tasks.
orchestrator @ GitHub
GitHub has adopted orchestrator
, and will continue to improve and maintain it. The github repo will serve as the new upstream and will accept issues and pull requests from the community.
orchestrator
continues to be free and open source, and is released under the Apache License 2.0.
Migrating the project to the GitHub repo had the unfortunate result of diverging from the original Outbrain repo, due to the way import paths are coupled with repo URI in golang
. The two diverged repositories will not be kept in sync; and we took the opportunity to make some further diverging changes, though made sure to keep API & command line spec compatible. We’ll keep an eye for incoming Issues on the Outbrain repo.
Outbrain
It is our pleasure to acknowledge Outbrain as the original author of orchestrator
. The project originated at Outbrain while seeking to manage a growing fleet of servers in three data centers. It began as a means to visualize the existing topologies, with minimal support for refactoring, and came at a time where massive hardware upgrades and datacenter changes were taking place. orchestrator
was used as the tool for refactoring and for ensuring topology setups went as planned and without interruption to service, even as servers were being provisioned or retired.
Later on Pseudo-GTID was introduced to overcome the problems of unreachable/crashing/lagging intermediate masters, and shortly afterwards recoveries came into being. orchestrator
was put to production in very early stages and worked on busy and sensitive systems.
Outbrain was happy to develop orchestrator
as a public open source project and provided the resources to allow its development, not only to the specific benefits of the company, but also to the wider community. Outbrain authors many more open source projects, which can be found on their GitHub’s Outbrain engineering page.
We’d like to thank Outbrain for their contributions to orchestrator
, as well as for their openness to having us adopt the project.
Further acknowledgements
orchestrator
was later developed at Booking.com. It was brought in to improve on the existing high availability scheme. orchestrator
‘s flexibility allowed for simpler hardware setup and faster failovers. It was fortunate to enjoy the large MySQL setup Booking.com employs, managing various MySQL vendors, versions, configurations, running on clusters ranging from a single master to many hundreds of MySQL servers and Binlog Servers on multiple data centers. Booking.com continuously contributes to orchestrator
.
We’d like to further acknowledge major community contributions made by Google/Vitess (orchestrator
is the failover mechanism used by Vitess), and by Square, Inc.
Related projects
We’ve released a public Puppet module for orchestrator, authored by @tomkrouper. This module sets up the orchestrator
service, config files, logging etc. We use this module within our own puppet
setup, and actively maintain it.
Chef users, please consider this Chef cookbook by @silviabotros.
Written by
Related posts
Breaking down CPU speed: How utilization impacts performance
The Performance Engineering team at GitHub assessed how CPU performance degrades as utilization increases and how this relates to capacity.
How to make Storybook Interactions respect user motion preferences
With this custom addon, you can ensure your workplace remains accessible to users with motion sensitivities while benefiting from Storybook’s Interactions.
GitHub Enterprise Cloud with data residency: How we built the next evolution of GitHub Enterprise using GitHub
How we used GitHub to build GitHub Enterprise Cloud with data residency.