Our MySQL infrastructure is a critical component to GitHub. MySQL serves GitHub.com, GitHub’s API, authentication and more. Every
git request touches MySQL in some way. We are tasked with keeping the data available, and maintaining its integrity. Even while our MySQL clusters serve traffic, we need to be able to perform tasks such as heavy duty cleanups, ad-hoc updates, online schema migrations, cluster topology refactoring, pooling and load balancing and more. We have the infrastructure to automate away such operations; in this post we share a few examples of how we build trust in our infrastructure through continuous testing. It is essentially how we sleep well at night.
It is incredibly important to take backups of your data. If you are not taking backups of your database, it is likely a matter of time before this will become an issue. Percona Xtrabackup is the tool we have been using for issuing full backups for our MySQL databases. If there is data that we need to be certain is saved, we have a server that is backing up the data.
In addition to the full binary backups, we run logical backups several times a day. These backups allow our engineers to get a copy of recent data. There are times that they would like a complete set of data from a table so they can test an index change on a production sized table or see data from a certain point of time. Hubot allows us to restore a backed up table and will ping us when the table is ready to use.
The data is loaded onto a non-production database which is accessible to the engineer requesting the restore.
The last way we keep a “backup” of data around is we use delayed replicas. This is less of a backup and more of a safeguard. For each production cluster we have a host that has replication delayed by 4 hours. If a query is run that shouldn’t have, we can run
mysql panic in chatops. This will cause all of our delayed replicas to stop replication immediately. This will also page the on-call DBA. From there we can use delayed replica to verify there is an issue, and then fast forward the binary logs to the point right before the error. We can then restore this data to the master, thus recovering data to that point.
Backups are great, however they are worthless if some unknown or uncaught error occurs corrupting the backup. A benefit of having a script to restore backups is it allows us to automate the verification of backups via cron. We have set up a dedicated host for each cluster that runs a restore of the latest backup. This ensures that the backup ran correctly and that we are able to retrieve the data from the backup.
Depending on dataset size, we run several restores per day. Restored servers are expected to join the replication stream and to be able to catch up with replication. This tests not only that we took a restorable backup, but also that we correctly identified the point in time at which it was taken and can further apply changes from that point in time. We are alerted if anything goes wrong in the restore process.
We furthermore track the time the restore takes, so we have a good idea of how long it will take to build a new replica or restore in cases of emergency.
The following is an output from an automated restore process, written by Hubot in our robots chat room.
One thing we use backups for is adding a new replica to an existing set of MySQL servers. We will initiate the build of a new server, and once we are notified it is ready, we can start a restore of the latest backup for that particular cluster. We have a script in place that runs all of the restore commands that we would otherwise have to do by hand. Our automated restore system essentially uses the same script. This simplifies the system build process and allows us to have a host up and running with a handful of chat commands opposed to dozens of manual processes. Shown below is a restore kicked manually in chat:
We use orchestrator to perform automated failovers for masters and intermediate masters. We expect
orchestrator to correctly detect master failure, designate a replica for promotion, heal the topology under said designated replica, make the promotion. We expect VIPs to change, pools to change, clients to reconnect,
puppet to run essential components on promoted master, and more. A failover is a complex task that touches many aspects of our infrastructure.
To build trust in our failovers we set up a production-like, test cluster, and we continuously crash it to observe failovers.
The production-like cluster is a replication setup that is identical in all aspects to our production clusters: types of hardware, operating systems, MySQL versions, network environments, VIP,
puppet configurations, haproxy setup, etc. The only thing different to this cluster is that it doesn’t send/receive production traffic.
We emulate a write load on the test cluster, while avoiding replication lag. The write load is not too heavy, but has queries that are intentionally contending to write on same datasets. This isn’t too interesting in normal times, but proves to be useful upon failovers, as we will shortly describe.
Our test cluster has representative servers from three data centers. We would like the failover to promote a replacement replica from within the same data center. We would like to be able to salvage as many replicas as possible under such constraint. We require that both apply whenever possible.
orchestrator has no prior assumption on the topology; it must react on whatever the state was at time of the crash.
We, however, are interested in creating complex and varying scenarios for failovers. Our failover testing script prepares the grounds for the failover:
- It identifies existing master
- It refactors the topology to have representatives of all three data centers under the master. Different DCs have different network latencies and are expected to react in different timing to master’s crash.
- It chooses a crash method. We choose from shooting the master (
kill -9) or network partitioning it:
iptables -j REJECT(nice-ish) or
iptables -j DROP(unresponsive).
The script proceeds to crash the master by chosen method, and waits for
orchestrator to reliably detect the crash and to perform failover. While we expect detection and promotion to both complete within
30 seconds, the script relaxes this expectation a bit, and sleeps for a designated time before looking into failover results. It will then:
- Check that a new (different) master is in place
- There is a good number of replicas in the cluster
- The master is writable
- Writes to the master are visible on the replicas
- Internal service discovery entries are updated (identity of new master is as expected; old master removed)
- Other internal checks
These tests confirm that the failover was successful, not only MySQL-wise but also on our larger infrastructure scope. A VIP has been assumed; specific services have been started; information got to where it was supposed to go.
The script further proceeds to restore the failed server:
- Restoring it from backup, thereby implicitly testing our backup/restore procedure
- Verifying server configuration is as expected (the server no longer believes it’s the master)
- Returning it to the replication cluster, expecting to find data written on the master
Consider the following visualization of a scheduled failover test: from having a well-running cluster, to seeing problems on some replicas, to diagnosing the master (
7136) is dead, to choosing a server to promote (
a79d), refactoring the topology below that server, to promoting it (failover successful), to restoring the dead master and placing it back into the cluster.
Our testing script uses a stop-the-world approach. A single failure in any of the failover components fails the entire test, disabling any future automated tests until a human resolves the matter. We get alerted and proceed to check the status and logs.
The script would fail on an unacceptable detection or failover time; on backup/restore issues; on losing too many servers; on unexpected configuration following the failover; etc.
We need to be certain
orchestrator connects the servers correctly. This is where the contending write load comes useful: if set up incorrectly, replication is easily susceptible to break. We would get
DUPLICATE KEY or other errors to suggest something went wrong.
This is particularly important as we make improvements and introduce new behavior to
orchestrator, and allows us to test such changes in a safe environment.
The testing procedure illustrated above will catch (and has caught) problems on many parts of our infrastructure. Is it enough?
In a production environment there’s always something else. Something about the particular test method that won’t apply to our production clusters. They don’t share the same traffic and traffic manipulation, nor the exact same set of servers. The types of failure can vary.
We are designing chaos testing for our production clusters. Chaos testing would literally destroy pieces in our production, but on expected schedule and under sufficiently controlled manner. Chaos testing introduces a higher level of trust in the recovery mechanism and affects (thus tests) larger parts of our infrastructure and application.
This is delicate work: while we acknowledge the need for chaos testing, we also wish to avoid unnecessary impact to our service. Different tests will differ in risk level and impact, and we will work to ensure availability of our service.
We use gh-ost to run live schema migrations.
gh-ost is stable, but also under active developments, with major new features being added or planned.
gh-ost migrates tables by copying data onto a ghost table, applying ongoing changes intercepted by the binary logs onto the ghost table, even as the original table is being written to. It then swaps the ghost table in place of the original table. At migration completion GitHub proceeds to work with a table generated and populated by
At this time almost all of GitHub’s MySQL data has been recreated by
gh-ost, and most of it multiple times. We must have high trust in
gh-ost to let it tamper with our data over and over again, even in face of active development. Here’s how we gain this trust.
gh-ost provides a testing-in-production capability. It supports running a migration on a replica, in much the same way as it would run on the master:
gh-ost would connect to the replica and treat it as if it were the master. It would parse its binary logs the same way it would for a real master migration. However it would copy rows and apply binlog events to the replica, and avoid making writes onto the master.
gh-ost-dedicated replicas in production. These replicas do not serve production traffic. Each such replica retrieves the current list of production tables and iterates them in random order. One by one it picks a table and performs a replica-migration on that table. The migration doesn’t actually modify table structure, but instead runs a trivial
ENGINE=InnoDB. The test runs the migration even as the table is being used in production, thus copying real production data and applying true production traffic off the binary logs.
These migrations can be audited. Here’s how we can inspect status of running tests from chat:
When a test migration completes copying of table data it stops replication and performs the cut-over, replacing the original table with the ghost table, and then swaps back. We’re not interested in actually replacing the data. Instead we are left with both the original table and the ghost table, which should both be identical. We verify that by checksumming the entire table data for both tables.
A test can complete with:
- success: All went well and checksum is identical. We expect to see this.
- failure: Execution problem. This can occasionally happen due to the migration process being killed, a replication issue etc., and is typically unrelated to
- checksum failure: table data inconsistency. For a tested branch, this call for fixes. For an ongoing
masterbranch test, this would imply immediate blocking of production migrations. We don’t get the latter.
Test results are audited, sent to robot chatrooms, sent as events to our metrics systems. Each vertical line in the following graph represents a successful migration test:
These tests run continuously. We are notified by alerts in case of failures. And of course we can always visit the robots chatroom to know what’s going on.
We continuously improve
gh-ost. Our development flow is based on
git branches, which we then offer to merge via pull requests.
gh-ost pull request goes through Continuous Integration (CI) which runs basic compilation and unit tests. Once past this, the PR is technically eligible for merging, but even more interestingly it is eligible for deployment via Heaven. Being the sensitive component in our infrastructure that it is, we take care to deploy
gh-ost branches for intensive testing before merging into
Some PRs are small and do not affect the data itself. Changes to status messages, interactive commands etc. are of lesser impact to the
gh-ost app. Others pose significant changes to the migration logic and operation. We would tests these rigorously, running through our production tables fleet until satisfied these changes do not pose data corruption threat.
Throughout testing we build trust in our systems. By automating these tests, in production, we get repetitive confirmation that everything is working as expected. As we continue to develop our infrastructure we also follow up by adapting tests to cover the newest changes.
Production always surprises with scenarios not covered by tests. The more we test on production environment, the more input we get on our app’s expectations and our infrastructure’s capabilities.