Three new Campus Experts are joining the fall 2022 batch of the MLH Fellowship to work with open source maintainers and get real-world experience.
Ever since we made the leap at GitHub to upgrade off our fork of Rails and worked hard to stay up to date with the latest releases, we’ve consistently looked for ways to improve the Rails framework upstream. We do this in many ways – running GitHub off of Rails main, reporting and fixing bugs we find, and most importantly pushing functionality upstream that the entire Ruby community can benefit from.
Most recently, we extracted internal functionality to disable joining queries when an association crossed multiple databases. Prior to our work in this area, Rails had no support for handling associations that spanned across clusters; teams had to write SQL to achieve this.
At GitHub, we have 30 databases configured in our Rails monolith—15 primaries and 15 replicas. We use “functional partitioning” to split up our data, which means that each of those 15 primaries has a different schema. In contrast, a “horizontal sharding” approach would have 15 shards with the same schema.
While there are some workarounds for joining across clusters in MySQL, they are usually not performant or else they require additional setup. Without these workarounds, attempting to join from a table in cluster A to a table in cluster B would result in an error. To work around this limitation, teams had to write SQL, selecting IDs from the first table to then use in the second query to find the appropriate records. This was extra work and could be error-prone. We had an opportunity to make this process smoother by implementing non-join queries in Rails.
Let’s look at some code to see how this works:
Let’s say we have three models:
# table dogs in database animals class Dog < AnimalsRecord has_many: treats, through: :humans has_many :humans end # table humans in database people class Human < PeopleRecord has_many :treats has_many :dogs end # table treats in database default class Treat < ApplicationRecord has_many :dogs, through: :humans has_many :humans end
If our Rails application code loaded the
dog.treats association, usually that would automatically perform a join query:
SELECT treats.* FROM treats INNER JOIN humans ON treats.human_id = humans.id WHERE humans.dog_id = 2
Looking at the inheritance chain, we can see that
Human all inherit from different base classes. Each of these base classes belongs to a different database connection, which means that records for all three models are stored in different databases.
Since the data is stored across multiple primaries, when the join on
dog.treats is run we’ll see an application error:
ActiveRecord::StatementInvalid (Table 'people_db_cluster.humans' doesn't exist)
One of the best features Rails provides out of the box is generating SQL for you. But since GitHub’s data lives in different databases, we could no longer take advantage of this. We had an opportunity to improve Rails in a way that benefited our engineers and everyone else in the Rails community who uses multiple databases.
Prior to our work in this area, engineers working on any associations that crossed database boundaries would be forced to manually query IDs rather than using Active Record’s association APIs. Writing SQL can be error prone and defeats the purpose of Active Record’s convenience methods like
A little over two years ago, we started experimenting with an internal gem to disable joins for cross-database associations. We chose to implement this outside of Rails first so that we could work out the majority of bugs before merging to Rails. We wanted to be sure that we could use it successfully in production and that it didn’t cause any significant friction in development or any performance concerns in production. This is how many of Rails’ popular features get developed. We often extract implementations from large production applications – if it’s something we need and something a lot of applications can benefit from, we make it stable first, then upstream it to Rails.
The overall implementation is relatively small. To accomplish disabling joins, we added an option to
has_many :through associations called
disable_joins. When set to
true for an association, Rails will generate separate queries for each database rather than a join query.
This needed to be an option on the association rather than performed at runtime because Rails associations are lazily loaded – the SQL is generated when the association objects are created, which means that by the time Rails runs the SQL to load
dog.treats the join will already be generated. After adding the option in Rails, we implemented a new scoping class that would handle the order, limit, scopes, and other options.
Now applications can add the following to their associations to make sure Rails generates two or more queries instead of joins:
class Dog < AnimalsRecord has_many: treats, through: :humans, disable_joins: true has_many :humans end
And that’s all that’s needed to disable generating joins for associations that cross database servers!
Now, calls to
dog.treats will generate the following SQL:
SELECT "humans"."id" FROM "humans" WHERE "humans"."dog_id" = ? [["dog_id", 1]] SELECT "treats".* FROM "treats" WHERE "treats"."human_id" IN (?, ?, ?) [["human_id", 1], ["human_id", 2], ["human_id", 3]]
There are a couple of important caveats to keep in mind when using this new feature. Applications that need to disable joins may see that those associations have slower database performance. This would be true whether you wrote the SQL manually or use Rails’ new
disable_joins feature. Fundamentally, if you’re performing multiple queries across multiple databases, that can be slower than performing a single join query on one database. It’s really important to make sure that your queries are efficient and that proper indexes are in place before using this feature. And as always, when making major changes to your database queries, it’s important to benchmark and understand how those changes will affect your application.
Additionally, if your queries rely on an order and limit from the join database you may see a performance impact on requests. When two queries are joined, MySQL can perform the order based on the table that’s joined (i.e.,
order by humans.human_id which would order the returned treats by the human ID). However, when you’re splitting queries the order can’t be applied by the database. To solve this, Rails orders the records in-memory based on the order they would have been returned if there was a join. This preserves the expected behavior but since the order and limit are performed in-memory, you’ll usually want to avoid performing these actions on hundreds of thousands of records.
Four years ago, contributing to Rails at this level was just something we’d hoped to be able to do one day. We were so far behind in our upgrades that it was difficult to contribute changes to the framework. This might look like a small change but it’s a clear demonstration of the hard work we’ve done to improve the technical debt in our application and ensure that we’re giving back to the community whenever we can.
By adding support for handling associations across databases, we help empower other applications to scale as their traffic and data grow. Additionally, by pushing this code into Rails and out of our private internal gem we’ll find even more improvements and edge cases in applications that aren’t GitHub. As we continue to grow and improve Rails for us at GitHub, we’ll continue to improve it for the entire community. This pull request is just one example of how we intend to do that for years to come.