Action needed by GitHub Connect customers using GHES 3.1 and older to adopt new authentication token format updates
Upgrade to GHES 3.2 or newer by June 3rd to continue using GitHub Connect.
About a year ago, we migrated an old rate limiter in order to serve more traffic and accommodate a more resilient platform architecture. We adopted a replicated Redis backend with…
About a year ago, we migrated an old rate limiter in order to serve more traffic and accommodate a more resilient platform architecture. We adopted a replicated Redis backend with client-side sharding. In the end, it worked out great, but we learned some lessons along the way.
We had an old rate limiter that was simple enough:
(There might have been more nuance to it, but that’s the main idea.)
However, this limiter had two problems:
After some discussion, we decided on a new design for the rate limiter:
One option we considered but decided against was using our MySQL-backed KV store (GitHub::KV) for storage. We didn’t want to add traffic to already-busy MySQL primaries: usually, we use replicas for GET requests, but rate limit updates would require write access to a primary. By choosing a different storage backend, we could avoid the additional (and substantial) write traffic to MySQL.
Another advantage to using Redis is that it’s a well-traveled path. We could take inspiration from two excellent existing resources:
To roll out this change, we isolated the current persistence logic into a
MemcachedBackend class, and built a new
RedisBackend class for the rate limiter. We used a feature flag to gate access to the new backend. This allowed us to gradually increase the percentage of clients using the new backend. We could change the percentage without a deploy, which meant, if something went wrong, we could quickly switch back to the old implementation.
The release went smoothly, and when it was done, we removed the feature flag and the
MemcachedBackend class, and integrated
RedisBackend directly with the
Throttler class that delegated to it.
Then, the bug reports started flowing in….
A lot of integrators watch their rate limit usage very closely. We got two really interesting bug reports in the weeks following our release:
X-RateLimit-Resetheader value “wobbled” – it might show
2020-01-01 10:00:00for one request, but
2020-01-01 10:00:01on another request (with one second difference).
X-RateLimit-Remaining: 5000. That didn’t make sense: if they had a full rate limit window ahead of them, why was the request rejected?
We were optimistic about using Redis’s built-in time-to-live (TTL) to implement our “reset at” feature. But it turns out, my implementation caused the “wobble” described above.
The Lua script returned the TTL of the client’s rate limit value, and then in Ruby, it was added to
Time.now.to_i to get a timestamp for the
X-RateLimit-Reset header. The problem was, time passes between the call to TTL (in Redis) and
Time.now.to_i (in Ruby). Depending exactly how much time, and where it fell on the clock’s second boundary, the resulting timestamp might be different. For example, consider the following calls:
|Redis call begins||latency||TTL (Redis)||latency||Time.now returns||sum of TTL and Time.now|
|(then, a half-second later)|
In that case, since the second boundary happened between the call to TTL and
Time.now, the resulting timestamp was one second bigger than the previous ones.
We could have tried increasing the precision of this operation (eg, Redis PTTL), but there would still have been some wobble, even if it was greatly reduced.
Another possibility was to calculate the time using only Redis, instead of mixing Ruby and Redis calls to create it. Redis’s
TIME command could have been used as the source of truth. (Old Redis versions didn’t allow
TIME in Lua scripts, but Redis 5+ does.) We avoided this design because it would have been harder to test: by using Ruby’s time as the source of truth, I could time-travel in my tests with Timecop, asserting that expired keys were handled correctly without actually waiting for Redis’s calls to the system clock to return true, future times. (I still had to wait on Redis to test the
EXPIRE-based database cleanup, but since
expires_at came from Ruby-land, I could inject very short expiration windows to simplify testing.)
Instead, we decided to persist the “reset at” time from Ruby in the database. That way, we could be sure it wouldn’t wobble. (Wobbling was an effect of the calculation – but reading from the database would guarantee a stable value.) Instead of reading TTL from Redis, we stored another value in the database (effectively doubling our storage footprint, but OK).
We still applied a TTL to rate limit keys, but they were set for one second after the “reset at” time. That way, we could use Redis’s own semantics to clean up “dead” rate limit windows.
Weirdly, many clients reported rejections that included
X-RateLimit-Remaining: 5000 headers. What’s going on!?
We found another problem that worked like this:
Well, it turned out that Step 1 above hit a Redis replica, since it was a read operation. The read operation returned information about the client’s previous window, and the application prepared a rejection response.
Then, Step 2 would hit a Redis primary. During that database call, Redis would expire the previous window data and return data for a fresh rate limit. This is a known limitation of Redis: replicas don’t expire data until they receive instructions to do so from their primaries, and primaries don’t expire keys until they’re accessed (GitHub issue). (In fact, primaries do randomly sample keys from time to time, expiring them as appropriate, see “How Redis Expires Keys”.)
Addressing this issue required two things:
Here are the Lua scripts we ended up with for implementing this pattern:
-- RATE_SCRIPT: -- count a request for a client -- and return the current state for the client -- rename the inputs for clarity below local rate_limit_key = KEYS local increment_amount = tonumber(ARGV) local next_expires_at = tonumber(ARGV) local current_time = tonumber(ARGV) local expires_at_key = rate_limit_key .. ":exp" local expires_at = tonumber(redis.call("get", expires_at_key)) if not expires_at or expires_at
We’ve learned a lot from this new approach, but there’s still one shortcoming we’re considering: the current implementation doesn’t increment the “current” rate limit value until after the request is finished. We do this because we don’t charge clients for
304 Not Modified responses (this can happen when the client provides an E-Tag). A better implementation might increment the value when the request starts, then refunds the client if the response is
304. That would prevent some edge cases where a client can exceed its limit when the final allowed request is still being processed.
After working out the issues described in this post, the new rate limiter has worked great. It has improved reliability, fixed issues for clients, and reduced our support load (eventually 😉) and the architecture is ready for our next wave of platform improvements.