
GitHub Availability Report: August 2023
In August, we experienced two incidents that resulted in degraded performance across GitHub services.
About a year ago, we migrated an old rate limiter in order to serve more traffic and accommodate a more resilient platform architecture. We adopted a replicated Redis backend with…
About a year ago, we migrated an old rate limiter in order to serve more traffic and accommodate a more resilient platform architecture. We adopted a replicated Redis backend with client-side sharding. In the end, it worked out great, but we learned some lessons along the way.
We had an old rate limiter that was simple enough:
#{key}:reset_at
“)(There might have been more nuance to it, but that’s the main idea.)
However, this limiter had two problems:
After some discussion, we decided on a new design for the rate limiter:
One option we considered but decided against was using our MySQL-backed KV store (GitHub::KV) for storage. We didn’t want to add traffic to already-busy MySQL primaries: usually, we use replicas for GET requests, but rate limit updates would require write access to a primary. By choosing a different storage backend, we could avoid the additional (and substantial) write traffic to MySQL.
Another advantage to using Redis is that it’s a well-traveled path. We could take inspiration from two excellent existing resources:
To roll out this change, we isolated the current persistence logic into a MemcachedBackend
class, and built a new RedisBackend
class for the rate limiter. We used a feature flag to gate access to the new backend. This allowed us to gradually increase the percentage of clients using the new backend. We could change the percentage without a deploy, which meant, if something went wrong, we could quickly switch back to the old implementation.
The release went smoothly, and when it was done, we removed the feature flag and the MemcachedBackend
class, and integrated RedisBackend
directly with the Throttler
class that delegated to it.
Then, the bug reports started flowing in….
A lot of integrators watch their rate limit usage very closely. We got two really interesting bug reports in the weeks following our release:
X-RateLimit-Reset
header value “wobbled” – it might show 2020-01-01 10:00:00
for one request, but 2020-01-01 10:00:01
on another request (with one second difference).X-RateLimit-Remaining: 5000
. That didn’t make sense: if they had a full rate limit window ahead of them, why was the request rejected?How odd!
We were optimistic about using Redis’s built-in time-to-live (TTL) to implement our “reset at” feature. But it turns out, my implementation caused the “wobble” described above.
The Lua script returned the TTL of the client’s rate limit value, and then in Ruby, it was added to Time.now.to_i
to get a timestamp for the X-RateLimit-Reset
header. The problem was, time passes between the call to TTL (in Redis) and Time.now.to_i
(in Ruby). Depending exactly how much time, and where it fell on the clock’s second boundary, the resulting timestamp might be different. For example, consider the following calls:
Redis call begins | latency | TTL (Redis) | latency | Time.now returns | sum of TTL and Time.now |
---|---|---|---|---|---|
10:00:04.2 | 0.1 | 5 | 0.1 | 10:00:05.4 | 10:00:10.1 |
(then, a half-second later) | |||||
10:00:05.9 | 0.05 | 5 | 0.1 | 10:00:06.05 | 10:00:11.05 |
In that case, since the second boundary happened between the call to TTL and Time.now
, the resulting timestamp was one second bigger than the previous ones.
We could have tried increasing the precision of this operation (eg, Redis PTTL), but there would still have been some wobble, even if it was greatly reduced.
Another possibility was to calculate the time using only Redis, instead of mixing Ruby and Redis calls to create it. Redis’s TIME
command could have been used as the source of truth. (Old Redis versions didn’t allow TIME
in Lua scripts, but Redis 5+ does.) We avoided this design because it would have been harder to test: by using Ruby’s time as the source of truth, I could time-travel in my tests with Timecop, asserting that expired keys were handled correctly without actually waiting for Redis’s calls to the system clock to return true, future times. (I still had to wait on Redis to test the EXPIRE
-based database cleanup, but since expires_at
came from Ruby-land, I could inject very short expiration windows to simplify testing.)
Instead, we decided to persist the “reset at” time from Ruby in the database. That way, we could be sure it wouldn’t wobble. (Wobbling was an effect of the calculation – but reading from the database would guarantee a stable value.) Instead of reading TTL from Redis, we stored another value in the database (effectively doubling our storage footprint, but OK).
We still applied a TTL to rate limit keys, but they were set for one second after the “reset at” time. That way, we could use Redis’s own semantics to clean up “dead” rate limit windows.
Weirdly, many clients reported rejections that included X-RateLimit-Remaining: 5000
headers. What’s going on!?
We found another problem that worked like this:
X-RateLimit-...
headers.Well, it turned out that Step 1 above hit a Redis replica, since it was a read operation. The read operation returned information about the client’s previous window, and the application prepared a rejection response.
Then, Step 2 would hit a Redis primary. During that database call, Redis would expire the previous window data and return data for a fresh rate limit. This is a known limitation of Redis: replicas don’t expire data until they receive instructions to do so from their primaries, and primaries don’t expire keys until they’re accessed (GitHub issue). (In fact, primaries do randomly sample keys from time to time, expiring them as appropriate, see “How Redis Expires Keys”.)
Addressing this issue required two things:
X-RateLimit-...
headers.Here are the Lua scripts we ended up with for implementing this pattern:
RATE_SCRIPT:
-- count a request for a client
-- and return the current state for the client
-- rename the inputs for clarity below
local rate_limit_key = KEYS[1]
local increment_amount = tonumber(ARGV[1])
local next_expires_at = tonumber(ARGV[2])
local current_time = tonumber(ARGV[3])
local expires_at_key = rate_limit_key .. ":exp"
local expires_at = tonumber(redis.call("get", expires_at_key))
if not expires_at or expires_at < current_time then
-- this is either a brand new window,
-- or this window has closed, but redis hasn't cleaned up the key yet
-- (redis will clean it up in one more second)
-- initialize a new rate limit window
redis.call("set", rate_limit_key, 0)
redis.call("set", expires_at_key, next_expires_at)
-- tell Redis to clean this up _one second after_ the expires-at time.
-- that way, clock differences between Ruby and Redis won't cause data to disappear.
-- (Redis will only clean up these keys "long after" the window has passed)
redis.call("expireat", rate_limit_key, next_expires_at + 1)
redis.call("expireat", expires_at_key, next_expires_at + 1)
-- since the database was updated, return the new value
expires_at = next_expires_at
end
-- Now that the window is either known to already exist _or_ be freshly initialized,
-- increment the counter (`incrby` returns a number)
local current = redis.call("incrby", rate_limit_key, increment_amount)
return { current, expires_at }
-- CHECK_SCRIPT:
-- Getting both the value and the expiration
-- of key as needed by our algorithm needs to be ran
-- in an atomic way, hence the script.
-- rename the inputs for clarity below
local rate_limit_key = KEYS[1]
local expires_at_key = rate_limit_key .. ":exp"
local current_time = tonumber(ARGV[1])
local tries = tonumber(redis.call("get", rate_limit_key))
local expires_at = nil -- maybe overridden below
if not tries then
-- this client hasn't initialized a window yet
-- let this fall through to returning {nil, nil},
-- where the application will provide defaults
else
-- we found a number of tries, now check
-- if this window is actually expired
expires_at = tonumber(redis.call("get", expires_at_key))
if not expires_at or expires_at < current_time then
-- this window hasn't been cleaned up by Redis yet, but it has closed.
-- (maybe it was _partly_ cleaned up, if we found `tries` but not `expires_at`)
-- ignore the data in the database; return a fresh window instead
tries = nil
expires_at = nil
end
end
-- Maybe {nil, nil} if the window is brand new (or expired)
return { tries, expires_at }
We’ve learned a lot from this new approach, but there’s still one shortcoming we’re considering: the current implementation doesn’t increment the “current” rate limit value until after the request is finished. We do this because we don’t charge clients for 304 Not Modified
responses (this can happen when the client provides an E-Tag). A better implementation might increment the value when the request starts, then refunds the client if the response is 304
. That would prevent some edge cases where a client can exceed its limit when the final allowed request is still being processed.
After working out the issues described in this post, the new rate limiter has worked great. It has improved reliability, fixed issues for clients, and reduced our support load (eventually 😉) and the architecture is ready for our next wave of platform improvements.