-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for fill lock with lock wait to avoid thundering herd problem #373
Conversation
else # hit | ||
return value | ||
end | ||
new_lock.cache_value # take lock |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently the code converts the lock values to primitives so that they can be serialized and deserialized with msgpack. However, an alternative would be to add a special value for msgpack
to handle the hydration of these values instead of doing it explicitly in the code. With Marshal
as the serializer, it can handle this for us as long as we are careful about not changing the class name or instance variables names in an incompatible way.
@@ -100,9 +101,11 @@ def should_use_cache? # :nodoc: | |||
# == Parameters | |||
# +key+ A cache key string | |||
# | |||
def fetch(key) | |||
def fetch(key, cache_fetcher_options = {}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit but I think you could splat the options hash all the way through. This way you won't have to update any assertion/expectation and everything will keep working as expected no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Capturing keyword arguments using **
has the performance disadvantage of duplicating the hash for each method call.
The other annoying thing with the test expectations are that some of them use the spy gem, which doesn't yet support keyword arguments (although I've opened ryanong/spy#12 for that issue). This prevents me from explicitly passing options through without a splat for any method that is referenced by Spy.on
in the test suite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My upstream spy PR has been merged and released, so I'll see what I can cleanup as a result of that.
I'm generally good with the solution, also thanks for the explanations! 🙏 I prefer sleeping in the unicorn than hammering redis with I've noticed a few branches seems to be untested in the 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still 👀
lib/identity_cache/cache_fetcher.rb
Outdated
data | ||
end | ||
|
||
def fetch_with_fill_lock(key, fill_lock_duration:, lock_wait_limit: 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Control flow in this method is ... complicated. It's a case of "not obviously wrong" instead of "obviously not wrong", which really makes me hesitate in approving this PR. If the problem is the block-structured cas
API in the memcached gem (and other layers above it), we should expose the API we need to make this simple. Effectively single_get
and single_cas
. Modelling this on the Dalli API for get_cas
and set_cas
might help with future migration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I would be able to reduce the number of code paths, so there is a certain amount of complexity that won't go away. So this still might make it hard to say the code is "obviously not wrong".
The block-structured cas
API definitely puts constraints on how the code is written, so I find it definitely more awkward to write the code, so I agree that we should try to get it modelled like Dalli upstream in the memcached_store and memcached gems. For instance, this might be helpful for implementing fill lock support for fetch_multi
. The change to memcached_store should be easier since we maintain it. I'll have to see how easy it is to get a change upstream for the memcached gem.
For reading the code, I find the following two things awkward about the cas
method:
- the cas set operation is more implicit by just returning a value from a block. This has also led to
assert_memcache_operations
not even counting these set operations break
orreturn
are needed to avoid the cas set operation.
This can be seen by comparing this code using cas
:
@cache_backend.cas(key, **expiration_options) do |value|
break unless FillLock.cache_value?(value)
lock = FillLock.from_cache(value)
lock.mark_failed
lock.cache_value
end
with equivalent code using get_cas
and set_cas
value, cas = @cache_backend.get_cas(key)
if FillLock.cache_value?(value)
lock = FillLock.from_cache(value)
lock.mark_failed
@cache_backend.set_cas(key, lock.cache_value, **expiration_options)
end
lib/identity_cache/cache_fetcher.rb
Outdated
yield | ||
rescue Exception | ||
@cache_backend.cas(key, **expiration_options) do |value| | ||
break unless FillLock.cache_value?(value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC this will exit the cas
method above and jump to the raise
on line 127 below. What happens if we just raise
here instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, it will break out of the cas
and then raise
. I could change the break
here to a raise
and it will effectively do the same thing. Mostly I just wrote it this way because I was thinking of the cas
to mark the failure independently of the re-raising, so this break line just says only do that if it is a fill lock.
However, we can't just raise
here, since we still need the raise
that comes after the cas
block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just have a couple of small comments, but overall this looks good to me.
@dylanahsmith would you be able to rebase this, and I'll try to find another reviewer?
lib/identity_cache/cache_fetcher.rb
Outdated
def add(key, value) | ||
@cache_backend.write(key, value, :unless_exist => true) if IdentityCache.should_fill_cache? | ||
def add(key, value, **expiration_options) | ||
return unless IdentityCache.should_fill_cache? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this explicitly return false
?
end | ||
|
||
def test_fetch_without_lock_miss | ||
assert_memcache_operations(2) do # get, add |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A nice future enhancement might be to have assert_memcache_operations
actually assert the operation types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. It should also distinguish between @cache_backend.cas
calls that break in the block (i.e. only send a get
operation) and calls where the block returns (i.e. get
and cas
operations get sent)
lib/identity_cache/cache_fetcher.rb
Outdated
expires_in = expires_in.ceil | ||
|
||
# memcached TTL only gets part of the first second, so increase TTL by 1 to compensate | ||
expires_in + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line does nothing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching that. Fixed it to be +=
which is what I intended
lib/identity_cache/cache_fetcher.rb
Outdated
# the cache store round down with `to_i` | ||
expires_in = expires_in.ceil | ||
|
||
# memcached TTL only gets part of the first second, so increase TTL by 1 to compensate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this trying to say the same thing as the comment on line 238? Is the code on line 243 redundant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that other comment was about conversion to an integer. This one is about the second granularity memcached uses client side that can cause a TTL of 1
to expire nearly immediately (memcached/memcached#307). I'll link to that issue to help clarify.
lib/identity_cache/query_api.rb
Outdated
# to another client taking the lock first. If another lock wait would be needed after | ||
# reaching the limit, then a `LockWaitTimeout` exception is raised. Default is 2. Use | ||
# this to control the maximum total lock wait duration | ||
# (`lock_wait_limit * fill_lock_duration`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The API here is awkward and the abstraction is leaky. From a user's perspective, what we care about is "how long do I want to wait", and we should really specify that directly. We want to specify a minimum, based on expected refill time, and a maximum we're willing to tolerate. The lock_wait_limit
leaks an implementation detail that users shouldn't have to care about (and which complicates tweaking the algorithm later).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specifying a number of times to retry doesn't seem like a leaky abstraction and seems like it is how we normally handle retries. In fact, I can't think of a place where we do retries for a certain amount of time instead of having a retry limit.
As you mentioned, we need to know how long to wait based on the expected refill time, so I it seems like fill_lock_duration
is necessary.
What type of tweaks did you have in mind that the current configuration doesn't provide a suitable abstraction for? It seems to me that we could use the same configuration if we switched to redis to store the lock or if we just did it in-memory in a multi-threaded process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exponential backoff is a pretty common approach that would map awkwardly to this API.
My point about leaky abstractions is that users want to specify their desired constraints under contention. It's awkward to expose the knobs as "how often should I try" and "how many times should I try" when the user really wants to tell us "I expect the refill to take X seconds" and "I don't want to wait longer than Y seconds".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What you are proposing is just a different abstraction. That doesn't make the configuration I proposed a leaky abstraction, it just isn't an abstraction around the change you want to be able to make.
As a user of the API, I would hate for it to silently change to exponential backoff without explicitly opting into that. Since the sleep is synchronous, we are wasting the threads time which could eat into capacity. We don't want to wait for a problem to be resolved in the current process, we want to fail fast if we know there is a problem and rely on semian to backoff in our attempts to retry accessing the resource.
Also, as a user of the API, I find the configuration API you are describing awkward to express what I actually want. Since this is specifically to solve the thundering herd problem, I really do care about how long the fill lock is taken, since that limits the frequency of attempts to fill the cache key under contention. I want it to do at most 2 lock waits because that both gracefully handles cache fills that take up to double the typically allowed time and because it allows it to be interrupted once by a cache invalidation. If I try to express the same thing as a a duration (i.e. lock_wait_limit * fill_lock_duration
) then it is no longer precise due to floating point math. Depending on the behaviour chosen for the remaining time, it could either lose the second attempt if there is just under the fill lock duration remaining or perform a last attempt immediately after the second last attempt if there is just over the fill lock duration remaining.
I think if we build something that works well for us, then it is more likely to be suitable for other users that encounter the same thundering herd problem.
lib/identity_cache/cache_fetcher.rb
Outdated
|
||
def initialize(cache_value) | ||
@cache_value = cache_value | ||
_, @client_id, @data_version = cache_value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a significant performance benefit to initializing this with an array and storing both the array and its components? It seems like it'd be cleaner to:
attr_reader :client_id, :data_version
def initialize(client_id:, data_version:)
@client_id = client_id
@data_version = data_version
end
def cache_value
[FILL_LOCKED, @client_id, @data_version]
end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already need to support initializing it from an array when it comes from the cache and storing the array simplified equality checking.
I don't think there would be a significant performance impact from not storing the array, so I'll change it to what you suggested.
return result | ||
end | ||
end | ||
raise "unexpected number of loop iterations" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems a bit rough to have a non-specific exception raised here. What can the caller do about it?
I can see two paths that reach here: if the last iteration of the loop was interrupted by cache invalidation (fetch_or_take_lock
returned IdentityCache::DELETED
and we have a lock
and we didn't previously try using the fallback key), or if we oren
ed on the last iteration of the loop. I feel like we should not raise
here, and we should instead try to handle both those paths in a sane way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What can the caller do about it?
The caller isn't meant to do anything about it, because getting here means there is a bug in the code. The (lock_wait_limit + 2).times do
loop could be replaced with an infinite loop, but I wanted to make it clear that it shouldn't have an infinite loop here, so this check prevents that from happening.
Caching invalidation being interrupted would be the fallback case that is mentioned in the # +2 is for first attempt and retry with fallback key
comment at the start of the loop. What does oren
stand for? I don't know what you mean by that path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does
oren
stand for?
Old #webscale
joke that should probably be retired: sleep
.
aee0620
to
b4ac8bb
Compare
Rebased and pushed changes to address feedback. |
@shuhaowu I feel like TLA+ would be helpful here, to increase confidence in the correctness of this approach. I think you're the expert in that. Can you please take a look at this & see if you think it'd be applicable? |
I quickly read through the problem/solution description but I feel that I don't have enough context to understand what's actually happening. Are there any supplementary material I can read? It's difficult for me to assess how TLA+ can help without further context, although I do think TLA+ can help define the "correctness" of the system more rigorously, if there are some uncertainties about that. |
I wouldn't mind giving that a shot. I had learned a bit about TLA+, but haven't actually used it for anything. There are certainly some invariants that we could formalize to ensure that the lock wait time is bounded, clients read their own writes and stale entries in the cache always get expired in the absence of errors on the invalidation request. One thing that I don't think TLA+ will help with is making sure the code corresponds to the TLA+ model that is being checked. @shuhaowu did you have any tips to help with doing that? Is the best approach to just try to structure the code such that the correspondence between the model and the code is clearer? |
It depends on what you mean by "correspondence". Since you're already somewhat familiar with TLA+, you know that TLA+ specifications cannot be compiled to code directly. Thus, we cannot get a 1-to-1 correspondence. Further, my experiences with TLA+ show that it is most useful when it does not try to mirror the code, but when it mirrors the RFC document describing the problem and solution. This may be confusing (it is even for me to explain it), but using the Ghostferry TLA spec as an example may make it more clear:
In essence, TLA+ models are for computer programs what 2D truss-based models are for bridges. They serve as idealized models that we can more easily manipulate in our heads. There will be differences, but this is OK as long as we have some understand on where the differences lie and if/how these differences can cause failures. I'm not sure if these made any sense so maybe the best way to do this is to just try it. Of course, I'm very interested in how this turns out and am willing to lend a hand if we move forward with it. |
Yeah, that is inline with what I was thinking was the limitation with TLA+. @fbogsany are your concerns with the approach about the abstract algorithm which TLA+ could formally specify and check? Since you were actually the one to recommend a lock wait based approach roughly like this. Or are you actually concerned about the implementation of the approach, which it sounds like TLA+ wouldn't really help with, other than providing a specification that can be referred to. |
Sorry, I missed the re-ping on this. I was mostly thinking of the type of things you listed here:
When working on JVM implementation, we tried to think about code in the thread library (in particular) in terms of a "devil's scheduler" - "what's the worst possible sequence of interleaved instructions?". That's the kind of thing I'd like to do here, but it requires a lot of time & energy, and that's in short supply. My hope was that TLA+ could provide a probably more thorough substitute. The state space when we added CAS was about at the limit of what I could hold in my head. This is well beyond that limit. |
3df9bc4
to
8c1f918
Compare
f51d597
to
214efcd
Compare
51702d1
to
837c7a7
Compare
837c7a7
to
9ddfa3a
Compare
a2c8ddd
to
ce176ab
Compare
ce176ab
to
4cc0e2c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty much LGTM.
# by another client. Defaults to not setting the fill lock and trying to fill the | ||
# cache from the database regardless of the presence of another client's fill lock. | ||
# Set this to just above the typical amount of time it takes to do a cache fill. | ||
# @param lock_wait_limit [Integer] Only applicable if fill_lock_duration is provided, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lock_wait_times
is an alternative to consider that disambiguates between a duration limit and a count limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, it is ambiguous about whether it is referring to a count or duration. What about lock_wait_tries
? In order to avoid using the word "time" which is not what we want to allude to.
end | ||
|
||
if !fill_with_lock(key, data, lock, expiration_options) && !using_fallback_key | ||
# fallback to storing data in the fallback key so it is available to clients waiting on the lock |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cache invalidations could happen while a cache miss is being resolved which could prevent clients that are in a lock wait from reading the value after the lock wait.
Is the problem this is addressing that the waiting clients would have to wait even longer, and we might as well serve them stale data since they queried before it was invalidated? i.e. if the lock wait timeout is infinite and the client is okay with higher latency, this wouldn't be necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't serve stale data from the perspective of the client. It is similar to how in MVCC the reader won't try to read a write that came in after it has started the query and chosen a data version for the query. However, the data version for a fill lock / fallback key should be thought of as a lower bound in this case.
The problem this is addressing is latency caused by cache invalidations. In the theoretical case where we didn't have to worry about latency, then this wouldn't be necessary. However, in practice we don't want the fill lock to turn into a queue because it only allows one client to try to fill at a time and the cache invalidations can overwrite or prevent the cache fill preventing any other clients from benefiting from their lock wait.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't serve stale data from the perspective of the client.
Right, that part was clear; I wanted to make sure I fully understood the reason for the extra complexity from the fallback key.
lib/identity_cache/cache_fetcher.rb
Outdated
end | ||
return data | ||
else | ||
raise LockWaitTimeout if lock_wait_limit <= 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it better to fail or to yield? I suppose the worker will have waited a long time by this point, but if lock_wait_limit * fill_lock_duration
is much less than the request timeout, falling back to the database would make sense. In Shopify configuration I imagine that wouldn't be the case though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would certainly be a problem if the lock_wait_limit * fill_lock_duration
was configured to be too low. That's probably true of all timeouts, although this one we might only notice during concurrent reads of the same key, making the misconfiguration more subtle.
If this is configured appropriately, then a long delay seems indicative of database load. If we yielded here, then that would likely make the problem much worse, since it would effectively be unleashing the thundering herd this is here to protect against.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose a lock wait timeout should be very rare anyway, since it would imply that several workers were unable to fill the value, but also didn’t encounter an error and mark a fill failure.
Under heavy load, does it seem right that the most likely failure outcome is query execution timeouts that cause fill failures to be marked? Or is the idea to set the lock wait time shorter then the query timeout (that actually makes sense here I suppose). In SFR the query timeout is much shorter (2s), which changes things significantly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lock wait time can end up adding latency in addition to the query itself. For a single lock wait fill_lock_duration
ends up being a minimum time to wait, so we don't want it much higher than the typical time the query takes. In the worst case, a client could end up doing lock waits before the query, which could add up to (lock_wait_limit - 1) * fill_lock_duration
to the query latency, so we probably want it shorter than the query timeout to avoid effectively doubling it.
# loop around to retry with fallback key | ||
else | ||
# cache invalidation prevented lock from being taken | ||
return yield |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why wouldn't we want to loop and attempt to acquire the fill lock again in this case? I don't see a reason for this worker to not participate in the next round of waiting/filling, since a worker that hits this case won't have done any waiting yet (or it would have a lock value).
Worth noting that it's impossible for this case to happen when using_fallback_key is true, since the fallback key can never have the value IdentityCache::DELETED
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lock wait limit isn't really configured as a lock fill retry limit, which seems like it would be to handle a different scenario. The lock wait limit is mostly about tolerating cache fill delays from the client that has the fill lock. Interruption from a cache invalidation doesn't take away from the number of waits, because it is a separate concern, thus lock_wait_limit -= 1
is not even done in the using_fallback_key = true
code path.
I had considered whether we should have a configuration for the number of attempts to fetch or take the fill lock and ended up deciding that it wasn't even the right solution to the problem, nor is it the right problem to focus on solving right now. Specifically, the problem comes up from frequently invalidated cache keys, which isn't really a good fit for read-through caching with cache invalidation. A better solution to this problem, if we wanted to address it, would be to add a data version to the cache invalidation value (IdentityCache::DELETED
) so we could switch to using the fallback key at this point. Even if we wanted that solution, we would need an extra PR to handle backwards compatibility around deploys so the new cache invalidation value doesn't get interrupted as a cache hit. Either way, we would only benefit from solving this problem if there are still enough more reads than cache invalidations, since we would only be grouping reads for the same data version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought I left this comment before but it seems to have gotten lost: I think this context is very useful and should go into the code comment here.
The takeaway I got is that looping here (without decrementing lock_wait_limit) could end up in an unbounded loop (exceeding lock_wait_limit + 2) due to extremely frequent invalidations, and that would be bad so we'd need another limit. However that other limit doesn't really make sense, since in this scenario the cache is pretty useless anyway.
It is possible for this case to be hit randomly on a key that isn't invalidating frequently though (just happened to have 2 invalidations in a row). But that should be rare so it's probably not worth handling explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth noting that it's impossible for this case to happen when using_fallback_key is true, since the fallback key can never have the value
IdentityCache::DELETED
Good point, the && !using_fallback_key
condition shouldn't be necessary. I'll change that to raise if the fallback key is invalidated to make the state for the code path clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is possible for this case to be hit randomly on a key that isn't invalidating frequently though (just happened to have 2 invalidations in a row). But that should be rare so it's probably not worth handling explicitly.
Yeah, this is the scenario I would like to add the data version to the cache invalidation value for. Mostly to avoid worrying about the scenario from a performance perspective.
However, we should see the benefits even in this scenario, since it should leave a smaller window of time for a thundering herd of reads to hit the database, since it would just be the round trip time to memcached rather than the longer additional time we have now of multiple queries to the database to load the record along with its embedded associations and time to serialize the cache blob.
# loop around to retry with fallback key | ||
else | ||
# cache invalidation prevented lock from being taken | ||
return yield |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought I left this comment before but it seems to have gotten lost: I think this context is very useful and should go into the code comment here.
The takeaway I got is that looping here (without decrementing lock_wait_limit) could end up in an unbounded loop (exceeding lock_wait_limit + 2) due to extremely frequent invalidations, and that would be bad so we'd need another limit. However that other limit doesn't really make sense, since in this scenario the cache is pretty useless anyway.
It is possible for this case to be hit randomly on a key that isn't invalidating frequently though (just happened to have 2 invalidations in a row). But that should be rare so it's probably not worth handling explicitly.
Between our discussions irl and in-channel, I've done a few cursory reviews on this PR but haven't yet found a chance for a proper one. As we've discussed it, I'm aligned to ship this either way by EOW but hold off till tomorrow and I'll do a proper review this evening + tomorrow morning just for a third ✅ / 👀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've gone over this a few times. Most of the major edge cases seem to have already been called out - though it's definitely difficult to have confidence in unknown unknowns
where things die / go incomplete in partial states.
That said - Siberia's use case is probably as good of a state as we can possibly have to dry run this PR (single shop, read/writes are C1 flow only, sufficiently high load, known IDC -> SQL invalidation issues w/o fill lock, etc.).
Very likely, I'll still circle back and thread follow-up discussion here if there's a reason to think we've hit a particular edge case (in initial rollout or afterwards).
Let's start with the single shop and see how this goes 👌
cac2055
to
b2917a7
Compare
b2917a7
to
5cfb3ac
Compare
cc @csfrancis
Problem
Currently there is a thundering herd problem when a cache key is invalidated and multiple readers get a cache miss before the first one fills the cache. All the readers after the first one are redundant and this can cause a lot of load on the database, especially for records with a lot of embedded associations.
Solution
Add support for using a fill lock when using the
fetch
andfetch_by_id
class methods so the first cache miss takes a lock before filling the cache and following clients getting a cache miss do a lock wait for a short duration for the cache to be filled so they can read from the cache rather than from the database.This feature is opt-in through specifying the
fill_lock_duration
option to control how long to wait on the fill lock before reading the cache key again. This should be set to just over the typical amount of time it takes to do a cache fill. This way we typically will only have a single client filling a cache key at a time.After a lock wait, if the lock is still present (e.g. hasn't been replaced by the value) then the client doing the lock wait will try to take the lock and fill the cache value. If the original lock is replaced by another client's lock, then that can cause other clients to do another lock wait to avoid a slow cache fill from resulting in a thundering herd problem. The
lock_wait_limit
option controls how many times a client should wait for different locks before raising an IdentityCache::LockWaitTimeout exception. It should be set to the maximum amount of time it would take for the cache to be filled outside of a service disruption causing the database to be excessively slow, so that we don't waste server capacity when there are unexpected delays.There is no performance impact for cache hits or cache invalidations. There is no impact on performance for cache misses when not using the fill lock. When using the fill lock, the client getting the first cache miss will make an extra
cas
memcached operation take the lock and an extraget
operation that is needed to get the CAS token as part of the cache fill. However, during a thundering herd scenario, most of the clients will only do 2get
memcached operations to read the lock then read the value after the lock wait.Cache invalidations could happen while a cache miss is being resolved which could prevent clients that are in a lock wait from reading the value after the lock wait. In this case, a fallback key is used so that the client resolving the cache miss can store the value someplace where waiting clients can read it. A data version that is stored in the lock value is used to derive the fallback key so that it isn't filled with stale data from the perspective of the client reading it. The fallback key isn't normally written to during a cache fill, so a cache invalidation might also overwrite the value before the lock wait finishes, but this will just cause the waiting clients to take a fill lock on the fallback key which won't be overwritten by further cache invalidations. If a key happens to be both read and write heavy, then the number of concurrent cache fills is still limited by the frequency of cache invalidations due to the fill lock being taken on the fallback key.
If memcached is down, then lock waits are avoided and the database will be used without a cache fill. If database exceptions interrupt the cache fill, then the fill failure is marked on the lock so other clients don't wait on the lock when they would otherwise fail fast by querying the database themselves.
Backwards compatibility
This could be rolled out in a backwards and forwards compatible way by first deploying a commit to bump the gem so that it can handle the lock values, then using a following deploy to start using the fill lock.
Alternatives
The lock wait was implemented by just using a sleep, so clients could sleep longer than necessary after the cache is filled. One alternative would be to use redis which supports blocking operations (e.g.
BRPOPLPUSH
) so the value can be returned to other clients as soon as its available. However, this could put a lot of additional load on redis to write to it for all cache fills and would add a new dependency that we would have to worry about failures for.Another alternative would be to keep stale values around after a cache invalidation so that we could fallback to stale data on a cache miss when the fill lock is taken. This is what we do in the cacheable gem. However, we would need to be careful with stale reads to make sure we don't fill another cache from the stale read (e.g. page cache) and to make sure we don't do stale reads where we can't tolerate stale data. This would also reduce where we could use this solution. There would also be a performance trade-off on cache invalidations if we needed to preserve the stale data while also marking it as stale.
Write-through caching is another alternative. However, this could add a lot of overhead to writes in order to fill the cache, especially when there are embedded associations. If there are concurrent writes, then this could interrupt the
cas
operation needed to set the new value in the cache, which would need to be resolved by retrying using freshly queried database data. This could have significant impact on models that are sometimes written heavily, even if in other cases they are read-heavy. The significantly different trade-offs seem like they would reduce where we could use this solution. Write-through caching seems like something that could be done a lot more efficiently by using the mysql binlog, which is a longer term project.Not Implemented
This PR doesn't implement support for
fetch_multi
and doesn't expose a way to provide a fill lock in all methods that do fetching. However, we could expand support for this feature on an as needed basis.