Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for fill lock with lock wait to avoid thundering herd problem #373
Add support for fill lock with lock wait to avoid thundering herd problem #373
Changes from 2 commits
4cc0e2c
2f1972a
e2e5e49
3cf750c
71ac900
5cfb3ac
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit but I think you could splat the options hash all the way through. This way you won't have to update any assertion/expectation and everything will keep working as expected no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Capturing keyword arguments using
**
has the performance disadvantage of duplicating the hash for each method call.The other annoying thing with the test expectations are that some of them use the spy gem, which doesn't yet support keyword arguments (although I've opened ryanong/spy#12 for that issue). This prevents me from explicitly passing options through without a splat for any method that is referenced by
Spy.on
in the test suite.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My upstream spy PR has been merged and released, so I'll see what I can cleanup as a result of that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the problem this is addressing that the waiting clients would have to wait even longer, and we might as well serve them stale data since they queried before it was invalidated? i.e. if the lock wait timeout is infinite and the client is okay with higher latency, this wouldn't be necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't serve stale data from the perspective of the client. It is similar to how in MVCC the reader won't try to read a write that came in after it has started the query and chosen a data version for the query. However, the data version for a fill lock / fallback key should be thought of as a lower bound in this case.
The problem this is addressing is latency caused by cache invalidations. In the theoretical case where we didn't have to worry about latency, then this wouldn't be necessary. However, in practice we don't want the fill lock to turn into a queue because it only allows one client to try to fill at a time and the cache invalidations can overwrite or prevent the cache fill preventing any other clients from benefiting from their lock wait.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, that part was clear; I wanted to make sure I fully understood the reason for the extra complexity from the fallback key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it better to fail or to yield? I suppose the worker will have waited a long time by this point, but if
lock_wait_limit * fill_lock_duration
is much less than the request timeout, falling back to the database would make sense. In Shopify configuration I imagine that wouldn't be the case though.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would certainly be a problem if the
lock_wait_limit * fill_lock_duration
was configured to be too low. That's probably true of all timeouts, although this one we might only notice during concurrent reads of the same key, making the misconfiguration more subtle.If this is configured appropriately, then a long delay seems indicative of database load. If we yielded here, then that would likely make the problem much worse, since it would effectively be unleashing the thundering herd this is here to protect against.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose a lock wait timeout should be very rare anyway, since it would imply that several workers were unable to fill the value, but also didn’t encounter an error and mark a fill failure.
Under heavy load, does it seem right that the most likely failure outcome is query execution timeouts that cause fill failures to be marked? Or is the idea to set the lock wait time shorter then the query timeout (that actually makes sense here I suppose). In SFR the query timeout is much shorter (2s), which changes things significantly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lock wait time can end up adding latency in addition to the query itself. For a single lock wait
fill_lock_duration
ends up being a minimum time to wait, so we don't want it much higher than the typical time the query takes. In the worst case, a client could end up doing lock waits before the query, which could add up to(lock_wait_limit - 1) * fill_lock_duration
to the query latency, so we probably want it shorter than the query timeout to avoid effectively doubling it.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why wouldn't we want to loop and attempt to acquire the fill lock again in this case? I don't see a reason for this worker to not participate in the next round of waiting/filling, since a worker that hits this case won't have done any waiting yet (or it would have a lock value).
Worth noting that it's impossible for this case to happen when using_fallback_key is true, since the fallback key can never have the value
IdentityCache::DELETED
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lock wait limit isn't really configured as a lock fill retry limit, which seems like it would be to handle a different scenario. The lock wait limit is mostly about tolerating cache fill delays from the client that has the fill lock. Interruption from a cache invalidation doesn't take away from the number of waits, because it is a separate concern, thus
lock_wait_limit -= 1
is not even done in theusing_fallback_key = true
code path.I had considered whether we should have a configuration for the number of attempts to fetch or take the fill lock and ended up deciding that it wasn't even the right solution to the problem, nor is it the right problem to focus on solving right now. Specifically, the problem comes up from frequently invalidated cache keys, which isn't really a good fit for read-through caching with cache invalidation. A better solution to this problem, if we wanted to address it, would be to add a data version to the cache invalidation value (
IdentityCache::DELETED
) so we could switch to using the fallback key at this point. Even if we wanted that solution, we would need an extra PR to handle backwards compatibility around deploys so the new cache invalidation value doesn't get interrupted as a cache hit. Either way, we would only benefit from solving this problem if there are still enough more reads than cache invalidations, since we would only be grouping reads for the same data version.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought I left this comment before but it seems to have gotten lost: I think this context is very useful and should go into the code comment here.
The takeaway I got is that looping here (without decrementing lock_wait_limit) could end up in an unbounded loop (exceeding lock_wait_limit + 2) due to extremely frequent invalidations, and that would be bad so we'd need another limit. However that other limit doesn't really make sense, since in this scenario the cache is pretty useless anyway.
It is possible for this case to be hit randomly on a key that isn't invalidating frequently though (just happened to have 2 invalidations in a row). But that should be rare so it's probably not worth handling explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, the
&& !using_fallback_key
condition shouldn't be necessary. I'll change that to raise if the fallback key is invalidated to make the state for the code path clearer.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is the scenario I would like to add the data version to the cache invalidation value for. Mostly to avoid worrying about the scenario from a performance perspective.
However, we should see the benefits even in this scenario, since it should leave a smaller window of time for a thundering herd of reads to hit the database, since it would just be the round trip time to memcached rather than the longer additional time we have now of multiple queries to the database to load the record along with its embedded associations and time to serialize the cache blob.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems a bit rough to have a non-specific exception raised here. What can the caller do about it?
I can see two paths that reach here: if the last iteration of the loop was interrupted by cache invalidation (
fetch_or_take_lock
returnedIdentityCache::DELETED
and we have alock
and we didn't previously try using the fallback key), or if weoren
ed on the last iteration of the loop. I feel like we should notraise
here, and we should instead try to handle both those paths in a sane way.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The caller isn't meant to do anything about it, because getting here means there is a bug in the code. The
(lock_wait_limit + 2).times do
loop could be replaced with an infinite loop, but I wanted to make it clear that it shouldn't have an infinite loop here, so this check prevents that from happening.Caching invalidation being interrupted would be the fallback case that is mentioned in the
# +2 is for first attempt and retry with fallback key
comment at the start of the loop. What doesoren
stand for? I don't know what you mean by that path.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Old
#webscale
joke that should probably be retired:sleep
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently the code converts the lock values to primitives so that they can be serialized and deserialized with msgpack. However, an alternative would be to add a special value for
msgpack
to handle the hydration of these values instead of doing it explicitly in the code. WithMarshal
as the serializer, it can handle this for us as long as we are careful about not changing the class name or instance variables names in an incompatible way.