Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for fill lock with lock wait to avoid thundering herd problem #373

Merged
merged 6 commits into from
Jan 15, 2021

Conversation

dylanahsmith
Copy link
Contributor

@dylanahsmith dylanahsmith commented Oct 4, 2018

cc @csfrancis

Problem

Currently there is a thundering herd problem when a cache key is invalidated and multiple readers get a cache miss before the first one fills the cache. All the readers after the first one are redundant and this can cause a lot of load on the database, especially for records with a lot of embedded associations.

Solution

Add support for using a fill lock when using the fetch and fetch_by_id class methods so the first cache miss takes a lock before filling the cache and following clients getting a cache miss do a lock wait for a short duration for the cache to be filled so they can read from the cache rather than from the database.

This feature is opt-in through specifying the fill_lock_duration option to control how long to wait on the fill lock before reading the cache key again. This should be set to just over the typical amount of time it takes to do a cache fill. This way we typically will only have a single client filling a cache key at a time.

After a lock wait, if the lock is still present (e.g. hasn't been replaced by the value) then the client doing the lock wait will try to take the lock and fill the cache value. If the original lock is replaced by another client's lock, then that can cause other clients to do another lock wait to avoid a slow cache fill from resulting in a thundering herd problem. The lock_wait_limit option controls how many times a client should wait for different locks before raising an IdentityCache::LockWaitTimeout exception. It should be set to the maximum amount of time it would take for the cache to be filled outside of a service disruption causing the database to be excessively slow, so that we don't waste server capacity when there are unexpected delays.

There is no performance impact for cache hits or cache invalidations. There is no impact on performance for cache misses when not using the fill lock. When using the fill lock, the client getting the first cache miss will make an extra cas memcached operation take the lock and an extra get operation that is needed to get the CAS token as part of the cache fill. However, during a thundering herd scenario, most of the clients will only do 2 get memcached operations to read the lock then read the value after the lock wait.

Cache invalidations could happen while a cache miss is being resolved which could prevent clients that are in a lock wait from reading the value after the lock wait. In this case, a fallback key is used so that the client resolving the cache miss can store the value someplace where waiting clients can read it. A data version that is stored in the lock value is used to derive the fallback key so that it isn't filled with stale data from the perspective of the client reading it. The fallback key isn't normally written to during a cache fill, so a cache invalidation might also overwrite the value before the lock wait finishes, but this will just cause the waiting clients to take a fill lock on the fallback key which won't be overwritten by further cache invalidations. If a key happens to be both read and write heavy, then the number of concurrent cache fills is still limited by the frequency of cache invalidations due to the fill lock being taken on the fallback key.

If memcached is down, then lock waits are avoided and the database will be used without a cache fill. If database exceptions interrupt the cache fill, then the fill failure is marked on the lock so other clients don't wait on the lock when they would otherwise fail fast by querying the database themselves.

Backwards compatibility

This could be rolled out in a backwards and forwards compatible way by first deploying a commit to bump the gem so that it can handle the lock values, then using a following deploy to start using the fill lock.

Alternatives

The lock wait was implemented by just using a sleep, so clients could sleep longer than necessary after the cache is filled. One alternative would be to use redis which supports blocking operations (e.g. BRPOPLPUSH) so the value can be returned to other clients as soon as its available. However, this could put a lot of additional load on redis to write to it for all cache fills and would add a new dependency that we would have to worry about failures for.

Another alternative would be to keep stale values around after a cache invalidation so that we could fallback to stale data on a cache miss when the fill lock is taken. This is what we do in the cacheable gem. However, we would need to be careful with stale reads to make sure we don't fill another cache from the stale read (e.g. page cache) and to make sure we don't do stale reads where we can't tolerate stale data. This would also reduce where we could use this solution. There would also be a performance trade-off on cache invalidations if we needed to preserve the stale data while also marking it as stale.

Write-through caching is another alternative. However, this could add a lot of overhead to writes in order to fill the cache, especially when there are embedded associations. If there are concurrent writes, then this could interrupt the cas operation needed to set the new value in the cache, which would need to be resolved by retrying using freshly queried database data. This could have significant impact on models that are sometimes written heavily, even if in other cases they are read-heavy. The significantly different trade-offs seem like they would reduce where we could use this solution. Write-through caching seems like something that could be done a lot more efficiently by using the mysql binlog, which is a longer term project.

Not Implemented

This PR doesn't implement support for fetch_multi and doesn't expose a way to provide a fill lock in all methods that do fetching. However, we could expand support for this feature on an as needed basis.

else # hit
return value
end
new_lock.cache_value # take lock
Copy link
Contributor Author

@dylanahsmith dylanahsmith Oct 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the code converts the lock values to primitives so that they can be serialized and deserialized with msgpack. However, an alternative would be to add a special value for msgpack to handle the hydration of these values instead of doing it explicitly in the code. With Marshal as the serializer, it can handle this for us as long as we are careful about not changing the class name or instance variables names in an incompatible way.

@@ -100,9 +101,11 @@ def should_use_cache? # :nodoc:
# == Parameters
# +key+ A cache key string
#
def fetch(key)
def fetch(key, cache_fetcher_options = {})

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit but I think you could splat the options hash all the way through. This way you won't have to update any assertion/expectation and everything will keep working as expected no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Capturing keyword arguments using ** has the performance disadvantage of duplicating the hash for each method call.

The other annoying thing with the test expectations are that some of them use the spy gem, which doesn't yet support keyword arguments (although I've opened ryanong/spy#12 for that issue). This prevents me from explicitly passing options through without a splat for any method that is referenced by Spy.on in the test suite.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My upstream spy PR has been merged and released, so I'll see what I can cleanup as a result of that.

@maximebedard
Copy link

I'm generally good with the solution, also thanks for the explanations! 🙏 I prefer sleeping in the unicorn than hammering redis with b* commands that's for sure.

I've noticed a few branches seems to be untested in the #fill_with_lock method. I think there's a few things that could be simplified, but it comes down to personal preferences at this point.

👍

Copy link
Member

@fbogsany fbogsany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still 👀

data
end

def fetch_with_fill_lock(key, fill_lock_duration:, lock_wait_limit: 2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Control flow in this method is ... complicated. It's a case of "not obviously wrong" instead of "obviously not wrong", which really makes me hesitate in approving this PR. If the problem is the block-structured cas API in the memcached gem (and other layers above it), we should expose the API we need to make this simple. Effectively single_get and single_cas. Modelling this on the Dalli API for get_cas and set_cas might help with future migration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I would be able to reduce the number of code paths, so there is a certain amount of complexity that won't go away. So this still might make it hard to say the code is "obviously not wrong".

The block-structured cas API definitely puts constraints on how the code is written, so I find it definitely more awkward to write the code, so I agree that we should try to get it modelled like Dalli upstream in the memcached_store and memcached gems. For instance, this might be helpful for implementing fill lock support for fetch_multi. The change to memcached_store should be easier since we maintain it. I'll have to see how easy it is to get a change upstream for the memcached gem.

For reading the code, I find the following two things awkward about the cas method:

  • the cas set operation is more implicit by just returning a value from a block. This has also led to assert_memcache_operations not even counting these set operations
  • break or return are needed to avoid the cas set operation.

This can be seen by comparing this code using cas:

  @cache_backend.cas(key, **expiration_options) do |value|
    break unless FillLock.cache_value?(value)
    lock = FillLock.from_cache(value)
    lock.mark_failed
    lock.cache_value
  end

with equivalent code using get_cas and set_cas

  value, cas = @cache_backend.get_cas(key)
  if FillLock.cache_value?(value)
    lock = FillLock.from_cache(value)
    lock.mark_failed
    @cache_backend.set_cas(key, lock.cache_value, **expiration_options)
  end

yield
rescue Exception
@cache_backend.cas(key, **expiration_options) do |value|
break unless FillLock.cache_value?(value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC this will exit the cas method above and jump to the raise on line 127 below. What happens if we just raise here instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it will break out of the cas and then raise. I could change the break here to a raise and it will effectively do the same thing. Mostly I just wrote it this way because I was thinking of the cas to mark the failure independently of the re-raising, so this break line just says only do that if it is a fill lock.

However, we can't just raise here, since we still need the raise that comes after the cas block.

Copy link
Contributor

@csfrancis csfrancis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just have a couple of small comments, but overall this looks good to me.

@dylanahsmith would you be able to rebase this, and I'll try to find another reviewer?

def add(key, value)
@cache_backend.write(key, value, :unless_exist => true) if IdentityCache.should_fill_cache?
def add(key, value, **expiration_options)
return unless IdentityCache.should_fill_cache?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this explicitly return false?

end

def test_fetch_without_lock_miss
assert_memcache_operations(2) do # get, add
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nice future enhancement might be to have assert_memcache_operations actually assert the operation types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. It should also distinguish between @cache_backend.cas calls that break in the block (i.e. only send a get operation) and calls where the block returns (i.e. get and cas operations get sent)

expires_in = expires_in.ceil

# memcached TTL only gets part of the first second, so increase TTL by 1 to compensate
expires_in + 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line does nothing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching that. Fixed it to be += which is what I intended

# the cache store round down with `to_i`
expires_in = expires_in.ceil

# memcached TTL only gets part of the first second, so increase TTL by 1 to compensate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this trying to say the same thing as the comment on line 238? Is the code on line 243 redundant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that other comment was about conversion to an integer. This one is about the second granularity memcached uses client side that can cause a TTL of 1 to expire nearly immediately (memcached/memcached#307). I'll link to that issue to help clarify.

# to another client taking the lock first. If another lock wait would be needed after
# reaching the limit, then a `LockWaitTimeout` exception is raised. Default is 2. Use
# this to control the maximum total lock wait duration
# (`lock_wait_limit * fill_lock_duration`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API here is awkward and the abstraction is leaky. From a user's perspective, what we care about is "how long do I want to wait", and we should really specify that directly. We want to specify a minimum, based on expected refill time, and a maximum we're willing to tolerate. The lock_wait_limit leaks an implementation detail that users shouldn't have to care about (and which complicates tweaking the algorithm later).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifying a number of times to retry doesn't seem like a leaky abstraction and seems like it is how we normally handle retries. In fact, I can't think of a place where we do retries for a certain amount of time instead of having a retry limit.

As you mentioned, we need to know how long to wait based on the expected refill time, so I it seems like fill_lock_duration is necessary.

What type of tweaks did you have in mind that the current configuration doesn't provide a suitable abstraction for? It seems to me that we could use the same configuration if we switched to redis to store the lock or if we just did it in-memory in a multi-threaded process.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exponential backoff is a pretty common approach that would map awkwardly to this API.

My point about leaky abstractions is that users want to specify their desired constraints under contention. It's awkward to expose the knobs as "how often should I try" and "how many times should I try" when the user really wants to tell us "I expect the refill to take X seconds" and "I don't want to wait longer than Y seconds".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you are proposing is just a different abstraction. That doesn't make the configuration I proposed a leaky abstraction, it just isn't an abstraction around the change you want to be able to make.

As a user of the API, I would hate for it to silently change to exponential backoff without explicitly opting into that. Since the sleep is synchronous, we are wasting the threads time which could eat into capacity. We don't want to wait for a problem to be resolved in the current process, we want to fail fast if we know there is a problem and rely on semian to backoff in our attempts to retry accessing the resource.

Also, as a user of the API, I find the configuration API you are describing awkward to express what I actually want. Since this is specifically to solve the thundering herd problem, I really do care about how long the fill lock is taken, since that limits the frequency of attempts to fill the cache key under contention. I want it to do at most 2 lock waits because that both gracefully handles cache fills that take up to double the typically allowed time and because it allows it to be interrupted once by a cache invalidation. If I try to express the same thing as a a duration (i.e. lock_wait_limit * fill_lock_duration) then it is no longer precise due to floating point math. Depending on the behaviour chosen for the remaining time, it could either lose the second attempt if there is just under the fill lock duration remaining or perform a last attempt immediately after the second last attempt if there is just over the fill lock duration remaining.

I think if we build something that works well for us, then it is more likely to be suitable for other users that encounter the same thundering herd problem.


def initialize(cache_value)
@cache_value = cache_value
_, @client_id, @data_version = cache_value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a significant performance benefit to initializing this with an array and storing both the array and its components? It seems like it'd be cleaner to:

attr_reader :client_id, :data_version

def initialize(client_id:, data_version:)
  @client_id = client_id
  @data_version = data_version
end

def cache_value
  [FILL_LOCKED, @client_id, @data_version]
end

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already need to support initializing it from an array when it comes from the cache and storing the array simplified equality checking.

I don't think there would be a significant performance impact from not storing the array, so I'll change it to what you suggested.

return result
end
end
raise "unexpected number of loop iterations"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a bit rough to have a non-specific exception raised here. What can the caller do about it?

I can see two paths that reach here: if the last iteration of the loop was interrupted by cache invalidation (fetch_or_take_lock returned IdentityCache::DELETED and we have a lock and we didn't previously try using the fallback key), or if we orened on the last iteration of the loop. I feel like we should not raise here, and we should instead try to handle both those paths in a sane way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What can the caller do about it?

The caller isn't meant to do anything about it, because getting here means there is a bug in the code. The (lock_wait_limit + 2).times do loop could be replaced with an infinite loop, but I wanted to make it clear that it shouldn't have an infinite loop here, so this check prevents that from happening.

Caching invalidation being interrupted would be the fallback case that is mentioned in the # +2 is for first attempt and retry with fallback key comment at the start of the loop. What does oren stand for? I don't know what you mean by that path.

Copy link
Member

@fbogsany fbogsany Jul 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does oren stand for?

Old #webscale joke that should probably be retired: sleep.

@dylanahsmith dylanahsmith force-pushed the fill-lock-with-lock-wait branch from aee0620 to b4ac8bb Compare July 10, 2019 21:07
@dylanahsmith
Copy link
Contributor Author

Rebased and pushed changes to address feedback.

@fbogsany
Copy link
Member

@shuhaowu I feel like TLA+ would be helpful here, to increase confidence in the correctness of this approach. I think you're the expert in that. Can you please take a look at this & see if you think it'd be applicable?

@shuhaowu
Copy link

I quickly read through the problem/solution description but I feel that I don't have enough context to understand what's actually happening. Are there any supplementary material I can read? It's difficult for me to assess how TLA+ can help without further context, although I do think TLA+ can help define the "correctness" of the system more rigorously, if there are some uncertainties about that.

@dylanahsmith
Copy link
Contributor Author

I feel like TLA+ would be helpful here, to increase confidence in the correctness of this approach.

I wouldn't mind giving that a shot. I had learned a bit about TLA+, but haven't actually used it for anything. There are certainly some invariants that we could formalize to ensure that the lock wait time is bounded, clients read their own writes and stale entries in the cache always get expired in the absence of errors on the invalidation request.

One thing that I don't think TLA+ will help with is making sure the code corresponds to the TLA+ model that is being checked. @shuhaowu did you have any tips to help with doing that? Is the best approach to just try to structure the code such that the correspondence between the model and the code is clearer?

@shuhaowu
Copy link

One thing that I don't think TLA+ will help with is making sure the code corresponds to the TLA+ model that is being checked. @shuhaowu did you have any tips to help with doing that? Is the best approach to just try to structure the code such that the correspondence between the model and the code is clearer?

It depends on what you mean by "correspondence". Since you're already somewhat familiar with TLA+, you know that TLA+ specifications cannot be compiled to code directly. Thus, we cannot get a 1-to-1 correspondence. Further, my experiences with TLA+ show that it is most useful when it does not try to mirror the code, but when it mirrors the RFC document describing the problem and solution. This may be confusing (it is even for me to explain it), but using the Ghostferry TLA spec as an example may make it more clear:

  • Only half of the specification pertains to Ghostferry's code. The other half of the specification models Shopify core. Without the Shopify core model, Ghostferry would be almost meaningless.
    • Naturally, since the Shopify core code is not within Ghostferry, there's no correspondance between that part of the model and code.
  • There are about 3 + n goroutines in Ghostferry's code, where n is configurable and designed to speed up the table copy process. This is only modeled as 4 goroutines via an induction type logic.
  • Ghostferry iterates over the entire database in its code via a Cursor class, having to deal with things like auto_increment>1 and filtering based on shop_id. This is entirely neglected in the specification and replaced with only 2-3 lines of spec. Once again, one can prove equivalence between the two scenarios.
    • A lot of simplifications like the previous two cases to make the model both tractable for a human to reason about, as well as for TLC (the TLA+ state space checker) to confirm.

In essence, TLA+ models are for computer programs what 2D truss-based models are for bridges. They serve as idealized models that we can more easily manipulate in our heads. There will be differences, but this is OK as long as we have some understand on where the differences lie and if/how these differences can cause failures.

I'm not sure if these made any sense so maybe the best way to do this is to just try it. Of course, I'm very interested in how this turns out and am willing to lend a hand if we move forward with it.

@dylanahsmith
Copy link
Contributor Author

Yeah, that is inline with what I was thinking was the limitation with TLA+.

@fbogsany are your concerns with the approach about the abstract algorithm which TLA+ could formally specify and check? Since you were actually the one to recommend a lock wait based approach roughly like this. Or are you actually concerned about the implementation of the approach, which it sounds like TLA+ wouldn't really help with, other than providing a specification that can be referred to.

@fbogsany
Copy link
Member

@fbogsany are your concerns with the approach about the abstract algorithm which TLA+ could formally specify and check? Since you were actually the one to recommend a lock wait based approach roughly like this. Or are you actually concerned about the implementation of the approach, which it sounds like TLA+ wouldn't really help with, other than providing a specification that can be referred to.

Sorry, I missed the re-ping on this. I was mostly thinking of the type of things you listed here:

There are certainly some invariants that we could formalize to ensure that the lock wait time is bounded, clients read their own writes and stale entries in the cache always get expired in the absence of errors on the invalidation request.

When working on JVM implementation, we tried to think about code in the thread library (in particular) in terms of a "devil's scheduler" - "what's the worst possible sequence of interleaved instructions?". That's the kind of thing I'd like to do here, but it requires a lot of time & energy, and that's in short supply. My hope was that TLA+ could provide a probably more thorough substitute.

The state space when we added CAS was about at the limit of what I could hold in my head. This is well beyond that limit.

@dylanahsmith dylanahsmith force-pushed the fill-lock-with-lock-wait branch from 3df9bc4 to 8c1f918 Compare November 26, 2019 22:02
@dylanahsmith dylanahsmith mentioned this pull request Apr 3, 2020
@dylanahsmith dylanahsmith force-pushed the fill-lock-with-lock-wait branch 2 times, most recently from f51d597 to 214efcd Compare July 2, 2020 19:58
@dylanahsmith dylanahsmith force-pushed the fill-lock-with-lock-wait branch 2 times, most recently from 51702d1 to 837c7a7 Compare September 15, 2020 16:16
@dylanahsmith dylanahsmith changed the base branch from master to drop-ruby-2.4-support September 15, 2020 16:16
Base automatically changed from drop-ruby-2.4-support to master September 15, 2020 17:30
@dylanahsmith dylanahsmith force-pushed the fill-lock-with-lock-wait branch from 837c7a7 to 9ddfa3a Compare September 15, 2020 17:31
@dylanahsmith dylanahsmith force-pushed the fill-lock-with-lock-wait branch 2 times, most recently from a2c8ddd to ce176ab Compare January 8, 2021 19:49
@dylanahsmith dylanahsmith force-pushed the fill-lock-with-lock-wait branch from ce176ab to 4cc0e2c Compare January 11, 2021 14:19
Copy link
Contributor

@pushrax pushrax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty much LGTM.

# by another client. Defaults to not setting the fill lock and trying to fill the
# cache from the database regardless of the presence of another client's fill lock.
# Set this to just above the typical amount of time it takes to do a cache fill.
# @param lock_wait_limit [Integer] Only applicable if fill_lock_duration is provided,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lock_wait_times is an alternative to consider that disambiguates between a duration limit and a count limit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, it is ambiguous about whether it is referring to a count or duration. What about lock_wait_tries? In order to avoid using the word "time" which is not what we want to allude to.

end

if !fill_with_lock(key, data, lock, expiration_options) && !using_fallback_key
# fallback to storing data in the fallback key so it is available to clients waiting on the lock
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cache invalidations could happen while a cache miss is being resolved which could prevent clients that are in a lock wait from reading the value after the lock wait.

Is the problem this is addressing that the waiting clients would have to wait even longer, and we might as well serve them stale data since they queried before it was invalidated? i.e. if the lock wait timeout is infinite and the client is okay with higher latency, this wouldn't be necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't serve stale data from the perspective of the client. It is similar to how in MVCC the reader won't try to read a write that came in after it has started the query and chosen a data version for the query. However, the data version for a fill lock / fallback key should be thought of as a lower bound in this case.

The problem this is addressing is latency caused by cache invalidations. In the theoretical case where we didn't have to worry about latency, then this wouldn't be necessary. However, in practice we don't want the fill lock to turn into a queue because it only allows one client to try to fill at a time and the cache invalidations can overwrite or prevent the cache fill preventing any other clients from benefiting from their lock wait.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't serve stale data from the perspective of the client.

Right, that part was clear; I wanted to make sure I fully understood the reason for the extra complexity from the fallback key.

end
return data
else
raise LockWaitTimeout if lock_wait_limit <= 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better to fail or to yield? I suppose the worker will have waited a long time by this point, but if lock_wait_limit * fill_lock_duration is much less than the request timeout, falling back to the database would make sense. In Shopify configuration I imagine that wouldn't be the case though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would certainly be a problem if the lock_wait_limit * fill_lock_duration was configured to be too low. That's probably true of all timeouts, although this one we might only notice during concurrent reads of the same key, making the misconfiguration more subtle.

If this is configured appropriately, then a long delay seems indicative of database load. If we yielded here, then that would likely make the problem much worse, since it would effectively be unleashing the thundering herd this is here to protect against.

Copy link
Contributor

@pushrax pushrax Jan 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose a lock wait timeout should be very rare anyway, since it would imply that several workers were unable to fill the value, but also didn’t encounter an error and mark a fill failure.

Under heavy load, does it seem right that the most likely failure outcome is query execution timeouts that cause fill failures to be marked? Or is the idea to set the lock wait time shorter then the query timeout (that actually makes sense here I suppose). In SFR the query timeout is much shorter (2s), which changes things significantly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lock wait time can end up adding latency in addition to the query itself. For a single lock wait fill_lock_duration ends up being a minimum time to wait, so we don't want it much higher than the typical time the query takes. In the worst case, a client could end up doing lock waits before the query, which could add up to (lock_wait_limit - 1) * fill_lock_duration to the query latency, so we probably want it shorter than the query timeout to avoid effectively doubling it.

# loop around to retry with fallback key
else
# cache invalidation prevented lock from being taken
return yield
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why wouldn't we want to loop and attempt to acquire the fill lock again in this case? I don't see a reason for this worker to not participate in the next round of waiting/filling, since a worker that hits this case won't have done any waiting yet (or it would have a lock value).

Worth noting that it's impossible for this case to happen when using_fallback_key is true, since the fallback key can never have the value IdentityCache::DELETED

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lock wait limit isn't really configured as a lock fill retry limit, which seems like it would be to handle a different scenario. The lock wait limit is mostly about tolerating cache fill delays from the client that has the fill lock. Interruption from a cache invalidation doesn't take away from the number of waits, because it is a separate concern, thus lock_wait_limit -= 1 is not even done in the using_fallback_key = true code path.

I had considered whether we should have a configuration for the number of attempts to fetch or take the fill lock and ended up deciding that it wasn't even the right solution to the problem, nor is it the right problem to focus on solving right now. Specifically, the problem comes up from frequently invalidated cache keys, which isn't really a good fit for read-through caching with cache invalidation. A better solution to this problem, if we wanted to address it, would be to add a data version to the cache invalidation value (IdentityCache::DELETED) so we could switch to using the fallback key at this point. Even if we wanted that solution, we would need an extra PR to handle backwards compatibility around deploys so the new cache invalidation value doesn't get interrupted as a cache hit. Either way, we would only benefit from solving this problem if there are still enough more reads than cache invalidations, since we would only be grouping reads for the same data version.

Copy link
Contributor

@pushrax pushrax Jan 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I left this comment before but it seems to have gotten lost: I think this context is very useful and should go into the code comment here.

The takeaway I got is that looping here (without decrementing lock_wait_limit) could end up in an unbounded loop (exceeding lock_wait_limit + 2) due to extremely frequent invalidations, and that would be bad so we'd need another limit. However that other limit doesn't really make sense, since in this scenario the cache is pretty useless anyway.

It is possible for this case to be hit randomly on a key that isn't invalidating frequently though (just happened to have 2 invalidations in a row). But that should be rare so it's probably not worth handling explicitly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth noting that it's impossible for this case to happen when using_fallback_key is true, since the fallback key can never have the value IdentityCache::DELETED

Good point, the && !using_fallback_key condition shouldn't be necessary. I'll change that to raise if the fallback key is invalidated to make the state for the code path clearer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible for this case to be hit randomly on a key that isn't invalidating frequently though (just happened to have 2 invalidations in a row). But that should be rare so it's probably not worth handling explicitly.

Yeah, this is the scenario I would like to add the data version to the cache invalidation value for. Mostly to avoid worrying about the scenario from a performance perspective.

However, we should see the benefits even in this scenario, since it should leave a smaller window of time for a thundering herd of reads to hit the database, since it would just be the round trip time to memcached rather than the longer additional time we have now of multiple queries to the database to load the record along with its embedded associations and time to serialize the cache blob.

# loop around to retry with fallback key
else
# cache invalidation prevented lock from being taken
return yield
Copy link
Contributor

@pushrax pushrax Jan 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I left this comment before but it seems to have gotten lost: I think this context is very useful and should go into the code comment here.

The takeaway I got is that looping here (without decrementing lock_wait_limit) could end up in an unbounded loop (exceeding lock_wait_limit + 2) due to extremely frequent invalidations, and that would be bad so we'd need another limit. However that other limit doesn't really make sense, since in this scenario the cache is pretty useless anyway.

It is possible for this case to be hit randomly on a key that isn't invalidating frequently though (just happened to have 2 invalidations in a row). But that should be rare so it's probably not worth handling explicitly.

@martelogan
Copy link
Member

martelogan commented Jan 13, 2021

Between our discussions irl and in-channel, I've done a few cursory reviews on this PR but haven't yet found a chance for a proper one. As we've discussed it, I'm aligned to ship this either way by EOW but hold off till tomorrow and I'll do a proper review this evening + tomorrow morning just for a third ✅ / 👀

Copy link
Member

@martelogan martelogan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've gone over this a few times. Most of the major edge cases seem to have already been called out - though it's definitely difficult to have confidence in unknown unknowns where things die / go incomplete in partial states.

That said - Siberia's use case is probably as good of a state as we can possibly have to dry run this PR (single shop, read/writes are C1 flow only, sufficiently high load, known IDC -> SQL invalidation issues w/o fill lock, etc.).

Very likely, I'll still circle back and thread follow-up discussion here if there's a reason to think we've hit a particular edge case (in initial rollout or afterwards).

Let's start with the single shop and see how this goes 👌

@dylanahsmith dylanahsmith force-pushed the fill-lock-with-lock-wait branch 3 times, most recently from cac2055 to b2917a7 Compare January 15, 2021 13:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants