-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LRU cache ensureFits() tweaks WRT deadlocking #796
Conversation
This could potentially spin forever if the size of that item is bigger than the max size so add this special case to handle it instead of spinning forever. In this case, the oldest one will not be removed in this function but later an item might be evicted when we will add an item to the LRU with c.lru.Add() so it will be OK either way -- just that we will not spin here forever.
* ensureFits() now returns true if the passed items will fit in the cache * Wire up the results to a new counter metric thanos_store_index_cache_items_overflowed_total * Add cache tests to check the edge cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good from the first glance, thanks!
Can we make remoteOlder bit smarter as in suggesiton?
pkg/store/cache.go
Outdated
@@ -127,7 +139,11 @@ func (c *indexCache) setPostings(b ulid.ULID, l labels.Label, v []byte) { | |||
c.mtx.Lock() | |||
defer c.mtx.Unlock() | |||
|
|||
c.ensureFits(v) | |||
fits := c.ensureFits(v) | |||
if !fits { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can inline this.
func (c *indexCache) ensureFits(b []byte) bool { | ||
if uint64(len(b)) > c.maxSize { | ||
return false | ||
} | ||
for c.curSize+uint64(len(b)) > c.maxSize { | ||
c.lru.RemoveOldest() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could do more aggressive removal if the slice is big enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but how the implementation would be like? I have two ideas:
- If the slice size exceeds a certain threshold then we could remove 2 or more items at a time
- Get all keys, calculate their sizes, and remove the top X items to make sure there's enough size for the item
b
?
Either way, the scope would definitely blow up since I do not think it is worth making these kinds of changes without scrupulous benchmarking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either way, maybe lets leave it for another day? I am mainly interested now in making sure that the loop doesn't go forever
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree
No need to create an extra variable to hold a boolean.
Inc() is specifically designed for this case so use it instead of Add(1).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Well done and tested. But would be nice to follow up on more aggressive removal. (:
Related to #651. An initial attempt to fix this because I am always nervous that some person could unintentionally make our Thanos Store instance hang even if the index cache is quite big :(
Changes
thanos_store_index_cache_items_overflowed_total
metric added which shows how many times we have tried to add an item to the cache which was simply too big according to the--index-cache-size
parameter.ensureFits()
now returns false when we cannot possibly ensure that the passed slice could fit into the cache. The original problem still exists in some form thatsetPostings
orsetSeries
could block for a longer time if there are a lot of small items in the cache however at least it shouldn't dead-lock.Verification
Wrote unit tests for it and they pass. It is kind of hard to reproduce this unfortunately in a live environment with the changes but I will see what I can do.