Next steps with groupcache (galaxycache) #5037

GiedriusS · 2022-01-06T20:10:45Z

After #4818 has been merged, I am writing down a list of TODO:

Make pull request that pushes changes (Thanos update) back into Cortex
Write documentation
Switch back to the original galaxycache version by implementing the epoch strategy on keys suggested here (*: add TTL support vimeo/galaxycache#25, https://medium.com/vimeo-engineering-blog/video-metadata-caching-at-vimeo-a54b25f0b304)
Add microbenchmarks for fetching keys via HTTP.
Enable H2C on our HTTP server to enable multiplexing even without TLS
Add e2e tests with TLS on Thanos Store w/ galaxycache. Currently galaxycache's HTTP client doesn't use TLS even though the server can do it.
Implement fetching multiple keys over one HTTP request. This isn't so trivial since we directly copy data from memory to the response's body :/ needs some kind of encoding scheme here that has minimal overhead?

For implementing the last two, perhaps we could even switch to gRPC?

Maybe I have missed something?

yeya24 · 2022-01-08T21:52:49Z

Looking forward to the documentation. Can't wait to try it 👍

GiedriusS · 2022-01-18T10:47:42Z

#5068 fixes [4] and [5].

stale · 2022-04-16T03:54:40Z

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale · 2022-04-30T10:12:28Z

Closing for now as promised, let us know if you need this to be reopened! 🤗

GiedriusS · 2022-06-13T08:26:06Z

From our experience, groupcache is still slow due to the last point - we need to fetch multiple keys in one request.

stale · 2022-08-13T07:14:11Z

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

pulkitent · 2022-11-02T13:00:08Z

Hey @GiedriusS, I would like to pick a sub-issue of this open issue which is Make pull request that pushes changes (Thanos update) back into Cortex

As I am a newbie to open source contribution & Thanos codebase, can someone share some docs to help me understand the requirement better?

Any acceptance criteria for this feature? Or which flow/component in the Thanos codebase should I focus on? :)

bwplotka · 2022-11-23T11:29:03Z

Amazing, thanks! So Make pull request that pushes changes (Thanos update) back into Cortex : I wonder if this is still relevant, given we vendor Cortex code. Could you double-check @GiedriusS ? Also I think it less important, given we are not sure if groupcache can be a better solution than Redis. We need to first figure out if groupcache is worth pursuing, so perhaps your (any new ideas) around making it faster (and benchmark) should be the next step here.

cc @yeya24 @GiedriusS does it sounds right?

GiedriusS · 2022-11-23T12:05:09Z

Thanks for picking this up! I am personally doubtful whether it is needed to pursue this further, I might even be leaning towards deleting groupcache before 1.0 😢 as far as I understand, the problem is that we are serializing access to a single map on each key retrieval: https://github.com/vimeo/galaxycache/blob/master/singleflight/singleflight.go#L41-L64. In Thanos, on a bigger query we can easily get thousands of requests at the same time. I saw this once I've tried deploying groupcache to prod. The latency even jumped after groupcache was enabled 😱 I think the performance issue must be the contention. This blog post is pretty illuminating: https://dgraph.io/blog/post/introducing-ristretto-high-perf-go-cache/. I wonder if there's a better strategy we could use for the map 🤔

bwplotka · 2022-11-24T14:50:31Z

Got it. I think it's nice piece to work, but best effort. Still something to consider if blockers are improved.

What would be a benchmark that could reproduce the problem? What requirements we are looking for in terms of efficiency here?

mfoldenyi · 2025-01-09T13:34:36Z

Hey @GiedriusS and @bwplotka !

These past days I was trying to tune our query path on our dev system to see what it can do, and backtracked our current bottleneck to our stores taking very long to return data to the queriers. In short order the problem was that we had no bucket_cache enabled at all, so everything was redownloaded from S3 every time it was needed. I did some research about the options available and came away with the following conclusions:

IN-MEMORY
-- works out of the box (in our case its actually 40x faster than GROUPCACHE as-is)
-- a little short on features
-- scaling out stores is extremely expensive as ultimately all stores would need to keep their copy of everything (not to mention being so much harder to have warm caches on all stores, making query performance be all over the place)
GROUPCACHE (somewhat better than not caching at all, but obviously waay worse than it could be)
-- sounds great in theory, but as described above have crippling performance issues
-- scaling out the stores is reasonably cheap since they share part of their memory, and requires minimal configuration
REDIS/MEMCACHED
-- as read in Thanos stores are unable to share the same redis database (bucket_cache) #6939 scaling out the stores is not working right now in either, so at the moment just using IN-MEMORY instead seems like a good idea
-- hard to setup since it needs additional components (compared to adding a 3 line config to get the IN-MEMORY alternative)

I have read your discussions about abandoning groupcache above in favor of Redis and I was a little confused considering the praise in the docs). The above discussion being over 2 years old now I wanted to ask if that still reflects your opinion or not. Looking at the options, all of these implementations seem to be broken to some degree, and I wonder which direction would be the best to go towards.

So far I am sold on the GROUPCACHE approach and I wonder if it is a good idea to try to dust off the implementation, put in the enhancements that it were missing and come out with the best theoretical solution at the other end.

What are your opinions?
Has it since been decided that Redis/Memcached is still the better alternative, contrary to what's written in the docs?
Is fixing groupcache unreasonably complicated to do?

SuperQ · 2025-01-09T17:25:06Z

I don't think any decision has been made, but nobody has put in any effort into debugging why the groupcache approach is not as performing well.

I don't think anyone would object to you putting in time to make improvements.

GiedriusS added feature request/improvement help wanted component: store labels Jan 6, 2022

GiedriusS mentioned this issue Jan 10, 2022

Research groupcache as alternative cache client to memcached #2962

Closed

GiedriusS mentioned this issue Jan 18, 2022

groupcache: enable H2C + add benchmarks #5068

Merged

stale bot added the stale label Apr 16, 2022

stale bot closed this as completed Apr 30, 2022

GiedriusS reopened this Apr 30, 2022

stale bot removed the stale label Apr 30, 2022

stale bot added the stale label Aug 13, 2022

GiedriusS added dont-go-stale Label for important issues which tells the stalebot not to close them and removed stale labels Aug 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Next steps with groupcache (galaxycache) #5037

Next steps with groupcache (galaxycache) #5037

GiedriusS commented Jan 6, 2022 •

edited

Loading

yeya24 commented Jan 8, 2022

GiedriusS commented Jan 18, 2022

stale bot commented Apr 16, 2022

stale bot commented Apr 30, 2022

GiedriusS commented Jun 13, 2022

stale bot commented Aug 13, 2022

pulkitent commented Nov 2, 2022

bwplotka commented Nov 23, 2022

GiedriusS commented Nov 23, 2022

bwplotka commented Nov 24, 2022

mfoldenyi commented Jan 9, 2025

SuperQ commented Jan 9, 2025

Next steps with groupcache (galaxycache) #5037

Next steps with groupcache (galaxycache) #5037

Comments

GiedriusS commented Jan 6, 2022 • edited Loading

yeya24 commented Jan 8, 2022

GiedriusS commented Jan 18, 2022

stale bot commented Apr 16, 2022

stale bot commented Apr 30, 2022

GiedriusS commented Jun 13, 2022

stale bot commented Aug 13, 2022

pulkitent commented Nov 2, 2022

bwplotka commented Nov 23, 2022

GiedriusS commented Nov 23, 2022

bwplotka commented Nov 24, 2022

mfoldenyi commented Jan 9, 2025

SuperQ commented Jan 9, 2025

GiedriusS commented Jan 6, 2022 •

edited

Loading