Thanos stores are unable to share the same redis database (bucket_cache) #6939

Chr0my · 2023-11-29T11:18:19Z

Hello guys,

I’m currently working on the redis cache & bucket_cache implementation on some of our store gateways and I’m facing some weird behaviors when enabling redis bucket_cache.

The redis cluster setup is a 3 node cluster running version 7 :

Redis server v=7.0.11 sha=00000000:0 malloc=jemalloc-5.3.0 bits=64 build=c4e7f6bf175a885b

Please find below the cache config that I use :

Index cache config :

config:
  addr: thns-cache1.domain:6379,thns-cache2.domain:6379,thns-cache3.domain:6379
  db: 0
  dial_timeout: 10s
  read_timeout: 5s
  write_timeout: 5s
  max_get_multi_concurrency: 100
  get_multi_batch_size: 100
  max_set_multi_concurrency: 100
  set_multi_batch_size: 100
  tls_enabled: false
  cache_size: 2GB
  expiration: 24h0m0s
  username: default
  password: SuperPassword

Bucket_cache config :

type: REDIS
config:
  addr: thns-cache1.domain:6379,thns-cache2.domain:6379,thns-cache3.domain:6379
  db: 0
  dial_timeout: 10s
  read_timeout: 5s
  write_timeout: 5s
  max_get_multi_concurrency: 100
  get_multi_batch_size: 100
  max_set_multi_concurrency: 100
  set_multi_batch_size: 100
  tls_enabled: false
  cache_size: 2GB
  expiration: 24h0m0s
  username: default
  password: SuperPassword

chunk_subrange_size: 16000
max_chunks_get_range_requests: 3
chunk_object_attrs_ttl: 24h
chunk_subrange_ttl: 24h
blocks_iter_ttl: 5m
metafile_exists_ttl: 2h
metafile_doesnt_exist_ttl: 15m
metafile_content_ttl: 24h
metafile_max_size: 1MiB

Here's the initial status of the loaded blocks before enabling cache & bucket_cache :

Here's what I see after enabling it :

... and getting the following logs :

Logs

Nov 29 10:16:21 thns-store.domain thanos[305695]: ts=2023-11-29T10:16:21.355740843Z caller=bucket.go:681 level=warn msg="loading block failed" elapsed=11.963466ms id=01HGCQ5GH923J1A0FKTW3FGC22 err="create index header reader: write index header: new index reader: get object attributes of 01HGCQ5GH923J1A0FKTW3FGC22/index: The specified key does not exist."
Nov 29 10:16:21 thns-store.domain thanos[305695]: ts=2023-11-29T10:16:21.355784695Z caller=bucket.go:681 level=warn msg="loading block failed" elapsed=11.53174ms id=01HGBDZ543XFFB4VQDP9MJREYD err="create index header reader: write index header: new index reader: get object attributes of 01HGBDZ543XFFB4VQDP9MJREYD/index: The specified key does not exist."
Nov 29 10:16:21 thns-store.domain thanos[305695]: ts=2023-11-29T10:16:21.355849129Z caller=bucket.go:681 level=warn msg="loading block failed" elapsed=11.693688ms id=01HGCG9SB4Q62KZ1YGY4TRY77D err="create index header reader: write index header: new index reader: get object attributes of 01HGCG9SB4Q62KZ1YGY4TRY77D/index: The specified key does not exist."
Nov 29 10:16:21 thns-store.domain thanos[305695]: ts=2023-11-29T10:16:21.355894902Z caller=bucket.go:681 level=warn msg="loading block failed" elapsed=11.84724ms id=01HGCY17NT22Z59X03JTE5AZ69 err="create index header reader: write index header: new index reader: get object attributes of 01HGCY17NT22Z59X03JTE5AZ69/index: The specified key does not exist.

Just by curiosity, I wanted to see if I was able to reproduce it exactly so I did the same operation one more time:

remove all redis cache config (index & bucket) and restart the store
re-apply the exact same configuration and restart the store

This time, when the store has restarted, everything looks just fine 🤷‍♂️

Thanos version :

thanos, version 0.32.4 (branch: HEAD, revision: fcd5683e3049924ae26a680e166ae6f27a344896)
  build user:       root@afb5016d2fc4
  build date:       20231002-07:45:12
  go version:       go1.20.8
  platform:         linux/amd64
  tags:             netgo

What you expected to happen:

I expect that all the blocks are loaded without issue when enabling the bucket_cache

Second point, I don't know if it can be linked, or not (if not, my apologies and I can create another issue), I've then tried to load some metrics (last 3d) and in the logs :

Nov 29 11:00:35 thns-store.domain thanos[306234]: ts=2023-11-29T11:00:35.149732061Z caller=memcached.go:164 level=error msg="failed to cache series in memcached" err="the async buffer is full"
Nov 29 11:00:35 thns-store.domain thanos[306234]: ts=2023-11-29T11:00:35.149828754Z caller=memcached.go:164 level=error msg="failed to cache series in memcached" err="the async buffer is full"
Nov 29 11:00:35 thns-store.domain thanos[306234]: ts=2023-11-29T11:00:35.149927293Z caller=memcached.go:164 level=error msg="failed to cache series in memcached" err="the async buffer is full"
Nov 29 11:00:35 thns-store.domain thanos[306234]: ts=2023-11-29T11:00:35.15002694Z caller=memcached.go:164 level=error msg="failed to cache series in memcached" err="the async buffer is full"
Nov 29 11:00:35 thns-store.domain thanos[306234]: ts=2023-11-29T11:00:35.150109701Z caller=memcached.go:164 level=error msg="failed to cache series in memcached" err="the async buffer is full"

I'm not using any memcache in the configs, so I tried to find where this can come from, and I discovered that it might be related to NewRemoteIndexCache function that is called from the factory.go, which is the one from memcache.go.

Is it really trying to call memcache even when using redis ? Or is it just a matter or printing and this is the async buffer of the redis implementation that need to be increased ? In both scenarios (if I'm not mistaken), a small explanation / clarification is needed here, because it's confusing. 🤔

Environment:

OS : Debian 12 bookworm (e.g. from /etc/os-release):
Kernel : Linux 6.1.0-12-amd64 Debian 6.1.52-1 (2023-09-07) x86_64 GNU/Linux

Thanks a lot ! 🙏

The text was updated successfully, but these errors were encountered:

MichaHoffmann · 2023-11-29T12:29:20Z

Same for us

MichaHoffmann · 2023-11-29T12:34:21Z

I think the memcached thing is only cosmetic because the remote index cache is situated in a file called memcached.go

MichaHoffmann · 2023-11-29T12:36:03Z

Redis cache should have a config option "max_async_buffer_size" which does feel related. Why blocks are not loading because of it is a mystery though.

Chr0my · 2023-11-30T08:36:50Z

Thanks for your answer @MichaHoffmann.

I think the memcached thing is only cosmetic because the remote index cache is situated in a file called memcached.go

Yes that was my first feeling but wasn't 100% sure, thank you.

Why blocks are not loading because of it is a mystery though.

I'm not sure the logs and the fact that the blocks are not loaded are linked finally, as I only launched the queries on the stores that were successfully started with all blocks.

Redis cache should have a config option "max_async_buffer_size" which does feel related.

Exactly, I played with it quickly and didn't found the good setup ATM. I can try to tweak the conf, but I wont go deeper right now until the initial issue is fixed, we have currently ~40 different stores to setup and I cannot afford to loose them randomly 😄

I'd be glad to do any tests you need to help finding out this issue 👍

MichaHoffmann · 2023-11-30T08:41:18Z

We also face the same issue ( though sporadically ); I wonder if we just hit the redis cache too hard and because of network latency the buffer fills too quickly.

Chr0my · 2023-12-04T10:38:24Z

Hi,

I did another set of tests increasing the log level (debug) and focused on a specific block.

Dec 04 10:08:08 thns-store-domain thanos[309001]: ts=2023-12-04T10:08:08.597954706Z caller=bucket.go:672 level=debug msg="loading new block" id=01HGK3X3JE5XYB796KRXD4HDA9
Dec 04 10:08:08 thns-store.domain thanos[309001]: ts=2023-12-04T10:08:08.598146159Z caller=binary_reader.go:536 level=debug msg="failed to read index-header from disk; recreating" path=/mnt/space1/thanos/store/01HGK3X3JE5XYB796KRXD4HDA9/index-header err="try lock file: open /mnt/space1/thanos/store/01HGK3X3JE5XYB796KRXD4HDA9/index-header: no such file or directory"

Dec 04 10:08:08 thns-storedomain thanos[309001]: ts=2023-12-04T10:08:08.743357367Z caller=bucket.go:681 level=warn msg="loading block failed" elapsed=145.403965ms id=01HGK3X3JE5XYB796KRXD4HDA9 err="create index header reader: write index header: new index reader: get object attributes of 01HGK3X3JE5XYB796KRXD4HDA9/index: The specified key does not exist."

Dec 04 10:08:08 thns-store.domain thanos[309001]: ts=2023-12-04T10:08:08.786764973Z caller=bucket.go:672 level=debug msg="loading new block" id=01HGK3X3JE5XYB796KRXD4HDA9
Dec 04 10:08:08 thns-store.domain thanos[309001]: ts=2023-12-04T10:08:08.786894177Z caller=binary_reader.go:536 level=debug msg="failed to read index-header from disk; recreating" path=/mnt/space1/thanos/store/01HGK3X3JE5XYB796KRXD4HDA9/index-header err="try lock file: open /mnt/space1/thanos/store/01HGK3X3JE5XYB796KRXD4HDA9/index-header: no such file or directory"

Could it be more global than just redis ? I've read this issue and we can observe the same kind of logs using memcached 🤔
Sadly it has been closed without real answer.

Chr0my · 2023-12-04T11:31:28Z

Thanks to the issue mentioned above ☝️ I realized that the failing blocks comes from another stores.
After ensuring only 1 store use the redis cluster, it restarted without any issue.

@MichaHoffmann Can you tell me if you are in the same situation with more than 1 store using the same cluster ?

Is it a known / wanted behavior not be able to share the cache system across different stores or can it be marked as a bug?
That's not very practical if we have to deploy a cluster for each store 😕
Edit : I'll try to do some tests using different redis db / store
Edit 2 : I'm unable to select a different database as mentioned here 😞

MichaHoffmann · 2023-12-04T14:00:37Z

The cache should be shared across different deployments; this is for sure a bug i think

MichaHoffmann · 2023-12-06T18:53:26Z

Ah so we are using redis only for index cache thats why i dont see the bucket meta sync issues i think! For bucket cache we use groupcache which works fine as far as i can tell!

Chr0my · 2023-12-08T10:02:10Z

Yes it looks that only bucket_cache is concerned, I didn't notice any trouble using only index cache, but my tests didn't last long because of the issue mentioned.

A potential solution could be be to add kind of storeID that would prefix the keys in the cache so each store would be able to retrieve only the keys that belong to it 🤔 And this would work for either memcache or redis, no?

On our side, right now, we will go back and deploy simple redis instances and use a dedicated databases for each stores.

About groupcache, if I understand well, a group should only contains stores that point to the same object storage; on our side it would mean multiplying at least by 2 or 3 the number of stores (from ~40 to 80 or 120). I'm not sure it's worth it

GiedriusS · 2024-04-15T09:27:46Z

Added path to the hash here #7158 so this should help given that you have a separate file for specifying the bucket cache configuration. Long-term fix would be to add a field like name to the cache config.

gjshao44 · 2024-09-12T16:38:21Z

I ran into a similar issue with memcached, two or more stores would eventually make long term metrics not usable. Revert to in-memory cache for bucket for now. Definitily would appreciate prioritize this fix.

kaiohenricunha · 2024-10-08T13:27:39Z

I'm having the same issues on a single instance Elasticache Redis with index cache on db 0 and bucket cache on db 1. So, the problem is not that they're sharing the same database.

Changing the bucket cache to in-memory, as mentioned by @gjshao44 , solved it.

Chr0my changed the title ~~Thanos Store randomly not loading blocks using redis bucket_cache~~ Thanos stores are unable to share the same redis database (bucket_cache) Jan 5, 2024

GiedriusS mentioned this issue Sep 13, 2024

proposal: have a list of live blocks in object storage #7710

Open

2 tasks

kaiohenricunha mentioned this issue Oct 7, 2024

[Thanos Store] Guidance on best way to scale sharded and autoscaled Thanos Store with persistence enabled #7797

Open

mfoldenyi mentioned this issue Jan 9, 2025

Next steps with groupcache (galaxycache) #5037

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thanos stores are unable to share the same redis database (bucket_cache) #6939

Thanos stores are unable to share the same redis database (bucket_cache) #6939

Chr0my commented Nov 29, 2023

MichaHoffmann commented Nov 29, 2023

MichaHoffmann commented Nov 29, 2023

MichaHoffmann commented Nov 29, 2023 •

edited

Loading

Chr0my commented Nov 30, 2023 •

edited

Loading

MichaHoffmann commented Nov 30, 2023

Chr0my commented Dec 4, 2023

Chr0my commented Dec 4, 2023 •

edited

Loading

MichaHoffmann commented Dec 4, 2023

MichaHoffmann commented Dec 6, 2023

Chr0my commented Dec 8, 2023

GiedriusS commented Apr 15, 2024

gjshao44 commented Sep 12, 2024

kaiohenricunha commented Oct 8, 2024

Thanos stores are unable to share the same redis database (bucket_cache) #6939

Thanos stores are unable to share the same redis database (bucket_cache) #6939

Comments

Chr0my commented Nov 29, 2023

MichaHoffmann commented Nov 29, 2023

MichaHoffmann commented Nov 29, 2023

MichaHoffmann commented Nov 29, 2023 • edited Loading

Chr0my commented Nov 30, 2023 • edited Loading

MichaHoffmann commented Nov 30, 2023

Chr0my commented Dec 4, 2023

Chr0my commented Dec 4, 2023 • edited Loading

MichaHoffmann commented Dec 4, 2023

MichaHoffmann commented Dec 6, 2023

Chr0my commented Dec 8, 2023

GiedriusS commented Apr 15, 2024

gjshao44 commented Sep 12, 2024

kaiohenricunha commented Oct 8, 2024

MichaHoffmann commented Nov 29, 2023 •

edited

Loading

Chr0my commented Nov 30, 2023 •

edited

Loading

Chr0my commented Dec 4, 2023 •

edited

Loading