-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: have a list of live blocks in object storage #7710
base: main
Are you sure you want to change the base?
proposal: have a list of live blocks in object storage #7710
Conversation
a73ba7c
to
c93235d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, some comments!
docs/proposals-accepted/20240909-live-blocks-in-objectstorage.md
Outdated
Show resolved
Hide resolved
docs/proposals-accepted/20240909-live-blocks-in-objectstorage.md
Outdated
Show resolved
Hide resolved
docs/proposals-accepted/20240909-live-blocks-in-objectstorage.md
Outdated
Show resolved
Hide resolved
c93235d
to
a884a0b
Compare
Signed-off-by: Michael Hoffmann <[email protected]>
a884a0b
to
3329161
Compare
|
||
## Why | ||
|
||
Accessing blocks from object storage is generally orders of magnitude slower then accessing the same data from other sources that can query them from disk or from memory. If the source that originally uploaded the blocks is still alive and still has access to the block then we would throw away the expensively obtained data from object storage during deduplication. Note that this will not put additional pressure on the source components since they would get queried during fan-out anyway. As an example, imagine a Sidecar to a Prometheus server that has 3 months of retention and a Storage Gateway that is configured to serve the whole range. Right now we dont have a great way to deal with this dynamically. This proposal should address this hopefully. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is somewhat of a way to deal with this through --min-time
and --max-time
flags but it is not ideal:
- What if some Sidecar/Receiver fails to upload blocks? Receiver doesn't delete blocks until they are uploaded so it might accidentally fall out of the configured min/max time.
- It almost requires all components (Receive/Sidecar/Ruler) to have equal retention to make
--max-time
work properly; otherwise, you might have to have multiple Thanos Store replicas with different--max-time
and selector labels.
We would like to improve this user experience. I would add this to the Why section.
|
||
### Solution | ||
|
||
Each source utilizes the `shipper` component to upload blocks into object storage. Additionally the `shipper` component also has a complete picture of the blocks that the source owns. If we extend the `shipper` to also own a register in object storage named `thanos/sources/<uuid>/live_blocks.json` that contains a plain list of the live blocks that this source owns. We can deduce a timestamp when it was last updated by checking attributes in object storage. When the storage gateway syncs its list of block metas, it can also iterate the `thanos/sources` directory and see which `live_blocks.json` files have been updated recently enough to assume that the sources are still alive. It can subsequently build an index of live block ids and proceed to prune them when it handles Store API requests (Series, LabelValues, LabelNames RPCs). In theory this should not lead to gaps since the block_ids are still owned by other live sources. Care needs to be taken to make sure that the blocks are within the `--min-time/--max-time` bounds of the source. The UUID for the source should be propagated in the `Info` gRPC call to the querier and through `Series` gRPC call into the storage gateway. This enables us to only prune sources that are still alive and registered with the querier. Note that this is not a breaking change - it should only be an opt-in optimization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest creating a separate subsystem/entity that would use the Shipper to upload thanos/sources/<uuid>/live_blocks.json
instead of hooking the logic inside of the Shipper. The reason is that we have external projects using the Shipper.
|
||
1. How do we obtain a stable UUID to claim ownership of `thanos/sources/<uuid>/live_blocks.json`? | ||
|
||
I propose that we extend Sidecar, Ruler and Receiver to read the file `./thanos-id` on startup. This file should contain a UUID that will identify it. If this file should not exist we generate a random UUID and write this file, which should give us a reasonably stable UUID for the live time of this service. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 this would also help us solve #6939. I propose adding this to Thanos Store too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah lets just add it to all components!
|
||
3. What happens if a block was deleted due to retention in Prometheus but shipper has not uploaded a new `live_blocks.json` file yet? | ||
|
||
We shall only filter blocks from the `live_blocks.json` list with a small buffer depending on the last updated timestamp. Since this list is essentially a snapshot of the blocks on disk, any block deleted because of retention will be deleted after the this timestamp. Any blocks whose range overlaps `oldest_live_block_start - (now - last_updated)` could be deleted because of retention, so they should not be pruned. Example: If we were updated 1 hour ago we should not filter the oldest block from the list. If we were updated 3 hours ago we should not filter the last 2 blocks, etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we include the timestamp inside of that file to avoid having to do an extra HEAD call? AFAICT that would be needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought maybe HEAD would be omre accurate because of possible clockskew but lets just write timestamp into the file.
|
||
* We tend to do tons of `Series` calls and a bloomfilter for a decently sized bucket of 10k blocks could be enormous. Additionally the live blocks dont really change often so updating it on every `Series` call seems unnecessarily expensive. | ||
|
||
2. Shared Redis/Memcached instance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably also worth mentioning why we don't use a filter here and opt for a JSON format with no compression - risk of false positives, UUIDs are essentially random so they don't compress well.
Changes
Verification