proposal: have a list of live blocks in object storage #7710

MichaHoffmann · 2024-09-09T11:09:50Z

I added CHANGELOG entry for this change.
Change is not relevant to the end user.

Changes

Verification

saswatamcode

Thanks, some comments!

docs/proposals-accepted/20240909-live-blocks-in-objectstorage.md

Signed-off-by: Michael Hoffmann <[email protected]>

GiedriusS · 2024-09-13T09:50:40Z

docs/proposals-accepted/20240909-live-blocks-in-objectstorage.md

+
+## Why
+
+Accessing blocks from object storage is generally orders of magnitude slower then accessing the same data from other sources that can query them from disk or from memory. If the source that originally uploaded the blocks is still alive and still has access to the block then we would throw away the expensively obtained data from object storage during deduplication. Note that this will not put additional pressure on the source components since they would get queried during fan-out anyway. As an example, imagine a Sidecar to a Prometheus server that has 3 months of retention and a Storage Gateway that is configured to serve the whole range. Right now we dont have a great way to deal with this dynamically. This proposal should address this hopefully.


There is somewhat of a way to deal with this through --min-time and --max-time flags but it is not ideal:

What if some Sidecar/Receiver fails to upload blocks? Receiver doesn't delete blocks until they are uploaded so it might accidentally fall out of the configured min/max time.

It almost requires all components (Receive/Sidecar/Ruler) to have equal retention to make --max-time work properly; otherwise, you might have to have multiple Thanos Store replicas with different --max-time and selector labels.

We would like to improve this user experience. I would add this to the Why section.

GiedriusS · 2024-09-13T09:55:17Z

docs/proposals-accepted/20240909-live-blocks-in-objectstorage.md

+
+### Solution
+
+Each source utilizes the `shipper` component to upload blocks into object storage. Additionally the `shipper` component also has a complete picture of the blocks that the source owns. If we extend the `shipper` to also own a register in object storage named `thanos/sources/<uuid>/live_blocks.json` that contains a plain list of the live blocks that this source owns. We can deduce a timestamp when it was last updated by checking attributes in object storage. When the storage gateway syncs its list of block metas, it can also iterate the `thanos/sources` directory and see which `live_blocks.json` files have been updated recently enough to assume that the sources are still alive. It can subsequently build an index of live block ids and proceed to prune them when it handles Store API requests (Series, LabelValues, LabelNames RPCs). In theory this should not lead to gaps since the block_ids are still owned by other live sources. Care needs to be taken to make sure that the blocks are within the `--min-time/--max-time` bounds of the source. The UUID for the source should be propagated in the `Info` gRPC call to the querier and through `Series` gRPC call into the storage gateway. This enables us to only prune sources that are still alive and registered with the querier. Note that this is not a breaking change - it should only be an opt-in optimization.


I would suggest creating a separate subsystem/entity that would use the Shipper to upload thanos/sources/<uuid>/live_blocks.json instead of hooking the logic inside of the Shipper. The reason is that we have external projects using the Shipper.

GiedriusS · 2024-09-13T09:57:03Z

docs/proposals-accepted/20240909-live-blocks-in-objectstorage.md

+
+1. How do we obtain a stable UUID to claim ownership of `thanos/sources/<uuid>/live_blocks.json`?
+
+I propose that we extend Sidecar, Ruler and Receiver to read the file `./thanos-id` on startup. This file should contain a UUID that will identify it. If this file should not exist we generate a random UUID and write this file, which should give us a reasonably stable UUID for the live time of this service.


👍 this would also help us solve #6939. I propose adding this to Thanos Store too.

Yeah lets just add it to all components!

GiedriusS · 2024-09-13T09:58:50Z

docs/proposals-accepted/20240909-live-blocks-in-objectstorage.md

+
+3. What happens if a block was deleted due to retention in Prometheus but shipper has not uploaded a new `live_blocks.json` file yet?
+
+We shall only filter blocks from the `live_blocks.json` list with a small buffer depending on the last updated timestamp. Since this list is essentially a snapshot of the blocks on disk, any block deleted because of retention will be deleted after the this timestamp. Any blocks whose range overlaps `oldest_live_block_start - (now - last_updated)` could be deleted because of retention, so they should not be pruned. Example: If we were updated 1 hour ago we should not filter the oldest block from the list. If we were updated 3 hours ago we should not filter the last 2 blocks, etc.


Should we include the timestamp inside of that file to avoid having to do an extra HEAD call? AFAICT that would be needed.

I thought maybe HEAD would be omre accurate because of possible clockskew but lets just write timestamp into the file.

GiedriusS · 2024-09-13T10:02:33Z

docs/proposals-accepted/20240909-live-blocks-in-objectstorage.md

+
+* We tend to do tons of `Series` calls and a bloomfilter for a decently sized bucket of 10k blocks could be enormous. Additionally the live blocks dont really change often so updating it on every `Series` call seems unnecessarily expensive.
+
+2. Shared Redis/Memcached instance


Probably also worth mentioning why we don't use a filter here and opt for a JSON format with no compression - risk of false positives, UUIDs are essentially random so they don't compress well.

pull-request-size bot added the size/M label Sep 9, 2024

MichaHoffmann changed the title ~~[DRAFt] proposal: have a list of live blocks in object storage~~ [DRAFT] proposal: have a list of live blocks in object storage Sep 9, 2024

MichaHoffmann force-pushed the mhoffmann/draft-for-live-blocks-list-in-object-storage branch 4 times, most recently from a73ba7c to c93235d Compare September 9, 2024 14:22

saswatamcode reviewed Sep 10, 2024

View reviewed changes

MichaHoffmann force-pushed the mhoffmann/draft-for-live-blocks-list-in-object-storage branch from c93235d to a884a0b Compare September 10, 2024 12:30

MichaHoffmann changed the title ~~[DRAFT] proposal: have a list of live blocks in object storage~~ proposal: have a list of live blocks in object storage Sep 10, 2024

proposal: have a list of live blocks in object storage

3329161

Signed-off-by: Michael Hoffmann <[email protected]>

MichaHoffmann force-pushed the mhoffmann/draft-for-live-blocks-list-in-object-storage branch from a884a0b to 3329161 Compare September 11, 2024 16:40

GiedriusS reviewed Sep 13, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proposal: have a list of live blocks in object storage #7710

proposal: have a list of live blocks in object storage #7710

MichaHoffmann commented Sep 9, 2024

saswatamcode left a comment

GiedriusS Sep 13, 2024

GiedriusS Sep 13, 2024

GiedriusS Sep 13, 2024

MichaHoffmann Sep 16, 2024

GiedriusS Sep 13, 2024

MichaHoffmann Sep 16, 2024

GiedriusS Sep 13, 2024


		## Why

		Accessing blocks from object storage is generally orders of magnitude slower then accessing the same data from other sources that can query them from disk or from memory. If the source that originally uploaded the blocks is still alive and still has access to the block then we would throw away the expensively obtained data from object storage during deduplication. Note that this will not put additional pressure on the source components since they would get queried during fan-out anyway. As an example, imagine a Sidecar to a Prometheus server that has 3 months of retention and a Storage Gateway that is configured to serve the whole range. Right now we dont have a great way to deal with this dynamically. This proposal should address this hopefully.


		### Solution

		Each source utilizes the `shipper` component to upload blocks into object storage. Additionally the `shipper` component also has a complete picture of the blocks that the source owns. If we extend the `shipper` to also own a register in object storage named `thanos/sources/<uuid>/live_blocks.json` that contains a plain list of the live blocks that this source owns. We can deduce a timestamp when it was last updated by checking attributes in object storage. When the storage gateway syncs its list of block metas, it can also iterate the `thanos/sources` directory and see which `live_blocks.json` files have been updated recently enough to assume that the sources are still alive. It can subsequently build an index of live block ids and proceed to prune them when it handles Store API requests (Series, LabelValues, LabelNames RPCs). In theory this should not lead to gaps since the block_ids are still owned by other live sources. Care needs to be taken to make sure that the blocks are within the `--min-time/--max-time` bounds of the source. The UUID for the source should be propagated in the `Info` gRPC call to the querier and through `Series` gRPC call into the storage gateway. This enables us to only prune sources that are still alive and registered with the querier. Note that this is not a breaking change - it should only be an opt-in optimization.


		1. How do we obtain a stable UUID to claim ownership of `thanos/sources/<uuid>/live_blocks.json`?

		I propose that we extend Sidecar, Ruler and Receiver to read the file `./thanos-id` on startup. This file should contain a UUID that will identify it. If this file should not exist we generate a random UUID and write this file, which should give us a reasonably stable UUID for the live time of this service.


		3. What happens if a block was deleted due to retention in Prometheus but shipper has not uploaded a new `live_blocks.json` file yet?

		We shall only filter blocks from the `live_blocks.json` list with a small buffer depending on the last updated timestamp. Since this list is essentially a snapshot of the blocks on disk, any block deleted because of retention will be deleted after the this timestamp. Any blocks whose range overlaps `oldest_live_block_start - (now - last_updated)` could be deleted because of retention, so they should not be pruned. Example: If we were updated 1 hour ago we should not filter the oldest block from the list. If we were updated 3 hours ago we should not filter the last 2 blocks, etc.


		* We tend to do tons of `Series` calls and a bloomfilter for a decently sized bucket of 10k blocks could be enormous. Additionally the live blocks dont really change often so updating it on every `Series` call seems unnecessarily expensive.

		2. Shared Redis/Memcached instance

proposal: have a list of live blocks in object storage #7710

Are you sure you want to change the base?

proposal: have a list of live blocks in object storage #7710

Conversation

MichaHoffmann commented Sep 9, 2024

Changes

Verification

saswatamcode left a comment

Choose a reason for hiding this comment

GiedriusS Sep 13, 2024

Choose a reason for hiding this comment

GiedriusS Sep 13, 2024

Choose a reason for hiding this comment

GiedriusS Sep 13, 2024

Choose a reason for hiding this comment

MichaHoffmann Sep 16, 2024

Choose a reason for hiding this comment

GiedriusS Sep 13, 2024

Choose a reason for hiding this comment

MichaHoffmann Sep 16, 2024

Choose a reason for hiding this comment

GiedriusS Sep 13, 2024

Choose a reason for hiding this comment