Compacting TSDB head on every ingester shutdown to speed up startup #98

grafanabot · 2021-08-10T14:47:24Z

Describe the solution you'd like
The typical use case of an ingester shutdown is during a rolling update. We currently close TSDB and, at subsequent startup, we replay the WAL before the ingester is ready. Replaying the WAL is slow and we recently found out that compacting the head and shipping it to the storage on /shutdown is actually faster than replaying the WAL.

Idea: what if we always compact TSDB head and ship it to the storage at shutdown?

Question:

If we compact TSDB head (up until head max time) on shutdown, what's the last checkpoint created and what's actually replayed from WAL at startup?

Pros:

The ingesters rollout may be faster
The scale down wouldn't be a snowflake operation anymore (currently it requires calling /shutdown API beforehand)

Cons (potential blockers):

At ingester startup, can the ingester ingests samples with timestamp < the last ingested samples before shutting down?

Let's discuss it.

Submitted by: pracucci
Cortex Issue Number: 3723

The text was updated successfully, but these errors were encountered:

grafanabot · 2021-08-10T14:47:25Z

If we compact TSDB head (up until head max time) on shutdown

This I assume will be similar to how the flush/shutdown endpoint works where you cut the block at 2h boundary.

If we compact TSDB head (up until head max time) on shutdown, what's the last checkpoint created and what's actually replayed from WAL at startup?

Checkpoint is blindly created on the first 2/3rd of the WAL segments discarding any old samples, which is basically everything here. So it would be replay of checkpoint for series and replaying last 1/3rd of WAL. But since the WAL replay will skip all the samples, it will be faster than actual replay since building the chunk is a slow process (this is validated from the WAL replay with the m-map Head chunks).

At ingester startup, can the ingester ingests samples with timestamp < the last ingested samples before shutting down?

Currently it cannot if the samples overlaps a block :/. But if you delete the block from the disk after shipping (which I dont think is in the place because of the queries and also WAL replay would depend on it), then it can.

Submitted by: codesome

grafanabot · 2021-08-10T14:47:26Z

Currently it cannot if the samples overlaps a block :/. But if you delete the block from the disk after shipping (which I dont think is in the place because of the queries and also WAL replay would depend on it), then it can.

We can't delete blocks from disk because of queries (as you mentioned). What if we delete the whole WAL after HEAD is compacted and blocks shipped? Would that change the behaviour?

Submitted by: pracucci

grafanabot · 2021-08-10T14:47:26Z

If we can guarantee no new sample incoming on shutdown, then we could delete the WAL after successful compaction of entire Head and the replay would be quick.

Submitted by: codesome

grafanabot · 2021-08-10T14:47:27Z

If we can guarantee no new sample incoming on shutdown

We already do.

Submitted by: pstibrany

grafanabot · 2021-08-10T14:47:28Z

What if we delete the whole WAL after HEAD is compacted and blocks shipped?

Race detected :)

It will only affect the replay, but the timestamp to allow samples will be taken from the maxt of the last block to avoid overlaps.

Submitted by: codesome

grafanabot · 2021-08-10T14:47:28Z

https://github.com/prometheus/prometheus/blob/d8c17025df163cbdb27e78974a5d6ed174fc04a0/tsdb/db.go#L669-L683

If we could split this Head initialisation outside of the DB.Open, it is possible we skip taking the block maxt as the min valid time and then the first sample would determine the acceptable ranges.

Submitted by: codesome

grafanabot · 2021-08-10T14:47:29Z

It will only affect the replay, but the timestamp to allow samples will be taken from the maxt of the last block to avoid overlaps.

What if we discuss an option in TSDB to only pick the latest timestamp from WAL (not blocks) and allow overlapping blocks? What would be the downsides?

Edit: you commented on this while I was posting the same question.

Submitted by: pracucci

grafanabot · 2021-08-10T14:47:29Z

The latest sample on the WAL also determines the minValidTime, the min valid time for ingestion is either the minValidTime set or the head.MaxTime() - 1h, whichever is lower. The WAL replay depends on the initial minValidTime provided to discard old samples, so breaking down the DB.Open as above would be an easier refactor and letting the first sample decide the minValidTime. That said, we will need to delete the WAL else it will end up loading the entire WAL in this case :)

Submitted by: codesome

grafanabot · 2021-08-10T14:47:30Z

There is even a simpler solution - expose a method on the Head block to set the min valid time and set it to math.MinInt64 after loading the DB. No more need of deleting the WAL before shutdown.

Submitted by: codesome

grafanabot · 2021-08-10T14:47:31Z

There is even a simpler solution - expose a method on the Head block to set the min valid time and set it to math.MinInt64 after loading the DB. No more need of deleting the WAL before shutdown.

Could it conflict with a subsequent WAL replay (think about an ingester crash)? We may end up with out of order samples in the WAL.

Submitted by: pracucci

grafanabot · 2021-08-10T14:47:31Z

Next question: wouldn't Prometheus benefit from a similar approach too, basically using the WAL only to recover from crash?

Submitted by: pracucci

grafanabot · 2021-08-10T14:47:31Z

Next question: wouldn't Prometheus benefit from a similar approach too, basically using the WAL only to recover from crash?

How would this affect remote write, which also uses WAL?

Submitted by: pstibrany

grafanabot · 2021-08-10T14:47:32Z

Could it conflict with a subsequent WAL replay (think about an ingester crash)? We may end up with out of order samples in the WAL.

If we delete the WAL on graceful shutdown, we will be able to detect a graceful shutdown and change the minValidTime only at that time.

Submitted by: codesome

grafanabot · 2021-08-10T14:47:32Z

Next question: wouldn't Prometheus benefit from a similar approach too, basically using the WAL only to recover from crash?

I was going to propose this to Prometheus now until I saw @pstibrany's comment. Since remote-write depends on WAL, highly unlikely we would like to throw away old WAL on restart. (I dont know from where remote-write starts in WAL after a restart, so there is some possibility)

Submitted by: codesome

grafanabot · 2021-08-10T14:47:33Z

See: prometheus/prometheus#8415

Submitted by: gouthamve

grafanabot · 2021-08-10T14:47:34Z

Since remote-write depends on WAL, highly unlikely we would like to throw away old WAL on restart.

Under normal conditions remote-write should be a second or so behind the latest data, so "ensure remote-write is all flushed before removing WAL" should work.

Under abnormal conditions, e.g. remote-write is an hour behind and you cannot flush in a few seconds, don't remove the WAL.

Submitted by: bboreham

grafanabot · 2021-08-10T14:47:34Z

Seems related: prometheus-junkyard/tsdb#346

Submitted by: bboreham

grafanabot · 2021-08-10T14:47:35Z

I think the idea may still be valid so I would keep the issue open.

Submitted by: pracucci

grafanabot · 2021-08-10T14:47:36Z

Something else in the same area: "Snapshot in-memory chunks on shutdown for faster restarts" prometheus/prometheus#7229

Submitted by: bboreham

shichanglin5 · 2024-01-17T06:42:46Z

If replication-factor is set to 3, there will be three copies of the monitoring data, can you turn off WAL? Even if one ingester hangs up and the data in memory is lost, the other two copies can still handle the query request normally.

shichanglin5 · 2024-01-17T06:43:44Z

Does mimir have the option to disable wal

pracucci · 2024-01-29T15:08:22Z

Does mimir have the option to disable wal

No, Mimir doesn't allow to disable WAL. WAL is required to restore the in-memory state after every restart.

pracucci · 2024-01-29T15:09:56Z

I opened the original issue few years ago. The original idea was to compact TSDB Head at shutdown to speed up startup, but it has side effects. In the meanwhile the WAL replay has been significantly optimised, parallel ingesters rollout has been supported via rollout-operator, and so the original need I had is no more a need. Going to close my issue.

grafanabot added storage/blocks component/ingester labels Aug 10, 2021

pracucci closed this as completed Jan 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compacting TSDB head on every ingester shutdown to speed up startup #98

Compacting TSDB head on every ingester shutdown to speed up startup #98

grafanabot commented Aug 10, 2021

grafanabot commented Aug 10, 2021

grafanabot commented Aug 10, 2021

grafanabot commented Aug 10, 2021

grafanabot commented Aug 10, 2021

grafanabot commented Aug 10, 2021

grafanabot commented Aug 10, 2021

grafanabot commented Aug 10, 2021

grafanabot commented Aug 10, 2021

grafanabot commented Aug 10, 2021

grafanabot commented Aug 10, 2021

grafanabot commented Aug 10, 2021

grafanabot commented Aug 10, 2021

grafanabot commented Aug 10, 2021

grafanabot commented Aug 10, 2021

grafanabot commented Aug 10, 2021

grafanabot commented Aug 10, 2021

grafanabot commented Aug 10, 2021

grafanabot commented Aug 10, 2021

grafanabot commented Aug 10, 2021

shichanglin5 commented Jan 17, 2024

shichanglin5 commented Jan 17, 2024 •

edited

Loading

pracucci commented Jan 29, 2024

pracucci commented Jan 29, 2024

Compacting TSDB head on every ingester shutdown to speed up startup #98

Compacting TSDB head on every ingester shutdown to speed up startup #98

Comments

grafanabot commented Aug 10, 2021

grafanabot commented Aug 10, 2021

grafanabot commented Aug 10, 2021

grafanabot commented Aug 10, 2021

grafanabot commented Aug 10, 2021

grafanabot commented Aug 10, 2021

grafanabot commented Aug 10, 2021

grafanabot commented Aug 10, 2021

grafanabot commented Aug 10, 2021

grafanabot commented Aug 10, 2021

grafanabot commented Aug 10, 2021

grafanabot commented Aug 10, 2021

grafanabot commented Aug 10, 2021

grafanabot commented Aug 10, 2021

grafanabot commented Aug 10, 2021

grafanabot commented Aug 10, 2021

grafanabot commented Aug 10, 2021

grafanabot commented Aug 10, 2021

grafanabot commented Aug 10, 2021

grafanabot commented Aug 10, 2021

shichanglin5 commented Jan 17, 2024

shichanglin5 commented Jan 17, 2024 • edited Loading

pracucci commented Jan 29, 2024

pracucci commented Jan 29, 2024

shichanglin5 commented Jan 17, 2024 •

edited

Loading