Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compacting TSDB head on every ingester shutdown to speed up startup #98

Closed
grafanabot opened this issue Aug 10, 2021 · 23 comments
Closed

Comments

@grafanabot
Copy link
Contributor

Describe the solution you'd like
The typical use case of an ingester shutdown is during a rolling update. We currently close TSDB and, at subsequent startup, we replay the WAL before the ingester is ready. Replaying the WAL is slow and we recently found out that compacting the head and shipping it to the storage on /shutdown is actually faster than replaying the WAL.

Idea: what if we always compact TSDB head and ship it to the storage at shutdown?

Question:

  • If we compact TSDB head (up until head max time) on shutdown, what's the last checkpoint created and what's actually replayed from WAL at startup?

Pros:

  • The ingesters rollout may be faster
  • The scale down wouldn't be a snowflake operation anymore (currently it requires calling /shutdown API beforehand)

Cons (potential blockers):

  • At ingester startup, can the ingester ingests samples with timestamp < the last ingested samples before shutting down?

Let's discuss it.

Submitted by: pracucci
Cortex Issue Number: 3723

@grafanabot
Copy link
Contributor Author

If we compact TSDB head (up until head max time) on shutdown

This I assume will be similar to how the flush/shutdown endpoint works where you cut the block at 2h boundary.

If we compact TSDB head (up until head max time) on shutdown, what's the last checkpoint created and what's actually replayed from WAL at startup?

Checkpoint is blindly created on the first 2/3rd of the WAL segments discarding any old samples, which is basically everything here. So it would be replay of checkpoint for series and replaying last 1/3rd of WAL. But since the WAL replay will skip all the samples, it will be faster than actual replay since building the chunk is a slow process (this is validated from the WAL replay with the m-map Head chunks).

At ingester startup, can the ingester ingests samples with timestamp < the last ingested samples before shutting down?

Currently it cannot if the samples overlaps a block :/. But if you delete the block from the disk after shipping (which I dont think is in the place because of the queries and also WAL replay would depend on it), then it can.

Submitted by: codesome

@grafanabot
Copy link
Contributor Author

Currently it cannot if the samples overlaps a block :/. But if you delete the block from the disk after shipping (which I dont think is in the place because of the queries and also WAL replay would depend on it), then it can.

We can't delete blocks from disk because of queries (as you mentioned). What if we delete the whole WAL after HEAD is compacted and blocks shipped? Would that change the behaviour?

Submitted by: pracucci

@grafanabot
Copy link
Contributor Author

If we can guarantee no new sample incoming on shutdown, then we could delete the WAL after successful compaction of entire Head and the replay would be quick.

Submitted by: codesome

@grafanabot
Copy link
Contributor Author

If we can guarantee no new sample incoming on shutdown

We already do.

Submitted by: pstibrany

@grafanabot
Copy link
Contributor Author

What if we delete the whole WAL after HEAD is compacted and blocks shipped?

Race detected :)

It will only affect the replay, but the timestamp to allow samples will be taken from the maxt of the last block to avoid overlaps.

Submitted by: codesome

@grafanabot
Copy link
Contributor Author

https://github.com/prometheus/prometheus/blob/d8c17025df163cbdb27e78974a5d6ed174fc04a0/tsdb/db.go#L669-L683

If we could split this Head initialisation outside of the DB.Open, it is possible we skip taking the block maxt as the min valid time and then the first sample would determine the acceptable ranges.

Submitted by: codesome

@grafanabot
Copy link
Contributor Author

It will only affect the replay, but the timestamp to allow samples will be taken from the maxt of the last block to avoid overlaps.

What if we discuss an option in TSDB to only pick the latest timestamp from WAL (not blocks) and allow overlapping blocks? What would be the downsides?

Edit: you commented on this while I was posting the same question.

Submitted by: pracucci

@grafanabot
Copy link
Contributor Author

The latest sample on the WAL also determines the minValidTime, the min valid time for ingestion is either the minValidTime set or the head.MaxTime() - 1h, whichever is lower. The WAL replay depends on the initial minValidTime provided to discard old samples, so breaking down the DB.Open as above would be an easier refactor and letting the first sample decide the minValidTime. That said, we will need to delete the WAL else it will end up loading the entire WAL in this case :)

Submitted by: codesome

@grafanabot
Copy link
Contributor Author

There is even a simpler solution - expose a method on the Head block to set the min valid time and set it to math.MinInt64 after loading the DB. No more need of deleting the WAL before shutdown.

Submitted by: codesome

@grafanabot
Copy link
Contributor Author

There is even a simpler solution - expose a method on the Head block to set the min valid time and set it to math.MinInt64 after loading the DB. No more need of deleting the WAL before shutdown.

Could it conflict with a subsequent WAL replay (think about an ingester crash)? We may end up with out of order samples in the WAL.

Submitted by: pracucci

@grafanabot
Copy link
Contributor Author

Next question: wouldn't Prometheus benefit from a similar approach too, basically using the WAL only to recover from crash?

Submitted by: pracucci

@grafanabot
Copy link
Contributor Author

Next question: wouldn't Prometheus benefit from a similar approach too, basically using the WAL only to recover from crash?

How would this affect remote write, which also uses WAL?

Submitted by: pstibrany

@grafanabot
Copy link
Contributor Author

Could it conflict with a subsequent WAL replay (think about an ingester crash)? We may end up with out of order samples in the WAL.

If we delete the WAL on graceful shutdown, we will be able to detect a graceful shutdown and change the minValidTime only at that time.

Submitted by: codesome

@grafanabot
Copy link
Contributor Author

Next question: wouldn't Prometheus benefit from a similar approach too, basically using the WAL only to recover from crash?

I was going to propose this to Prometheus now until I saw @pstibrany's comment. Since remote-write depends on WAL, highly unlikely we would like to throw away old WAL on restart. (I dont know from where remote-write starts in WAL after a restart, so there is some possibility)

Submitted by: codesome

@grafanabot
Copy link
Contributor Author

See: prometheus/prometheus#8415

Submitted by: gouthamve

@grafanabot
Copy link
Contributor Author

Since remote-write depends on WAL, highly unlikely we would like to throw away old WAL on restart.

Under normal conditions remote-write should be a second or so behind the latest data, so "ensure remote-write is all flushed before removing WAL" should work.

Under abnormal conditions, e.g. remote-write is an hour behind and you cannot flush in a few seconds, don't remove the WAL.

Submitted by: bboreham

@grafanabot
Copy link
Contributor Author

Seems related: prometheus-junkyard/tsdb#346

Submitted by: bboreham

@grafanabot
Copy link
Contributor Author

I think the idea may still be valid so I would keep the issue open.

Submitted by: pracucci

@grafanabot
Copy link
Contributor Author

Something else in the same area: "Snapshot in-memory chunks on shutdown for faster restarts" prometheus/prometheus#7229

Submitted by: bboreham

@shichanglin5
Copy link

If replication-factor is set to 3, there will be three copies of the monitoring data, can you turn off WAL? Even if one ingester hangs up and the data in memory is lost, the other two copies can still handle the query request normally.

@shichanglin5
Copy link

shichanglin5 commented Jan 17, 2024

Does mimir have the option to disable wal

@pracucci
Copy link
Collaborator

Does mimir have the option to disable wal

No, Mimir doesn't allow to disable WAL. WAL is required to restore the in-memory state after every restart.

@pracucci
Copy link
Collaborator

I opened the original issue few years ago. The original idea was to compact TSDB Head at shutdown to speed up startup, but it has side effects. In the meanwhile the WAL replay has been significantly optimised, parallel ingesters rollout has been supported via rollout-operator, and so the original need I had is no more a need. Going to close my issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants