-
Notifications
You must be signed in to change notification settings - Fork 541
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compacting TSDB head on every ingester shutdown to speed up startup #98
Comments
This I assume will be similar to how the flush/shutdown endpoint works where you cut the block at 2h boundary.
Checkpoint is blindly created on the first 2/3rd of the WAL segments discarding any old samples, which is basically everything here. So it would be replay of checkpoint for series and replaying last 1/3rd of WAL. But since the WAL replay will skip all the samples, it will be faster than actual replay since building the chunk is a slow process (this is validated from the WAL replay with the m-map Head chunks).
Currently it cannot if the samples overlaps a block :/. But if you delete the block from the disk after shipping (which I dont think is in the place because of the queries and also WAL replay would depend on it), then it can. Submitted by: codesome |
We can't delete blocks from disk because of queries (as you mentioned). What if we delete the whole WAL after HEAD is compacted and blocks shipped? Would that change the behaviour? Submitted by: pracucci |
If we can guarantee no new sample incoming on shutdown, then we could delete the WAL after successful compaction of entire Head and the replay would be quick. Submitted by: codesome |
We already do. Submitted by: pstibrany |
Race detected :) It will only affect the replay, but the timestamp to allow samples will be taken from the maxt of the last block to avoid overlaps. Submitted by: codesome |
If we could split this Head initialisation outside of the DB.Open, it is possible we skip taking the block maxt as the min valid time and then the first sample would determine the acceptable ranges. Submitted by: codesome |
What if we discuss an option in TSDB to only pick the latest timestamp from WAL (not blocks) and allow overlapping blocks? What would be the downsides? Edit: you commented on this while I was posting the same question. Submitted by: pracucci |
The latest sample on the WAL also determines the minValidTime, the min valid time for ingestion is either the minValidTime set or the Submitted by: codesome |
There is even a simpler solution - expose a method on the Head block to set the min valid time and set it to Submitted by: codesome |
Could it conflict with a subsequent WAL replay (think about an ingester crash)? We may end up with out of order samples in the WAL. Submitted by: pracucci |
Next question: wouldn't Prometheus benefit from a similar approach too, basically using the WAL only to recover from crash? Submitted by: pracucci |
How would this affect remote write, which also uses WAL? Submitted by: pstibrany |
If we delete the WAL on graceful shutdown, we will be able to detect a graceful shutdown and change the minValidTime only at that time. Submitted by: codesome |
I was going to propose this to Prometheus now until I saw @pstibrany's comment. Since remote-write depends on WAL, highly unlikely we would like to throw away old WAL on restart. (I dont know from where remote-write starts in WAL after a restart, so there is some possibility) Submitted by: codesome |
See: prometheus/prometheus#8415 Submitted by: gouthamve |
Under normal conditions remote-write should be a second or so behind the latest data, so "ensure remote-write is all flushed before removing WAL" should work. Under abnormal conditions, e.g. remote-write is an hour behind and you cannot flush in a few seconds, don't remove the WAL. Submitted by: bboreham |
Seems related: prometheus-junkyard/tsdb#346 Submitted by: bboreham |
I think the idea may still be valid so I would keep the issue open. Submitted by: pracucci |
Something else in the same area: "Snapshot in-memory chunks on shutdown for faster restarts" prometheus/prometheus#7229 Submitted by: bboreham |
If replication-factor is set to 3, there will be three copies of the monitoring data, can you turn off WAL? Even if one ingester hangs up and the data in memory is lost, the other two copies can still handle the query request normally. |
Does mimir have the option to disable wal |
No, Mimir doesn't allow to disable WAL. WAL is required to restore the in-memory state after every restart. |
I opened the original issue few years ago. The original idea was to compact TSDB Head at shutdown to speed up startup, but it has side effects. In the meanwhile the WAL replay has been significantly optimised, parallel ingesters rollout has been supported via rollout-operator, and so the original need I had is no more a need. Going to close my issue. |
Describe the solution you'd like
The typical use case of an ingester shutdown is during a rolling update. We currently close TSDB and, at subsequent startup, we replay the WAL before the ingester is ready. Replaying the WAL is slow and we recently found out that compacting the head and shipping it to the storage on
/shutdown
is actually faster than replaying the WAL.Idea: what if we always compact TSDB head and ship it to the storage at shutdown?
Question:
Pros:
/shutdown
API beforehand)Cons (potential blockers):
Let's discuss it.
Submitted by: pracucci
Cortex Issue Number: 3723
The text was updated successfully, but these errors were encountered: