-
Notifications
You must be signed in to change notification settings - Fork 179
[feature request]: Implement Flush method. #346
Comments
CC: @domgreen |
IIUC the problem is the pending WAL right? Otherwise about the Flush method wouldn't this suffer the same problems as the snapshots and maybe cause some other race conditions.
|
or you meant that the Flush point will be run after Prometheus is shutdown? |
I agree, this is best handled by shutting down Prometheus and then working with the data. |
hmm, but with just shutting down Prometheus we don't have the data from memory in a form of a TSDB block. And when we do |
@krasi-georgiev I mean rather
I think it is totally the same as Snapshot API, right? It just gives us potential overlaps with what TSDB writes down to the main directory |
There won't be any since Flush (that truncates head) is synchronized with compaction.
Again, this is synced, but in the worst case flushed TSDB block is super tiny (which is not perfect, I agree) |
Or ... @brian-brazil @krasi-georgiev there is some TSDB CLI tool that gives me TSDB block from WAL file somehow after Prometheus termination? 🤔 That would work as well for us I think ^ BTW, thanks for quick responses 👍 |
I was refering to this one. |
yea, adding some logic that understands WAL file and is able to write block from it, maybe solves this. |
@krasi-georgiev I take it this means we would want to have Prometheus shutdown and call To then ensure that we do not upload multiple blocks would we need to clear / delete the contents of |
I will wait for the Scan PR to be merged and will add this next. |
So for our use case we would end up doing something like the following:
However, if we added With either approach there is some work to be done the main benefit of exposing it as an API is that the instance could be reused after it has been flushed so we don't have scale down till after a time period we are confident we will not have to scale up again which would make auto scaling much cleaner. |
I just checked the code again and everything that is in the HEAD is added to the WAL at the same time. I think at the moment the only way would be to shutdown Prometheus so it closes the WAL and than you convert it to a block , but there is also a PR that should allow doing this while the TSDB is in use. #332 |
IIUC the problem is not how to take the current data as snapshoting is perfect for that, but how to leave prometheus running while making sure that no block changes happen locally (due to compaction or WAL truncation) during thanos uploads the snapshot? |
Correct, I thought taking In an ideal world we would remove targets from Prometheus and upload blocks before scaling down the statefulset (after a given duration) however, if we get more targets and need to scale up during that window it would be better to use this empty Prometheus (which we just flushed) rather than wait for scale down and then scale up statefulsets again. Again, thanks for taking the time to discuss 😄 its really helping me understand tsdb much better 👍 |
even if you lock the DB than it will be released as soon as you finish with the snapshoting so compaction or new blocks can happen right after that. P.S. use cases help me learn a lot as well so glad to discuss. |
Hm.. can we dig this actually? is that really wrong? With our proposed My only worry is that you end up with block not equal to 2h, but that is up to admin I think. |
I must be missing something than , what is the difference between the current snapshoting and your idea for flush? |
Hey @krasi-georgiev , so from my understanding of how tsdb persists to disk I will try to outline why I think snapshotting is a solution but not the best solution for what we are trying to achieve. Please correct me if I am missing something I am new to the tsdb code and operations 👍 and sorry if this is already all obvious and I'm explaining the wrong thing :) Aim - Auto scale scrapers but allowing draining / lame ducking of a scraper before we take it fully offline, which can then take new load if it appears without the overhead of scaling down and up a stateful set. Snapshotting - Original Approach
Data (at snapshot):
If for whatever reason lots of new targets coming online and we want to scale up after at this point in time (after the snapshot) we will have to wait for this scraper to be scaled down and come back up again as if the normal persist of wal occurs we would end up uploading duplicate chunks. Data (after snapshot and after next block duration):
Here we now have an extra block that has overlapping data with the snapshot so we cannot upload this block which means we may end up loosing data and we have to terminate the entire node and scale down. Flushing - Alternative approach
Data (at flush):
Now for whatever reason we decide not to scale that instance down or the next block duration min|max time passes and a new block is created we can still continue using the instance of Prometheus as there will be no overlapping blocks: Data (after flush and after next block duration):
For our scenario we do not currently mind if the blocks have been fully compacted or not due to Thanos not currently supporting local compaction. |
I'm not seeing how this is adding anything that can't be done with the current shutdown and snapshotting. This seems like a microoptimisation. |
With I think we are up to the idea of extending |
hey @brian-brazil maybe I think @Bplotka just cleared up my confusion with the shutdown / snapshotting approach 👍 I agree that flush is an optimisation 😄 My confusion was all around the assumption if we shutdown and start the wal will be read on startup and we still get overlapping blocks. If we remove the contents of |
Agreed, if we can get a flush in tsdb cli as @krasi-georgiev suggested earlier in the thread that would mean we can work around all race conditions and correctly drain a instance. |
IIUC this would work for you? maybe another thing you can try is restarting Prometheus with low value for |
Yea, not safe area I would say (: |
@krasi-georgiev would you envisage the tsdb cli removing the WAL in the flush command? I was thinking if it didn't and someone used it they would end up with overlap. Not sure if this is an issue but just something I was thinking about. |
you would be the first user so you have the privilege to say how you want this implemented. |
We need to consider things beyond what the first user requests. I would not expect that a "flush" method would break my Prometheus. |
How is that even possible that for normal flow Prometheus compacts to 2h block super quickly? |
I haven't measured it, but persisting a block should take half the time compared to wal -> block. main difference is that writing a block reads from the memory(the head) and in the case of wal-> block the data first need to be loaded into memory first and than written to disk. |
will measure again now with the change I did where the checkpointing deletes all the duplicated data that is already persisted in a block. |
even with the checkpointing changes, the compactor writes a block when the head/wal reaches 3h of data leaving 1h in the head/wal so at best case you would have 1h in the wal at worst case close to 3h depending on when you stop Prometheus. |
with the current changes in prometheus/prometheus#4653 the following works:
Since the wal might include up to 3h of data in worst case converting the wal into a block will take up to 20min at the load I tested it. 2350963250 samples and 13173065 series
|
Awesome, Thanks for this @krasi-georgiev - This is very useful! 20 min is a lot, but the data size is huge as well. Even with that data size, just upload to object storage will take X minutes time, so we just need to take into account that this operation is time consuming. |
the wal under this load and nearly 3h of data is 30gb |
@bwplotka, @brancz alternative more simple method.
with the head included, snapshoting will create a block with custom range but with very small changes this will be handled nicely at next compaction time. |
The nice thing about doing this "offline" though is that no new head chunk is started. |
yes that is true. Maybe can even add an option to the snapshot api to allow snapshoting only the head. This will reduce the time needed in half compared to the WAL to block cli as the WAL is already loaded in memory. I guess if the tsdb.Snapshot allows creating a block just from the head the wal to block cli would just open the db with compaction disabled and call that snapshot api |
I don't think that would be equivalent. Equivalent would be not to accept writes anymore, perform the snapshot and shutdown or somehow allow "manually" enabling writes again, which seems too specific of a workflow to be part of Prometheus/tsdb. |
Hi, reading it again, I think the Snapshot API does what you want already no? For shutdown, you do the following:
I might have missed something, but if we stop looking at the data-dir once the Snapshot API returns, we'll be fine no? |
I think what @gouthamve suggests probably suffices, I'm just a little concerned about the atomicity of step 1, it's hard to tell when the head will not receive any more data, removing targets doesn't necessarily mean that all in progress scrapes/inserts are immediately done. This could however be treated similarly to a missed scrape so could probably be waited for on a best effort basis and then just ignored. |
Well, you have this atomicity problem anyway, even with our previous ideas of Yea @gouthamve , I mentioned Snapshot API in the very beginning as something we can use right now, but this issue was all about finding a solution that will:
It looks to me like starting with using Snapshot API and gathering more data regarding issues is the way to go with this. |
I hope you find the time soon to test all discussed solutions. |
any chance to get this revived ? |
Couldn't you use readonly mode plus db.Snapshot for this? |
aaah yeah that is an idea. |
The current idea is to do Snapshot + potentially readmode. So nothing required from TSDB for now. cc @devnev Closing. |
Hello TSDB folks!
We are chasing for a safe solution to quickly "terminate" Prometheus server without losing any monitoring data stored in memory (and WAL). By terminate, we mean killing whole instance, including potential persistent disk. We use thanos for uploading the blocks that are in
tsdb-path
blocks into object store, so we would like to dump in-mem HEAD block to the filesystem on demand and let Thanos to upload it. But there is noflush
API for TSDB (thus noFlush
endpoint for Prometheus). The example scenario would look like:Flush
endpoint. Head block is flushed to the filesystem and truncated in memory.The obvious workaround is TSDB
Snapshot
method, but that is actually not "safe". TSDB blocks are immutable and overlaps are not tolerated, so:After we do the snapshot with
withHead=true
to separate directory (and making Thanos upload from those), we have indeed a portion of HEAD in the object storage (let's called itA
) as we wanted. However:dirty
, because any new TSDB block from HEAD that got "written" into filesystem (becausedb.compact()
decided so) as blockB
is strictly overlapping withA
and thus this instance cannot be used anymore again.B
block can be created and also uploaded by Thanos.All of these problems make our case really difficult to handle, and just single
flush
logic will help us a lot here. Do you think we can enable those in TSDB (and maybe further in Prometheus?). Would you be ok to take a PR for it?We would propose something like
Flush
method that will have logic similardb.compact()
method, but with forcedb.compactor.Write
of head block:What do you think? @gouthamve @fabxc @krasi-georgiev
For context: We are experimenting with something that will auto-scale horizontally Prometheus servers in the highly dynamic environment (scrape targets changing a lot). We have implemented code that assigns targets to each Prometheus server automatically and scales up and down a number of Prometheus instances.
The text was updated successfully, but these errors were encountered: