[feature request]: Implement Flush method. #346

bwplotka · 2018-06-11T12:52:54Z

Hello TSDB folks!

We are chasing for a safe solution to quickly "terminate" Prometheus server without losing any monitoring data stored in memory (and WAL). By terminate, we mean killing whole instance, including potential persistent disk. We use thanos for uploading the blocks that are in tsdb-path blocks into object store, so we would like to dump in-mem HEAD block to the filesystem on demand and let Thanos to upload it. But there is no flush API for TSDB (thus no Flush endpoint for Prometheus). The example scenario would look like:

We need to scale down Prometheus servers
We remove all scrape targets, so nothing new is scraped (nothing new is added to TSDB)
We hit Flush endpoint. Head block is flushed to the filesystem and truncated in memory.
We wait until Thanos uploads all, including flushed head block.
We terminate the instance.

The obvious workaround is TSDB Snapshot method, but that is actually not "safe". TSDB blocks are immutable and overlaps are not tolerated, so:

After we do the snapshot with withHead=true to separate directory (and making Thanos upload from those), we have indeed a portion of HEAD in the object storage (let's called it A) as we wanted. However:

we are ultimately marking this instance as dirty, because any new TSDB block from HEAD that got "written" into filesystem (because db.compact() decided so) as block B is strictly overlapping with A and thus this instance cannot be used anymore again.
there is race condition possible that while we are doing a snapshot and uploading those, the B block can be created and also uploaded by Thanos.

All of these problems make our case really difficult to handle, and just single flush logic will help us a lot here. Do you think we can enable those in TSDB (and maybe further in Prometheus?). Would you be ok to take a PR for it?

We would propose something like Flush method that will have logic similar db.compact() method, but with force db.compactor.Write of head block:

func (db *DB) Flush() error {
	db.cmtx.Lock()
	defer db.cmtx.Unlock()

	db.mtx.RLock()
	defer db.mtx.RUnlock()

	// Wrap head into a range that bounds all reads to it.
	head := &rangeHead{
		head: db.head,
		mint: mint,
		maxt: maxt,
	}
	if _, err = db.compactor.Write(db.dir, head, mint, maxt); err != nil {
		return  errors.Wrap(err, "persist head block")
	}

	runtime.GC()
	if err := db.reload(); err != nil {
		return  errors.Wrap(err, "reload blocks")
	}
	runtime.GC()
        return nil
}

What do you think? @gouthamve @fabxc @krasi-georgiev

For context: We are experimenting with something that will auto-scale horizontally Prometheus servers in the highly dynamic environment (scrape targets changing a lot). We have implemented code that assigns targets to each Prometheus server automatically and scales up and down a number of Prometheus instances.

The text was updated successfully, but these errors were encountered:

bwplotka · 2018-06-11T12:53:09Z

CC: @domgreen

krasi-georgiev · 2018-06-11T13:33:11Z

IIUC the problem is the pending WAL right?
why not just add this to the tsdb cli tool - read the WAL and make a block out of it.

Otherwise about the Flush method wouldn't this suffer the same problems as the snapshots and maybe cause some other race conditions.

What if another compaction gets triggered at the same time?
What if the WAL gets truncated at the same time as flush is called?

krasi-georgiev · 2018-06-11T13:34:36Z

or you meant that the Flush point will be run after Prometheus is shutdown?

brian-brazil · 2018-06-11T13:47:11Z

I agree, this is best handled by shutting down Prometheus and then working with the data.

bwplotka · 2018-06-11T14:08:26Z

I agree, this is best handled by shutting down Prometheus and then working with the data.

hmm, but with just shutting down Prometheus we don't have the data from memory in a form of a TSDB block. And when we do Snapshot right before a shutdown, there still might be the race between what snapshot gives us as "head block" and Prometheus writing some block just before shutdown potentially.

bwplotka · 2018-06-11T14:09:05Z

@krasi-georgiev I mean rather Flush that will write what it has and continue working.

why not just add this to the tsdb cli tool - read the WAL and make a block out of it.

I think it is totally the same as Snapshot API, right? It just gives us potential overlaps with what TSDB writes down to the main directory

bwplotka · 2018-06-11T14:10:58Z

What if another compaction gets triggered at the same time?

There won't be any since Flush (that truncates head) is synchronized with compaction.

What if the WAL gets truncated at the same time as flush is called?

Again, this is synced, but in the worst case flushed TSDB block is super tiny (which is not perfect, I agree)

bwplotka · 2018-06-11T14:11:55Z

Or ... @brian-brazil @krasi-georgiev there is some TSDB CLI tool that gives me TSDB block from WAL file somehow after Prometheus termination? 🤔

That would work as well for us I think ^

BTW, thanks for quick responses 👍

krasi-georgiev · 2018-06-11T14:30:51Z

I was refering to this one.
https://github.com/prometheus/tsdb/tree/master/cmd/tsdb

bwplotka · 2018-06-11T14:38:57Z

yea, adding some logic that understands WAL file and is able to write block from it, maybe solves this.

domgreen · 2018-06-11T15:29:51Z

@krasi-georgiev I take it this means we would want to have Prometheus shutdown and call WAL.Close() to flush the head. Therefore, all data in memory is now in the WAL which we can turn into a block for upload.

To then ensure that we do not upload multiple blocks would we need to clear / delete the contents of /wal after we create the block?

krasi-georgiev · 2018-06-11T15:34:57Z

I will wait for the Scan PR to be merged and will add this next.

domgreen · 2018-06-11T15:51:42Z

So for our use case we would end up doing something like the following:

- signal for termination of Thanos / Prometheus
- mark instance as dirty
- call /-/quit on Prom
- call Flush on CLI
- Upload created block
- scale down statefulset

However, if we added /admin/tsdb/flush via admin API it would allow us to flush the head/wal to a block and not have to mark the pod as dirty which would be beneficial to scaling up / down instances.

With either approach there is some work to be done the main benefit of exposing it as an API is that the instance could be reused after it has been flushed so we don't have scale down till after a time period we are confident we will not have to scale up again which would make auto scaling much cleaner.

krasi-georgiev · 2018-06-11T15:52:21Z

I just checked the code again and everything that is in the HEAD is added to the WAL at the same time.
https://github.com/prometheus/tsdb/blob/c848349f07c83bd38d5d19faa5ea71c7fd8923ea/head.go#L526-L530

I think at the moment the only way would be to shutdown Prometheus so it closes the WAL and than you convert it to a block , but there is also a PR that should allow doing this while the TSDB is in use.

#332
Here is the PR for a new WAL format which also add checkpointing. I haven't looked into the checkpointing part yet, but might align with what you trying to do. The idea of the PR is to allow ingesting of the recent metrics by third party systems. The design doc is in few comments below the desc of the PR

krasi-georgiev · 2018-06-11T16:15:42Z

IIUC the problem is not how to take the current data as snapshoting is perfect for that, but how to leave prometheus running while making sure that no block changes happen locally (due to compaction or WAL truncation) during thanos uploads the snapshot?

domgreen · 2018-06-11T16:25:29Z

Correct, I thought taking db.cmtx.Lock() and db.mtx.Lock() may have allowed us to do this.

In an ideal world we would remove targets from Prometheus and upload blocks before scaling down the statefulset (after a given duration) however, if we get more targets and need to scale up during that window it would be better to use this empty Prometheus (which we just flushed) rather than wait for scale down and then scale up statefulsets again.

Again, thanks for taking the time to discuss 😄 its really helping me understand tsdb much better 👍

krasi-georgiev · 2018-06-11T16:35:43Z

even if you lock the DB than it will be released as soon as you finish with the snapshoting so compaction or new blocks can happen right after that.
what you need is some /admin/freeze , /admin/unfreeze api 😜

P.S. use cases help me learn a lot as well so glad to discuss.

bwplotka · 2018-06-11T17:34:50Z

even if you lock the DB than it will be released as soon as you finish with the snapshoting so compaction or new blocks can happen right after that.

Hm.. can we dig this actually? is that really wrong? With our proposed flush flow you just flushed the HEAD block into TSDB block and truncate it from memory. So it does not matter what happens after flush, does it?

My only worry is that you end up with block not equal to 2h, but that is up to admin I think.

krasi-georgiev · 2018-06-11T21:24:41Z

I must be missing something than , what is the difference between the current snapshoting and your idea for flush?

domgreen · 2018-06-12T09:01:34Z

Hey @krasi-georgiev , so from my understanding of how tsdb persists to disk I will try to outline why I think snapshotting is a solution but not the best solution for what we are trying to achieve.

Please correct me if I am missing something I am new to the tsdb code and operations 👍 and sorry if this is already all obvious and I'm explaining the wrong thing :)

Aim - Auto scale scrapers but allowing draining / lame ducking of a scraper before we take it fully offline, which can then take new load if it appears without the overhead of scaling down and up a stateful set.

Snapshotting - Original Approach

remove targets
mark scraper as dirty
snapshot
stop / quit Prometheus
upload snapshotted wal block (we will only upload the newly created block)
force scale down scrape pool

Data (at snapshot):

└ data/
   ├─ 01CETTJ2PVKBK54V36ZA541RBD/
   ├─ 01CETV4CMV15AY06HTV07FMA6K/
   ├─ wal/
   |    |─000001
   |    └─000002
   └─snapshots
          └─ 20180531T095042Z-77bc1ca430f8db98/
                 |─01CETTJ2PVKBK54V36ZA541RBD/
                 |─ 01CETV4CMV15AY06HTV07FMA6K/
                 └─01CETV4Z17NA95VQZH28BT4PER/      <- block created from wal during snapshot

If for whatever reason lots of new targets coming online and we want to scale up after at this point in time (after the snapshot) we will have to wait for this scraper to be scaled down and come back up again as if the normal persist of wal occurs we would end up uploading duplicate chunks.

Data (after snapshot and after next block duration):

└ data/
   ├─ 01CETTJ2PVKBK54V36ZA541RBD/
   ├─ 01CETV4CMV15AY06HTV07FMA6K/
   ├─ 01CETVPPJV7PG0NJMRT8PRYZPS/   <- new block persisted that contains chunks from snapshot
   ├─ wal/
   |    |─000001
   |    |─000002
   |    └─000003
   └─snapshots
          └─ 20180531T095042Z-77bc1ca430f8db98/
                 |─01CETTJ2PVKBK54V36ZA541RBD/
                 |─ 01CETV4CMV15AY06HTV07FMA6K/
                 └─01CETV4Z17NA95VQZH28BT4PER/      <- block created from wal during snapshot

Here we now have an extra block that has overlapping data with the snapshot so we cannot upload this block which means we may end up loosing data and we have to terminate the entire node and scale down.

Flushing - Alternative approach

remove targets
flush head/wal to a block (updating tsdb.head min max)
upload new block
drained scraper waits an allotted time ensure no need keep the scraper

Data (at flush):

└ data/
   ├─ 01CETTJ2PVKBK54V36ZA541RBD/
   ├─ 01CETV4CMV15AY06HTV07FMA6K/
   ├─ 01CETV4Z17NA95VQZH28BT4PER/ <- block created from flush
   └─ wal/
       |─000001
       └─000002

Now for whatever reason we decide not to scale that instance down or the next block duration min|max time passes and a new block is created we can still continue using the instance of Prometheus as there will be no overlapping blocks:

Data (after flush and after next block duration):

└ data/
   ├─ 01CETTJ2PVKBK54V36ZA541RBD/
   ├─ 01CETV4CMV15AY06HTV07FMA6K/
   ├─ 01CETV4Z17NA95VQZH28BT4PER/ <- block created from flush
   ├─ 01CETVPPJV7PG0NJMRT8PRYZPS/ <- new block with no overlap
   └─ wal/
       |─000001
       |─000002
       └─000003

For our scenario we do not currently mind if the blocks have been fully compacted or not due to Thanos not currently supporting local compaction.

brian-brazil · 2018-06-12T09:16:06Z

I'm not seeing how this is adding anything that can't be done with the current shutdown and snapshotting. This seems like a microoptimisation.

bwplotka · 2018-06-12T09:43:07Z

With flush it will be bit cleaner without special logic on our side and safeguards.
For example snapshot -> shutdown -> remove WAL -> start again would work, but there might be still race condition between end of snapshot and shutdown.

I think we are up to the idea of extending tsdb cli to be able to make post-death snapshot with head block. That will solve our issue I think.

domgreen · 2018-06-12T09:44:01Z

hey @brian-brazil maybe I think @Bplotka just cleared up my confusion with the shutdown / snapshotting approach 👍 I agree that flush is an optimisation 😄

My confusion was all around the assumption if we shutdown and start the wal will be read on startup and we still get overlapping blocks. If we remove the contents of /wal and restart then we should be okay to drain.

domgreen · 2018-06-12T09:45:14Z

Agreed, if we can get a flush in tsdb cli as @krasi-georgiev suggested earlier in the thread that would mean we can work around all race conditions and correctly drain a instance.

krasi-georgiev · 2018-06-12T23:26:01Z

IIUC this would work for you?
shutdown -> tsdb cli flush (convert the WAL to a block) -> remove WAL -> upload -> start again if needed

maybe another thing you can try is restarting Prometheus with low value for --storage.tsdb.min-block-duration=1m this will pesist the WAL/Head every 1min , but I imagine it might cause other issues as it will trigger compaction quite often.

bwplotka · 2018-06-13T08:17:10Z

maybe another thing you can try is restarting Prometheus with low value for --storage.tsdb.min-block-duration=1m this will pesist the WAL/Head every 1min , but I imagine it might cause other issues as it will trigger compaction quite often.

Yea, not safe area I would say (:

domgreen · 2018-06-13T08:25:03Z

@krasi-georgiev would you envisage the tsdb cli removing the WAL in the flush command?

I was thinking if it didn't and someone used it they would end up with overlap. Not sure if this is an issue but just something I was thinking about.

krasi-georgiev · 2018-06-13T10:01:37Z

you would be the first user so you have the privilege to say how you want this implemented.

brian-brazil · 2018-06-13T10:09:44Z

We need to consider things beyond what the first user requests. I would not expect that a "flush" method would break my Prometheus.

bwplotka · 2018-10-05T15:24:59Z

How is that even possible that for normal flow Prometheus compacts to 2h block super quickly?

krasi-georgiev · 2018-10-05T18:36:11Z

I haven't measured it, but persisting a block should take half the time compared to wal -> block.

main difference is that writing a block reads from the memory(the head) and in the case of wal-> block the data first need to be loaded into memory first and than written to disk.

krasi-georgiev · 2018-10-05T18:37:52Z

will measure again now with the change I did where the checkpointing deletes all the duplicated data that is already persisted in a block.

krasi-georgiev · 2018-10-05T21:02:55Z

even with the checkpointing changes, the compactor writes a block when the head/wal reaches 3h of data leaving 1h in the head/wal so at best case you would have 1h in the wal at worst case close to 3h depending on when you stop Prometheus.

krasi-georgiev · 2018-10-08T15:22:17Z

with the current changes in prometheus/prometheus#4653 the following works:

Stop Prometheus.
Run the tsdb tool to create a block from the wal.
Delete the wal.
Start Prometheus Again.

Since the wal might include up to 3h of data in worst case converting the wal into a block will take up to 20min at the load I tested it. 2350963250 samples and 13173065 series

>> start stage=reading the wal files into memory
>> completed stage=reading the wal files into memory duration=7m7.463992998s
level=info msg="wal loaded" mint="2018-10-08 04:38:34 +0000 UTC" maxt="2018-10-08 07:19:35 +0000 UTC"
>> start stage=read series from memory and write a block


level=info msg="write block" mint=1538973514358 maxt=1538983175805 ulid=01CS9ASG8M9DZXZJ4YNEBVZ4D4
>> completed stage=read series from memory and write a block duration=9m55.622007693s
>> start stage=opening new block
>> completed stage=opening new block duration=1.61196ms

>> total time for all stages:17m3.0877398s 

path=wal/01CS9ASG8M9DZXZJ4YNEBVZ4D4
BLOCK ULID                  MIN TIME                       MAX TIME                       NUM SAMPLES  NUM CHUNKS  NUM SERIES
01CS9ASG8M9DZXZJ4YNEBVZ4D4  2018-10-08 04:38:34 +0000 UTC  2018-10-08 07:19:35 +0000 UTC  2350963250   24157929    13173065

bwplotka · 2018-10-08T16:10:58Z

Awesome, Thanks for this @krasi-georgiev - This is very useful!

20 min is a lot, but the data size is huge as well. Even with that data size, just upload to object storage will take X minutes time, so we just need to take into account that this operation is time consuming.

krasi-georgiev · 2018-10-08T16:54:54Z

the wal under this load and nearly 3h of data is 30gb
and the block is 5.7G

krasi-georgiev · 2018-11-26T10:09:38Z

@bwplotka, @brancz alternative more simple method.

disable compaction (already an api for this in tsdb)
run snapshot with the head included (already implemented in tsdb, can tweak it if needed)

with the head included, snapshoting will create a block with custom range but with very small changes this will be handled nicely at next compaction time.

brancz · 2018-11-26T10:25:56Z

The nice thing about doing this "offline" though is that no new head chunk is started.

krasi-georgiev · 2018-11-26T14:45:58Z

yes that is true. Maybe can even add an option to the snapshot api to allow snapshoting only the head. This will reduce the time needed in half compared to the WAL to block cli as the WAL is already loaded in memory.

I guess if the tsdb.Snapshot allows creating a block just from the head the wal to block cli would just open the db with compaction disabled and call that snapshot api

brancz · 2018-11-26T15:20:20Z

I don't think that would be equivalent. Equivalent would be not to accept writes anymore, perform the snapshot and shutdown or somehow allow "manually" enabling writes again, which seems too specific of a workflow to be part of Prometheus/tsdb.

gouthamve · 2018-11-27T11:19:12Z

Hi, reading it again, I think the Snapshot API does what you want already no? For shutdown, you do the following:

We remove all scrape targets, so nothing new is scraped (nothing new is added to TSDB)
We call the Snapshot API
We stop looking at the data-dir, only look at the Snapshot dir. Upload what is in the snapshot dir
Kill the pod

I might have missed something, but if we stop looking at the data-dir once the Snapshot API returns, we'll be fine no?

brancz · 2018-11-27T15:44:01Z

I think what @gouthamve suggests probably suffices, I'm just a little concerned about the atomicity of step 1, it's hard to tell when the head will not receive any more data, removing targets doesn't necessarily mean that all in progress scrapes/inserts are immediately done. This could however be treated similarly to a missed scrape so could probably be waited for on a best effort basis and then just ignored.

bwplotka · 2018-11-28T17:51:25Z

Well, you have this atomicity problem anyway, even with our previous ideas of Flush, Stop Prometheus or disable compacting. Only some native logic of tear down (dump head + turn off) in form of endpoint would provide us that safety - which would be nice to have, what do you think? (:

Yea @gouthamve , I mentioned Snapshot API in the very beginning as something we can use right now, but this issue was all about finding a solution that will:

Allow reusing same Prometheus after flush (dynamic changes in autoscaling)
Provide the atomicity (good point @brancz)

It looks to me like starting with using Snapshot API and gathering more data regarding issues is the way to go with this.

krasi-georgiev · 2019-01-04T09:59:49Z

I hope you find the time soon to test all discussed solutions.

bwplotka · 2019-01-04T10:04:55Z

Yup, we will be dealing with this issue soon.

CC @devnev @easyCZ

krasi-georgiev · 2019-04-03T13:11:11Z

any chance to get this revived ?

brian-brazil · 2019-04-03T13:18:48Z

Couldn't you use readonly mode plus db.Snapshot for this?

krasi-georgiev · 2019-04-03T13:36:00Z

aaah yeah that is an idea.

bwplotka · 2019-08-13T10:51:02Z

The current idea is to do Snapshot + potentially readmode. So nothing required from TSDB for now. cc @devnev

Closing.

bwplotka changed the title ~~[feature request]: Allow on-demand Flush method.~~ [feature request]: Implement Flush method. Jun 11, 2018

krasi-georgiev added new feature and removed feature labels Dec 1, 2018

krasi-georgiev mentioned this issue Dec 12, 2018

Investigate possibility of tools modifying persisted blocks #5

Closed

bwplotka closed this as completed Aug 13, 2019

krasi-georgiev mentioned this issue Sep 9, 2019

cmd,pkg: flush storage when hashring changes thanos-io/thanos#1480

Merged

squat mentioned this issue Sep 10, 2019

tsdb: allow readonly DB to flush WAL prometheus/prometheus#6006

Merged

bboreham mentioned this issue Mar 19, 2021

Compacting TSDB head on every ingester shutdown to speed up startup cortexproject/cortex#3723

Closed

grafanabot mentioned this issue Aug 10, 2021

Compacting TSDB head on every ingester shutdown to speed up startup grafana/mimir#98

Closed

underrun mentioned this issue Aug 19, 2021

command: force thanos to upload thanos-io/thanos#1849

Closed

Nashluffy mentioned this issue May 24, 2024

sidecar: Add /api/v1/flush endpoint thanos-io/thanos#7359

Open

2 tasks

[feature request]: Implement Flush method. #346

[feature request]: Implement Flush method. #346

Comments

bwplotka commented Jun 11, 2018 • edited Loading

bwplotka commented Jun 11, 2018

krasi-georgiev commented Jun 11, 2018

krasi-georgiev commented Jun 11, 2018 • edited Loading

brian-brazil commented Jun 11, 2018

bwplotka commented Jun 11, 2018

bwplotka commented Jun 11, 2018 • edited Loading

bwplotka commented Jun 11, 2018

bwplotka commented Jun 11, 2018 • edited Loading

krasi-georgiev commented Jun 11, 2018

bwplotka commented Jun 11, 2018

domgreen commented Jun 11, 2018

krasi-georgiev commented Jun 11, 2018

domgreen commented Jun 11, 2018

krasi-georgiev commented Jun 11, 2018 • edited Loading

krasi-georgiev commented Jun 11, 2018

domgreen commented Jun 11, 2018

krasi-georgiev commented Jun 11, 2018

bwplotka commented Jun 11, 2018

krasi-georgiev commented Jun 11, 2018

domgreen commented Jun 12, 2018

Snapshotting - Original Approach

Flushing - Alternative approach

brian-brazil commented Jun 12, 2018

bwplotka commented Jun 12, 2018

domgreen commented Jun 12, 2018

domgreen commented Jun 12, 2018

krasi-georgiev commented Jun 12, 2018

bwplotka commented Jun 13, 2018

domgreen commented Jun 13, 2018

krasi-georgiev commented Jun 13, 2018

brian-brazil commented Jun 13, 2018

bwplotka commented Oct 5, 2018

krasi-georgiev commented Oct 5, 2018 • edited Loading

krasi-georgiev commented Oct 5, 2018

krasi-georgiev commented Oct 5, 2018

krasi-georgiev commented Oct 8, 2018

bwplotka commented Oct 8, 2018

krasi-georgiev commented Oct 8, 2018

krasi-georgiev commented Nov 26, 2018

brancz commented Nov 26, 2018

krasi-georgiev commented Nov 26, 2018 • edited Loading

brancz commented Nov 26, 2018

gouthamve commented Nov 27, 2018

brancz commented Nov 27, 2018

bwplotka commented Nov 28, 2018

krasi-georgiev commented Jan 4, 2019

bwplotka commented Jan 4, 2019 • edited Loading

krasi-georgiev commented Apr 3, 2019

brian-brazil commented Apr 3, 2019

krasi-georgiev commented Apr 3, 2019

bwplotka commented Aug 13, 2019

bwplotka commented Jun 11, 2018 •

edited

Loading

krasi-georgiev commented Jun 11, 2018 •

edited

Loading

bwplotka commented Jun 11, 2018 •

edited

Loading

bwplotka commented Jun 11, 2018 •

edited

Loading

krasi-georgiev commented Jun 11, 2018 •

edited

Loading

krasi-georgiev commented Oct 5, 2018 •

edited

Loading

krasi-georgiev commented Nov 26, 2018 •

edited

Loading

bwplotka commented Jan 4, 2019 •

edited

Loading