Skip to content
This repository has been archived by the owner on Aug 13, 2019. It is now read-only.

[feature request]: Implement Flush method. #346

Closed
bwplotka opened this issue Jun 11, 2018 · 65 comments
Closed

[feature request]: Implement Flush method. #346

bwplotka opened this issue Jun 11, 2018 · 65 comments

Comments

@bwplotka
Copy link
Contributor

bwplotka commented Jun 11, 2018

Hello TSDB folks!

We are chasing for a safe solution to quickly "terminate" Prometheus server without losing any monitoring data stored in memory (and WAL). By terminate, we mean killing whole instance, including potential persistent disk. We use thanos for uploading the blocks that are in tsdb-path blocks into object store, so we would like to dump in-mem HEAD block to the filesystem on demand and let Thanos to upload it. But there is no flush API for TSDB (thus no Flush endpoint for Prometheus). The example scenario would look like:

  1. We need to scale down Prometheus servers
  2. We remove all scrape targets, so nothing new is scraped (nothing new is added to TSDB)
  3. We hit Flush endpoint. Head block is flushed to the filesystem and truncated in memory.
  4. We wait until Thanos uploads all, including flushed head block.
  5. We terminate the instance.

The obvious workaround is TSDB Snapshot method, but that is actually not "safe". TSDB blocks are immutable and overlaps are not tolerated, so:

After we do the snapshot with withHead=true to separate directory (and making Thanos upload from those), we have indeed a portion of HEAD in the object storage (let's called it A) as we wanted. However:

  • we are ultimately marking this instance as dirty, because any new TSDB block from HEAD that got "written" into filesystem (because db.compact() decided so) as block B is strictly overlapping with A and thus this instance cannot be used anymore again.
  • there is race condition possible that while we are doing a snapshot and uploading those, the B block can be created and also uploaded by Thanos.

All of these problems make our case really difficult to handle, and just single flush logic will help us a lot here. Do you think we can enable those in TSDB (and maybe further in Prometheus?). Would you be ok to take a PR for it?

We would propose something like Flush method that will have logic similar db.compact() method, but with force db.compactor.Write of head block:

func (db *DB) Flush() error {
	db.cmtx.Lock()
	defer db.cmtx.Unlock()

	db.mtx.RLock()
	defer db.mtx.RUnlock()

	// Wrap head into a range that bounds all reads to it.
	head := &rangeHead{
		head: db.head,
		mint: mint,
		maxt: maxt,
	}
	if _, err = db.compactor.Write(db.dir, head, mint, maxt); err != nil {
		return  errors.Wrap(err, "persist head block")
	}

	runtime.GC()
	if err := db.reload(); err != nil {
		return  errors.Wrap(err, "reload blocks")
	}
	runtime.GC()
        return nil
}

What do you think? @gouthamve @fabxc @krasi-georgiev

For context: We are experimenting with something that will auto-scale horizontally Prometheus servers in the highly dynamic environment (scrape targets changing a lot). We have implemented code that assigns targets to each Prometheus server automatically and scales up and down a number of Prometheus instances.

@bwplotka
Copy link
Contributor Author

CC: @domgreen

@bwplotka bwplotka changed the title [feature request]: Allow on-demand Flush method. [feature request]: Implement Flush method. Jun 11, 2018
@krasi-georgiev
Copy link
Contributor

IIUC the problem is the pending WAL right?
why not just add this to the tsdb cli tool - read the WAL and make a block out of it.

Otherwise about the Flush method wouldn't this suffer the same problems as the snapshots and maybe cause some other race conditions.

  • What if another compaction gets triggered at the same time?
  • What if the WAL gets truncated at the same time as flush is called?

@krasi-georgiev
Copy link
Contributor

krasi-georgiev commented Jun 11, 2018

or you meant that the Flush point will be run after Prometheus is shutdown?

@brian-brazil
Copy link
Contributor

I agree, this is best handled by shutting down Prometheus and then working with the data.

@bwplotka
Copy link
Contributor Author

I agree, this is best handled by shutting down Prometheus and then working with the data.

hmm, but with just shutting down Prometheus we don't have the data from memory in a form of a TSDB block. And when we do Snapshot right before a shutdown, there still might be the race between what snapshot gives us as "head block" and Prometheus writing some block just before shutdown potentially.

@bwplotka
Copy link
Contributor Author

bwplotka commented Jun 11, 2018

@krasi-georgiev I mean rather Flush that will write what it has and continue working.

why not just add this to the tsdb cli tool - read the WAL and make a block out of it.

I think it is totally the same as Snapshot API, right? It just gives us potential overlaps with what TSDB writes down to the main directory

@bwplotka
Copy link
Contributor Author

What if another compaction gets triggered at the same time?

There won't be any since Flush (that truncates head) is synchronized with compaction.

What if the WAL gets truncated at the same time as flush is called?

Again, this is synced, but in the worst case flushed TSDB block is super tiny (which is not perfect, I agree)

@bwplotka
Copy link
Contributor Author

bwplotka commented Jun 11, 2018

Or ... @brian-brazil @krasi-georgiev there is some TSDB CLI tool that gives me TSDB block from WAL file somehow after Prometheus termination? 🤔

That would work as well for us I think ^

BTW, thanks for quick responses 👍

@krasi-georgiev
Copy link
Contributor

I was refering to this one.
https://github.com/prometheus/tsdb/tree/master/cmd/tsdb

@bwplotka
Copy link
Contributor Author

yea, adding some logic that understands WAL file and is able to write block from it, maybe solves this.

@domgreen
Copy link

@krasi-georgiev I take it this means we would want to have Prometheus shutdown and call WAL.Close() to flush the head. Therefore, all data in memory is now in the WAL which we can turn into a block for upload.

To then ensure that we do not upload multiple blocks would we need to clear / delete the contents of /wal after we create the block?

@krasi-georgiev
Copy link
Contributor

I will wait for the Scan PR to be merged and will add this next.

@domgreen
Copy link

So for our use case we would end up doing something like the following:

- signal for termination of Thanos / Prometheus
- mark instance as dirty
- call /-/quit on Prom
- call Flush on CLI
- Upload created block
- scale down statefulset

However, if we added /admin/tsdb/flush via admin API it would allow us to flush the head/wal to a block and not have to mark the pod as dirty which would be beneficial to scaling up / down instances.

With either approach there is some work to be done the main benefit of exposing it as an API is that the instance could be reused after it has been flushed so we don't have scale down till after a time period we are confident we will not have to scale up again which would make auto scaling much cleaner.

@krasi-georgiev
Copy link
Contributor

krasi-georgiev commented Jun 11, 2018

I just checked the code again and everything that is in the HEAD is added to the WAL at the same time.
https://github.com/prometheus/tsdb/blob/c848349f07c83bd38d5d19faa5ea71c7fd8923ea/head.go#L526-L530

I think at the moment the only way would be to shutdown Prometheus so it closes the WAL and than you convert it to a block , but there is also a PR that should allow doing this while the TSDB is in use.

#332
Here is the PR for a new WAL format which also add checkpointing. I haven't looked into the checkpointing part yet, but might align with what you trying to do. The idea of the PR is to allow ingesting of the recent metrics by third party systems. The design doc is in few comments below the desc of the PR

@krasi-georgiev
Copy link
Contributor

IIUC the problem is not how to take the current data as snapshoting is perfect for that, but how to leave prometheus running while making sure that no block changes happen locally (due to compaction or WAL truncation) during thanos uploads the snapshot?

@domgreen
Copy link

Correct, I thought taking db.cmtx.Lock() and db.mtx.Lock() may have allowed us to do this.

In an ideal world we would remove targets from Prometheus and upload blocks before scaling down the statefulset (after a given duration) however, if we get more targets and need to scale up during that window it would be better to use this empty Prometheus (which we just flushed) rather than wait for scale down and then scale up statefulsets again.

Again, thanks for taking the time to discuss 😄 its really helping me understand tsdb much better 👍

@krasi-georgiev
Copy link
Contributor

even if you lock the DB than it will be released as soon as you finish with the snapshoting so compaction or new blocks can happen right after that.
what you need is some /admin/freeze , /admin/unfreeze api 😜

P.S. use cases help me learn a lot as well so glad to discuss.

@bwplotka
Copy link
Contributor Author

even if you lock the DB than it will be released as soon as you finish with the snapshoting so compaction or new blocks can happen right after that.

Hm.. can we dig this actually? is that really wrong? With our proposed flush flow you just flushed the HEAD block into TSDB block and truncate it from memory. So it does not matter what happens after flush, does it?

My only worry is that you end up with block not equal to 2h, but that is up to admin I think.

@krasi-georgiev
Copy link
Contributor

I must be missing something than , what is the difference between the current snapshoting and your idea for flush?

@domgreen
Copy link

Hey @krasi-georgiev , so from my understanding of how tsdb persists to disk I will try to outline why I think snapshotting is a solution but not the best solution for what we are trying to achieve.

Please correct me if I am missing something I am new to the tsdb code and operations 👍 and sorry if this is already all obvious and I'm explaining the wrong thing :)

Aim - Auto scale scrapers but allowing draining / lame ducking of a scraper before we take it fully offline, which can then take new load if it appears without the overhead of scaling down and up a stateful set.

Snapshotting - Original Approach

  1. remove targets
  2. mark scraper as dirty
  3. snapshot
  4. stop / quit Prometheus
  5. upload snapshotted wal block (we will only upload the newly created block)
  6. force scale down scrape pool

Data (at snapshot):

└ data/
   ├─ 01CETTJ2PVKBK54V36ZA541RBD/
   ├─ 01CETV4CMV15AY06HTV07FMA6K/
   ├─ wal/
   |    |─000001
   |    └─000002
   └─snapshots
          └─ 20180531T095042Z-77bc1ca430f8db98/
                 |─01CETTJ2PVKBK54V36ZA541RBD/
                 |─ 01CETV4CMV15AY06HTV07FMA6K/
                 └─01CETV4Z17NA95VQZH28BT4PER/      <- block created from wal during snapshot

If for whatever reason lots of new targets coming online and we want to scale up after at this point in time (after the snapshot) we will have to wait for this scraper to be scaled down and come back up again as if the normal persist of wal occurs we would end up uploading duplicate chunks.

Data (after snapshot and after next block duration):

└ data/
   ├─ 01CETTJ2PVKBK54V36ZA541RBD/
   ├─ 01CETV4CMV15AY06HTV07FMA6K/
   ├─ 01CETVPPJV7PG0NJMRT8PRYZPS/   <- new block persisted that contains chunks from snapshot
   ├─ wal/
   |    |─000001
   |    |─000002
   |    └─000003
   └─snapshots
          └─ 20180531T095042Z-77bc1ca430f8db98/
                 |─01CETTJ2PVKBK54V36ZA541RBD/
                 |─ 01CETV4CMV15AY06HTV07FMA6K/
                 └─01CETV4Z17NA95VQZH28BT4PER/      <- block created from wal during snapshot

Here we now have an extra block that has overlapping data with the snapshot so we cannot upload this block which means we may end up loosing data and we have to terminate the entire node and scale down.

Flushing - Alternative approach

  1. remove targets
  2. flush head/wal to a block (updating tsdb.head min max)
  3. upload new block
  4. drained scraper waits an allotted time ensure no need keep the scraper

Data (at flush):

└ data/
   ├─ 01CETTJ2PVKBK54V36ZA541RBD/
   ├─ 01CETV4CMV15AY06HTV07FMA6K/
   ├─ 01CETV4Z17NA95VQZH28BT4PER/ <- block created from flush
   └─ wal/
       |─000001
       └─000002

Now for whatever reason we decide not to scale that instance down or the next block duration min|max time passes and a new block is created we can still continue using the instance of Prometheus as there will be no overlapping blocks:

Data (after flush and after next block duration):

└ data/
   ├─ 01CETTJ2PVKBK54V36ZA541RBD/
   ├─ 01CETV4CMV15AY06HTV07FMA6K/
   ├─ 01CETV4Z17NA95VQZH28BT4PER/ <- block created from flush
   ├─ 01CETVPPJV7PG0NJMRT8PRYZPS/ <- new block with no overlap
   └─ wal/
       |─000001
       |─000002
       └─000003

For our scenario we do not currently mind if the blocks have been fully compacted or not due to Thanos not currently supporting local compaction.

@brian-brazil
Copy link
Contributor

I'm not seeing how this is adding anything that can't be done with the current shutdown and snapshotting. This seems like a microoptimisation.

@bwplotka
Copy link
Contributor Author

With flush it will be bit cleaner without special logic on our side and safeguards.
For example snapshot -> shutdown -> remove WAL -> start again would work, but there might be still race condition between end of snapshot and shutdown.

I think we are up to the idea of extending tsdb cli to be able to make post-death snapshot with head block. That will solve our issue I think.

@domgreen
Copy link

hey @brian-brazil maybe I think @Bplotka just cleared up my confusion with the shutdown / snapshotting approach 👍 I agree that flush is an optimisation 😄

My confusion was all around the assumption if we shutdown and start the wal will be read on startup and we still get overlapping blocks. If we remove the contents of /wal and restart then we should be okay to drain.

@domgreen
Copy link

Agreed, if we can get a flush in tsdb cli as @krasi-georgiev suggested earlier in the thread that would mean we can work around all race conditions and correctly drain a instance.

@krasi-georgiev
Copy link
Contributor

IIUC this would work for you?
shutdown -> tsdb cli flush (convert the WAL to a block) -> remove WAL -> upload -> start again if needed

maybe another thing you can try is restarting Prometheus with low value for --storage.tsdb.min-block-duration=1m this will pesist the WAL/Head every 1min , but I imagine it might cause other issues as it will trigger compaction quite often.

@bwplotka
Copy link
Contributor Author

maybe another thing you can try is restarting Prometheus with low value for --storage.tsdb.min-block-duration=1m this will pesist the WAL/Head every 1min , but I imagine it might cause other issues as it will trigger compaction quite often.

Yea, not safe area I would say (:

@domgreen
Copy link

@krasi-georgiev would you envisage the tsdb cli removing the WAL in the flush command?

I was thinking if it didn't and someone used it they would end up with overlap. Not sure if this is an issue but just something I was thinking about.

@krasi-georgiev
Copy link
Contributor

you would be the first user so you have the privilege to say how you want this implemented.

@brian-brazil
Copy link
Contributor

We need to consider things beyond what the first user requests. I would not expect that a "flush" method would break my Prometheus.

@bwplotka
Copy link
Contributor Author

bwplotka commented Oct 5, 2018

How is that even possible that for normal flow Prometheus compacts to 2h block super quickly?

@krasi-georgiev
Copy link
Contributor

krasi-georgiev commented Oct 5, 2018

I haven't measured it, but persisting a block should take half the time compared to wal -> block.

main difference is that writing a block reads from the memory(the head) and in the case of wal-> block the data first need to be loaded into memory first and than written to disk.

@krasi-georgiev
Copy link
Contributor

will measure again now with the change I did where the checkpointing deletes all the duplicated data that is already persisted in a block.

@krasi-georgiev
Copy link
Contributor

even with the checkpointing changes, the compactor writes a block when the head/wal reaches 3h of data leaving 1h in the head/wal so at best case you would have 1h in the wal at worst case close to 3h depending on when you stop Prometheus.

@krasi-georgiev
Copy link
Contributor

with the current changes in prometheus/prometheus#4653 the following works:

  1. Stop Prometheus.
  2. Run the tsdb tool to create a block from the wal.
  3. Delete the wal.
  4. Start Prometheus Again.

Since the wal might include up to 3h of data in worst case converting the wal into a block will take up to 20min at the load I tested it. 2350963250 samples and 13173065 series

>> start stage=reading the wal files into memory
>> completed stage=reading the wal files into memory duration=7m7.463992998s
level=info msg="wal loaded" mint="2018-10-08 04:38:34 +0000 UTC" maxt="2018-10-08 07:19:35 +0000 UTC"
>> start stage=read series from memory and write a block


level=info msg="write block" mint=1538973514358 maxt=1538983175805 ulid=01CS9ASG8M9DZXZJ4YNEBVZ4D4
>> completed stage=read series from memory and write a block duration=9m55.622007693s
>> start stage=opening new block
>> completed stage=opening new block duration=1.61196ms

>> total time for all stages:17m3.0877398s 

path=wal/01CS9ASG8M9DZXZJ4YNEBVZ4D4
BLOCK ULID                  MIN TIME                       MAX TIME                       NUM SAMPLES  NUM CHUNKS  NUM SERIES
01CS9ASG8M9DZXZJ4YNEBVZ4D4  2018-10-08 04:38:34 +0000 UTC  2018-10-08 07:19:35 +0000 UTC  2350963250   24157929    13173065

@bwplotka
Copy link
Contributor Author

bwplotka commented Oct 8, 2018

Awesome, Thanks for this @krasi-georgiev - This is very useful!

20 min is a lot, but the data size is huge as well. Even with that data size, just upload to object storage will take X minutes time, so we just need to take into account that this operation is time consuming.

@krasi-georgiev
Copy link
Contributor

the wal under this load and nearly 3h of data is 30gb
and the block is 5.7G

@krasi-georgiev
Copy link
Contributor

@bwplotka, @brancz alternative more simple method.

  • disable compaction (already an api for this in tsdb)
  • run snapshot with the head included (already implemented in tsdb, can tweak it if needed)

with the head included, snapshoting will create a block with custom range but with very small changes this will be handled nicely at next compaction time.

@brancz
Copy link
Contributor

brancz commented Nov 26, 2018

The nice thing about doing this "offline" though is that no new head chunk is started.

@krasi-georgiev
Copy link
Contributor

krasi-georgiev commented Nov 26, 2018

yes that is true. Maybe can even add an option to the snapshot api to allow snapshoting only the head. This will reduce the time needed in half compared to the WAL to block cli as the WAL is already loaded in memory.

I guess if the tsdb.Snapshot allows creating a block just from the head the wal to block cli would just open the db with compaction disabled and call that snapshot api

@brancz
Copy link
Contributor

brancz commented Nov 26, 2018

I don't think that would be equivalent. Equivalent would be not to accept writes anymore, perform the snapshot and shutdown or somehow allow "manually" enabling writes again, which seems too specific of a workflow to be part of Prometheus/tsdb.

@gouthamve
Copy link
Collaborator

Hi, reading it again, I think the Snapshot API does what you want already no? For shutdown, you do the following:

  1. We remove all scrape targets, so nothing new is scraped (nothing new is added to TSDB)
  2. We call the Snapshot API
  3. We stop looking at the data-dir, only look at the Snapshot dir. Upload what is in the snapshot dir
  4. Kill the pod

I might have missed something, but if we stop looking at the data-dir once the Snapshot API returns, we'll be fine no?

@brancz
Copy link
Contributor

brancz commented Nov 27, 2018

I think what @gouthamve suggests probably suffices, I'm just a little concerned about the atomicity of step 1, it's hard to tell when the head will not receive any more data, removing targets doesn't necessarily mean that all in progress scrapes/inserts are immediately done. This could however be treated similarly to a missed scrape so could probably be waited for on a best effort basis and then just ignored.

@bwplotka
Copy link
Contributor Author

Well, you have this atomicity problem anyway, even with our previous ideas of Flush, Stop Prometheus or disable compacting. Only some native logic of tear down (dump head + turn off) in form of endpoint would provide us that safety - which would be nice to have, what do you think? (:

Yea @gouthamve , I mentioned Snapshot API in the very beginning as something we can use right now, but this issue was all about finding a solution that will:

  • Allow reusing same Prometheus after flush (dynamic changes in autoscaling)
  • Provide the atomicity (good point @brancz)

It looks to me like starting with using Snapshot API and gathering more data regarding issues is the way to go with this.

@krasi-georgiev
Copy link
Contributor

I hope you find the time soon to test all discussed solutions.

@bwplotka
Copy link
Contributor Author

bwplotka commented Jan 4, 2019

Yup, we will be dealing with this issue soon.

CC @devnev @easyCZ

@krasi-georgiev
Copy link
Contributor

any chance to get this revived ?

@brian-brazil
Copy link
Contributor

Couldn't you use readonly mode plus db.Snapshot for this?

@krasi-georgiev
Copy link
Contributor

aaah yeah that is an idea.

@bwplotka
Copy link
Contributor Author

The current idea is to do Snapshot + potentially readmode. So nothing required from TSDB for now. cc @devnev

Closing.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants