Skip to content
This repository has been archived by the owner on Aug 13, 2019. It is now read-only.

Add tsdb.Scan() to unblock from a corrupted db. #320

Open
wants to merge 19 commits into
base: master
Choose a base branch
from

Conversation

krasi-georgiev
Copy link
Contributor

@krasi-georgiev krasi-georgiev commented May 2, 2018

I have seen few issues where tsdb corruption forces users to delete the entire data folder so I would like to add a new public function tsdb.Scan to run different scans for overlapping blocks , corrupted chunks , missing meta files, tmp files etc.

At the end of the scan It will display a list and prompt if the user wishes to deleted these to allow a clean startup.

I was thinking to implement all this in the tsdb cli tool , but to avoid a lot of code duplication it would need access to many unexported funcs and structs so instead will add the Scan func as part of the tsdb package and just call it from the tsdb cli.

Let me know for any objections or other ideas.

fixes prometheus/prometheus#4058

Following PRs

  • Update the README.md
  • add it to the release page so it is easy to download.
  • maybe move the repairBadIndexVersion somewhere in the Scan due to stability guarantee as described by fabian this needs to remain as part of Prometheus.

Signed-off-by: Krasi Georgiev [email protected]

@gouthamve
Copy link
Collaborator

This sounds good to me but a few things, can we fix the block by excluding corrupt data rather than delete it?

For example, the querier should return the data for everything except that part and also show an error for the corruption. Then people can go ahead, run this tool and cleanup the corrupted data rather than losing potentially days worth of data for a single corruption.

@brian-brazil
Copy link
Contributor

brian-brazil commented May 7, 2018

Partial data is virtually impossible to deal with correctly for users, so we would need to lose everything in that time slice to keep things sane.

@krasi-georgiev
Copy link
Contributor Author

I will add block deletion and can explore partial recovery if there are user requests

@brian-brazil
Copy link
Contributor

We could quarantine them like we did in 1.x

@krasi-georgiev krasi-georgiev changed the title [PROPOSAL] Scan and delete corrupted blocks. [WIP] Scan and delete corrupted blocks. May 8, 2018
@krasi-georgiev
Copy link
Contributor Author

@brian-brazil can you point me tome some code or a branch for the quarantine?
What can we do with the quarantined data after that?

@gouthamve Not fully tested and not ready for a review yet, but thought I would share the current progress.
At the moment it tests for

  • unreadable meta files
  • blocks that can't be opened
  • overlapping blocks

can we test for anything else that has been a problem with the data corruption?

@gouthamve
Copy link
Collaborator

Yep, we could start with this and then add cases as we people issues.

@krasi-georgiev krasi-georgiev changed the title [WIP] Scan and delete corrupted blocks. Add tsdb.Scan() to scan for unreadable blocks. May 11, 2018
@krasi-georgiev
Copy link
Contributor Author

@gouthamve PTAL. let me know if you think some of this should be split in a separate PR.

@krasi-georgiev
Copy link
Contributor Author

@fabxc PTAL and will update the README as well.

@bwplotka
Copy link
Contributor

bwplotka commented Jun 11, 2018

Can we scan for this as well? https://github.com/improbable-eng/thanos/blob/master/pkg/block/index.go#L154

It seems that people are having "outsider" chunks (outside of meta time range) for TSDB produced blocks: thanos-io/thanos#354 (comment)

@krasi-georgiev
Copy link
Contributor Author

krasi-georgiev commented Jun 11, 2018

that is very useful , thanks
will add shortly.

@krasi-georgiev
Copy link
Contributor Author

krasi-georgiev commented Jun 26, 2018

@Bplotka, @gouthamve added the IndexRepair.

go run cmd/tsdb/main.go scan /tmp/test..gibberishNumbers

you can use the test to generate some indexes with invalid chunks.
Just comment few lines in the test so it leaves the invalid indexes on the disk.
defer close()
this removes the the db after the test so commenting will leave it on your disk in the /tmp folder.

also comment all lines after
// Repair the index.
this does the actual index repair and we want to leave these un-repaired so you can test the tsdb scan tool

Copy link
Contributor

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial review. I did not review the index stat & rewrite stuff super closely.

Some suggestions.

So far I am worried about 2 things:

  • It might be tricky to do black box repair. That's why thanos bucket repair only known or safe to repair things: https://github.com/improbable-eng/thanos/tree/master/pkg/verifier
    whereas here we just kill things which are broken. I guess it is fine for some use cases (people that do not care about missed metrics much and want to be just unblocked).
  • I would love to have single index repair logic for both Thanos and TSDB since we could use in Thanos just TSDB repair once we have it downloaded locally. But because the current logic is not repairing much, mostly killing unsure stuff (because we want to be generic here - we don't know the root cause of the issue that generated broken index), I won't be able to base on this code.

I guess this tool can go in 2 direction.

  • white box: so as Thanos verify package - we add repair step on every known issue
  • black box: just remove "unsure" corrupted data.

I can see totally the black box tool usefulness and simplicity and maybe we want this. But I have seen people asking questions "can I recover this data", so I would really be explicit here about the approach this tool take.

What do you think? it is really hard topic!

Also one thing that will help here is to separate SCAN step from Repair/delete as we do in Thanos. This way you can safely run this and gather all issues and see how much is broken and how. Then it easier to decide if you want to dig and repair, or maybe only 2h is corrupted so we are fine to delete those.

cmd/tsdb/main.go Show resolved Hide resolved
cmd/tsdb/main.go Show resolved Hide resolved
scan.go Outdated Show resolved Hide resolved
cmd/tsdb/main.go Outdated Show resolved Hide resolved
cmd/tsdb/main.go Outdated Show resolved Hide resolved
scan.go Outdated Show resolved Hide resolved
scan.go Outdated Show resolved Hide resolved
scan.go Outdated Show resolved Hide resolved
scan.go Outdated Show resolved Hide resolved
scan.go Outdated Show resolved Hide resolved
Copy link
Contributor Author

@krasi-georgiev krasi-georgiev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good/bad news.

I addresses all your comments, but decided to rewrite it to completely to brake it into smaller more logical blocks, hoping it will make it easier to read and maintain.

Also decided to test a db from disk instead of creating the db on every run. I think this will cause less maintenance in the long run as creating a fresh db on every run uses few internal funcs and logics which are likely to chance so if the tests relies on these will require maintenance more often.

also added more code to rewrite partial outsiders by just dropping the outsider samples and keeping the once in range.

cmd/tsdb/main.go Show resolved Hide resolved
cmd/tsdb/main.go Outdated Show resolved Hide resolved
cmd/tsdb/main.go Outdated Show resolved Hide resolved
cmd/tsdb/main.go Outdated
promptDelete(unreadableBlocks, scanCmdHumanReadable)
}

repaired, err := scan.RepairIndex()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.
I have been also thinking whether all scan command should take the block to e scanned as a parameter instead of doing this inside the function, but still unsure , maybe will update later when we see how it can be used.

cmd/tsdb/main.go Outdated
if len(overlappingBlocks) > 0 {
didSomething = true
fmt.Println("Overlaping blocks.")
fmt.Println("Deleting these will remove all data in the listed time range.")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my idea is that first we try to repair all known issues and when everything fails we just delete to unblock the startup.
If you can think of another repair we can do will put it before this stage.

Last issue I saw the user deleted the offending block anyway and less experienced users would delete the entire data folder.

Not sure, wonder if this portion of code will not harm people (they will miss the fact that it is killing their metrics)

you don't think the comment I added is enough?

scan.go Outdated
}

loop:
for {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch , this was some left over from some previous code changes.

scan.go Outdated
unreadable := make(map[error][]*Block)

for _, dir := range dirs {
block := &Block{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

scan.go Outdated Show resolved Hide resolved
scan.go Outdated Show resolved Hide resolved
scan.go Outdated Show resolved Hide resolved
@krasi-georgiev krasi-georgiev changed the title Add tsdb.Scan() to unblock prometheus pieces Add tsdb.Scan() to unblock from a corrupted db. Sep 20, 2018
@krasi-georgiev
Copy link
Contributor Author

ping @bwplotka

@krasi-georgiev
Copy link
Contributor Author

ping @bwplotka

Copy link
Contributor

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial review (TBH I have this halfy reviewed in July, but forgot to submit ): )

This PR is huuuuge 😢

cmd/tsdb/main.go Outdated Show resolved Hide resolved
cmd/tsdb/main.go Outdated Show resolved Hide resolved
cmd/tsdb/main.go Outdated
}
}

func scanTmps(scanPath string, hformat *bool) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scanTmpCompactFiles to be more verbose?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto with error return

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there were cases with other temps like WAL checkpoint so will leave as is to be more generic and just updated the print message.

cmd/tsdb/main.go Outdated Show resolved Hide resolved
cmd/tsdb/main.go Show resolved Hide resolved
cmd/tsdb/main.go Show resolved Hide resolved
cmd/tsdb/main.go Outdated Show resolved Hide resolved
cmd/tsdb/main.go Outdated Show resolved Hide resolved
@@ -534,8 +534,9 @@ func (db *DB) reload() (err error) {
sort.Slice(blocks, func(i, j int) bool {
return blocks[i].Meta().MinTime < blocks[j].Meta().MinTime
})
if err := validateBlockSequence(blocks); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about changing anything in db.go before resolving conflicts (: I would aim for changing as little as possible. (e.g -type Overlaps map[TimeRange][]BlockMeta -> type Overlaps map[TimeRange][]*Block change)

// TimeRange specifies minTime and maxTime range.
type TimeRange struct {
Min, Max int64
}

// Overlaps contains overlapping blocks aggregated by overlapping range.
type Overlaps map[TimeRange][]BlockMeta
type Overlaps map[TimeRange][]*Block
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not using meta.ULID?

Krasi Georgiev added 4 commits December 14, 2018 14:27
Signed-off-by: Krasi Georgiev <[email protected]>
Signed-off-by: Krasi Georgiev <[email protected]>
Signed-off-by: Krasi Georgiev <[email protected]>
Copy link
Contributor Author

@krasi-georgiev krasi-georgiev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed all comments and resolved conflicts.

cmd/tsdb/main.go Outdated Show resolved Hide resolved
cmd/tsdb/main.go Outdated Show resolved Hide resolved
cmd/tsdb/main.go Outdated Show resolved Hide resolved
cmd/tsdb/main.go Outdated Show resolved Hide resolved
cmd/tsdb/main.go Outdated Show resolved Hide resolved
cmd/tsdb/main.go Outdated Show resolved Hide resolved
cmd/tsdb/main.go Outdated
}
}

func scanTmps(scanPath string, hformat *bool) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there were cases with other temps like WAL checkpoint so will leave as is to be more generic and just updated the print message.

cmd/tsdb/main.go Outdated Show resolved Hide resolved
cmd/tsdb/main.go Outdated Show resolved Hide resolved
// TimeRange specifies minTime and maxTime range.
type TimeRange struct {
Min, Max int64
}

// Overlaps contains overlapping blocks aggregated by overlapping range.
type Overlaps map[TimeRange][]BlockMeta
type Overlaps map[TimeRange][]*Block
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any harm in this refactoring as validateBlockSequence is doing the same thing.

@krasi-georgiev
Copy link
Contributor Author

ping @bwplotka

@mmencner
Copy link

I just hit the issue of overlapping blocks with prometheus 2.6.0 running with local storage, inside a docker container:

err="opening storage failed: invalid block sequence: block time ranges overlap: [mint: 1551146400000, maxt: 1551153600000, range: 2h0m0s, blocks: 597]:

When it happened, a number of new directories were created, each time the container was trying to start (unsuccessfully). Only after deleting all of them, we were able to get the prometheus running.

In this case, auto-check/auto-discard would be really helpful.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

does not start up after corrupted meta.json file
5 participants