-
Notifications
You must be signed in to change notification settings - Fork 179
Add tsdb.Scan() to unblock from a corrupted db. #320
base: master
Are you sure you want to change the base?
Add tsdb.Scan() to unblock from a corrupted db. #320
Conversation
This sounds good to me but a few things, can we fix the block by excluding corrupt data rather than delete it? For example, the querier should return the data for everything except that part and also show an error for the corruption. Then people can go ahead, run this tool and cleanup the corrupted data rather than losing potentially days worth of data for a single corruption. |
Partial data is virtually impossible to deal with correctly for users, so we would need to lose everything in that time slice to keep things sane. |
I will add block deletion and can explore partial recovery if there are user requests |
We could quarantine them like we did in 1.x |
@brian-brazil can you point me tome some code or a branch for the quarantine? @gouthamve Not fully tested and not ready for a review yet, but thought I would share the current progress.
can we test for anything else that has been a problem with the data corruption? |
Yep, we could start with this and then add cases as we people issues. |
@gouthamve PTAL. let me know if you think some of this should be split in a separate PR. |
@fabxc PTAL and will update the README as well. |
Can we scan for this as well? https://github.com/improbable-eng/thanos/blob/master/pkg/block/index.go#L154 It seems that people are having "outsider" chunks (outside of meta time range) for TSDB produced blocks: thanos-io/thanos#354 (comment) |
that is very useful , thanks |
@Bplotka, @gouthamve added the IndexRepair.
you can use the test to generate some indexes with invalid chunks. also comment all lines after |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some initial review. I did not review the index stat & rewrite stuff super closely.
Some suggestions.
So far I am worried about 2 things:
- It might be tricky to do black box repair. That's why thanos
bucket
repair only known or safe to repair things: https://github.com/improbable-eng/thanos/tree/master/pkg/verifier
whereas here we just kill things which are broken. I guess it is fine for some use cases (people that do not care about missed metrics much and want to be just unblocked). - I would love to have single index repair logic for both Thanos and TSDB since we could use in Thanos just TSDB repair once we have it downloaded locally. But because the current logic is not repairing much, mostly killing unsure stuff (because we want to be generic here - we don't know the root cause of the issue that generated broken index), I won't be able to base on this code.
I guess this tool can go in 2 direction.
- white box: so as Thanos
verify
package - we add repair step on every known issue - black box: just remove "unsure" corrupted data.
I can see totally the black box tool usefulness and simplicity and maybe we want this. But I have seen people asking questions "can I recover this data", so I would really be explicit here about the approach this tool take.
What do you think? it is really hard topic!
Also one thing that will help here is to separate SCAN
step from Repair/delete
as we do in Thanos. This way you can safely run this and gather all issues and see how much is broken and how. Then it easier to decide if you want to dig and repair, or maybe only 2h is corrupted so we are fine to delete those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good/bad news.
I addresses all your comments, but decided to rewrite it to completely to brake it into smaller more logical blocks, hoping it will make it easier to read and maintain.
Also decided to test a db from disk instead of creating the db on every run. I think this will cause less maintenance in the long run as creating a fresh db on every run uses few internal funcs and logics which are likely to chance so if the tests relies on these will require maintenance more often.
also added more code to rewrite partial outsiders by just dropping the outsider samples and keeping the once in range.
cmd/tsdb/main.go
Outdated
promptDelete(unreadableBlocks, scanCmdHumanReadable) | ||
} | ||
|
||
repaired, err := scan.RepairIndex() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated.
I have been also thinking whether all scan command should take the block to e scanned as a parameter instead of doing this inside the function, but still unsure , maybe will update later when we see how it can be used.
cmd/tsdb/main.go
Outdated
if len(overlappingBlocks) > 0 { | ||
didSomething = true | ||
fmt.Println("Overlaping blocks.") | ||
fmt.Println("Deleting these will remove all data in the listed time range.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my idea is that first we try to repair all known issues and when everything fails we just delete to unblock the startup.
If you can think of another repair we can do will put it before this stage.
Last issue I saw the user deleted the offending block anyway and less experienced users would delete the entire data folder.
Not sure, wonder if this portion of code will not harm people (they will miss the fact that it is killing their metrics)
you don't think the comment I added is enough?
scan.go
Outdated
} | ||
|
||
loop: | ||
for { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch , this was some left over from some previous code changes.
scan.go
Outdated
unreadable := make(map[error][]*Block) | ||
|
||
for _, dir := range dirs { | ||
block := &Block{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
f7f2578
to
d577ed4
Compare
ping @bwplotka |
ping @bwplotka |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some initial review (TBH I have this halfy reviewed in July, but forgot to submit ): )
This PR is huuuuge 😢
cmd/tsdb/main.go
Outdated
} | ||
} | ||
|
||
func scanTmps(scanPath string, hformat *bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scanTmpCompactFiles
to be more verbose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto with error return
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there were cases with other temps like WAL checkpoint so will leave as is to be more generic and just updated the print message.
@@ -534,8 +534,9 @@ func (db *DB) reload() (err error) { | |||
sort.Slice(blocks, func(i, j int) bool { | |||
return blocks[i].Meta().MinTime < blocks[j].Meta().MinTime | |||
}) | |||
if err := validateBlockSequence(blocks); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about changing anything in db.go
before resolving conflicts (: I would aim for changing as little as possible. (e.g -type Overlaps map[TimeRange][]BlockMeta
-> type Overlaps map[TimeRange][]*Block
change)
// TimeRange specifies minTime and maxTime range. | ||
type TimeRange struct { | ||
Min, Max int64 | ||
} | ||
|
||
// Overlaps contains overlapping blocks aggregated by overlapping range. | ||
type Overlaps map[TimeRange][]BlockMeta | ||
type Overlaps map[TimeRange][]*Block |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not using meta.ULID?
Signed-off-by: Krasi Georgiev <[email protected]>
Signed-off-by: Krasi Georgiev <[email protected]>
Signed-off-by: Krasi Georgiev <[email protected]>
Signed-off-by: Krasi Georgiev <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed all comments and resolved conflicts.
cmd/tsdb/main.go
Outdated
} | ||
} | ||
|
||
func scanTmps(scanPath string, hformat *bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there were cases with other temps like WAL checkpoint so will leave as is to be more generic and just updated the print message.
// TimeRange specifies minTime and maxTime range. | ||
type TimeRange struct { | ||
Min, Max int64 | ||
} | ||
|
||
// Overlaps contains overlapping blocks aggregated by overlapping range. | ||
type Overlaps map[TimeRange][]BlockMeta | ||
type Overlaps map[TimeRange][]*Block |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any harm in this refactoring as validateBlockSequence
is doing the same thing.
Signed-off-by: Krasi Georgiev <[email protected]>
Signed-off-by: Krasi Georgiev <[email protected]>
Signed-off-by: Krasi Georgiev <[email protected]>
Signed-off-by: Krasi Georgiev <[email protected]>
Signed-off-by: Krasi Georgiev <[email protected]>
Signed-off-by: Krasi Georgiev <[email protected]>
Signed-off-by: Krasi Georgiev <[email protected]>
Signed-off-by: Krasi Georgiev <[email protected]>
Signed-off-by: Krasi Georgiev <[email protected]>
Signed-off-by: Krasi Georgiev <[email protected]>
Signed-off-by: Krasi Georgiev <[email protected]>
Signed-off-by: Krasi Georgiev <[email protected]>
Signed-off-by: Krasi Georgiev <[email protected]>
ping @bwplotka |
I just hit the issue of overlapping blocks with prometheus 2.6.0 running with local storage, inside a docker container:
When it happened, a number of new directories were created, each time the container was trying to start (unsuccessfully). Only after deleting all of them, we were able to get the prometheus running. In this case, auto-check/auto-discard would be really helpful. |
I have seen few issues where tsdb corruption forces users to delete the entire data folder so I would like to add a new public function
tsdb.Scan
to run different scans for overlapping blocks , corrupted chunks , missing meta files, tmp files etc.At the end of the scan It will display a list and prompt if the user wishes to deleted these to allow a clean startup.
I was thinking to implement all this in the tsdb cli tool , but to avoid a lot of code duplication it would need access to many unexported funcs and structs so instead will add the
Scan
func as part of the tsdb package and just call it from the tsdb cli.Let me know for any objections or other ideas.
fixes prometheus/prometheus#4058
Following PRs
maybe move thedue to stability guarantee as described by fabian this needs to remain as part of Prometheus.repairBadIndexVersion
somewhere in theScan
Signed-off-by: Krasi Georgiev [email protected]