-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
write compaction: chunk 63 not found: invalid encoding \"none\"" #1345
Comments
Seems like a case of #1335. Were there any partial uploads? Could you elaborate more on how to reproduce this or what happened? |
We encounter the same problem as described above.
However, if we run |
looks like some block corruption(unclean shutdown, crash, using remote storage etc.). Could you send me that block privately(thanos-dev slack or prometheus-dev irc) and I will be able to confirm. |
I was able to replace the broken block, because we keep the original blocks on prometheus itself for quite a while. But thanos could handle this gracefully instead of just stopping. Sorry, I deleted the corrupt block - would have been interesting why verify did not find any problems ... |
Verify does not look if none of the chunks are malformed, especially in such way. Can't think of any operation that would cause this. Even partial upload... It looks really odd. I am really interested in what's the root cause of this. I guess you deleted it from object storage? What's object storage you guys use? Wonder If the replaced block will not cause the same problem because it look like the original block was wrongly produced by Prometheus. |
Some colleges that use Prometheus at large scale mention that with data corruptions a hard fail is better than just log an error an continue. I think the main reason for that is that it would lead to false alerts and misleading data when querying. |
@bwplotka we're using ceph with radosgw as object storage, and the blocks have been processed now. I think this could be related to some problems within our storage infrastructure we had around that time, but i did not find any errors in the sidecar logs regarding failed uploads ... |
Ah that makes sense. Anyway, what would be action items for Thanos in this case? Maybe
|
Incrementing metric for malformed blocks and continue compacting for healthy groups seems like a good idea - shouldn't thanos_compact_group_compactions_failures_total be the metric already? But that was always at 0 - because the service just died and restarted. |
Hello, we are facing the same issue. We have this block:
And the compactor is in crashloop with:
What "surprises" me is the caller=streamed_block_writer.go:219 msg="finalized downsampled block" log line before the invalid encoding "none" error. Even if the block is "corrupted", the finalized downsample log would be called? Is there any script I could run to validate the block is corrupted and that there is not so many things to do anymore...? This is quite a big range of time ^^' More info: Thanos 0.9.0, Objectstore: S3 Scality on-premise |
This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions. |
Thanos, Prometheus and Golang version used
Thanos Version: 0.5.0
Prometheus Version: 2.7.2
What happened
What you expected to happen
How to reproduce it (as minimally and precisely as possible):
Full logs to relevant components
Anything else we need to know
Environment:
Openshift Version: v3.11.43
Kubernetes Version: v1.11.0+d4cacc0
Object Storage: S3
The text was updated successfully, but these errors were encountered: