Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

write compaction: chunk 63 not found: invalid encoding \"none\"" #1345

Closed
itcloudy opened this issue Jul 22, 2019 · 11 comments
Closed

write compaction: chunk 63 not found: invalid encoding \"none\"" #1345

itcloudy opened this issue Jul 22, 2019 · 11 comments

Comments

@itcloudy
Copy link

itcloudy commented Jul 22, 2019

Thanos, Prometheus and Golang version used

Thanos Version: 0.5.0
Prometheus Version: 2.7.2

What happened

What you expected to happen

How to reproduce it (as minimally and precisely as possible):

Full logs to relevant components

Logs

level=debug ts=2019-07-22T11:29:07.217897384Z caller=compact.go:824 compactionGroup="0@{cluster_replica=\"shenzhen_test\",dc_replica=\"lab\",project_replica=\"test\",prometheus=\"monitoring/k8s\",prometheus_replica=\"prometheus-k8s-1\"}" msg="downloaded and verified blocks" blocks="[/var/thanos/store/monitoring-thanos/compact/0@{cluster_replica=\"shenzhen_test\",dc_replica=\"lab\",project_replica=\"test\",prometheus=\"monitoring/k8s\",prometheus_replica=\"prometheus-k8s-1\"}/01DG5S1JWSR4HMJWDDVR9ZV7AT /var/thanos/store/monitoring-thanos/compact/0@{cluster_replica=\"shenzhen_test\",dc_replica=\"lab\",project_replica=\"test\",prometheus=\"monitoring/k8s\",prometheus_replica=\"prometheus-k8s-1\"}/01DG6Q0F4K5BVCR334NJB92CMX /var/thanos/store/monitoring-thanos/compact/0@{cluster_replica=\"shenzhen_test\",dc_replica=\"lab\",project_replica=\"test\",prometheus=\"monitoring/k8s\",prometheus_replica=\"prometheus-k8s-1\"}/01DG90J9GADDXQVMNJ9EX2VHJA /var/thanos/store/monitoring-thanos/compact/0@{cluster_replica=\"shenzhen_test\",dc_replica=\"lab\",project_replica=\"test\",prometheus=\"monitoring/k8s\",prometheus_replica=\"prometheus-k8s-1\"}/01DG9E988WREQM2XKTK3SVGYFQ]" duration=1m1.245448142s
level=error ts=2019-07-22T11:29:08.386446072Z caller=main.go:182 msg="running command failed" err="error executing compaction: compaction failed: compaction failed for group 0@{cluster_replica=\"shenzhen_test\",dc_replica=\"lab\",project_replica=\"test\",prometheus=\"monitoring/k8s\",prometheus_replica=\"prometheus-k8s-1\"}: compact blocks [/var/thanos/store/monitoring-thanos/compact/0@{cluster_replica=\"shenzhen_test\",dc_replica=\"lab\",project_replica=\"test\",prometheus=\"monitoring/k8s\",prometheus_replica=\"prometheus-k8s-1\"}/01DG5S1JWSR4HMJWDDVR9ZV7AT /var/thanos/store/monitoring-thanos/compact/0@{cluster_replica=\"shenzhen_test\",dc_replica=\"lab\",project_replica=\"test\",prometheus=\"monitoring/k8s\",prometheus_replica=\"prometheus-k8s-1\"}/01DG6Q0F4K5BVCR334NJB92CMX /var/thanos/store/monitoring-thanos/compact/0@{cluster_replica=\"shenzhen_test\",dc_replica=\"lab\",project_replica=\"test\",prometheus=\"monitoring/k8s\",prometheus_replica=\"prometheus-k8s-1\"}/01DG90J9GADDXQVMNJ9EX2VHJA /var/thanos/store/monitoring-thanos/compact/0@{cluster_replica=\"shenzhen_test\",dc_replica=\"lab\",project_replica=\"test\",prometheus=\"monitoring/k8s\",prometheus_replica=\"prometheus-k8s-1\"}/01DG9E988WREQM2XKTK3SVGYFQ]: write compaction: chunk 63 not found: invalid encoding \"none\""

Anything else we need to know

Environment:
Openshift Version: v3.11.43
Kubernetes Version: v1.11.0+d4cacc0
Object Storage: S3

@GiedriusS
Copy link
Member

Seems like a case of #1335. Were there any partial uploads? Could you elaborate more on how to reproduce this or what happened?

@rsommer
Copy link

rsommer commented Oct 2, 2019

We encounter the same problem as described above.

Oct 02 08:26:03 thanoscompact01 thanos[105774]: level=warn ts=2019-10-02T08:26:03.028686797Z caller=prober.go:154 msg="changing probe status" status=unhealthy reason="error executing compaction: compaction failed: compaction failed for group 0@{environment=\"production\",monitor=\"infrastructure\",replica=\"prom01\"}: compact blocks [/var/cache/thanos/compact/0@{environment=\"production\",monitor=\"infrastructure\",replica=\"prom01\"}/01DNHEV4WJ042J94AS3HJHGJXH /var/cache/thanos/compact/0@{environment=\"production\",monitor=\"infrastructure\",replica=\"prom01\"}/01DNHNPW4JPNWSP76WH18G6DMR /var/cache/thanos/compact/0@{environment=\"production\",monitor=\"infrastructure\",replica=\"prom01\"}/01DNHWJKCK1Y0DG1N3MGBD5W4R]: write compaction: chunk 266 not found: invalid encoding \"none\""
Oct 02 08:26:03 thanoscompact01 thanos[105774]: level=error ts=2019-10-02T08:26:03.028868443Z caller=main.go:213 msg="running command failed" err="error executing compaction: compaction failed: compaction failed for group 0@{environment=\"production\",monitor=\"infrastructure\",replica=\"prom01\"}: compact blocks [/var/cache/thanos/compact/0@{environment=\"production\",monitor=\"infrastructure\",replica=\"prom01\"}/01DNHEV4WJ042J94AS3HJHGJXH /var/cache/thanos/compact/0@{environment=\"production\",monitor=\"infrastructure\",replica=\"prom01\"}/01DNHNPW4JPNWSP76WH18G6DMR /var/cache/thanos/compact/0@{environment=\"production\",monitor=\"infrastructure\",replica=\"prom01\"}/01DNHWJKCK1Y0DG1N3MGBD5W4R]: write compaction: chunk 266 not found: invalid encoding \"none\""

However, if we run bucket verify, no error is found. Local cache has been deleted.
thanos is version 0.7.0.

@krasi-georgiev
Copy link
Contributor

looks like some block corruption(unclean shutdown, crash, using remote storage etc.). Could you send me that block privately(thanos-dev slack or prometheus-dev irc) and I will be able to confirm.
Otherwise to unblock can just delete this block.

@rsommer
Copy link

rsommer commented Oct 3, 2019

I was able to replace the broken block, because we keep the original blocks on prometheus itself for quite a while. But thanos could handle this gracefully instead of just stopping. Sorry, I deleted the corrupt block - would have been interesting why verify did not find any problems ...

@bwplotka
Copy link
Member

bwplotka commented Oct 3, 2019

Verify does not look if none of the chunks are malformed, especially in such way. Can't think of any operation that would cause this. Even partial upload... It looks really odd.

I am really interested in what's the root cause of this. I guess you deleted it from object storage? What's object storage you guys use?

Wonder If the replaced block will not cause the same problem because it look like the original block was wrongly produced by Prometheus.

@krasi-georgiev
Copy link
Contributor

krasi-georgiev commented Oct 3, 2019

But thanos could handle this gracefully instead of just stopping.

Some colleges that use Prometheus at large scale mention that with data corruptions a hard fail is better than just log an error an continue. I think the main reason for that is that it would lead to false alerts and misleading data when querying.

@rsommer
Copy link

rsommer commented Oct 4, 2019

@bwplotka we're using ceph with radosgw as object storage, and the blocks have been processed now. I think this could be related to some problems within our storage infrastructure we had around that time, but i did not find any errors in the sidecar logs regarding failed uploads ...

@bwplotka
Copy link
Member

bwplotka commented Oct 4, 2019

Ah that makes sense.

Anyway, what would be action items for Thanos in this case? Maybe

  • Increment metric for malformed block?
  • continue compacting whatever we can avoiding this "group"
  • Add check against malformed blocks in verify? This might be quite difficult.

@rsommer
Copy link

rsommer commented Oct 4, 2019

Incrementing metric for malformed blocks and continue compacting for healthy groups seems like a good idea - shouldn't thanos_compact_group_compactions_failures_total be the metric already? But that was always at 0 - because the service just died and restarted.

@ahurtaud
Copy link
Contributor

ahurtaud commented Jan 7, 2020

Hello, we are facing the same issue.

We have this block:

| 01DX0WGA3DBRS0E964JGYHYG63 | 12-12-2019 01:00:00 | 26-12-2019 01:00:00 | 336h0m0s       | -296h0m0s     | 7,389,671  | 268,583,707,631 | 2,237,389,868 | 4          | false       | <MY_LABELS_KEY_VALUE> | 0s         | compactor |

And the compactor is in crashloop with:

level=info ts=2020-01-07T11:17:09.156025559Z caller=compact.go:290 msg="start first pass of downsampling"
level=info ts=2020-01-07T12:22:00.795975326Z caller=downsample.go:269 msg="downloaded block" id=01DX0WGA3DBRS0E964JGYHYG63 duration=1h4m32.513475834s
level=info ts=2020-01-07T14:51:13.634701294Z caller=streamed_block_writer.go:219 msg="finalized downsampled block" mint=1576108800000 maxt=1577318400000 ulid=01DXZZDE3MW3QC78VJYSFMYQ38 resolution=300000
level=warn ts=2020-01-07T14:51:35.919396231Z caller=prober.go:117 msg="changing probe status" status=not-ready reason="error executing compaction: first pass of downsampling failed: downsampling to 5 min: downsample block 01DX0WGA3DBRS0E964JGYHYG63 to window 300000: get chunk 2594428682246, series 447305847: invalid encoding \"none\""
level=info ts=2020-01-07T14:51:35.920068131Z caller=http.go:78 service=http/server component=compact msg="internal server shutdown" err="error executing compaction: first pass of downsampling failed: downsampling to 5 min: downsample block 01DX0WGA3DBRS0E964JGYHYG63 to window 300000: get chunk 2594428682246, series 447305847: invalid encoding \"none\""
level=info ts=2020-01-07T14:51:35.920141985Z caller=prober.go:137 msg="changing probe status" status=not-healthy reason="error executing compaction: first pass of downsampling failed: downsampling to 5 min: downsample block 01DX0WGA3DBRS0E964JGYHYG63 to window 300000: get chunk 2594428682246, series 447305847: invalid encoding \"none\""
level=error ts=2020-01-07T14:51:35.920212385Z caller=main.go:194 msg="running command failed" err="error executing compaction: first pass of downsampling failed: downsampling to 5 min: downsample block 01DX0WGA3DBRS0E964JGYHYG63 to window 300000: get chunk 2594428682246, series 447305847: invalid encoding \"none\""

What "surprises" me is the caller=streamed_block_writer.go:219 msg="finalized downsampled block" log line before the invalid encoding "none" error.

Even if the block is "corrupted", the finalized downsample log would be called?

Is there any script I could run to validate the block is corrupted and that there is not so many things to do anymore...? This is quite a big range of time ^^'

More info: Thanos 0.9.0, Objectstore: S3 Scality on-premise

@stale
Copy link

stale bot commented Feb 6, 2020

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.

@stale stale bot added the stale label Feb 6, 2020
@stale stale bot closed this as completed Feb 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants