write compaction: chunk 63 not found: invalid encoding \"none\"" #1345

itcloudy · 2019-07-22T11:54:51Z

Thanos, Prometheus and Golang version used

Thanos Version: 0.5.0
Prometheus Version: 2.7.2

What happened

What you expected to happen

How to reproduce it (as minimally and precisely as possible):

Full logs to relevant components

Logs

level=debug ts=2019-07-22T11:29:07.217897384Z caller=compact.go:824 compactionGroup="0@{cluster_replica=\"shenzhen_test\",dc_replica=\"lab\",project_replica=\"test\",prometheus=\"monitoring/k8s\",prometheus_replica=\"prometheus-k8s-1\"}" msg="downloaded and verified blocks" blocks="[/var/thanos/store/monitoring-thanos/compact/0@{cluster_replica=\"shenzhen_test\",dc_replica=\"lab\",project_replica=\"test\",prometheus=\"monitoring/k8s\",prometheus_replica=\"prometheus-k8s-1\"}/01DG5S1JWSR4HMJWDDVR9ZV7AT /var/thanos/store/monitoring-thanos/compact/0@{cluster_replica=\"shenzhen_test\",dc_replica=\"lab\",project_replica=\"test\",prometheus=\"monitoring/k8s\",prometheus_replica=\"prometheus-k8s-1\"}/01DG6Q0F4K5BVCR334NJB92CMX /var/thanos/store/monitoring-thanos/compact/0@{cluster_replica=\"shenzhen_test\",dc_replica=\"lab\",project_replica=\"test\",prometheus=\"monitoring/k8s\",prometheus_replica=\"prometheus-k8s-1\"}/01DG90J9GADDXQVMNJ9EX2VHJA /var/thanos/store/monitoring-thanos/compact/0@{cluster_replica=\"shenzhen_test\",dc_replica=\"lab\",project_replica=\"test\",prometheus=\"monitoring/k8s\",prometheus_replica=\"prometheus-k8s-1\"}/01DG9E988WREQM2XKTK3SVGYFQ]" duration=1m1.245448142s
level=error ts=2019-07-22T11:29:08.386446072Z caller=main.go:182 msg="running command failed" err="error executing compaction: compaction failed: compaction failed for group 0@{cluster_replica=\"shenzhen_test\",dc_replica=\"lab\",project_replica=\"test\",prometheus=\"monitoring/k8s\",prometheus_replica=\"prometheus-k8s-1\"}: compact blocks [/var/thanos/store/monitoring-thanos/compact/0@{cluster_replica=\"shenzhen_test\",dc_replica=\"lab\",project_replica=\"test\",prometheus=\"monitoring/k8s\",prometheus_replica=\"prometheus-k8s-1\"}/01DG5S1JWSR4HMJWDDVR9ZV7AT /var/thanos/store/monitoring-thanos/compact/0@{cluster_replica=\"shenzhen_test\",dc_replica=\"lab\",project_replica=\"test\",prometheus=\"monitoring/k8s\",prometheus_replica=\"prometheus-k8s-1\"}/01DG6Q0F4K5BVCR334NJB92CMX /var/thanos/store/monitoring-thanos/compact/0@{cluster_replica=\"shenzhen_test\",dc_replica=\"lab\",project_replica=\"test\",prometheus=\"monitoring/k8s\",prometheus_replica=\"prometheus-k8s-1\"}/01DG90J9GADDXQVMNJ9EX2VHJA /var/thanos/store/monitoring-thanos/compact/0@{cluster_replica=\"shenzhen_test\",dc_replica=\"lab\",project_replica=\"test\",prometheus=\"monitoring/k8s\",prometheus_replica=\"prometheus-k8s-1\"}/01DG9E988WREQM2XKTK3SVGYFQ]: write compaction: chunk 63 not found: invalid encoding \"none\""

Anything else we need to know

Environment:
Openshift Version: v3.11.43
Kubernetes Version: v1.11.0+d4cacc0
Object Storage: S3

The text was updated successfully, but these errors were encountered:

GiedriusS · 2019-07-22T14:16:52Z

Seems like a case of #1335. Were there any partial uploads? Could you elaborate more on how to reproduce this or what happened?

rsommer · 2019-10-02T08:46:27Z

We encounter the same problem as described above.

Oct 02 08:26:03 thanoscompact01 thanos[105774]: level=warn ts=2019-10-02T08:26:03.028686797Z caller=prober.go:154 msg="changing probe status" status=unhealthy reason="error executing compaction: compaction failed: compaction failed for group 0@{environment=\"production\",monitor=\"infrastructure\",replica=\"prom01\"}: compact blocks [/var/cache/thanos/compact/0@{environment=\"production\",monitor=\"infrastructure\",replica=\"prom01\"}/01DNHEV4WJ042J94AS3HJHGJXH /var/cache/thanos/compact/0@{environment=\"production\",monitor=\"infrastructure\",replica=\"prom01\"}/01DNHNPW4JPNWSP76WH18G6DMR /var/cache/thanos/compact/0@{environment=\"production\",monitor=\"infrastructure\",replica=\"prom01\"}/01DNHWJKCK1Y0DG1N3MGBD5W4R]: write compaction: chunk 266 not found: invalid encoding \"none\""
Oct 02 08:26:03 thanoscompact01 thanos[105774]: level=error ts=2019-10-02T08:26:03.028868443Z caller=main.go:213 msg="running command failed" err="error executing compaction: compaction failed: compaction failed for group 0@{environment=\"production\",monitor=\"infrastructure\",replica=\"prom01\"}: compact blocks [/var/cache/thanos/compact/0@{environment=\"production\",monitor=\"infrastructure\",replica=\"prom01\"}/01DNHEV4WJ042J94AS3HJHGJXH /var/cache/thanos/compact/0@{environment=\"production\",monitor=\"infrastructure\",replica=\"prom01\"}/01DNHNPW4JPNWSP76WH18G6DMR /var/cache/thanos/compact/0@{environment=\"production\",monitor=\"infrastructure\",replica=\"prom01\"}/01DNHWJKCK1Y0DG1N3MGBD5W4R]: write compaction: chunk 266 not found: invalid encoding \"none\""

However, if we run bucket verify, no error is found. Local cache has been deleted.
thanos is version 0.7.0.

krasi-georgiev · 2019-10-03T01:10:50Z

looks like some block corruption(unclean shutdown, crash, using remote storage etc.). Could you send me that block privately(thanos-dev slack or prometheus-dev irc) and I will be able to confirm.
Otherwise to unblock can just delete this block.

rsommer · 2019-10-03T19:24:58Z

I was able to replace the broken block, because we keep the original blocks on prometheus itself for quite a while. But thanos could handle this gracefully instead of just stopping. Sorry, I deleted the corrupt block - would have been interesting why verify did not find any problems ...

bwplotka · 2019-10-03T20:44:03Z

Verify does not look if none of the chunks are malformed, especially in such way. Can't think of any operation that would cause this. Even partial upload... It looks really odd.

I am really interested in what's the root cause of this. I guess you deleted it from object storage? What's object storage you guys use?

Wonder If the replaced block will not cause the same problem because it look like the original block was wrongly produced by Prometheus.

krasi-georgiev · 2019-10-03T21:14:05Z

But thanos could handle this gracefully instead of just stopping.

Some colleges that use Prometheus at large scale mention that with data corruptions a hard fail is better than just log an error an continue. I think the main reason for that is that it would lead to false alerts and misleading data when querying.

rsommer · 2019-10-04T05:36:53Z

@bwplotka we're using ceph with radosgw as object storage, and the blocks have been processed now. I think this could be related to some problems within our storage infrastructure we had around that time, but i did not find any errors in the sidecar logs regarding failed uploads ...

bwplotka · 2019-10-04T09:28:43Z

Ah that makes sense.

Anyway, what would be action items for Thanos in this case? Maybe

Increment metric for malformed block?
continue compacting whatever we can avoiding this "group"
Add check against malformed blocks in verify? This might be quite difficult.

rsommer · 2019-10-04T09:40:05Z

Incrementing metric for malformed blocks and continue compacting for healthy groups seems like a good idea - shouldn't thanos_compact_group_compactions_failures_total be the metric already? But that was always at 0 - because the service just died and restarted.

ahurtaud · 2020-01-07T15:02:57Z

Hello, we are facing the same issue.

We have this block:

| 01DX0WGA3DBRS0E964JGYHYG63 | 12-12-2019 01:00:00 | 26-12-2019 01:00:00 | 336h0m0s       | -296h0m0s     | 7,389,671  | 268,583,707,631 | 2,237,389,868 | 4          | false       | <MY_LABELS_KEY_VALUE> | 0s         | compactor |

And the compactor is in crashloop with:

level=info ts=2020-01-07T11:17:09.156025559Z caller=compact.go:290 msg="start first pass of downsampling"
level=info ts=2020-01-07T12:22:00.795975326Z caller=downsample.go:269 msg="downloaded block" id=01DX0WGA3DBRS0E964JGYHYG63 duration=1h4m32.513475834s
level=info ts=2020-01-07T14:51:13.634701294Z caller=streamed_block_writer.go:219 msg="finalized downsampled block" mint=1576108800000 maxt=1577318400000 ulid=01DXZZDE3MW3QC78VJYSFMYQ38 resolution=300000
level=warn ts=2020-01-07T14:51:35.919396231Z caller=prober.go:117 msg="changing probe status" status=not-ready reason="error executing compaction: first pass of downsampling failed: downsampling to 5 min: downsample block 01DX0WGA3DBRS0E964JGYHYG63 to window 300000: get chunk 2594428682246, series 447305847: invalid encoding \"none\""
level=info ts=2020-01-07T14:51:35.920068131Z caller=http.go:78 service=http/server component=compact msg="internal server shutdown" err="error executing compaction: first pass of downsampling failed: downsampling to 5 min: downsample block 01DX0WGA3DBRS0E964JGYHYG63 to window 300000: get chunk 2594428682246, series 447305847: invalid encoding \"none\""
level=info ts=2020-01-07T14:51:35.920141985Z caller=prober.go:137 msg="changing probe status" status=not-healthy reason="error executing compaction: first pass of downsampling failed: downsampling to 5 min: downsample block 01DX0WGA3DBRS0E964JGYHYG63 to window 300000: get chunk 2594428682246, series 447305847: invalid encoding \"none\""
level=error ts=2020-01-07T14:51:35.920212385Z caller=main.go:194 msg="running command failed" err="error executing compaction: first pass of downsampling failed: downsampling to 5 min: downsample block 01DX0WGA3DBRS0E964JGYHYG63 to window 300000: get chunk 2594428682246, series 447305847: invalid encoding \"none\""

What "surprises" me is the caller=streamed_block_writer.go:219 msg="finalized downsampled block" log line before the invalid encoding "none" error.

Even if the block is "corrupted", the finalized downsample log would be called?

Is there any script I could run to validate the block is corrupted and that there is not so many things to do anymore...? This is quite a big range of time ^^'

More info: Thanos 0.9.0, Objectstore: S3 Scality on-premise

stale · 2020-02-06T15:29:43Z

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.

GiedriusS added the component: compact label Jul 22, 2019

bwplotka mentioned this issue Nov 25, 2019

Add checksums to block chunk files. #1787

Closed

zygiss mentioned this issue Dec 13, 2019

Store gateway fails to sync block meta.json #1874

Closed

stale bot added the stale label Feb 6, 2020

stale bot closed this as completed Feb 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

write compaction: chunk 63 not found: invalid encoding \"none\"" #1345

write compaction: chunk 63 not found: invalid encoding \"none\"" #1345

itcloudy commented Jul 22, 2019 •

edited

Loading

GiedriusS commented Jul 22, 2019

rsommer commented Oct 2, 2019

krasi-georgiev commented Oct 3, 2019

rsommer commented Oct 3, 2019

bwplotka commented Oct 3, 2019 •

edited

Loading

krasi-georgiev commented Oct 3, 2019 •

edited

Loading

rsommer commented Oct 4, 2019

bwplotka commented Oct 4, 2019

rsommer commented Oct 4, 2019

ahurtaud commented Jan 7, 2020

stale bot commented Feb 6, 2020

write compaction: chunk 63 not found: invalid encoding \"none\"" #1345

write compaction: chunk 63 not found: invalid encoding \"none\"" #1345

Comments

itcloudy commented Jul 22, 2019 • edited Loading

GiedriusS commented Jul 22, 2019

rsommer commented Oct 2, 2019

krasi-georgiev commented Oct 3, 2019

rsommer commented Oct 3, 2019

bwplotka commented Oct 3, 2019 • edited Loading

krasi-georgiev commented Oct 3, 2019 • edited Loading

rsommer commented Oct 4, 2019

bwplotka commented Oct 4, 2019

rsommer commented Oct 4, 2019

ahurtaud commented Jan 7, 2020

stale bot commented Feb 6, 2020

itcloudy commented Jul 22, 2019 •

edited

Loading

bwplotka commented Oct 3, 2019 •

edited

Loading

krasi-georgiev commented Oct 3, 2019 •

edited

Loading