Thanos receiver issue when 1 receiver is down #6368

thibautmery · 2023-05-17T08:10:24Z

Thanos, Prometheus and Golang version used:

thanos v0.31.0
prometheus v2.41.0

What happened:

I have one hardtenancy configured on 3 thanos receiver host. I remote write from prometheus to one of the three thanos receiver.

There is replication factor of 1

If I shutdown one thanos receiver, prometheus switch to the other one (based on consul query). But I see some error on "forwarding reqyuest to that node.

If the node is shut how, thanos receiver continue to load balance on this node.
The configuration is hardcoding in the hashring conf.
endpoints: ["endpoint1", endpoint2", endpoint3"]

What you expected to happen:

Thanos receiver stop sending on a dead receiver endpoint.

How to reproduce it (as minimally and precisely as possible):

3 thanos receiver with 1 hardtenancy with 3 endpoints.
Shutdown 1 node and it's not working correctly.

Full logs to relevant components:

ts=2023-05-17T08:08:32.111Z caller=dedupe.go:112 component=remote level=warn remote_name=225c8a url=http://thanos.query.consul:XXXXX/api/v1/receive msg="Failed to send batch, retrying" err="server returned HTTP status 503 Service Unavailable: backing off forward request for endpoint X.X.X.X:XXXXX: target not available"
ts=2023-05-17T08:08:40.305Z caller=dedupe.go:112 component=remote level=warn remote_name=225c8a url=http://thanos.query.consul:XXXXX/api/v1/receive msg="Skipping resharding, last successful send was beyond threshold" lastSendTimestamp=1684310488 minSendTimestamp=1684310910
ts=2023-05-17T08:08:42.123Z caller=dedupe.go:112 component=remote level=warn remote_name=225c8a url=http://thanos.query.consul:18908/api/v1/receive msg="Failed to send batch, retrying" err="server returned HTTP status 503 Service Unavailable: forwarding request to endpoint X.X.X.X:XXXXX: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp X.X.X.X:XXXXX: connect: connection refused\""

The text was updated successfully, but these errors were encountered:

GiedriusS · 2023-05-17T14:21:02Z

Thanks for filling the issue. Have you seen #5809? I think it might be related.

thibautmery · 2023-05-17T14:26:13Z

Well kind of. In fact there is two things to do I think:

add a replay for error remote write
healthcheck to remove the grpc endpoint receiver

dosubot bot mentioned this issue Oct 19, 2024

Missing metrics when a receiver is shutdown #7845

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thanos receiver issue when 1 receiver is down #6368

Thanos receiver issue when 1 receiver is down #6368

thibautmery commented May 17, 2023

GiedriusS commented May 17, 2023

thibautmery commented May 17, 2023

Thanos receiver issue when 1 receiver is down #6368

Thanos receiver issue when 1 receiver is down #6368

Comments

thibautmery commented May 17, 2023

GiedriusS commented May 17, 2023

thibautmery commented May 17, 2023