Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thanos receiver issue when 1 receiver is down #6368

Open
thibautmery opened this issue May 17, 2023 · 2 comments
Open

Thanos receiver issue when 1 receiver is down #6368

thibautmery opened this issue May 17, 2023 · 2 comments

Comments

@thibautmery
Copy link

Thanos, Prometheus and Golang version used:

thanos v0.31.0
prometheus v2.41.0

What happened:

I have one hardtenancy configured on 3 thanos receiver host. I remote write from prometheus to one of the three thanos receiver.

There is replication factor of 1

If I shutdown one thanos receiver, prometheus switch to the other one (based on consul query). But I see some error on "forwarding reqyuest to that node.

If the node is shut how, thanos receiver continue to load balance on this node.
The configuration is hardcoding in the hashring conf.
endpoints: ["endpoint1", endpoint2", endpoint3"]

What you expected to happen:

Thanos receiver stop sending on a dead receiver endpoint.

How to reproduce it (as minimally and precisely as possible):

3 thanos receiver with 1 hardtenancy with 3 endpoints.
Shutdown 1 node and it's not working correctly.

Full logs to relevant components:

ts=2023-05-17T08:08:32.111Z caller=dedupe.go:112 component=remote level=warn remote_name=225c8a url=http://thanos.query.consul:XXXXX/api/v1/receive msg="Failed to send batch, retrying" err="server returned HTTP status 503 Service Unavailable: backing off forward request for endpoint X.X.X.X:XXXXX: target not available"
ts=2023-05-17T08:08:40.305Z caller=dedupe.go:112 component=remote level=warn remote_name=225c8a url=http://thanos.query.consul:XXXXX/api/v1/receive msg="Skipping resharding, last successful send was beyond threshold" lastSendTimestamp=1684310488 minSendTimestamp=1684310910
ts=2023-05-17T08:08:42.123Z caller=dedupe.go:112 component=remote level=warn remote_name=225c8a url=http://thanos.query.consul:18908/api/v1/receive msg="Failed to send batch, retrying" err="server returned HTTP status 503 Service Unavailable: forwarding request to endpoint X.X.X.X:XXXXX: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp X.X.X.X:XXXXX: connect: connection refused\""
@GiedriusS
Copy link
Member

Thanks for filling the issue. Have you seen #5809? I think it might be related.

@thibautmery
Copy link
Author

Well kind of. In fact there is two things to do I think:

  • add a replay for error remote write
  • healthcheck to remove the grpc endpoint receiver

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants