-
Notifications
You must be signed in to change notification settings - Fork 569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
distributor, ruler: Expect METHOD_NOT_ALLOWED from pusher #7618
Conversation
Signed-off-by: Vladimir Varankin <[email protected]>
I think we should not do this change, and treat it as internal error instead. User/client errors don't trigger our alerts, but internal errors do -- and this is clearly misconfiguration error. I think current state is fine as-is. WDYT? |
I was thinking about this Marco's comment to my original changes:
The concern is that in case of the misconfiguration, the clients will keep retrying their Push requests (will they?), which may not be ideal. But I also agree on the point about the alerting. I'm ok, with closing this one. |
i believe both error types will not be retried, but I may be missing something Line 93 in bf37da3
I agree that it's better if the ruler fails more openly and start triggering alerts for failed writes ( |
I should have noted that by "clients" I meant grafana-agent or prometheus. I think (although haven't double-checked that), they retry on 5xx. |
Yes, they do. This is also part of remote-write protocol: https://prometheus.io/docs/concepts/remote_write_spec/#retries-backoff |
So my understanding is that the case we're solving for is when a misconfigured or an old distributor replica is still using gRPC to talk to the ingesters. In this case we can either choose to discard the data that the client sent to that distributor (mapping to HTTP 400) or to continue retrying (mapping to HTTP 500). IMO it's safer and more responsible to keep retrying the data instead of throwing it on the floor. The clients can keep retrying until the misconfiguration is fixed. They usually have some decent buffering capacity (tens of minutes, even hours). It's also an option that the Mimir cluster finishes some rollout and the "unimplemented" errors are only temporary and fix themselves after the rollout is complete. |
The arguments above make very good sense; let's close this one. |
What this PR does
This is a follow up to #7503
I realized that
distributor
(andruler
) now need to expect theMETHOD_NOT_ALLOWED
(gRPCUnimplemented
), from ingester, if the latter disables itsPush
gRPC method. This PR addresses that.Note, the changes here chose to assume such a case as a client-side error, even though this is a misconfiguration (that is I believe this is not expected to happen in reality). What do you think, @pstibrany?
Which issue(s) this PR fixes or relates to
n/a
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]
.about-versioning.md
updated with experimental features.