Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distributor, ruler: Expect METHOD_NOT_ALLOWED from pusher #7618

Closed
wants to merge 1 commit into from

Conversation

narqo
Copy link
Contributor

@narqo narqo commented Mar 13, 2024

What this PR does

This is a follow up to #7503

I realized that distributor (and ruler) now need to expect the METHOD_NOT_ALLOWED (gRPC Unimplemented), from ingester, if the latter disables its PushgRPC method. This PR addresses that.

Note, the changes here chose to assume such a case as a client-side error, even though this is a misconfiguration (that is I believe this is not expected to happen in reality). What do you think, @pstibrany?

Which issue(s) this PR fixes or relates to

n/a

Checklist

  • Tests updated.
  • Documentation added.
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
  • about-versioning.md updated with experimental features.

@narqo narqo requested review from a team as code owners March 13, 2024 11:17
@pstibrany
Copy link
Member

I think we should not do this change, and treat it as internal error instead. User/client errors don't trigger our alerts, but internal errors do -- and this is clearly misconfiguration error. I think current state is fine as-is. WDYT?

@narqo
Copy link
Contributor Author

narqo commented Mar 13, 2024

I was thinking about this Marco's comment to my original changes:

I'm wondering if we should add a NOT_ALLOWED error, and then we map it as 405 status code and treat it as non retryable.

The concern is that in case of the misconfiguration, the clients will keep retrying their Push requests (will they?), which may not be ideal. But I also agree on the point about the alerting. I'm ok, with closing this one.

@dimitarvdimitrov
Copy link
Contributor

i believe both error types will not be retried, but I may be missing something

_, err := a.pusher.Push(user.InjectOrgID(a.ctx, a.userID), req)

I agree that it's better if the ruler fails more openly and start triggering alerts for failed writes (MimirRulerTooManyFailedPushes) instead of silently failing

@narqo
Copy link
Contributor Author

narqo commented Mar 14, 2024

i believe both error types will not be retried [in ruler], but I may be missing something

I should have noted that by "clients" I meant grafana-agent or prometheus. I think (although haven't double-checked that), they retry on 5xx.

@pstibrany
Copy link
Member

I should have noted that by "clients" I meant grafana-agent or prometheus. I think (although haven't double-checked that), they retry on 5xx.

Yes, they do. This is also part of remote-write protocol: https://prometheus.io/docs/concepts/remote_write_spec/#retries-backoff

@dimitarvdimitrov
Copy link
Contributor

So my understanding is that the case we're solving for is when a misconfigured or an old distributor replica is still using gRPC to talk to the ingesters. In this case we can either choose to discard the data that the client sent to that distributor (mapping to HTTP 400) or to continue retrying (mapping to HTTP 500).

IMO it's safer and more responsible to keep retrying the data instead of throwing it on the floor. The clients can keep retrying until the misconfiguration is fixed. They usually have some decent buffering capacity (tens of minutes, even hours). It's also an option that the Mimir cluster finishes some rollout and the "unimplemented" errors are only temporary and fix themselves after the rollout is complete.

@narqo
Copy link
Contributor Author

narqo commented Mar 15, 2024

The arguments above make very good sense; let's close this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants