Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix regression that caused client errors to be tracked in cortex_ruler_write_requests_failed_total #7472

Merged
merged 2 commits into from
Feb 27, 2024

Conversation

pracucci
Copy link
Collaborator

@pracucci pracucci commented Feb 26, 2024

What this PR does

Today I've been paged by MimirRulerTooManyFailedPushes. The Mimir cluster health was good and there was no critical error on the write path. However, some recording rules were failing to append samples to ingesters because of:

the sample has been rejected because another sample with the same timestamp, but a different value, has already been ingested (err-mimir-sample-duplicate-timestamp)

In Mimir we consider such error a client error, because we have no control over it (e.g. recording rules generating clashing series). For this reason, in the ruler we have a logic to not count client errors in cortex_ruler_write_requests_failed_total. However, the recent work on gRPC errors introduced a regression, because the ruler code looks for a HTTPgRPC 4xx status code, but now the Distributor.Push() doesn't return a HTTPgRPC anymore.

In this PR I've fixed it introducing mimirpb.IsClientError() utility. The reason why I've placed it in mimirpb is because the error codes are defined there and I needed to use the same utility both in pkg/storage/ingest (used by ingest storage) and pkg/ruler.

I've also keep backward compatibility because it's something we also have in other places where we handle such errors, but I'm open to feedback.

Which issue(s) this PR fixes or relates to

N/A

Checklist

  • Tests updated.
  • Documentation added.
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
  • about-versioning.md updated with experimental features.

…r_write_requests_failed_total

Signed-off-by: Marco Pracucci <[email protected]>
@pracucci pracucci marked this pull request as ready for review February 26, 2024 16:57
@pracucci pracucci requested review from a team as code owners February 26, 2024 16:57
Copy link
Contributor

@duricanikolic duricanikolic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix looks good to me, I am just concerned about INGESTION_RATE_LIMITED and REQUEST_RATE_LIMITED should be considered client errors when distributor.service-overload-status-code-on-rate-limit-enabled is true.

Copy link
Contributor

@duricanikolic duricanikolic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Thanks for this fix.

@@ -293,7 +293,7 @@ func wrapPartitionPushError(err error, partitionID int32) error {
return errors.Wrap(err, fmt.Sprintf("%s %d", failedPushingToPartitionMessage, partitionID))
}

func isClientError(err error) bool {
func isIngesterClientError(err error) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"client" here refers to client pushing data to distributor, right? I don't see how this rename makes sense in that context.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my understanding, not in this context. This function only checks ingester errors and not distributor errors, that's why I named it specifically isIngesterClientError().

@duricanikolic Is my understanding correct or not?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Among other arguments of dskit's ring.DoBatchWithOptions() there are:

  • a callback, which is the actual function we want to invoke on a set of instances, and that might return an error
  • an optional ring.DoBatchOptions containing a IsClientError function, used to distinguish between client and server errors returned by the callback function.

In this particular context, distributor.push() calls ring.DoBatchWithOptions() and passes ingester.Push() as a callback and isIngesterClientError (ex isClientError) as ring.DoBatchOptions.IsClientError function. This function should, therefore, distinguish between ingester's client and server errors.

So, @pracucci has a correct understanding.

Copy link
Member

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me.

@pracucci pracucci merged commit ab2e749 into main Feb 27, 2024
28 checks passed
@pracucci pracucci deleted the fix-client-error-detection-in-ruler branch February 27, 2024 10:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants