Fix regression that caused client errors to be tracked in cortex_ruler_write_requests_failed_total #7472

pracucci · 2024-02-26T16:50:44Z

What this PR does

Today I've been paged by MimirRulerTooManyFailedPushes. The Mimir cluster health was good and there was no critical error on the write path. However, some recording rules were failing to append samples to ingesters because of:

the sample has been rejected because another sample with the same timestamp, but a different value, has already been ingested (err-mimir-sample-duplicate-timestamp)

In Mimir we consider such error a client error, because we have no control over it (e.g. recording rules generating clashing series). For this reason, in the ruler we have a logic to not count client errors in cortex_ruler_write_requests_failed_total. However, the recent work on gRPC errors introduced a regression, because the ruler code looks for a HTTPgRPC 4xx status code, but now the Distributor.Push() doesn't return a HTTPgRPC anymore.

In this PR I've fixed it introducing mimirpb.IsClientError() utility. The reason why I've placed it in mimirpb is because the error codes are defined there and I needed to use the same utility both in pkg/storage/ingest (used by ingest storage) and pkg/ruler.

I've also keep backward compatibility because it's something we also have in other places where we handle such errors, but I'm open to feedback.

Which issue(s) this PR fixes or relates to

N/A

Checklist

Tests updated.
Documentation added.
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
about-versioning.md updated with experimental features.

…r_write_requests_failed_total Signed-off-by: Marco Pracucci <[email protected]>

duricanikolic

The fix looks good to me, I am just concerned about INGESTION_RATE_LIMITED and REQUEST_RATE_LIMITED should be considered client errors when distributor.service-overload-status-code-on-rate-limit-enabled is true.

pkg/mimirpb/errors.go

Signed-off-by: Marco Pracucci <[email protected]>

duricanikolic

LGTM
Thanks for this fix.

pstibrany · 2024-02-27T09:33:33Z

pkg/distributor/errors.go

@@ -293,7 +293,7 @@ func wrapPartitionPushError(err error, partitionID int32) error {
 	return errors.Wrap(err, fmt.Sprintf("%s %d", failedPushingToPartitionMessage, partitionID))
 }

-func isClientError(err error) bool {
+func isIngesterClientError(err error) bool {


"client" here refers to client pushing data to distributor, right? I don't see how this rename makes sense in that context.

To my understanding, not in this context. This function only checks ingester errors and not distributor errors, that's why I named it specifically isIngesterClientError().

@duricanikolic Is my understanding correct or not?

Among other arguments of dskit's ring.DoBatchWithOptions() there are:

a callback, which is the actual function we want to invoke on a set of instances, and that might return an error

an optional ring.DoBatchOptions containing a IsClientError function, used to distinguish between client and server errors returned by the callback function.

In this particular context, distributor.push() calls ring.DoBatchWithOptions() and passes ingester.Push() as a callback and isIngesterClientError (ex isClientError) as ring.DoBatchOptions.IsClientError function. This function should, therefore, distinguish between ingester's client and server errors.

So, @pracucci has a correct understanding.

pstibrany

Makes sense to me.

Fix regression that caused client errors to be tracked in cortex_rule…

9d5d281

…r_write_requests_failed_total Signed-off-by: Marco Pracucci <[email protected]>

pracucci marked this pull request as ready for review February 26, 2024 16:57

pracucci requested review from a team as code owners February 26, 2024 16:57

duricanikolic reviewed Feb 26, 2024

View reviewed changes

pkg/mimirpb/errors.go Outdated Show resolved Hide resolved

Consider only BAD_DATA as a client error

56e11ce

Signed-off-by: Marco Pracucci <[email protected]>

duricanikolic approved these changes Feb 27, 2024

View reviewed changes

pstibrany reviewed Feb 27, 2024

View reviewed changes

pstibrany approved these changes Feb 27, 2024

View reviewed changes

pracucci merged commit ab2e749 into main Feb 27, 2024
28 checks passed

pracucci deleted the fix-client-error-detection-in-ruler branch February 27, 2024 10:46

krajorama mentioned this pull request Apr 3, 2024

Ruler: MimirRulerTooManyFailedQueries alert due to user error #7668

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix regression that caused client errors to be tracked in cortex_ruler_write_requests_failed_total #7472

Fix regression that caused client errors to be tracked in cortex_ruler_write_requests_failed_total #7472

pracucci commented Feb 26, 2024 •

edited

Loading

duricanikolic left a comment

duricanikolic left a comment

pstibrany Feb 27, 2024

pracucci Feb 27, 2024

duricanikolic Feb 27, 2024

pstibrany left a comment

Fix regression that caused client errors to be tracked in cortex_ruler_write_requests_failed_total #7472

Fix regression that caused client errors to be tracked in cortex_ruler_write_requests_failed_total #7472

Conversation

pracucci commented Feb 26, 2024 • edited Loading

What this PR does

Which issue(s) this PR fixes or relates to

Checklist

duricanikolic left a comment

Choose a reason for hiding this comment

duricanikolic left a comment

Choose a reason for hiding this comment

pstibrany Feb 27, 2024

Choose a reason for hiding this comment

pracucci Feb 27, 2024

Choose a reason for hiding this comment

duricanikolic Feb 27, 2024

Choose a reason for hiding this comment

pstibrany left a comment

Choose a reason for hiding this comment

pracucci commented Feb 26, 2024 •

edited

Loading