-
Notifications
You must be signed in to change notification settings - Fork 569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix regression that caused client errors to be tracked in cortex_ruler_write_requests_failed_total #7472
Conversation
…r_write_requests_failed_total Signed-off-by: Marco Pracucci <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fix looks good to me, I am just concerned about INGESTION_RATE_LIMITED
and REQUEST_RATE_LIMITED
should be considered client errors when distributor.service-overload-status-code-on-rate-limit-enabled
is true.
Signed-off-by: Marco Pracucci <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks for this fix.
@@ -293,7 +293,7 @@ func wrapPartitionPushError(err error, partitionID int32) error { | |||
return errors.Wrap(err, fmt.Sprintf("%s %d", failedPushingToPartitionMessage, partitionID)) | |||
} | |||
|
|||
func isClientError(err error) bool { | |||
func isIngesterClientError(err error) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"client" here refers to client pushing data to distributor, right? I don't see how this rename makes sense in that context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To my understanding, not in this context. This function only checks ingester errors and not distributor errors, that's why I named it specifically isIngesterClientError()
.
@duricanikolic Is my understanding correct or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Among other arguments of dskit's ring.DoBatchWithOptions()
there are:
- a callback, which is the actual function we want to invoke on a set of instances, and that might return an error
- an optional
ring.DoBatchOptions
containing aIsClientError
function, used to distinguish between client and server errors returned by the callback function.
In this particular context, distributor.push()
calls ring.DoBatchWithOptions()
and passes ingester.Push()
as a callback and isIngesterClientError
(ex isClientError
) as ring.DoBatchOptions.IsClientError
function. This function should, therefore, distinguish between ingester's client and server errors.
So, @pracucci has a correct understanding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me.
What this PR does
Today I've been paged by
MimirRulerTooManyFailedPushes
. The Mimir cluster health was good and there was no critical error on the write path. However, some recording rules were failing to append samples to ingesters because of:In Mimir we consider such error a client error, because we have no control over it (e.g. recording rules generating clashing series). For this reason, in the ruler we have a logic to not count client errors in
cortex_ruler_write_requests_failed_total
. However, the recent work on gRPC errors introduced a regression, because the ruler code looks for a HTTPgRPC 4xx status code, but now theDistributor.Push()
doesn't return a HTTPgRPC anymore.In this PR I've fixed it introducing
mimirpb.IsClientError()
utility. The reason why I've placed it inmimirpb
is because the error codes are defined there and I needed to use the same utility both inpkg/storage/ingest
(used by ingest storage) andpkg/ruler
.I've also keep backward compatibility because it's something we also have in other places where we handle such errors, but I'm open to feedback.
Which issue(s) this PR fixes or relates to
N/A
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]
.about-versioning.md
updated with experimental features.