-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix not retryable error codes from: Unavailable -> FailedPrecondition #12985
Fix not retryable error codes from: Unavailable -> FailedPrecondition #12985
Conversation
Seems its not as obvious as it seems:
|
LGTM. Even for the learner case, this is not a transient error from the server point of view. So the client should not rely on We need to either fix the test or implement the learner switch in a more reliable way :). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, thanks for the fix
888c4b2
to
e591871
Compare
PTAL. Decided to add dedicated case to retry interceptor: |
api/v3rpc/rpctypes/error.go
Outdated
ErrGRPCStopped = status.New(codes.Unavailable, "etcdserver: server stopped").Err() | ||
ErrGRPCTimeout = status.New(codes.Unavailable, "etcdserver: request timed out").Err() | ||
ErrGRPCTimeoutDueToLeaderFail = status.New(codes.Unavailable, "etcdserver: request timed out, possibly due to previous leader failure").Err() | ||
ErrGRPCTimeoutDueToConnectionLost = status.New(codes.Unavailable, "etcdserver: request timed out, possibly due to connection lost").Err() | ||
ErrGRPCUnhealthy = status.New(codes.Unavailable, "etcdserver: unhealthy cluster").Err() | ||
ErrGRPCCorrupt = status.New(codes.DataLoss, "etcdserver: corrupt cluster").Err() | ||
ErrGPRCNotSupportedForLearner = status.New(codes.Unavailable, "etcdserver: rpc not supported for learner").Err() | ||
ErrGPRCNotSupportedForLearner = status.New(codes.InvalidArgument, "etcdserver: rpc not supported for learner").Err() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems Unimplemented
or Not Found
status is better fit here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about FailedPrecondition ?
- not found - sounds like a problem with 'resource' not 'semantic'.
- not implemented - sounds like a way to fix the problem is to upgrade the server.
FailedPrecondition is the most neutral in my opinion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not implemented
can mean two things:
- The operation is not implemented
- The operation is not supported/enabled
This case can fall into the second one.
ref: https://grpc.github.io/grpc/core/md_doc_statuscodes.html.
The precondition failure indicates that the service is in a state that cannot execute the operation. Usually, it is an external triggered state and can be fixed externally (for example to remove an unempty director, client needs to remove all files under it). learner
is an internal state of the service and cannot be changed by users, but it also makes sense in a border sense.
I have no strong opinion.
client/v3/retry_interceptor.go
Outdated
if isContextError(err) { | ||
return false | ||
} | ||
|
||
// Situation when learner refuses RPC its supposed to not serve is from server |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea. We should have a more reliable way for client to retry on the learner case. The client should be able to tell if the server instance is a leaner or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added TODO
e591871
to
13b6791
Compare
- ErrGRPCNotCapable("etcdserver: not capable") -> codes.FailedPrecondition (it will not autofix, it requires new version of server) - ErrGPRCNotSupportedForLearner("etcdserver: rpc not supported for learner") -> codes.FailedPrecondition (as long as its learner, the call will not work) - ErrGRPCClusterVersionUnavailable("etcdserver: cluster version not found during downgrade") -> codes.FailedPrecondition (backend does not contain the version (old etcd?) so retry will not help) https://github.com/etcd-io/etcd/runs/2599598633?check_suite_focus=true ``` {"level":"warn","ts":"2021-05-17T09:55:30.246Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000539880/#initially=[unix://localhost:m30]","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: rpc not supported for learner"} {"level":"warn","ts":"2021-05-17T09:55:30.270Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000539880/#initially=[unix://localhost:m30]","attempt":1,"error":"rpc error: code = Unavailable desc = etcdserver: rpc not supported for learner"}` ```
13b6791
to
16d51d8
Compare
Codecov Report
@@ Coverage Diff @@
## main #12985 +/- ##
==========================================
- Coverage 47.24% 42.47% -4.77%
==========================================
Files 438 393 -45
Lines 34110 31699 -2411
==========================================
- Hits 16116 13465 -2651
- Misses 16088 16543 +455
+ Partials 1906 1691 -215
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
LGTM |
@ptabor This is a bug fix. Can we backport to 3.5? |
https://github.com/etcd-io/etcd/runs/2599598633?check_suite_focus=true
Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.