Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix not retryable error codes from: Unavailable -> FailedPrecondition #12985

Merged
merged 1 commit into from
May 19, 2021

Conversation

ptabor
Copy link
Contributor

@ptabor ptabor commented May 17, 2021

  • ErrGRPCNotCapable("etcdserver: not capable") -> codes.FailedPrecondition (it will not autofix, it requires new version of server)
  • ErrGPRCNotSupportedForLearner("etcdserver: rpc not supported for learner") -> codes.FailedPrecondition (as long as its learner, the call will not work)
  • ErrGRPCClusterVersionUnavailable("etcdserver: cluster version not found during downgrade") -> codes.FailedPrecondition (backend does not contain the version (old etcd?) so retry will not help)

https://github.com/etcd-io/etcd/runs/2599598633?check_suite_focus=true

{"level":"warn","ts":"2021-05-17T09:55:30.246Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000539880/#initially=[unix://localhost:m30]","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: rpc not supported for learner"}
{"level":"warn","ts":"2021-05-17T09:55:30.270Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying
of unary invoker
failed","target":"etcd-endpoints://0xc000539880/#initially=[unix://localhost:m30]","attempt":1,"error":"rpc
error: code = Unavailable desc = etcdserver: rpc not supported for
learner"}`

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

@ptabor ptabor assigned hexfusion and unassigned hexfusion May 17, 2021
@ptabor
Copy link
Contributor Author

ptabor commented May 17, 2021

Seems its not as obvious as it seems:

TestBalancerSupportLearner tests that expects requests (not supported by learners) to be retried. This makes sense if the retry happens on another instance that is likely not a learner.
But if the learner(s) the only endpoint that client connects to, the retrying does not make sense.

@xiang90
Copy link
Contributor

xiang90 commented May 17, 2021

LGTM.

Even for the learner case, this is not a transient error from the server point of view. So the client should not rely on Unavailable to trigger the rebalance to the actual etcd server.

We need to either fix the test or implement the learner switch in a more reliable way :).

Copy link
Contributor

@gyuho gyuho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks for the fix

@ptabor ptabor force-pushed the 20210517-fix-not-a-learner-retries branch 2 times, most recently from 888c4b2 to e591871 Compare May 18, 2021 17:41
@ptabor
Copy link
Contributor Author

ptabor commented May 18, 2021

@ptabor ptabor requested a review from xiang90 May 18, 2021 17:57
ErrGRPCStopped = status.New(codes.Unavailable, "etcdserver: server stopped").Err()
ErrGRPCTimeout = status.New(codes.Unavailable, "etcdserver: request timed out").Err()
ErrGRPCTimeoutDueToLeaderFail = status.New(codes.Unavailable, "etcdserver: request timed out, possibly due to previous leader failure").Err()
ErrGRPCTimeoutDueToConnectionLost = status.New(codes.Unavailable, "etcdserver: request timed out, possibly due to connection lost").Err()
ErrGRPCUnhealthy = status.New(codes.Unavailable, "etcdserver: unhealthy cluster").Err()
ErrGRPCCorrupt = status.New(codes.DataLoss, "etcdserver: corrupt cluster").Err()
ErrGPRCNotSupportedForLearner = status.New(codes.Unavailable, "etcdserver: rpc not supported for learner").Err()
ErrGPRCNotSupportedForLearner = status.New(codes.InvalidArgument, "etcdserver: rpc not supported for learner").Err()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems Unimplemented or Not Found status is better fit here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about FailedPrecondition ?

  • not found - sounds like a problem with 'resource' not 'semantic'.
  • not implemented - sounds like a way to fix the problem is to upgrade the server.

FailedPrecondition is the most neutral in my opinion

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not implemented can mean two things:

  1. The operation is not implemented
  2. The operation is not supported/enabled

This case can fall into the second one.

ref: https://grpc.github.io/grpc/core/md_doc_statuscodes.html.

The precondition failure indicates that the service is in a state that cannot execute the operation. Usually, it is an external triggered state and can be fixed externally (for example to remove an unempty director, client needs to remove all files under it). learner is an internal state of the service and cannot be changed by users, but it also makes sense in a border sense.

I have no strong opinion.

if isContextError(err) {
return false
}

// Situation when learner refuses RPC its supposed to not serve is from server
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea. We should have a more reliable way for client to retry on the learner case. The client should be able to tell if the server instance is a leaner or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added TODO

@ptabor ptabor force-pushed the 20210517-fix-not-a-learner-retries branch from e591871 to 13b6791 Compare May 18, 2021 23:39
@ptabor ptabor requested a review from xiang90 May 18, 2021 23:40
 - ErrGRPCNotCapable("etcdserver: not capable") -> codes.FailedPrecondition  (it will not autofix, it requires new version of server)
 - ErrGPRCNotSupportedForLearner("etcdserver: rpc not supported for learner") -> codes.FailedPrecondition (as long as its learner, the call will not work)
 - ErrGRPCClusterVersionUnavailable("etcdserver: cluster version not found during downgrade") -> codes.FailedPrecondition (backend does not contain the version (old etcd?) so retry will not help)

https://github.com/etcd-io/etcd/runs/2599598633?check_suite_focus=true

```
{"level":"warn","ts":"2021-05-17T09:55:30.246Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000539880/#initially=[unix://localhost:m30]","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: rpc not supported for learner"}
{"level":"warn","ts":"2021-05-17T09:55:30.270Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying
of unary invoker
failed","target":"etcd-endpoints://0xc000539880/#initially=[unix://localhost:m30]","attempt":1,"error":"rpc
error: code = Unavailable desc = etcdserver: rpc not supported for
learner"}`
```
@ptabor ptabor force-pushed the 20210517-fix-not-a-learner-retries branch from 13b6791 to 16d51d8 Compare May 19, 2021 00:09
@codecov-commenter
Copy link

Codecov Report

Merging #12985 (16d51d8) into main (80ccb27) will decrease coverage by 4.76%.
The diff coverage is 0.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main   #12985      +/-   ##
==========================================
- Coverage   47.24%   42.47%   -4.77%     
==========================================
  Files         438      393      -45     
  Lines       34110    31699    -2411     
==========================================
- Hits        16116    13465    -2651     
- Misses      16088    16543     +455     
+ Partials     1906     1691     -215     
Flag Coverage Δ
all 42.47% <0.00%> (-4.77%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
api/v3rpc/rpctypes/error.go 47.05% <ø> (-17.65%) ⬇️
client/v3/retry_interceptor.go 0.00% <0.00%> (-23.25%) ⬇️
client/v3/ctx.go 0.00% <0.00%> (-100.00%) ⬇️
pkg/idutil/id.go 0.00% <0.00%> (-100.00%) ⬇️
raft/quorum/joint.go 0.00% <0.00%> (-100.00%) ⬇️
raft/quorum/quorum.go 0.00% <0.00%> (-100.00%) ⬇️
raft/tracker/state.go 0.00% <0.00%> (-100.00%) ⬇️
client/v3/compact_op.go 0.00% <0.00%> (-100.00%) ⬇️
raft/quorum/majority.go 0.00% <0.00%> (-100.00%) ⬇️
raft/raftpb/confstate.go 0.00% <0.00%> (-100.00%) ⬇️
... and 166 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 80ccb27...16d51d8. Read the comment docs.

@xiang90
Copy link
Contributor

xiang90 commented May 19, 2021

LGTM

@ptabor ptabor merged commit 159d191 into etcd-io:main May 19, 2021
@ptabor ptabor deleted the 20210517-fix-not-a-learner-retries branch May 19, 2021 06:05
@gyuho
Copy link
Contributor

gyuho commented May 19, 2021

@ptabor This is a bug fix. Can we backport to 3.5?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

5 participants