-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Watcher.Watch() hangs when some endpoints are not available #7247
Comments
It will block until |
Sorry to be more clear, it's not the returned channel that is blocking, it's this Watch() call not returning a watch channel and just hanging here because I used context.Background(). I dig around a bit more, seem to find something related. I was running a cluster of 5 etcd nodes: etcd[1-5], but my etcd1 was sad:
on the rest of the cluster:
So when I try to create the etcd client with etcd[2-5], the Watch() works just fine, but when I create the etcd client with only etcd1 in the endpoint list, the Watch() just hangs. When I create client with endpoints: etcd[1-5], the client will randomly talk to one of the endpoints, so when they connect to etcd[2-5], they are fine, but when they are connected to etcd1, it hangs. My question is why does the simpleBalancer in the etcd client not trying to contact other instances, does this mean the watch connection is sticky? |
@cw9 OK, something may be wrong with the etcd1 member. Should the client endpoint be 127.0.0.1? The balancer will pin an address if it can open a connection. If requests time out, it won't know about that; it will issue requests on that endpoint so long as the connection is up. This isn't a problem necessarily isolated to watches. It seems like 127.0.0.1:2379 is accepting connections, then doing nothing. For example, this "hangs" in a similar manner: $ nc -l -p 2379 &
$ ETCDCTL_API=3 etcdctl watch abc The fix would probably involve some kind of endpoint poisoning in the balancer so the client can abandon malfunctioning nodes. |
If you provide etcd clientv3 a blackhole endpoint, it will hang for watch, put or whatever request without a timeout. Watch is especially important here since you do not really want to put a timeout for most cases. A off channel endpoint health checking mechanism is required to break out the rpc waiting I assume. See https://github.com/grpc/grpc/blob/master/doc/health-checking.md. Not sure if gRPC-go already support this or not. |
thanks for the explanation, I'll probably do something to prevent this sort of blackholing from happening from my end, besides that I'd like to confirm 2 things: 1, when there is a network partition, let's say the 5 nodes splits into 3+2, if the client can still talk to any of the etcd boxes, will it be able to get watch updates that happened in the bigger partition if the original watch channel was connected to an etcd instance that is now in the smaller partition? 2, if the client is connected to proxy etcd instances and 1 of the proxy instances now lost connection to the main etcd cluster, would the watch channel that established with this box catch up/retry other proxy boxes? |
By default, the watchers connected to the minority will hang there until the network partition recovers. However, you can provide option You probably want to try this feature yourself to understand how it works. Let us know if you see any issues with it.
Correct. If you enable serializable reads, local read is allowed.
It should. But the gRPC proxy is an alpha feature, I have not tried it personally. |
Cool, this looks interesting, I'll def try it out
Got it, I'm not using the proxy feature as well, will make sure before I onboard to that. |
@gyuho is this fixed on master? |
Closing via #8545. |
@cw9 probably give it a try with current master + current master client. thank you! |
1. Send an ErrWatchStopped to the caller only once. - Currently ErrWatchStopped gets sent to the caller multiple times causing a resubscribing watch to fail as well. 2. Use context with leader requirement for Watch API. - By default the etcd watchers will hang in case of a network partition and they are connected to the minority. - As mentioned here - etcd-io/etcd#7247 (comment) setting the leader requirement for watchers allows them to switch to the majority partition.
1. Send an ErrWatchStopped to the caller only once. - Currently ErrWatchStopped gets sent to the caller multiple times causing a resubscribing watch to fail as well. 2. Use context with leader requirement for Watch API. - By default the etcd watchers will hang in case of a network partition and they are connected to the minority. - As mentioned here - etcd-io/etcd#7247 (comment) setting the leader requirement for watchers allows them to switch to the majority partition.
Hi, I've experienced a few times that Wathcer.Watch() would hang forever, the hanging happens on the following code:
I'm using the release-3.1 version of clients, is this behavior expected? What is the reason for the hang? The key being watched on is already existing, but I don't think that matters?
My current plan of work around is to add timeout and retry to this call, let me know if you have any concerns with this approach.
The text was updated successfully, but these errors were encountered: