-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
session.Done() returns even when all etcd servers in a cluster are online #8181
Comments
The channel for Done() closes if the client does not receive a keepalive response from the etcd server within the TTL limit.
If it switches to a member that is partitioned from the rest of the cluster then it could possibly time out waiting for keepalives.
3.0.3? There have been some lease fixes to 3.0.x since then: |
thanks @heyitsanthony
|
@sangleganesh Any updates? |
we just worked around the issue. closing the bug. |
@sangleganesh I am not happy with workaround. If this is an issue, we should get it fixed. |
@sangleganesh Have you tried the latest etcd? |
@sangleganesh @gyuho @xiang90 I have verified the issue with the reported server version v3.0.3 and client library version v3.2.0. I have also retested using the server and client version v3.3.1 and the issue is reproducible in that version as well. Falls into the |
@dvonthenen great. thanks for looking into this. |
maybe the same problem. I have a three-node etcd cluster. |
I have investigated the given issue. The problem is reproducible even in the latest 3.5.0 release, for reproduction it is enough to create a client and try to get a sufficiently large number of sessions (to get a lease), the problem may appear even when getting the first lease At some point, a situation will occur in the cluster in which the master will send a request to commit changes to the lease repository , but he himself has not yet applied these changes. In my view, there are two solutions to this problem. |
I think this was addressed by @ahrtr. Can you confirm? |
This issue should have been resolved in 3.5.3 and the main branch. Please see 13932 . Please feel free to reopen this issue if you still can reproduce this on etcd 3.5.3+. |
We followed the code in #6699 and are using concurrency.Session to detect loss of connectivity to all etcd servers.
The way we do it is to start a watch and in the watch function start a Session.
The assumption is that if all etcd servers in a cluster go down Session.Done() returns indicating loss of connection to etcd and we exit the process.
In the following code, sometime (not always) as soon as the process restarts (about 3-8 seconds), session.Done() returns an error despite the fact the all servers in the etcd cluster are alive.
How can I log more information about it - the reason Done was returned ?
Can it be called if etcd client switches connection to some other etcd server in cluster ?
Note that we moved the etcd client code to 3.2.0 release so that it could do load balancing (and automatically move the connections without disrupting the watches) to other etcd servers when an etcd server goes down.
Any help is greatly appreciated.
The etcd server version is: etcdv3.03
client library version: 3.2.0
Opening an issue as suggested in this forum:
https://groups.google.com/forum/?hl=en#!topic/etcd-dev/0qXtLqRfLTk
The text was updated successfully, but these errors were encountered: