-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(error "EOF", ServerName "") error on etcd servers #10587
Comments
I can confirm that the rejected connection log entries do not appear when I use the
|
I can confirm same issue.
|
seems etcdctl did not use correct server IP to validate server cert, it always use 1st endpoint ip to validate all servers.
|
okay, already fixed by #11184 |
I meet the same issue when using etcd 3.3.18 when I upgrade from 3.2.24. Seems a similar issue still exists.
|
We are seeing this issue on 3.3.18 as well. |
@spzala Sorry for recomment in this closed issue.
The endpoints status below:
I have followed issue #11184 #10391 #10634 and the cluster was in health status by stable operation of kubernetes production cluster,so did i miss anything momentous about this? |
Still hitting this on |
I meet the same issue at 3.4.14 when I start Flanneld etcdctl versionetcdctl version: 3.4.14 flanneld -versionv0.13.1-rc1 the log messages as below who could give some suggestions where is the issues ? why the flanneld start fault ? |
@jejer @spzala Sorry for recommenting on this closed issue but I thought this could be useful. I tried applying this idea to our own clusters deployed with kubeadm and etcd managed locally with a kubelet doc. Versionk8s/kubeadm 1.17.4 In the kubeadmcfg.yaml config file as per the documentation the endpoints are not separated by spaces: So the manifest generated (/etc/kubernetes/manifests/etcd.yaml) also doesnt have spaced between endpoints: This results in the famous (error "EOF", ServerName "") messages on the ETCDs. If i were to edit the config file under (/etc/kubernetes/manifests/etcd.yaml) on each node, the kubelet restarts each ETCD pod and the error messages disappear and everything is is fine. IssueThe main issue arose from when i tried initing a new cluster with the spaces in the kubeadmcfg.yaml file The manifest generated (/etc/kubernetes/manifests/etcd.yaml) also has these spaces as expected This results in the kubelet complaining that the ETCDs are in CrashLoopBackoff and as you can expect. Nothing works 👯 . Adding quotes didn't seem to help the issue in this case either. Is that a bug or have I applied it incorrectly ? Does that mean I should be concerned for the clusters where i manually edited the /etc/kubernetes/manifests/etcd.yaml to not have the error messages anymore. 🤔 Cheers |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions. |
For anyone else coming across this issue: After rebooting an etcd node, my clients couldn't reconnect due to timeouts. The following messages were being logged multiple times per second:
The culprit turned out to be a lack of CPU cores. https://etcd.io/docs/v3.4/op-guide/hardware/ says:
My etcd nodes had 2 CPU cores each. Contrary to the documentation, that was not nearly enough. With 'only' a few hundred clients, etcd now easily hogs 8 CPU cores when starting. I still see EOF messages, but only for a few seconds while all CPU cores are busy. There are two possible reasons, and the log does not reveal which is the case:
Right now, the tail of my startup log looks like below, as all CPU cores are hogged on startup still, but only for a few seconds. Note that before, it would not get further than
|
Incase it helps someone. I have faced the same error on a single etcd instance while other 2 etcd instances are okay. This single node never recovered and used 100% CPU and going out of memory. This does not change etcd service or the host where etcd is hosted is restarted/rebooted. Thing to look out for is the error preceeding this error. In my case there are warnings about broken wal files and etcd ignoring/skipping. Apparently, it is not able to move forward with all these files in there and choking itself. I follow the below steps. Note that it is valid when you have multi node etcd cluster Stop etcd service or the pod which has the problem Restart the etcd service or the pod and it should be healthy If you have only one etcd instance without cluster, you can hope to recover this error by restoring the backup of the etcd db assuming you have some db backups taken before. |
Environment
Server version: 3.3.10
Client version: 3.3.10
Issue
Running
will cause etcd servers to log errors. However, when adding spaces after the commas, such as
no errors will be logged on etcd servers. This is predictable in both specifying the
--endpoints
flag in-line as well as viaETCDCTL_ENDPOINTS
environment variable.This issue was discovered while upgrading from 3.2.24 to 3.3.10. We did not see this same issue with 3.2.24. Note also that these
etcdctl member list
command return successfully without issue and with proper data. This is only an issue of logs.Example error logs:
Full example command for testing:
Relevant other issues: #10040 and #10391 were both closed as duplicates of #9949, however #9949 does not appear to be related to this particular issue.
The text was updated successfully, but these errors were encountered: