Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple endpoints in ETCDCTL_ENDPOINTS lead to rejected connection from "1.2.3.4:5678" (error "EOF", ServerName "") spam in server log #10391

Closed
taladar opened this issue Jan 9, 2019 · 3 comments

Comments

@taladar
Copy link

taladar commented Jan 9, 2019

We have a Cluster of 3 etcd nodes and use it from various shell scripts running in systemd services which all use etcdctl. Two nodes run on Debian stretch, the third on Centos 7. All nodes use etcd downloaded directly from Github.

etcdctl version: 3.3.10
API version: 3.3
etcd Version: 3.3.10
Git SHA: 27fc7e2
Go Version: go1.10.4
Go OS/Arch: linux/amd64

A lot of the systemd services check some system parameter (e.g. an IP or whether a service is running) and write data about it to etcd. All of the scripts use leases to ensure no outdated information remains in etcd. Some use put, some use txn with a check whether the key exists followed by put.

etcdctl is configured via environment variables (IPs changed for privacy)

ETCDCTL_API=3
ETCDCTL_CACERT=/etc/etcd/client_certs/ca.crt
ETCDCTL_CERT=/etc/etcd/client_certs/client.crt
ETCDCTL_DIAL_TIMEOUT=3s
ETCDCTL_ENDPOINTS=https://1.2.3.4:2379,https://1.2.3.5:2379,https://1.2.3.6:2379
ETCDCTL_KEY=/etc/etcd/client_keys/client.key
`
etcd is started by a systemd unit we wrote (adjusted heavily)

`
[Unit]
Description=etcd
Documentation=https://github.com/coreos/etcd
Conflicts=etcd2.service

[Service]
Type=notify
Restart=always
RestartSec=5s
LimitNOFILE=40000
TimeoutStartSec=60s
User=etcd
Group=etcd

ExecStart=/opt/etcd/etcd \
    --name node1 \
    --data-dir /var/lib/etcd \
    --listen-client-urls https://1.2.3.4:2379,https://127.0.0.1:2379 \
    --cert-file=/etc/etcd/server_certs/server_client.crt \
    --key-file=/etc/etcd/server_keys/server_client.key \
    --client-cert-auth \
    --trusted-ca-file=/etc/etcd/server_certs/ca.crt \
    --advertise-client-urls https://1.2.3.4:2379 \
    --listen-peer-urls https://1.2.3.4:2380 \
    --peer-cert-file=/etc/etcd/server_certs/server_peer.crt \
    --peer-key-file=/etc/etcd/server_keys/server_peer.key \
    --peer-client-cert-auth \
    --peer-trusted-ca-file=/etc/etcd/server_certs/ca.crt \
    --initial-advertise-peer-urls https://1.2.3.4:2380 \
    --initial-cluster node1=https://1.2.3.4:2380,node2=https://1.2.3.5:2380,node3=https://1.2.3.6:2380 \
    --initial-cluster-token "etcd_cluster1" \
    --initial-cluster-state new \
    --auto-compaction-mode=revision \
    --auto-compaction-retention=1000

[Install]
WantedBy=multi-user.target

We get several messages per second with the IPs of the two nodes where etcdctl processes run (two of the nodes that are part of the cluster)

rejected connection from "1.2.3.4:23456" (error "EOF", ServerName "")
rejected connection from "1.2.3.5:34567" (error "EOF", ServerName "")

as well as occasional (once every couple of seconds)

rejected connection from "1.2.3.4:54321" (error "read tcp 1.2.3.4:2379->1.2.3.4:54321: read: connection reset by peer", ServerName "")

(the first and last IP and port are identical here)

I can't quite pin down what causes this but I strongly assume that it is related to the way etcdctl connects to the cluster when multiple endpoints are supplied. Looking at the debug output of a simple get call

~ # etcdctl --debug get /internet/elected_router                                                                                                                                                                    
ETCDCTL_CACERT=/etc/etcd/client_certs/ca.crt
ETCDCTL_CERT=/etc/etcd/client_certs/client.crt
ETCDCTL_COMMAND_TIMEOUT=5s
ETCDCTL_DEBUG=true
ETCDCTL_DIAL_TIMEOUT=3s
ETCDCTL_DISCOVERY_SRV=
ETCDCTL_ENDPOINTS=[https://1.2.3.4:2379,https://1.2.3.5:2379,https://1.2.3.6:2379]
ETCDCTL_HEX=false
ETCDCTL_INSECURE_DISCOVERY=true
ETCDCTL_INSECURE_SKIP_TLS_VERIFY=false
ETCDCTL_INSECURE_TRANSPORT=true
ETCDCTL_KEEPALIVE_TIME=2s
ETCDCTL_KEEPALIVE_TIMEOUT=6s
ETCDCTL_KEY=/etc/etcd/client_keys/client.key
ETCDCTL_USER=
ETCDCTL_WRITE_OUT=simple
INFO: 2019/01/09 14:42:15 ccBalancerWrapper: updating state and picker called by balancer: IDLE, 0xc4202d2a20
INFO: 2019/01/09 14:42:15 dialing to target with scheme: ""
INFO: 2019/01/09 14:42:15 could not get resolver for scheme: ""
INFO: 2019/01/09 14:42:15 balancerWrapper: is pickfirst: false
INFO: 2019/01/09 14:42:15 balancerWrapper: got update addr from Notify: [{1.2.3.4:2379 <nil>} {1.2.3.5:2379 <nil>} {1.2.3.6:2379 <nil>}]
INFO: 2019/01/09 14:42:15 ccBalancerWrapper: new subconn: [{1.2.3.4:2379 0  <nil>}]
INFO: 2019/01/09 14:42:15 ccBalancerWrapper: new subconn: [{1.2.3.5:2379 0  <nil>}]
INFO: 2019/01/09 14:42:15 ccBalancerWrapper: new subconn: [{1.2.3.6:2379 0  <nil>}]
INFO: 2019/01/09 14:42:15 balancerWrapper: handle subconn state change: 0xc4201986b0, CONNECTING
INFO: 2019/01/09 14:42:15 ccBalancerWrapper: updating state and picker called by balancer: CONNECTING, 0xc4202d2a20
INFO: 2019/01/09 14:42:15 balancerWrapper: handle subconn state change: 0xc420198700, CONNECTING
INFO: 2019/01/09 14:42:15 ccBalancerWrapper: updating state and picker called by balancer: CONNECTING, 0xc4202d2a20
INFO: 2019/01/09 14:42:15 balancerWrapper: handle subconn state change: 0xc420198750, CONNECTING
INFO: 2019/01/09 14:42:15 ccBalancerWrapper: updating state and picker called by balancer: CONNECTING, 0xc4202d2a20
INFO: 2019/01/09 14:42:15 balancerWrapper: handle subconn state change: 0xc420198700, READY
INFO: 2019/01/09 14:42:15 clientv3/balancer: pin "1.2.3.5:2379"
INFO: 2019/01/09 14:42:15 ccBalancerWrapper: updating state and picker called by balancer: READY, 0xc4202d2a20
INFO: 2019/01/09 14:42:15 balancerWrapper: got update addr from Notify: [{1.2.3.5:2379 <nil>}]
INFO: 2019/01/09 14:42:15 ccBalancerWrapper: removing subconn
INFO: 2019/01/09 14:42:15 ccBalancerWrapper: removing subconn
INFO: 2019/01/09 14:42:15 balancerWrapper: handle subconn state change: 0xc4201986b0, SHUTDOWN
INFO: 2019/01/09 14:42:15 ccBalancerWrapper: updating state and picker called by balancer: READY, 0xc4202d2a20
INFO: 2019/01/09 14:42:15 balancerWrapper: handle subconn state change: 0xc420198750, SHUTDOWN
INFO: 2019/01/09 14:42:15 ccBalancerWrapper: updating state and picker called by balancer: READY, 0xc4202d2a20
/internet/elected_router
node1

it looks like etcdctl connects to all three endpoints and only uses one connection, presumably it races them and uses the first one that responds. I suspect the EOF errors in the logs are caused by those two other connection attempts.

Sadly there seems to be no way to suppress the logging of TLS handshake failures like this.

@taladar
Copy link
Author

taladar commented Jan 9, 2019

The closed issue (for lack of information from submitter) #10040 seems to show a similar problem.

@jsok
Copy link
Contributor

jsok commented Jan 15, 2019

This is a duplicate of #9949

@hexfusion
Copy link
Contributor

This is a duplicate of #9949
yeah closing lets track #9949

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants