Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd cluster fails to start when using DNS SRV discovery with non-TLS #11321

Closed
samuelvl opened this issue Oct 31, 2019 · 15 comments · Fixed by #11776
Closed

etcd cluster fails to start when using DNS SRV discovery with non-TLS #11321

samuelvl opened this issue Oct 31, 2019 · 15 comments · Fixed by #11776

Comments

@samuelvl
Copy link

I am running etcd (version 3.4.3) on Fedora CoreOS (version 30) using Podman.

When running etcd with no TLS and SRV discovery, the installation is failing because it doesn't find _etcd-server-ssl entries. This should not fail since entries do not exist as TLS is not being used.

2019-10-31 09:53:30.575647 E | embed: couldn't resolve during SRV discovery (error querying DNS SRV records for _etcd-server-ssl lookup _etcd-server-ssl._tcp.libvirt.labs on 172.16.10.1:53: no such host)
2019-10-31 09:53:30.575892 C | etcdmain: error setting up initial cluster: error querying DNS SRV records for _etcd-server-ssl lookup _etcd-server-ssl._tcp.libvirt.labs on 172.16.10.1:53: no such host

It also fails on 3.4.2, 3.4.1 and 3.4.0. However, in 3.3.17 it is working properly (see table blelow) but I don't see any change in 3.4 changelog that forces to use TLS when SRV discovery is enabled. Is this the correct behaviour in 3.4?

+--------------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|            ENDPOINT            |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+--------------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| http://etcd1.libvirt.labs:2379 | ceb796e1dfaeb27e |  3.3.17 |   20 kB |      true |      false |        11 |          9 |                  0 |        |
| http://etcd2.libvirt.labs:2379 | b8dfd5ef2d30984a |  3.3.17 |   20 kB |     false |      false |        11 |          9 |                  0 |        |
| http://etcd3.libvirt.labs:2379 | dde9feb56ac9a7ad |  3.3.17 |   20 kB |     false |      false |        11 |          9 |                  0 |        |
+--------------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

The first member of the cluster is getting started with:

ETCD_UUID="5d9701ad-6c02-4f64-b614-1e4561c29181" # $(uuidgen)
ETCD_VERSION="v3.4.3"
ETCD_NODE_NAME="$(hostname -s)"
ETCD_NODE_CLIENT_ADVERTISE_URL="http://$(hostname | cut -d' ' -f1):2379"
ETCD_NODE_SERVER_ADVERTISE_URL="http://$(hostname | cut -d' ' -f1):2380"
ETCD_NODE_CLIENT_LISTEN_URL="http://$(hostname -I | cut -d' ' -f1):2379"
ETCD_NODE_SERVER_LISTEN_URL="http://$(hostname -I | cut -d' ' -f1):2380"
ETCD_DATA_DIR="/var/lib/etcd"
ETCD_DNS_SRV_DOMAIN="$(dnsdomainname)"

mkdir -p ${ETCD_DATA_DIR}

podman run \
  --name etcd \
  --volume ${ETCD_DATA_DIR}:/etcd-data:z \
  --net=host \
  quay.io/coreos/etcd:${ETCD_VERSION} \
    /usr/local/bin/etcd \
      --name ${ETCD_NODE_NAME} \
      --data-dir /etcd-data \
      --initial-cluster-state new \
      --initial-cluster-token ${ETCD_UUID} \
      --discovery-srv ${ETCD_DNS_SRV_DOMAIN} \
      --advertise-client-urls ${ETCD_NODE_CLIENT_ADVERTISE_URL} \
      --initial-advertise-peer-urls ${ETCD_NODE_SERVER_ADVERTISE_URL} \
      --listen-client-urls ${ETCD_NODE_CLIENT_LISTEN_URL} \
      --listen-peer-urls ${ETCD_NODE_SERVER_LISTEN_URL}

The DNS SRV entries for etcd cluster are:

$ dig +noall +answer SRV _etcd-server._tcp.libvirt.labs _etcd-client._tcp.libvirt.labs
_etcd-server._tcp.libvirt.labs.	0 IN	SRV	0 0 2380 etcd1.libvirt.labs.
_etcd-server._tcp.libvirt.labs.	0 IN	SRV	0 0 2380 etcd3.libvirt.labs.
_etcd-server._tcp.libvirt.labs.	0 IN	SRV	0 0 2380 etcd2.libvirt.labs.
_etcd-client._tcp.libvirt.labs.	0 IN	SRV	0 0 2379 etcd3.libvirt.labs.
_etcd-client._tcp.libvirt.labs.	0 IN	SRV	0 0 2379 etcd2.libvirt.labs.
_etcd-client._tcp.libvirt.labs.	0 IN	SRV	0 0 2379 etcd1.libvirt.labs.

The DNS A entries for etcd cluster are:

$ dig +noall +answer etcd1.libvirt.labs etcd2.libvirt.labs etcd3.libvirt.labs
etcd1.libvirt.labs.	0	IN	A	172.16.10.49
etcd2.libvirt.labs.	0	IN	A	172.16.10.188
etcd3.libvirt.labs.	0	IN	A	172.16.10.36

Etcd version is:

etcd Version: 3.4.3
Git SHA: 3cf2f69b5
Go Version: go1.12.12
Go OS/Arch: linux/amd64
@ngrigoriev
Copy link

same here, does not seem to be working according to the documentation

@maliqu
Copy link

maliqu commented Feb 24, 2020

agree, it still does not work correctly

@brandond
Copy link
Contributor

brandond commented Apr 10, 2020

According to the code, GetDNSClusterNames is supposed to try both. However, that function returns the error from the etcd-server-ssl lookup if it fails, ignoring the fact that the etcd-server lookup was successful and clusterStrs contains valid addresses. Unfortunately, PeerURLsMapAndToken sees the error from the failed tls lookup and returns early.

This seems to have been broken here: b664b91 - the tests don't catch it because they hardcode the SRV result set, without actually testing the record that they come from (_etcd-server-ssl._tcp.example.com for the https scheme or _etcd-server._tcp.example.com for http)

brandond added a commit to brandond/etcd that referenced this issue Apr 10, 2020
Fix issues identified in etcd-io#11321 (comment)
@vladimirtiukhtin
Copy link

Facing the same issue on kubernetes. Had to return to v3.3 as it works there perfectly

@mluds
Copy link

mluds commented Jul 30, 2020

I'm also seeing this on v3.4.10

@der-eismann
Copy link

Yeah, because #11776 is still not merged.

@brandond
Copy link
Contributor

I don't know who you gotta poke to get stuff merged, but the pr still works on master so it's ready whenever.

@stale
Copy link

stale bot commented Oct 29, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Oct 29, 2020
@brandond
Copy link
Contributor

Not stale, just not response from maintainers...

@stale stale bot removed the stale label Oct 29, 2020
@stale
Copy link

stale bot commented Jan 27, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jan 27, 2021
@der-eismann
Copy link

Not stale ...

@stale stale bot removed the stale label Jan 27, 2021
@truong-hua
Copy link

Still a bug on 3.4.17

@hexfusion
Copy link
Contributor

@truong-hua #11776 needs backported to release-3.4. If you can do that I plan on cutting a new release soon.

@der-eismann
Copy link

Would be great if this could be even backported to 3.3, or is it EOL already?

@misstt123
Copy link

Still a bug on 3.4.16

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging a pull request may close this issue.