Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1698456 *: use wildcard domain in DNS: SAN for etcd server certs #676

Merged
merged 1 commit into from
Apr 29, 2019

Conversation

hexfusion
Copy link
Contributor

@hexfusion hexfusion commented Apr 27, 2019

This PR resolves an issue with client balancer and etcd. The balancer is populated with a list of etcd peer endpoints. When we dial endpoint[0] it is used as the target and the other endpoints are dialed using subconnections. I have verified that each connection, the target and subs all make a proper TLS handshake with Wireshark.

The issue we see which is painted well in the below logs, when etcd-0 fails and the balancer failsover to etcd-1 the connection will fail because the TLS context of the balancer assumes target (etcd-0).

1 clientconn.go:1304] grpc: addrConn.createTransport failed to connect to {etcd-1.jliu-demo.qe.devcluster.openshift.com:2379 0 }. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for localhost, etcd.kube-system.svc, etcd.kube-system.svc.cluster.local, etcd-1.jliu-demo.qe.devcluster.openshift.com, not etcd-0.jliu-demo.qe.devcluster.openshift.com". Reconnecting...

The solution, for now, is to populate the DNS: SAN of server certs with a wildcard. This will allow TLS auth to complete successfully and the balancer can properly work. This is because the target etcd-0 will now authenticate against the *.clustername.domain.com in SAN.

Check openshift-kube-apiserver logs
I0423 06:53:53.612060       1 resolver_conn_wrapper.go:116] ccResolverWrapper: sending new addresses to cc: [{etcd-0.jliu-demo.qe.devcluster.openshift.com:2379 0  <nil>}]
I0423 06:53:53.612145       1 balancer_v1_wrapper.go:125] balancerWrapper: got update addr from Notify: [{etcd-0.jliu-demo.qe.devcluster.openshift.com:2379 <nil>} {etcd-1.jliu-demo.qe.devcluster.openshift.com:2379 <nil>} {etcd-2.jliu-demo.qe.devcluster.openshift.com:2379 <nil>}]
W0423 06:53:53.654818       1 clientconn.go:1304] grpc: addrConn.createTransport failed to connect to {etcd-1.jliu-demo.qe.devcluster.openshift.com:2379 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for localhost, etcd.kube-system.svc, etcd.kube-system.svc.cluster.local, etcd-1.jliu-demo.qe.devcluster.openshift.com, not etcd-0.jliu-demo.qe.devcluster.openshift.com". Reconnecting...
I0423 06:53:53.660209       1 balancer_v1_wrapper.go:125] balancerWrapper: got update addr from Notify: [{etcd-0.jliu-demo.qe.devcluster.openshift.com:2379 <nil>}]
W0423 06:53:53.660243       1 clientconn.go:953] Failed to dial etcd-1.jliu-demo.qe.devcluster.openshift.com:2379: context canceled; please retry.
I0423 06:53:53.686866       1 master.go:228] Using reconciler: lease
I0423 06:53:53.687309       1 clientconn.go:551] parsed scheme: ""
I0423 06:53:53.687324       1 clientconn.go:557] scheme "" not registered, fallback to default scheme
I0423 06:53:53.687354       1 resolver_conn_wrapper.go:116] ccResolverWrapper: sending new addresses to cc: [{etcd-0.jliu-demo.qe.devcluster.openshift.com:2379 0  <nil>}]
I0423 06:53:53.687387       1 balancer_v1_wrapper.go:125] balancerWrapper: got update addr from Notify: [{etcd-0.jliu-demo.qe.devcluster.openshift.com:2379 <nil>} {etcd-1.jliu-demo.qe.devcluster.openshift.com:2379 <nil>} {etcd-2.jliu-demo.qe.devcluster.openshift.com:2379 <nil>}]
W0423 06:53:53.688678       1 clientconn.go:953] Failed to dial etcd-2.jliu-demo.qe.devcluster.openshift.com:2379: grpc: the connection is closing; please retry.
W0423 06:53:53.693322       1 clientconn.go:1304] grpc: addrConn.createTransport failed to connect to {etcd-1.jliu-demo.qe.devcluster.openshift.com:2379 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for localhost, etcd.kube-system.svc, etcd.kube-system.svc.cluster.local, etcd-1.jliu-demo.qe.devcluster.openshift.com, not etcd-0.jliu-demo.qe.devcluster.openshift.com". Reconnecting...
W0423 06:53:53.693536       1 clientconn.go:1304] grpc: addrConn.createTransport failed to connect to {etcd-2.jliu-demo.qe.devcluster.openshift.com:2379 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for localhost, etcd.kube-system.svc, etcd.kube-system.svc.cluster.local, etcd-2.jliu-demo.qe.devcluster.openshift.com, not etcd-0.jliu-demo.qe.devcluster.openshift.com". Reconnecting...
I0423 06:53:53.733572       1 balancer_v1_wrapper.go:125] balancerWrapper: got update addr from Notify: [{etcd-0.jliu-demo.qe.devcluster.openshift.com:2379 <nil>}]
W0423 06:53:53.733612       1 clientconn.go:953] Failed to dial etcd-1.jliu-demo.qe.devcluster.openshift.com:2379: context canceled; please retry.
W0423 06:53:53.733621       1 clientconn.go:953] Failed to dial etcd-2.jliu-demo.qe.devcluster.openshift.com:2379: context canceled; please retry.

Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1698456

Ref:

/hold

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 27, 2019
@openshift-ci-robot openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Apr 27, 2019
@hexfusion
Copy link
Contributor Author

@hexfusion
Copy link
Contributor Author

/cc @ericavonb

@smarterclayton
Copy link
Contributor

This is reasonable to me, and the failure scenario is bad.

/approve

for 4.1 (the code looks correct to me, but want others to review deeper)

@smarterclayton smarterclayton added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 28, 2019
@smarterclayton
Copy link
Contributor

As a side note, we should make sure you are in OWNERS under cmd/setup-etcd-environment and a few other dirs.

@runcom
Copy link
Member

runcom commented Apr 28, 2019

/approve

from mco point on view, will leave to others as well

@smarterclayton
Copy link
Contributor

This lgtm, and I've heard no comment, and this is a huge blocker

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 29, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hexfusion, runcom, smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@smarterclayton
Copy link
Contributor

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 29, 2019
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@smarterclayton
Copy link
Contributor

/retest

@hexfusion
Copy link
Contributor Author

level=error msg="1 error occurred:"
level=error msg="\t* module.vpc.aws_route.to_nat_gw[1]: 1 error occurred:"
level=error msg="\t* aws_route.to_nat_gw.1: Error creating route: timeout while waiting for state to become 'success' (timeout: 20m0s)"

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@hexfusion
Copy link
Contributor Author

/test e2e-aws

1 similar comment
@hexfusion
Copy link
Contributor Author

/test e2e-aws

@hexfusion
Copy link
Contributor Author

level=error msg="1 error occurred:"
level=error msg="\t* module.vpc.aws_route_table_association.route_net[1]: 1 error occurred:"
level=error msg="\t* aws_route_table_association.route_net.1: timeout while waiting for state to become 'success' (timeout: 5m0s)"

/test e2e-aws

@openshift-merge-robot openshift-merge-robot merged commit 03ff4af into openshift:master Apr 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants