Skip to content
This repository has been archived by the owner on Jun 20, 2024. It is now read-only.

Lost K8S autoscaled nodes contacted indefinitely #3300

Closed
sgmiller opened this issue May 14, 2018 · 32 comments
Closed

Lost K8S autoscaled nodes contacted indefinitely #3300

sgmiller opened this issue May 14, 2018 · 32 comments

Comments

@sgmiller
Copy link

What you expected to happen?

When a Kubernetes node is removed from our autoscaled cluster, Weave continues trying to contact it until an operator manually rmpeer & forget's the node via the command line.

What happened?

weave-net-mlqj4 weave INFO: 2018/05/14 14:30:16.028571 ->[10.0.3.96:6783] attempting connection
weave-net-p5mjh weave INFO: 2018/05/14 14:30:17.357611 Discovered remote MAC 5a:b9:a1:0c:57:a6 at 1e:bc:cb:84:0d:2a(ip-10-0-3-37.us-east-2.compute.internal)
weave-net-mlqj4 weave INFO: 2018/05/14 14:30:19.091097 ->[10.0.3.96:6783] error during connection attempt: dial tcp4 :0->10.0.3.96:6783: connect: no route to host
weave-net-2bgzv weave INFO: 2018/05/14 14:30:25.250138 ->[10.0.3.93:6783] attempting connection

... about once a minute

How to reproduce it?

Remove a Kubernetes node (terminate the instance)

Anything else we need to know?

Kubernetes 1.9.4, kubeadm cluster on AWS.

Versions:

$ weave version
weave 2.3.0
$ docker version
Client:
 Version:      17.05.0-ce
 API version:  1.29
 Go version:   go1.7.5
 Git commit:   89658be
 Built:        Thu May  4 22:10:54 2017
 OS/Arch:      linux/amd64
$ uname -a
Linux ip-10-0-3-160.us-east-2.compute.internal 4.8.0-59-generic #64-Ubuntu SMP Thu Jun 29 19:38:34 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.5", GitCommit:"f01a2bf98249a4db383560443a59bed0c13575df", GitTreeState:"clean", BuildDate:"2018-03-19T15:59:24Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.4", GitCommit:"bee2d1505c4fe820744d26d41ecd3fdd4a3d6546", GitTreeState:"clean", BuildDate:"2018-03-12T16:21:35Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

Logs:

Grepped out a relevant host:

NFO: 2018/05/11 00:40:34.163026 Launch detected - using supplied peer list: [10.0.1.109 10.0.1.128 10.0.1.200 10.0.1.41 10.0.1.43 10.0.1.64 10.0.3.114 10.0.3.37 10.0.3.64 10.0.3.73 10.0.3.93]
INFO: 2018/05/11 00:40:34.175099 ->[10.0.1.64:6783] attempting connection
INFO: 2018/05/11 00:40:34.176893 ->[10.0.1.64:6783|5e:31:34:2d:2e:e9(ip-10-0-1-64.us-east-2.compute.internal)]: connection ready; using protocol version 2
INFO: 2018/05/11 00:40:34.177060 overlay_switch ->[5e:31:34:2d:2e:e9(ip-10-0-1-64.us-east-2.compute.internal)] using fastdp
INFO: 2018/05/11 00:40:34.179623 ->[10.0.1.64:6783|5e:31:34:2d:2e:e9(ip-10-0-1-64.us-east-2.compute.internal)]: connection added (new peer)
INFO: 2018/05/11 00:40:34.200111 ->[10.0.1.64:6783|5e:31:34:2d:2e:e9(ip-10-0-1-64.us-east-2.compute.internal)]: connection fully established
INFO: 2018/05/11 00:40:34.206113 sleeve ->[10.0.1.64:6783|5e:31:34:2d:2e:e9(ip-10-0-1-64.us-east-2.compute.internal)]: Effective MTU verified at 8939
INFO: 2018/05/11 00:40:35.102470 [kube-peers] Added myself to peer list &{[{e6:4b:b7:4c:df:d0 ip-10-0-1-128.us-east-2.compute.internal} {c2:18:9c:74:e7:f2 ip-10-0-3-64.us-east-2.compute.internal} {aa:84:b9:7a:13:5a ip-10-0-3-114.us-east-2.compute.internal} {86:41:ea:22:67:e5 ip-10-0-3-73.us-east-2.compute.internal} {e6:d6:b7:a2:bf:26 ip-10-0-1-41.us-east-2.compute.internal} {86:f7:66:2b:87:19 ip-10-0-1-200.us-east-2.compute.internal} {7a:e5:64:52:cd:43 ip-10-0-3-93.us-east-2.compute.internal} {1e:bc:cb:84:0d:2a ip-10-0-3-37.us-east-2.compute.internal} {22:02:44:cb:1a:e5 ip-10-0-1-109.us-east-2.compute.internal} {5e:31:34:2d:2e:e9 ip-10-0-1-64.us-east-2.compute.internal} {8a:50:e4:5b:fc:b3 ip-10-0-1-43.us-east-2.compute.internal}]}
INFO: 2018/05/11 00:40:35.968265 Discovered remote MAC 5e:31:34:2d:2e:e9 at 5e:31:34:2d:2e:e9(ip-10-0-1-64.us-east-2.compute.internal)
INFO: 2018/05/11 00:40:36.070057 Discovered remote MAC e2:05:43:3e:09:b9 at 5e:31:34:2d:2e:e9(ip-10-0-1-64.us-east-2.compute.internal)
INFO: 2018/05/11 00:40:36.070427 Discovered remote MAC b2:1b:27:4c:a4:52 at 5e:31:34:2d:2e:e9(ip-10-0-1-64.us-east-2.compute.internal)
INFO: 2018/05/11 00:40:36.215259 Discovered remote MAC 3a:3c:9f:e7:7d:08 at 5e:31:34:2d:2e:e9(ip-10-0-1-64.us-east-2.compute.internal)
INFO: 2018/05/11 00:42:00.633047 ->[10.0.1.64:6783|5e:31:34:2d:2e:e9(ip-10-0-1-64.us-east-2.compute.internal)]: connection shutting down due to error: read tcp4 10.0.1.43:50037->10.0.1.64:6783: read: connection reset by peer
INFO: 2018/05/11 00:42:00.633138 ->[10.0.1.64:6783|5e:31:34:2d:2e:e9(ip-10-0-1-64.us-east-2.compute.internal)]: connection deleted
INFO: 2018/05/11 00:42:00.640213 ->[10.0.1.64:6783] attempting connection
INFO: 2018/05/11 00:42:00.640585 ->[10.0.1.64:6783] error during connection attempt: dial tcp4 :0->10.0.1.64:6783: connect: connection refused
INFO: 2018/05/11 00:42:00.646541 Removed unreachable peer 5e:31:34:2d:2e:e9(ip-10-0-1-64.us-east-2.compute.internal)
INFO: 2018/05/11 00:42:03.457897 ->[10.0.1.64:6783] attempting connection
INFO: 2018/05/11 00:42:03.458262 ->[10.0.1.64:6783] error during connection attempt: dial tcp4 :0->10.0.1.64:6783: connect: connection refused
INFO: 2018/05/11 00:42:05.602050 ->[10.0.1.64:6783] attempting connection
INFO: 2018/05/11 00:42:05.602441 ->[10.0.1.64:6783] error during connection attempt: dial tcp4 :0->10.0.1.64:6783: connect: connection refused
INFO: 2018/05/11 00:42:11.717182 ->[10.0.1.64:6783] attempting connection
INFO: 2018/05/11 00:42:11.717543 ->[10.0.1.64:6783] error during connection attempt: dial tcp4 :0->10.0.1.64:6783: connect: connection refused
INFO: 2018/05/11 00:42:21.712218 ->[10.0.1.64:6783] attempting connection
INFO: 2018/05/11 00:42:40.031406 ->[10.0.1.64:6783] error during connection attempt: dial tcp4 :0->10.0.1.64:6783: connect: no route to host
INFO: 2018/05/11 00:42:53.518432 ->[10.0.1.64:6783] attempting connection
INFO: 2018/05/11 00:42:56.575443 ->[10.0.1.64:6783] error during connection attempt: dial tcp4 :0->10.0.1.64:6783: connect: no route to host
INFO: 2018/05/11 00:43:12.400009 ->[10.0.1.64:6783] attempting connection
INFO: 2018/05/11 00:43:15.455412 ->[10.0.1.64:6783] error during connection attempt: dial tcp4 :0->10.0.1.64:6783: connect: no route to host
INFO: 2018/05/11 00:43:41.529976 ->[10.0.1.64:6783] attempting connection
INFO: 2018/05/11 00:43:44.607391 ->[10.0.1.64:6783] error during connection attempt: dial tcp4 :0->10.0.1.64:6783: connect: no route to host
INFO: 2018/05/11 00:44:32.622034 ->[10.0.1.64:6783] attempting connection
INFO: 2018/05/11 00:44:35.679413 ->[10.0.1.64:6783] error during connection attempt: dial tcp4 :0->10.0.1.64:6783: connect: no route to host
INFO: 2018/05/11 00:45:27.154253 ->[10.0.1.64:6783] attempting connection
INFO: 2018/05/11 00:45:30.207380 ->[10.0.1.64:6783] error during connection attempt: dial tcp4 :0->10.0.1.64:6783: connect: no route to host
INFO: 2018/05/11 00:46:46.325098 ->[10.0.1.64:6783] attempting connection

.. and so on for days.
-->

See also issue #2797

@bboreham
Copy link
Contributor

This is the same as #3021; see particularly #3021 (comment)

@sgmiller
Copy link
Author

Read that issue. How often is "eventually restart"? It seems to be never?

@bboreham
Copy link
Contributor

I was thinking things like kernel upgrades, power-downs, hardware failures would lead to restarts of the Weave Net pods.

@sgmiller
Copy link
Author

Okay, so its expected that we'll have these errors. It created some concern for us as our nodes cycle with some frequency based on spot price bidding, so many of the weave pods are logging numerous old node contact attempts constantly. It feels like it should give up after some, long (configurable?) time.

@bboreham
Copy link
Contributor

I'm not saying the behaviour is correct, merely noting my comments last time it was raised.

Happy to have an issue open, and/or to receive PRs improving it.

@sgmiller
Copy link
Author

Understood. My apologies, I thought this was as intended.

@frittentheke
Copy link

@bboreham in your comment #3300 (comment) you talk about "having an issue open".

Isn't this issue, #3300, sufficiently reported on to narrow down the cause and fix / improve this?

@bboreham
Copy link
Contributor

@frittentheke I believe so.

The cause is that there is no code to re-fetch the peer list from Kubernetes after startup.

@Dmitry1987
Copy link

Hi, we have the same issues, weaveworks/weave-kube:2.3.0 also.
Is there a fix? would 2.4.0 upgrade solve this?

@bboreham
Copy link
Contributor

bboreham commented Aug 9, 2018

No fix yet.

@Raffo
Copy link

Raffo commented Aug 23, 2018

@bboreham What would be needed for a fix? Is there a PR or something I can contribute on? I had this issue in a cluster today where we scale the nodes up and down a lot and I had to manually remove the local .db file and restart all the nodes to clean them up. I'd love to work on automating a solution, either by integrating it in weave or in an additional pod/container.

@bboreham
Copy link
Contributor

bboreham commented Aug 23, 2018

Something needs to wake up periodically, re-fetch the peer list from Kubernetes and supply it to the Weave Net router process.

It's a little fiddly since we have two binaries, one linked with Kubernetes client-go and one not.

Here is the code that fetches the peer list from Kubernetes:
https://github.com/weaveworks/weave/blob/master/prog/kube-utils/main.go#L28

Here is code that handles a REST call to replace the peer list inside the router:
https://github.com/weaveworks/weave/blob/master/router/http.go#L16

and then write a test. (Note #3372 has a test for this, but it's unfinished)

@Raffo
Copy link

Raffo commented Aug 23, 2018

The problem here would really be that we don't want to drop immediately nodes that are gone as it could just be a temporary issue. We probably need a "timeout" on which we consider the nodes dead.
I'm actually quite surprised that this is not handled automatically and doesn't seem like a minor issue to me, so it's probably worth discussing further and a bit more in detail, what do you think?

@bboreham
Copy link
Contributor

It's not a problem to delete a node from one peer's list of targets; if the node comes back it can add itself into the cluster and everyone will talk.

Then again, why are you deleting nodes from Kubernetes on a temporary basis?

@Raffo
Copy link

Raffo commented Aug 23, 2018

We scale down for autoscaling purpose: on a testing cluster, we want to scale to 0 the autoscaling group on AWS to make sure we don't waste money.

@bboreham
Copy link
Contributor

The total impact that I am aware of is some failed connection attempts and associated log noise, so I consider it a minor issue. If you have evidence of some real consequence please open a new issue with the details.

@sgmiller
Copy link
Author

sgmiller commented Aug 24, 2018 via email

@Raffo
Copy link

Raffo commented Aug 24, 2018

Thanks for adding a bit more context. Did you ever experience issues like the one reported in #3384 ?

@sgmiller
Copy link
Author

sgmiller commented Aug 24, 2018 via email

@Dmitry1987
Copy link

By the way the issue seems like an easy to fix, if it's just a log noise. Those reconnect attempts need to have a global retention so the nodes which are down more than a week at least will be cleaned up. IMHO

@dcowden
Copy link

dcowden commented Sep 4, 2018

We have this problem too, though not because we use spot instances. In production, we maintenance nodes by slowly deleting nodes on a rolling basis. Over a 6 week period, the cluster is recycled. This keeps us free of problems with filled disk, keeps the nodes patched, and keeps us free of rootkits installed by badguys.

In development, we do it more often-- nightly-- which produces this issue.

@Raffo what is the recommended fix?

--UPDATED--

We are using Weave 2.4.0 via kops on k8s 1.10.3

@Raffo
Copy link

Raffo commented Sep 4, 2018

AFAIK, there is no recommended automated fix and it's actually good info that you run Weave 2.4.0. I'd say to continue on #3384 which I opened as suggested by @bboreham

@Dmitry1987
Copy link

k8s 1.11.3 , AWS, weave 2.4.1
still the same :(

@Raffo
Copy link

Raffo commented Sep 28, 2018 via email

@murali-reddy
Copy link
Contributor

quick update on this issue please see

#3401 (comment) and #3399

I am planning to add forget once #3399 is merged. Basically watch for k8s node delete events and perform necessary steps on receiving a node delete event.

@Dmitry1987
Copy link

Do the logs show more information with the new version? Can you post them? I'd be trying this soon myself, but any info would be great.

On Fri, Sep 28, 2018, 10:43 Dmitry @.***> wrote: k8s 1.11.3 , AWS, weave 2.4.1 still the same :( — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3300 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/AApv1KbY1t0h73--hZV3hzvsyIgVvJEGks5ufeFBgaJpZM4T96wf .

no it is exactly all the same as before... I was just hoping in was fixed over time...

@murali-reddy that sounds great 👍

@Raffo
Copy link

Raffo commented Oct 6, 2018

@murali-reddy Sorry to bother and for the silly question, but does this PR mean that the issue reported as fixed in 2.4.1 (https://github.com/weaveworks/weave/releases/tag/v2.4.1 ) still aren't fixed and there could be cases where reusing IPs can lead to the same problems reported for example in #3384 ?

@bboreham
Copy link
Contributor

bboreham commented Oct 6, 2018

@Raffo the issues we reported as fixed are believed to be fixed.
"this PR" (which? Murali named two) is fixing other things.

No problem was ever identified in reusing IPs. Re-using node names delayed reclaim of IP ranges, but that was fixed in 2.4.1.

@weaveworks weaveworks deleted a comment from Raffo Oct 6, 2018
@weaveworks weaveworks deleted a comment from Raffo Oct 6, 2018
@weaveworks weaveworks deleted a comment from Raffo Oct 6, 2018
@Raffo
Copy link

Raffo commented Oct 7, 2018

Sorry I meant "this issue", not PR. Thanks for the clarification.

@m0rganic
Copy link

Whats the status on this issue?

I have weavenet components in one of my clusters reporting 7 hosts unreachable. I looked into it on my end and it appears that all of these IPs that weavenet is reporting as unreachable were deleted from the cluster autoscaler days ago.

I'm running on aws, kops and k8s 1.10.3. We are also using a number of spot instances on this cluster so there's a chance that we're seeing more churn than average with these node groups.

Manually running a command like weave rmpeer <peer_id> doesn't seem like viable long term workaround, is there any other way to avoid this or an alternative remedy?

@Dmitry1987
Copy link

Hi @m0rganic , recently this was fixed, as you can see in another issue I had open that was relevant (I think) to these old nodes cleanup tasks: #3427 (comment)

@murali-reddy
Copy link
Contributor

This issue if fixed in 2.5.0 release (as part of #3399). weave rmpeer and weave forget are automated now. Please reopen if you still see any issue with 2.5.0 release.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants