Lost K8S autoscaled nodes contacted indefinitely #3300

sgmiller · 2018-05-14T14:36:33Z

What you expected to happen?

When a Kubernetes node is removed from our autoscaled cluster, Weave continues trying to contact it until an operator manually rmpeer & forget's the node via the command line.

What happened?

weave-net-mlqj4 weave INFO: 2018/05/14 14:30:16.028571 ->[10.0.3.96:6783] attempting connection
weave-net-p5mjh weave INFO: 2018/05/14 14:30:17.357611 Discovered remote MAC 5a:b9:a1:0c:57:a6 at 1e:bc:cb:84:0d:2a(ip-10-0-3-37.us-east-2.compute.internal)
weave-net-mlqj4 weave INFO: 2018/05/14 14:30:19.091097 ->[10.0.3.96:6783] error during connection attempt: dial tcp4 :0->10.0.3.96:6783: connect: no route to host
weave-net-2bgzv weave INFO: 2018/05/14 14:30:25.250138 ->[10.0.3.93:6783] attempting connection

... about once a minute

How to reproduce it?

Remove a Kubernetes node (terminate the instance)

Anything else we need to know?

Kubernetes 1.9.4, kubeadm cluster on AWS.

Versions:

$ weave version
weave 2.3.0
$ docker version
Client:
 Version:      17.05.0-ce
 API version:  1.29
 Go version:   go1.7.5
 Git commit:   89658be
 Built:        Thu May  4 22:10:54 2017
 OS/Arch:      linux/amd64
$ uname -a
Linux ip-10-0-3-160.us-east-2.compute.internal 4.8.0-59-generic #64-Ubuntu SMP Thu Jun 29 19:38:34 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.5", GitCommit:"f01a2bf98249a4db383560443a59bed0c13575df", GitTreeState:"clean", BuildDate:"2018-03-19T15:59:24Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.4", GitCommit:"bee2d1505c4fe820744d26d41ecd3fdd4a3d6546", GitTreeState:"clean", BuildDate:"2018-03-12T16:21:35Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

Logs:

Grepped out a relevant host:

NFO: 2018/05/11 00:40:34.163026 Launch detected - using supplied peer list: [10.0.1.109 10.0.1.128 10.0.1.200 10.0.1.41 10.0.1.43 10.0.1.64 10.0.3.114 10.0.3.37 10.0.3.64 10.0.3.73 10.0.3.93]
INFO: 2018/05/11 00:40:34.175099 ->[10.0.1.64:6783] attempting connection
INFO: 2018/05/11 00:40:34.176893 ->[10.0.1.64:6783|5e:31:34:2d:2e:e9(ip-10-0-1-64.us-east-2.compute.internal)]: connection ready; using protocol version 2
INFO: 2018/05/11 00:40:34.177060 overlay_switch ->[5e:31:34:2d:2e:e9(ip-10-0-1-64.us-east-2.compute.internal)] using fastdp
INFO: 2018/05/11 00:40:34.179623 ->[10.0.1.64:6783|5e:31:34:2d:2e:e9(ip-10-0-1-64.us-east-2.compute.internal)]: connection added (new peer)
INFO: 2018/05/11 00:40:34.200111 ->[10.0.1.64:6783|5e:31:34:2d:2e:e9(ip-10-0-1-64.us-east-2.compute.internal)]: connection fully established
INFO: 2018/05/11 00:40:34.206113 sleeve ->[10.0.1.64:6783|5e:31:34:2d:2e:e9(ip-10-0-1-64.us-east-2.compute.internal)]: Effective MTU verified at 8939
INFO: 2018/05/11 00:40:35.102470 [kube-peers] Added myself to peer list &{[{e6:4b:b7:4c:df:d0 ip-10-0-1-128.us-east-2.compute.internal} {c2:18:9c:74:e7:f2 ip-10-0-3-64.us-east-2.compute.internal} {aa:84:b9:7a:13:5a ip-10-0-3-114.us-east-2.compute.internal} {86:41:ea:22:67:e5 ip-10-0-3-73.us-east-2.compute.internal} {e6:d6:b7:a2:bf:26 ip-10-0-1-41.us-east-2.compute.internal} {86:f7:66:2b:87:19 ip-10-0-1-200.us-east-2.compute.internal} {7a:e5:64:52:cd:43 ip-10-0-3-93.us-east-2.compute.internal} {1e:bc:cb:84:0d:2a ip-10-0-3-37.us-east-2.compute.internal} {22:02:44:cb:1a:e5 ip-10-0-1-109.us-east-2.compute.internal} {5e:31:34:2d:2e:e9 ip-10-0-1-64.us-east-2.compute.internal} {8a:50:e4:5b:fc:b3 ip-10-0-1-43.us-east-2.compute.internal}]}
INFO: 2018/05/11 00:40:35.968265 Discovered remote MAC 5e:31:34:2d:2e:e9 at 5e:31:34:2d:2e:e9(ip-10-0-1-64.us-east-2.compute.internal)
INFO: 2018/05/11 00:40:36.070057 Discovered remote MAC e2:05:43:3e:09:b9 at 5e:31:34:2d:2e:e9(ip-10-0-1-64.us-east-2.compute.internal)
INFO: 2018/05/11 00:40:36.070427 Discovered remote MAC b2:1b:27:4c:a4:52 at 5e:31:34:2d:2e:e9(ip-10-0-1-64.us-east-2.compute.internal)
INFO: 2018/05/11 00:40:36.215259 Discovered remote MAC 3a:3c:9f:e7:7d:08 at 5e:31:34:2d:2e:e9(ip-10-0-1-64.us-east-2.compute.internal)
INFO: 2018/05/11 00:42:00.633047 ->[10.0.1.64:6783|5e:31:34:2d:2e:e9(ip-10-0-1-64.us-east-2.compute.internal)]: connection shutting down due to error: read tcp4 10.0.1.43:50037->10.0.1.64:6783: read: connection reset by peer
INFO: 2018/05/11 00:42:00.633138 ->[10.0.1.64:6783|5e:31:34:2d:2e:e9(ip-10-0-1-64.us-east-2.compute.internal)]: connection deleted
INFO: 2018/05/11 00:42:00.640213 ->[10.0.1.64:6783] attempting connection
INFO: 2018/05/11 00:42:00.640585 ->[10.0.1.64:6783] error during connection attempt: dial tcp4 :0->10.0.1.64:6783: connect: connection refused
INFO: 2018/05/11 00:42:00.646541 Removed unreachable peer 5e:31:34:2d:2e:e9(ip-10-0-1-64.us-east-2.compute.internal)
INFO: 2018/05/11 00:42:03.457897 ->[10.0.1.64:6783] attempting connection
INFO: 2018/05/11 00:42:03.458262 ->[10.0.1.64:6783] error during connection attempt: dial tcp4 :0->10.0.1.64:6783: connect: connection refused
INFO: 2018/05/11 00:42:05.602050 ->[10.0.1.64:6783] attempting connection
INFO: 2018/05/11 00:42:05.602441 ->[10.0.1.64:6783] error during connection attempt: dial tcp4 :0->10.0.1.64:6783: connect: connection refused
INFO: 2018/05/11 00:42:11.717182 ->[10.0.1.64:6783] attempting connection
INFO: 2018/05/11 00:42:11.717543 ->[10.0.1.64:6783] error during connection attempt: dial tcp4 :0->10.0.1.64:6783: connect: connection refused
INFO: 2018/05/11 00:42:21.712218 ->[10.0.1.64:6783] attempting connection
INFO: 2018/05/11 00:42:40.031406 ->[10.0.1.64:6783] error during connection attempt: dial tcp4 :0->10.0.1.64:6783: connect: no route to host
INFO: 2018/05/11 00:42:53.518432 ->[10.0.1.64:6783] attempting connection
INFO: 2018/05/11 00:42:56.575443 ->[10.0.1.64:6783] error during connection attempt: dial tcp4 :0->10.0.1.64:6783: connect: no route to host
INFO: 2018/05/11 00:43:12.400009 ->[10.0.1.64:6783] attempting connection
INFO: 2018/05/11 00:43:15.455412 ->[10.0.1.64:6783] error during connection attempt: dial tcp4 :0->10.0.1.64:6783: connect: no route to host
INFO: 2018/05/11 00:43:41.529976 ->[10.0.1.64:6783] attempting connection
INFO: 2018/05/11 00:43:44.607391 ->[10.0.1.64:6783] error during connection attempt: dial tcp4 :0->10.0.1.64:6783: connect: no route to host
INFO: 2018/05/11 00:44:32.622034 ->[10.0.1.64:6783] attempting connection
INFO: 2018/05/11 00:44:35.679413 ->[10.0.1.64:6783] error during connection attempt: dial tcp4 :0->10.0.1.64:6783: connect: no route to host
INFO: 2018/05/11 00:45:27.154253 ->[10.0.1.64:6783] attempting connection
INFO: 2018/05/11 00:45:30.207380 ->[10.0.1.64:6783] error during connection attempt: dial tcp4 :0->10.0.1.64:6783: connect: no route to host
INFO: 2018/05/11 00:46:46.325098 ->[10.0.1.64:6783] attempting connection

.. and so on for days.
-->

See also issue #2797

The text was updated successfully, but these errors were encountered:

bboreham · 2018-05-14T15:03:17Z

This is the same as #3021; see particularly #3021 (comment)

sgmiller · 2018-05-14T15:39:43Z

Read that issue. How often is "eventually restart"? It seems to be never?

bboreham · 2018-05-14T15:41:27Z

I was thinking things like kernel upgrades, power-downs, hardware failures would lead to restarts of the Weave Net pods.

sgmiller · 2018-05-14T15:43:30Z

Okay, so its expected that we'll have these errors. It created some concern for us as our nodes cycle with some frequency based on spot price bidding, so many of the weave pods are logging numerous old node contact attempts constantly. It feels like it should give up after some, long (configurable?) time.

bboreham · 2018-05-14T15:46:57Z

I'm not saying the behaviour is correct, merely noting my comments last time it was raised.

Happy to have an issue open, and/or to receive PRs improving it.

sgmiller · 2018-05-14T15:47:54Z

Understood. My apologies, I thought this was as intended.

frittentheke · 2018-07-25T10:24:42Z

@bboreham in your comment #3300 (comment) you talk about "having an issue open".

Isn't this issue, #3300, sufficiently reported on to narrow down the cause and fix / improve this?

bboreham · 2018-07-25T10:41:28Z

@frittentheke I believe so.

The cause is that there is no code to re-fetch the peer list from Kubernetes after startup.

Dmitry1987 · 2018-08-09T19:00:28Z

Hi, we have the same issues, weaveworks/weave-kube:2.3.0 also.
Is there a fix? would 2.4.0 upgrade solve this?

bboreham · 2018-08-09T20:02:58Z

No fix yet.

Raffo · 2018-08-23T12:18:46Z

@bboreham What would be needed for a fix? Is there a PR or something I can contribute on? I had this issue in a cluster today where we scale the nodes up and down a lot and I had to manually remove the local .db file and restart all the nodes to clean them up. I'd love to work on automating a solution, either by integrating it in weave or in an additional pod/container.

bboreham · 2018-08-23T13:32:46Z

Something needs to wake up periodically, re-fetch the peer list from Kubernetes and supply it to the Weave Net router process.

It's a little fiddly since we have two binaries, one linked with Kubernetes client-go and one not.

Here is the code that fetches the peer list from Kubernetes:
https://github.com/weaveworks/weave/blob/master/prog/kube-utils/main.go#L28

Here is code that handles a REST call to replace the peer list inside the router:
https://github.com/weaveworks/weave/blob/master/router/http.go#L16

and then write a test. (Note #3372 has a test for this, but it's unfinished)

Raffo · 2018-08-23T16:08:08Z

The problem here would really be that we don't want to drop immediately nodes that are gone as it could just be a temporary issue. We probably need a "timeout" on which we consider the nodes dead.
I'm actually quite surprised that this is not handled automatically and doesn't seem like a minor issue to me, so it's probably worth discussing further and a bit more in detail, what do you think?

bboreham · 2018-08-23T16:11:10Z

It's not a problem to delete a node from one peer's list of targets; if the node comes back it can add itself into the cluster and everyone will talk.

Then again, why are you deleting nodes from Kubernetes on a temporary basis?

Raffo · 2018-08-23T16:13:17Z

We scale down for autoscaling purpose: on a testing cluster, we want to scale to 0 the autoscaling group on AWS to make sure we don't waste money.

bboreham · 2018-08-23T16:13:19Z

The total impact that I am aware of is some failed connection attempts and associated log noise, so I consider it a minor issue. If you have evidence of some real consequence please open a new issue with the details.

sgmiller · 2018-08-24T13:02:22Z

To add some context: Our Kubernetes clusters consist of a few dozen nodes allocated to a specific EC2 instance type, but we also run a controller which bids on spot instances to save cost. These instances semi-frequently get removed as the spot prices fluctuate, so the node comes down and another is spun up. Over a few weeks we wind up with quite a lot of 'lost' instances. All part of normal operation.

…

On Thu, Aug 23, 2018 at 11:26 AM, Raffaele Di Fazio < ***@***.***> wrote: It's okay for me if you want to delete them, go on. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#3300 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABLTrmHezeOvs-Ww0L6GDQ8OqZV3mSEdks5uTtecgaJpZM4T96wf> .

Raffo · 2018-08-24T16:53:32Z

Thanks for adding a bit more context. Did you ever experience issues like the one reported in #3384 ?

sgmiller · 2018-08-24T19:02:14Z

No, I haven't had that particular problem.

…

On Fri, Aug 24, 2018 at 11:53 AM Raffaele Di Fazio ***@***.***> wrote: Thanks for adding a bit more context. Did you ever experience issues like the one reported in #3384 <#3384> ? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#3300 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABLTroIPxj0Reqk1--9O5lygtWka1p76ks5uUC-ZgaJpZM4T96wf> .

Dmitry1987 · 2018-08-24T19:07:33Z

By the way the issue seems like an easy to fix, if it's just a log noise. Those reconnect attempts need to have a global retention so the nodes which are down more than a week at least will be cleaned up. IMHO

dcowden · 2018-09-04T17:20:49Z

We have this problem too, though not because we use spot instances. In production, we maintenance nodes by slowly deleting nodes on a rolling basis. Over a 6 week period, the cluster is recycled. This keeps us free of problems with filled disk, keeps the nodes patched, and keeps us free of rootkits installed by badguys.

In development, we do it more often-- nightly-- which produces this issue.

@Raffo what is the recommended fix?

--UPDATED--

We are using Weave 2.4.0 via kops on k8s 1.10.3

Raffo · 2018-09-04T21:10:33Z

AFAIK, there is no recommended automated fix and it's actually good info that you run Weave 2.4.0. I'd say to continue on #3384 which I opened as suggested by @bboreham

Dmitry1987 · 2018-09-28T08:43:31Z

k8s 1.11.3 , AWS, weave 2.4.1
still the same :(

Raffo · 2018-09-28T09:33:26Z

Do the logs show more information with the new version? Can you post them? I'd be trying this soon myself, but any info would be great.

…

On Fri, Sep 28, 2018, 10:43 Dmitry ***@***.***> wrote: k8s 1.11.3 , AWS, weave 2.4.1 still the same :( — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3300 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AApv1KbY1t0h73--hZV3hzvsyIgVvJEGks5ufeFBgaJpZM4T96wf> .

murali-reddy · 2018-09-28T13:00:02Z

quick update on this issue please see

#3401 (comment) and #3399

I am planning to add forget once #3399 is merged. Basically watch for k8s node delete events and perform necessary steps on receiving a node delete event.

Dmitry1987 · 2018-09-28T21:05:29Z

Do the logs show more information with the new version? Can you post them? I'd be trying this soon myself, but any info would be great.
…
On Fri, Sep 28, 2018, 10:43 Dmitry @.***> wrote: k8s 1.11.3 , AWS, weave 2.4.1 still the same :( — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3300 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/AApv1KbY1t0h73--hZV3hzvsyIgVvJEGks5ufeFBgaJpZM4T96wf .

no it is exactly all the same as before... I was just hoping in was fixed over time...

@murali-reddy that sounds great 👍

Raffo · 2018-10-06T14:59:10Z

@murali-reddy Sorry to bother and for the silly question, but does this PR mean that the issue reported as fixed in 2.4.1 (https://github.com/weaveworks/weave/releases/tag/v2.4.1 ) still aren't fixed and there could be cases where reusing IPs can lead to the same problems reported for example in #3384 ?

bboreham · 2018-10-06T15:12:55Z

@Raffo the issues we reported as fixed are believed to be fixed.
"this PR" (which? Murali named two) is fixing other things.

No problem was ever identified in reusing IPs. Re-using node names delayed reclaim of IP ranges, but that was fixed in 2.4.1.

Raffo · 2018-10-07T11:05:27Z

Sorry I meant "this issue", not PR. Thanks for the clarification.

m0rganic · 2018-11-12T23:16:23Z

Whats the status on this issue?

I have weavenet components in one of my clusters reporting 7 hosts unreachable. I looked into it on my end and it appears that all of these IPs that weavenet is reporting as unreachable were deleted from the cluster autoscaler days ago.

I'm running on aws, kops and k8s 1.10.3. We are also using a number of spot instances on this cluster so there's a chance that we're seeing more churn than average with these node groups.

Manually running a command like weave rmpeer <peer_id> doesn't seem like viable long term workaround, is there any other way to avoid this or an alternative remedy?

Dmitry1987 · 2018-11-12T23:44:58Z

Hi @m0rganic , recently this was fixed, as you can see in another issue I had open that was relevant (I think) to these old nodes cleanup tasks: #3427 (comment)

murali-reddy · 2018-11-13T06:23:50Z

This issue if fixed in 2.5.0 release (as part of #3399). weave rmpeer and weave forget are automated now. Please reopen if you still see any issue with 2.5.0 release.

bboreham added chore [component/kube] help wanted labels May 24, 2018

Raffo mentioned this issue Aug 23, 2018

Weave not working correctly leads to containers stuck in ContainerCreating #3384

Closed

dcowden mentioned this issue Sep 11, 2018

Weave contacting old hosts-- NOT listed in ipam status. How to fix? #3401

Closed

Raffo mentioned this issue Oct 6, 2018

Update Weave Net to version 2.4.1 kubernetes/kops#5845

Merged

weaveworks deleted a comment from Raffo Oct 6, 2018

murali-reddy closed this as completed Nov 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lost K8S autoscaled nodes contacted indefinitely #3300

Lost K8S autoscaled nodes contacted indefinitely #3300

sgmiller commented May 14, 2018

bboreham commented May 14, 2018

sgmiller commented May 14, 2018

bboreham commented May 14, 2018

sgmiller commented May 14, 2018

bboreham commented May 14, 2018

sgmiller commented May 14, 2018

frittentheke commented Jul 25, 2018

bboreham commented Jul 25, 2018

Dmitry1987 commented Aug 9, 2018

bboreham commented Aug 9, 2018

Raffo commented Aug 23, 2018

bboreham commented Aug 23, 2018 •

edited

Loading

Raffo commented Aug 23, 2018

bboreham commented Aug 23, 2018

Raffo commented Aug 23, 2018 •

edited

Loading

bboreham commented Aug 23, 2018

sgmiller commented Aug 24, 2018 via email

Raffo commented Aug 24, 2018

sgmiller commented Aug 24, 2018 via email

Dmitry1987 commented Aug 24, 2018

dcowden commented Sep 4, 2018 •

edited

Loading

Raffo commented Sep 4, 2018

Dmitry1987 commented Sep 28, 2018

Raffo commented Sep 28, 2018 via email

murali-reddy commented Sep 28, 2018

Dmitry1987 commented Sep 28, 2018

Raffo commented Oct 6, 2018

bboreham commented Oct 6, 2018 •

edited

Loading

Raffo commented Oct 7, 2018

m0rganic commented Nov 12, 2018

Dmitry1987 commented Nov 12, 2018

murali-reddy commented Nov 13, 2018

Lost K8S autoscaled nodes contacted indefinitely #3300

Lost K8S autoscaled nodes contacted indefinitely #3300

Comments

sgmiller commented May 14, 2018

What you expected to happen?

What happened?

How to reproduce it?

Anything else we need to know?

Versions:

Logs:

bboreham commented May 14, 2018

sgmiller commented May 14, 2018

bboreham commented May 14, 2018

sgmiller commented May 14, 2018

bboreham commented May 14, 2018

sgmiller commented May 14, 2018

frittentheke commented Jul 25, 2018

bboreham commented Jul 25, 2018

Dmitry1987 commented Aug 9, 2018

bboreham commented Aug 9, 2018

Raffo commented Aug 23, 2018

bboreham commented Aug 23, 2018 • edited Loading

Raffo commented Aug 23, 2018

bboreham commented Aug 23, 2018

Raffo commented Aug 23, 2018 • edited Loading

bboreham commented Aug 23, 2018

sgmiller commented Aug 24, 2018 via email

Raffo commented Aug 24, 2018

sgmiller commented Aug 24, 2018 via email

Dmitry1987 commented Aug 24, 2018

dcowden commented Sep 4, 2018 • edited Loading

Raffo commented Sep 4, 2018

Dmitry1987 commented Sep 28, 2018

Raffo commented Sep 28, 2018 via email

murali-reddy commented Sep 28, 2018

Dmitry1987 commented Sep 28, 2018

Raffo commented Oct 6, 2018

bboreham commented Oct 6, 2018 • edited Loading

Raffo commented Oct 7, 2018

m0rganic commented Nov 12, 2018

Dmitry1987 commented Nov 12, 2018

murali-reddy commented Nov 13, 2018

bboreham commented Aug 23, 2018 •

edited

Loading

Raffo commented Aug 23, 2018 •

edited

Loading

dcowden commented Sep 4, 2018 •

edited

Loading

bboreham commented Oct 6, 2018 •

edited

Loading