-
Notifications
You must be signed in to change notification settings - Fork 678
Lost K8S autoscaled nodes contacted indefinitely #3300
Comments
This is the same as #3021; see particularly #3021 (comment) |
Read that issue. How often is "eventually restart"? It seems to be never? |
I was thinking things like kernel upgrades, power-downs, hardware failures would lead to restarts of the Weave Net pods. |
Okay, so its expected that we'll have these errors. It created some concern for us as our nodes cycle with some frequency based on spot price bidding, so many of the weave pods are logging numerous old node contact attempts constantly. It feels like it should give up after some, long (configurable?) time. |
I'm not saying the behaviour is correct, merely noting my comments last time it was raised. Happy to have an issue open, and/or to receive PRs improving it. |
Understood. My apologies, I thought this was as intended. |
@bboreham in your comment #3300 (comment) you talk about "having an issue open". Isn't this issue, #3300, sufficiently reported on to narrow down the cause and fix / improve this? |
@frittentheke I believe so. The cause is that there is no code to re-fetch the peer list from Kubernetes after startup. |
Hi, we have the same issues, weaveworks/weave-kube:2.3.0 also. |
No fix yet. |
@bboreham What would be needed for a fix? Is there a PR or something I can contribute on? I had this issue in a cluster today where we scale the nodes up and down a lot and I had to manually remove the local |
Something needs to wake up periodically, re-fetch the peer list from Kubernetes and supply it to the Weave Net router process. It's a little fiddly since we have two binaries, one linked with Kubernetes client-go and one not. Here is the code that fetches the peer list from Kubernetes: Here is code that handles a REST call to replace the peer list inside the router: and then write a test. (Note #3372 has a test for this, but it's unfinished) |
The problem here would really be that we don't want to drop immediately nodes that are gone as it could just be a temporary issue. We probably need a "timeout" on which we consider the nodes dead. |
It's not a problem to delete a node from one peer's list of targets; if the node comes back it can add itself into the cluster and everyone will talk. Then again, why are you deleting nodes from Kubernetes on a temporary basis? |
We scale down for autoscaling purpose: on a testing cluster, we want to scale to 0 the autoscaling group on AWS to make sure we don't waste money. |
The total impact that I am aware of is some failed connection attempts and associated log noise, so I consider it a minor issue. If you have evidence of some real consequence please open a new issue with the details. |
To add some context: Our Kubernetes clusters consist of a few dozen nodes
allocated to a specific EC2 instance type, but we also run a controller
which bids on spot instances to save cost. These instances semi-frequently
get removed as the spot prices fluctuate, so the node comes down and
another is spun up. Over a few weeks we wind up with quite a lot of 'lost'
instances. All part of normal operation.
…On Thu, Aug 23, 2018 at 11:26 AM, Raffaele Di Fazio < ***@***.***> wrote:
It's okay for me if you want to delete them, go on.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#3300 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABLTrmHezeOvs-Ww0L6GDQ8OqZV3mSEdks5uTtecgaJpZM4T96wf>
.
|
Thanks for adding a bit more context. Did you ever experience issues like the one reported in #3384 ? |
No, I haven't had that particular problem.
…On Fri, Aug 24, 2018 at 11:53 AM Raffaele Di Fazio ***@***.***> wrote:
Thanks for adding a bit more context. Did you ever experience issues like
the one reported in #3384
<#3384> ?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#3300 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABLTroIPxj0Reqk1--9O5lygtWka1p76ks5uUC-ZgaJpZM4T96wf>
.
|
By the way the issue seems like an easy to fix, if it's just a log noise. Those reconnect attempts need to have a global retention so the nodes which are down more than a week at least will be cleaned up. IMHO |
We have this problem too, though not because we use spot instances. In production, we maintenance nodes by slowly deleting nodes on a rolling basis. Over a 6 week period, the cluster is recycled. This keeps us free of problems with filled disk, keeps the nodes patched, and keeps us free of rootkits installed by badguys. In development, we do it more often-- nightly-- which produces this issue. @Raffo what is the recommended fix? --UPDATED-- We are using Weave 2.4.0 via kops on k8s 1.10.3 |
k8s 1.11.3 , AWS, weave 2.4.1 |
Do the logs show more information with the new version? Can you post them?
I'd be trying this soon myself, but any info would be great.
…On Fri, Sep 28, 2018, 10:43 Dmitry ***@***.***> wrote:
k8s 1.11.3 , AWS, weave 2.4.1
still the same :(
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3300 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AApv1KbY1t0h73--hZV3hzvsyIgVvJEGks5ufeFBgaJpZM4T96wf>
.
|
quick update on this issue please see #3401 (comment) and #3399 I am planning to add |
no it is exactly all the same as before... I was just hoping in was fixed over time... @murali-reddy that sounds great 👍 |
@murali-reddy Sorry to bother and for the silly question, but does this PR mean that the issue reported as fixed in |
@Raffo the issues we reported as fixed are believed to be fixed. No problem was ever identified in reusing IPs. Re-using node names delayed reclaim of IP ranges, but that was fixed in 2.4.1. |
Sorry I meant "this issue", not PR. Thanks for the clarification. |
Whats the status on this issue? I have weavenet components in one of my clusters reporting 7 hosts unreachable. I looked into it on my end and it appears that all of these IPs that weavenet is reporting as unreachable were deleted from the cluster autoscaler days ago. I'm running on aws, kops and k8s 1.10.3. We are also using a number of spot instances on this cluster so there's a chance that we're seeing more churn than average with these node groups. Manually running a command like |
Hi @m0rganic , recently this was fixed, as you can see in another issue I had open that was relevant (I think) to these old nodes cleanup tasks: #3427 (comment) |
This issue if fixed in 2.5.0 release (as part of #3399). |
What you expected to happen?
When a Kubernetes node is removed from our autoscaled cluster, Weave continues trying to contact it until an operator manually rmpeer & forget's the node via the command line.
What happened?
... about once a minute
How to reproduce it?
Remove a Kubernetes node (terminate the instance)
Anything else we need to know?
Kubernetes 1.9.4, kubeadm cluster on AWS.
Versions:
Logs:
Grepped out a relevant host:
.. and so on for days.
-->
See also issue #2797
The text was updated successfully, but these errors were encountered: