Skip to content
This repository has been archived by the owner on Jun 20, 2024. It is now read-only.

Weave (2.4.1) pods memory leak with "seeded by different peers" error #3427

Closed
Dmitry1987 opened this issue Oct 12, 2018 · 12 comments
Closed

Comments

@Dmitry1987
Copy link

What you expected to happen?

Weave pod from default DaemonSet to not eat up 14-16GB of RAM 👍

What happened?

Memory leak of all weave pods in cluster:
image

We thought at first it's our app, but were surprised to see this was weave pod on all servers.

How to reproduce it?

  1. Create new K8s 1.11.3 cluster with kubeadm, following default instructions from official documentation (no custom steps were used except modifying weave daemonset manifest to "CONN_LIMIT=200" to allow more nodes to connect to cluster. We ended up scaling up to 40 nodes per cluster, and then back to 10-12 nodes per cluster, the leak happens now, when we have 10-12 nodes, on 3 different K8s clusters of same version/type).
  2. used this weave manifest:
kubectl apply -f "https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version | base64 | tr -d '\n')"

Anything else we need to know?

K8s master on DigitalOcean, 6cpu / 16gb RAM.

Versions:

ubuntu:

4.15.0-30-generic #32-Ubuntu SMP Thu Jul 26 17:42:43 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

kubectl / kubelet / kubeadm are 1.11.3

Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.3", GitCommit:"a4529464e4629c21224b3d52edfe0ea91b072862", GitTreeState:"clean", BuildDate:"2018-09-09T18:02:47Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.3", GitCommit:"a4529464e4629c21224b3d52edfe0ea91b072862", GitTreeState:"clean", BuildDate:"2018-09-09T17:53:03Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}

weave daemonset:

image: docker.io/weaveworks/weave-kube:2.4.1
image: docker.io/weaveworks/weave-npc:2.4.1

Logs:

The most problematic piece of log was when someone reattached a machine from one cluster to another without cleaning properly I think, not sure about the reason, but expect weave to handle this without memory/cpu leaks (by the way the CPU usage was 50% on a leaked pod I just checked with 'docker stats' so both the memory and the CPU suffer from this.

{"log":"INFO: 2018/10/12 18:52:37.938078 -\u003e[(some-worker-node-prod-bare-metal)]: connection ready; using protocol version 2\n","stream":"stderr","time":"2018-10-12T18:52:37.938270141Z"}
{"log":"INFO: 2018/10/12 18:52:37.938186 overlay_switch -\u003e[(some-worker-node-prod-bare-metal)] using fastdp\n","stream":"stderr","time":"2018-10-12T18:52:37.938338473Z"}
{"log":"INFO: 2018/10/12 18:52:37.938244 -\u003e[(some-worker-node-prod-bare-metal)]: connection added (new peer)\n","stream":"stderr","time":"2018-10-12T18:52:37.938833501Z"}
{"log":"INFO: 2018/10/12 18:52:37.939725 -\u003e[(some-worker-node-prod-bare-metal)]: connection shutting down due to error: IP allocation was seeded by different peers (received: [master-node-2], ours: [master-node-1])\n","stream":"stderr","time":"2018-10-12T18:52:37.940003472Z"}
{"log":"INFO: 2018/10/12 18:52:37.940188 -\u003e[(some-worker-node-prod-bare-metal)]: connection deleted\n","stream":"stderr","time":"2018-10-12T18:52:37.940309289Z"}
{"log":"INFO: 2018/10/12 18:52:37.946334 -\u003e[(some-woreker-node-3)]: connection ready; using protocol version 2\n","stream":"stderr","time":"2018-10-12T18:52:37.946529924Z"}
{"log":"INFO: 2018/10/12 18:52:37.947290 -\u003e[(some-worker-node-prod-bare-metal-2)]: connection shutting down due to error: IP allocation was seeded by different peers (received: [master-node-2], ours: [master-node-1])\n","stream":"stderr","time":"2018-10-12T18:52:37.947450411Z"}
{"log":"INFO: 2018/10/12 18:52:37.947452 -\u003e[(some-worker-node-prod-bare-metal-2)]: connection deleted\n","stream":"stderr","time":"2018-10-12T18:52:37.947608711Z"}
{"log":"INFO: 2018/10/12 18:52:37.947809 overlay_switch -\u003e[(some-worker-node-prod-bare-metal-2)] fastdp write tcp4 xxxxxxxxx:6783-\u003exxxxxx:47036: use of closed network connection\n","stream":"stderr","time":"2018-10-12T18:52:37.947963738Z"}

I don't know if this can be the issue, but of course it's expected that weave will not misbehave in terms of resource usage if this happens.

@Dmitry1987
Copy link
Author

Also here's a screenshot from one of the leaking nodes:

image

The CPU usage is high, while on other nodes in other similar clusters (exact same config with all versions AFAIK) where it doesn't leak, I don't see the high CPU of this process.

@bboreham
Copy link
Contributor

Hi, thanks for the report, could you post the full logs of the weave container please?

@bboreham
Copy link
Contributor

Since it's at 102% CPU, perhaps it is in a tight loop.
If you can repeat this, please send the process a SIGQUIT and then post the full logs.
Thanks!

@Dmitry1987
Copy link
Author

I can post the logs yes, will try to collect now (need to cleanup IPs).

I also tried to profile and here are the results (posting the same I reported in slack channel)
image

and after some time it looks like this:

go tool pprof http://127.0.0.1:6784/debug/pprof/heap?debug=2
Fetching profile from http://127.0.0.1:6784/debug/pprof/heap?debug=2
Saved profile in /root/pprof/pprof.127.0.0.1:6784.inuse_objects.inuse_space.007.pb.gz
Entering interactive mode (type "help" for commands)
(pprof) top5
277.56MB of 321.06MB total (86.45%)
Dropped 570 nodes (cum <= 1.61MB)
Showing top 5 nodes out of 55 (cum >= 14.50MB)
      flat  flat%   sum%        cum   cum%
  158.05MB 49.23% 49.23%   158.05MB 49.23%  github.com/weaveworks/weave/vendor/github.com/weaveworks/mesh.makeConnsMap
   63.50MB 19.78% 69.01%    63.50MB 19.78%  encoding/gob.decString
      27MB  8.41% 77.42%    29.63MB  9.23%  github.com/weaveworks/weave/router.(*OverlaySwitch).PrepareConnection
   14.51MB  4.52% 81.93%    14.51MB  4.52%  runtime.malg
   14.50MB  4.52% 86.45%    14.50MB  4.52%  github.com/weaveworks/weave/vendor/github.com/weaveworks/mesh.newPeerFromSummary

@Dmitry1987
Copy link
Author

here is full log: https://gist.github.com/Dmitry1987/27c46a0ce9fd5bce097098044721de12

I noticed that for example one leaking machine is always mentioned in these messages:

 overlay_switch ->[(xxxxxxxxx)] fastdp write tcp4 xxxxxxxxxx:6783->xxxxxxxx:55818: use of closed network connection
using sleeve
overlay_switch ->[(xxxxxx)] sleeve write tcp4 xxxxxx:6783->xxxxxx:55818: use of closed network connection
connection shutting down due to error: write tcp4 xxxxxx:41378->xxxxx:6783: write: connection reset by peer

so this might be a hint of the issue

@Dmitry1987
Copy link
Author

I suspect it is because of one of the nodes which was redeployed to another cluster by mistake and has these "connection shutting down due to error: IP allocation was seeded by different peers"... but I still can't find which one it is, because appears that all weave pods in cluster-1 scream about "cluster-2" nodes that they are "seeded blahblahblah" but how they got information about each other I don't get it :D
All "cluster-2" weave pods also write errors about all "cluster-1" nodes that they are foreign intruders :) ...

The redeploy job we have, uses kubeadm reset + deleting the weave db in "/var/lib/weave/weave-netdata.db"
What else can we do to make sure such thing never happens? and why this "kills" weave instead of it silently (or not silently) ignoring these and functioning as usual with the rest of valid nodes.

@bboreham
Copy link
Contributor

I'm not clear how it gets to those lines in the profile or what it is trying to do. So it would still be good to get the log after SIGQUIT, or equivalently the goroutines profile.

@Dmitry1987
Copy link
Author

@bboreham ok we figured this out, the leak was in function related to the error of "IPs seeded by different peers" after some of our machines were redeployed to another cluster and probably their weave db was not deleted by the automation scripts.
The moment we cleaned up weave db (i just deleted the file on all machines, and deleted the daemonset, then recreated) the memory leak stopped, and the errors in logs about 'bad ips'.
I hope it helps to locate and fix the leak (just reproduce easily by creating the IP collision, and see the effect on memory. we had 15 worker nodes and 1 master in that cluster. the second cluster was also 15 nodes, so maybe will not be reproducible with less machines, I don't know).

@bboreham
Copy link
Contributor

From the set of symptoms, it could be similar to #1830 - gossip data building up.

The "seeded by different peers" condition is absolutely fatal to Weave Net, so maybe we should think about ways to make that clearer. It doesn't really help to fix the memory blow-up when it then won't communicate between nodes.

@bboreham bboreham changed the title Weave (2.4.1) pods memory leak on K8s 1.11.3 Weave (2.4.1) pods memory leak with "seeded by different peers" error Oct 15, 2018
@Dmitry1987
Copy link
Author

@bboreham that makes sense, but in our case the memory leak was worse for our application than losing weave and overlay networking functionality (we have a service, on these clusters, that doesn't even communicate with any other pods, so it doesn't care about overlay, it's kinda standalone. we used K8s for it just for the easy Helm templating and to be in same "format" with all other projects and services that we have :) it's kinda a set of several separate clusters which hold an important service as HA on many machines).

So, the moment weave died, we didn't even noticed (I guess we could've noticed if we had microservices in this cluster that communicate with each other, but our app is standalone and the world talks to it directly on all nodes), but when memory leaked and affected performance of the "standalone" app on these nodes, just then we found out something is broken :)

I mean, the memory leak needs to be somehow limited or fixed anyway, IMHO. To avoid even more damage on top of the loss of overlay connectivity (I'm not even sure... the 'fatal' condition you mean was the pods cannot talk to each other or something? I didn't even tested this, you see, because the app does not require pods communication... I know it might sound like overkill to use K8s for an app like that, but it's just so easy and nice to utilize K8s daemonsets, kubeadm init joins for new nodes, configmaps and their tracking in git, easy metrics and logging collection.. so we had to go with K8s for that).

One more question though, can the weave corruption happed even after the /var/lib/weave/ folder was cleaned (for sure) and we did "kubeadm reset", and then did "kubeadm join" to another cluster? (maybe it took the "bad IPs" of previous cluster, from iptables that were not flushed?)

Because iptables was the only thing which could have traces of something about old "peers" of this machine. I can't find how it was caused, the "bad peers" condition, the logs of "re-attach worker node to another K8s cluster" job (that we use to move nodes between one cluster to another if we really need to) show clearly that /var/lib/weave/ was deleted for sure, before a new "kubeadm join" was run.

@bboreham
Copy link
Contributor

bboreham commented Nov 1, 2018

When you remove a peer from a cluster, all the peers that were told to connect to it at startup will continue to attempt a re-connection. So if you want to add it to a new cluster, it is possible the old cluster will re-connect to it first.

After #3399, if you delete a node in the Kubernetes api-server then all the remaining nodes in that cluster will remove that node from their list and stop trying to re-connect.

@Dmitry1987
Copy link
Author

ok thanks, it makes sense to clean up deleted nodes automatically. It also solves the memory leak right? I guess we can close this issue then 👍

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants