watch for Kubernetes node delete events and reclaim removed peers IP space on delete event #3399

murali-reddy · 2018-09-10T08:14:25Z

Changes to invoke reclaimRemovedPeers in kube-utils on Kubernetes node delete event

bboreham · 2018-09-10T15:20:03Z

One thought: triggering a reclaim on all peers immediately will give a "thundering herd" effect with lots of contention over the ConfigMap we use as a lock.

Maybe have each peer sleep a random number of seconds before beginning, to avoid this?

murali-reddy · 2018-09-10T16:14:30Z

Maybe have each peer sleep a random number of seconds before beginning, to avoid this?

Agree. I will add random delay.

murali-reddy · 2018-09-11T08:00:44Z

Following up on the comments from the slack conversation
https://weave-community.slack.com/archives/C9N5HME4B/p1536572237000100

proposes to run another in the background to listen to Kubernetes events, and I wondered if it would be better to put it in a sidecar instead (i.e. a third container in the DaemonSet) Originally I was worried about defunct processes, but now I think that's ok because the shell will be the parent (in #3399).
Another downside of just firing off the process in the background is that if it hits a problem and dies this might not be noticed.

I'm in favor of moving the functionality into weaver. I'm afraid that over time weave-kube/launch.sh will become a new weave (shell script).

@brb Are you suggesting to move kube-utils functionality in to weaver? or complete Kubernetes bits (launch.sh+weave-kube)? That would be significant work. Change in this PR are very contained changes and I was wondering we should go for sidecar approach for now for immediate functional benefit?

brb · 2018-09-12T10:49:50Z

@murali-reddy

Are you suggesting to move kube-utils functionality in to weaver?

Just the reclaim parts from kube-utils. It shouldn't require a lot of effort - move to a separate pkg and start the reclaiming process in a goroutine.

bboreham · 2018-09-12T11:00:59Z

Chatting with Murali IRL yesterday, I felt the sidecar idea is bad because we will have to ask users to fetch two logs for troubleshooting.
I dislike the idea of weaver getting 35MB larger, too.

So, right now, I'm ok with running it as a background process from the launch.sh shell.

brb · 2018-09-12T11:26:31Z

If so, then we need to ask k8s to pass --init to the weave Docker container (not sure whether it's possible), as the shell script is not a perfect init system.

bboreham · 2018-09-12T12:53:45Z

the shell script is not a perfect init system

Can you expand? We already run stuff in the background in that shell script, and I hadn't noticed an issue.

brb · 2018-09-12T13:03:34Z

The A simple init system section in https://blog.phusion.nl/2015/01/20/docker-and-the-pid-1-zombie-reaping-problem/ describes it well.

bboreham · 2018-09-12T13:06:10Z

That identifies a problem where the parent dies and leaves the child. Since in our case the parent is the main process of the container, if it dies the entire container will go away.

Or did I miss something?

brb · 2018-09-12T13:15:52Z

If the shell process (parent) is terminated, the child will receive SIGKILL which is unhandled (no chance to do a proper cleanup).

murali-reddy · 2018-09-12T14:27:35Z

At least in the context of the function this background process does, graceful shutdown is not critical

murali-reddy · 2018-09-18T08:46:52Z

I will add weave forget for the deleted node as separate PR (need to add go weave client API call for forget). Let me know if we have consensus on the approach to go ahead.

Testing done:

On AWS cluster, deleted couple of instances in ASG and verified on node delete IP space allocated for the node is reclaimed for all deleted nodes.

Let me know if its worth to write an integration test by doing kubectl delete node and verifying IP space is reclaimed.

brb · 2018-09-21T11:27:41Z

Thanks for the PR.

Let me know if we have consensus on the approach to go ahead.

I still think that we should do it properly and run the reclaimer from the weaver process.

prog/kube-utils/main.go

+	common.Log.Debugln("registering for updates for node delete events")
+	nodeInformer.AddEventHandler(cache.ResourceEventHandlerFuncs{
+		DeleteFunc: func(obj interface{}) {
+			node := obj.(*v1core.Node)


prog/kube-utils/main.go

+				common.Log.Fatalf("[kube-peers] Could not make Kubernetes connection: %v", err)
+			}
+			cml := newConfigMapAnnotations(configMapNamespace, configMapName, client)
+			weave := weaveapi.NewClient(os.Getenv("WEAVE_HTTP_ADDR"), common.Log)


bboreham

couple more thoughts

prog/kube-utils/main.go

+				}
+			}
+			common.Log.Debugln("[kube-peers] Nodes deleted:", nodeObj.Name)
+			config, err := rest.InClusterConfig()


prog/kube-utils/main.go

+					common.Log.Fatalf("[kube-peers] Tombstone contained object that is not a Node: %#v", obj)
+				}
+			}
+			common.Log.Debugln("[kube-peers] Nodes deleted:", nodeObj.Name)


murali-reddy · 2018-10-30T06:37:38Z

I have included weave connect --replace as well now. Tested by deleting/adding the nodes from ASG on AWS.

On node delete now

rmpeer is done so IP addresses of deleted nodes are reclaimed and we have clean weave status ipam
connect --replace with empty peer list: a node deleted is no longed connected and we have clean weave status connections

Please take a look.

murali-reddy mentioned this pull request Sep 14, 2018

Weave contacting old hosts-- NOT listed in ipam status. How to fix? #3401

Closed

murali-reddy force-pushed the rmpeer branch from 865fde7 to 19e26fa Compare September 18, 2018 07:55

murali-reddy changed the title ~~WIP: watch for Kubernetes node delete events and reclaim removed peers on delete event~~ WIP: watch for Kubernetes node delete events and reclaim removed peers IP space on delete event Sep 18, 2018

murali-reddy changed the title ~~WIP: watch for Kubernetes node delete events and reclaim removed peers IP space on delete event~~ watch for Kubernetes node delete events and reclaim removed peers IP space on delete event Sep 18, 2018

bboreham added this to the 2.5 milestone Sep 19, 2018

bboreham reviewed Sep 21, 2018

View reviewed changes

murali-reddy mentioned this pull request Sep 28, 2018

Lost K8S autoscaled nodes contacted indefinitely #3300

Closed

bboreham reviewed Oct 5, 2018

View reviewed changes

murali-reddy added 5 commits October 24, 2018 13:45

update client-go to include informeres, listers

edba711

watch for node delete events and call reclaimRemovedPeers

df467be

addressing review comments

4a40380

review comment: removing unnessary verbose code

ea42375

perform weave connect replace on node delete with empty peer list

f334575

murali-reddy force-pushed the rmpeer branch from a732e93 to f334575 Compare October 30, 2018 06:27

murali-reddy added 2 commits October 30, 2018 12:34

remove realunch weave pod to test the unreachable_ip_addresses_count

f16d456

replace peerlist with current set of Kubernetes nodes

c48c746

add delay before checking ipam status

bc9939f

bboreham merged commit a5161b2 into master Nov 1, 2018

This was referenced Nov 1, 2018

WIP: 2797 should recover ips on peer loss #3171

Closed

Node deletion does not clear up the IPs #3372

Closed

murali-reddy deleted the rmpeer branch November 1, 2018 11:52

bboreham mentioned this pull request Nov 1, 2018

Weave (2.4.1) pods memory leak with "seeded by different peers" error #3427

Closed

murali-reddy mentioned this pull request Nov 3, 2018

fix occasional failure of 870_weave_recovers_unreachable_ips_on_relaunch_3_test.sh in CI #3444

Closed

murali-reddy mentioned this pull request Nov 13, 2018

Weave not removing old Peers #3396

Closed

bboreham mentioned this pull request Oct 10, 2019

Reduce time test script waits for IPs to be reclaimed #3711

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

watch for Kubernetes node delete events and reclaim removed peers IP space on delete event #3399

watch for Kubernetes node delete events and reclaim removed peers IP space on delete event #3399

murali-reddy commented Sep 10, 2018

bboreham commented Sep 10, 2018

murali-reddy commented Sep 10, 2018

murali-reddy commented Sep 11, 2018

brb commented Sep 12, 2018

bboreham commented Sep 12, 2018

brb commented Sep 12, 2018

bboreham commented Sep 12, 2018

brb commented Sep 12, 2018

bboreham commented Sep 12, 2018

brb commented Sep 12, 2018

murali-reddy commented Sep 12, 2018

murali-reddy commented Sep 18, 2018

brb commented Sep 21, 2018

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

bboreham left a comment

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

murali-reddy commented Oct 30, 2018

watch for Kubernetes node delete events and reclaim removed peers IP space on delete event #3399

watch for Kubernetes node delete events and reclaim removed peers IP space on delete event #3399

Conversation

murali-reddy commented Sep 10, 2018

bboreham commented Sep 10, 2018

murali-reddy commented Sep 10, 2018

murali-reddy commented Sep 11, 2018

brb commented Sep 12, 2018

bboreham commented Sep 12, 2018

brb commented Sep 12, 2018

bboreham commented Sep 12, 2018

brb commented Sep 12, 2018

bboreham commented Sep 12, 2018

brb commented Sep 12, 2018

murali-reddy commented Sep 12, 2018

murali-reddy commented Sep 18, 2018

brb commented Sep 21, 2018

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

bboreham left a comment

Choose a reason for hiding this comment

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

murali-reddy commented Oct 30, 2018