Force Drain and detachment for Volumes for Unhealthy Nodes which were `NotReady` for over 5min #781

himanshu-kun · 2023-02-13T11:27:06Z

How to categorize this issue?

/area performance
/kind bug
/priority 2

What happened:

Currently we skip drain for an unhealthy node which we want to remove, if we see that it was not Ready for >5 min.
We have now seen problems with this. Imagine a node with a lot of PVs attached. Now node goes unhealthy and we skip drain, and start deleting the VM.
In azure as part of deletion we currently detach the volumes from the VM (PV disks and root disk) and then proceed with VM deletion.
This means MCM now will detach the backing disks of the PVs for the pods directly and then trigger node deletion. K8s starts acting only after node deletion as till now everything was happening on infra.
KCM deletes the nodes, and then the orphan pod force deletion logic removes the pods. This only makes the attach/detach controller come into action.
The attach/detach controller relies on kubelet to direct CSI driver on the node to unmount the disk. Since there is no node , meaning no kubelet , the A/D controller waits till maxWaitForUnmountDuration (6min , non-configurable) and then only force detachment starts.

So there are two detachments , one by MCM and then by A/D controller. This leads to downtimes for the customers.

What you expected to happen:

Expect the pods with PV to recover faster after unhealthy nodes are removed

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

During normal draining a wait for detachment and wait for reattachment( wait for new volumeattachment formation in case volumeattachment support is enabled) is done , introduced in #608

live issue # 3645

Environment:

Kubernetes version (use kubectl version):
Cloud provider or hardware configuration:
Others:

The text was updated successfully, but these errors were encountered:

himanshu-kun · 2023-02-13T11:32:36Z

Proposed solutions:

Raise issue on upstream so that unmounting attempt is not made if kubelet status is unknown or node is deleted.
MCM should force delete the pods, and then wait until the node.status.volumeAttached becomes empty. After that only proceed with VM deletion.
Delete the node object before deleting the VM. This way KCM's garbage collection would delete the pods and A/D controller would start detachment. But we need to figure out a way(like using volumeAttachments) to make sure not to start VM deletion before detachment is complete.

cc @dguendisch

himanshu-kun · 2023-02-17T11:07:51Z

Proposals after grooming

We have decided to do some testing , where we would first delete the node object and then delete the VM or delete them both in parallel. This would mean A/D controller detaching volumes and MCM azure(for now) also doing the detachment.
This decision was taken after going through the recommended approach of deleting pods https://kubernetes.io/docs/tasks/run-application/force-delete-stateful-set-pod/#delete-pods where its stated that deleting nodes is favoured over force deleting pods especially for pods with PV.

vlerenc · 2023-02-21T14:08:14Z

From issues#2556 mentioned by @himanshu-kun:

This make the KCM wait for the unmounting of volume. Unmounting is done by csi plugin on direction of kubelet of the node. But since node is gone => no kubelet and unmounting never happens
Why does KCM wait for the unmounting of volumes if the node is gone. Does that make sense? And how does it know that (where is the mount information persisted) and how does it do it?

The reconciler https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/volume/attachdetach/reconciler/reconciler.go#L232 works on https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/volume/attachdetach/cache/actual_state_of_world.go#L156 that is set by https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/volume/attachdetach/cache/actual_state_of_world.go#L388 which is called here https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/volume/attachdetach/attach_detach_controller.go#L667, but that only works with a node resource. To me this sounds somewhat strange, if it really depends on the node that is gone, is it not? Is it keeping that state in memory, because otherwise it wouldn’t know anymore, if the source of truth is the (missing) node resource and it got restarted.

I guess, what I am trying to ask: Was I reading that wrongly or are you saying, KCM does still wait 6m even if the node resource disappears?

vlerenc · 2023-02-21T14:21:16Z

Deleting the node resource before the machine termination was confirmed (or doing both in parallel) is risky, isn't it? To me, it doesn't sound safe/worth exploring, because you can never know whether the kubelet would come back to life and recreate the node resource (unlikely, but possible and a race condition in principle).

With a web hook in place, we could "purge" the volumesInUse on nodes we anyway plan to delete, but the more direct way is to forcefully terminate the pods on that node and then delete the volume attachments for that node and do what the kubelet after maxWaitForUnmountDuration would have done. Why is that no option? The node is lost anyway, if we have decided to start the machine deletion and the faster we get those volumes detached the better. It's because we are scared of parallel operations (CSI detacher and MCM machine deletion)? What if we wait for the CSI detacher to finish it's job before commencing? If the node is cordoned as well, no new pod would get onto it, the existing ones are terminated, so the CSI attacher won't try to attach anything in that time window and would only detach the existing stuff. Wrong?

elankath · 2023-03-30T12:01:27Z

Hi Vedran,

We discussed this and we agree with you. We propose the following enhancement:

High Level Proposal

Initiate our drain routine (with forceful pod deletion) if the Node does not have the Ready condition for 5 min or has the ReadOnlyFileSystem is true for 5min. We also delete the VolumeAttachments associated with the node.
Before Initiating VM Deletion, we wait for the node volumes to be detached by checking node.Status.volumesAttached (till a timeout) . This detachment is being carried out by A/D controller, MCM would just wait.
Once volumes are detached or detach timeout has expired, we proceed with the VM Deletion as usual. This involves the call to Driver.DeleteMachine.
We then delete the node as usual.

Detailed Proposal

Overview of Current Deletion Flow

When the machine's Deletion Timestamp is set, the MC invokes the triggerDeletionFlow
The triggerDeletion flow does the following:
1. The triggerDeletionFlow sets the machine.Status.CurrentStatus.Phase to Terminating.
2. Then the node drain stage is invoked after the operation is changed to InitiateDrain. This maybe graceful deletion - pod eviction,wait for detach/attach of volume to new pod or forceful deletion of pods.
3. Then the VM deletion stage is invoked after the operation is changed to InitiateVMDeletion
4. Then the Node deletion stage is invoked after the operation is changed to InitiateNodeDeletion
5. Then machine finalizers are removed after the operation is changed to: InitiateFinalizerRemoval

Proposed Enhancement

We enhance the triggerDeletion flow and introduce the following:

In the node drain stage, initiate our drain routine with forceful pod deletion if the Node does not have the Ready condition for 5 min or has the ReadOnlyFileSystem is true for 5min. We also additionally delete the VolumeAttachments associated with the node for the force pod deletion case.
We introduce a new operation: deleteNodeVolAttachments prior to InitiateVMDeletion. This checks that volumes have been detached from the node using node.Status.volumesAttached. (This field should be 0). And returns with a ShortRetry to check again if not until a timeout is reached.
Once the deleteNodeVolAttachments stage is done, we proceed with the to InitiateVMDeletion stage as usual.

To ensure we can appropriately timeout in waitForNodeVolumesDetachment, we propose enhancing LastOperation and add a LastStateTransitionTime field (meaning Last Operation State Transition Time).

The end of InitiateDrain will update the the field as follows (only illustrative). waitForNodeVolumesDetachment will leverage LastStateTransitionTime in order to determine when to timeout and move to the InitiateVMDeletion stage.

{
"Description":"waitForNodeVolumesDetach",
"State": "Processing",
"Type": "Delete",
"LastUpdateTime": "1:00 PM"
"LastStateTransitionTime": "1:00 PM"
}

elankath · 2023-08-31T06:26:17Z

Another live issue # 3645

himanshu-kun added the kind/bug Bug label Feb 13, 2023

gardener-robot added area/performance Performance (across all domains, such as control plane, networking, storage, etc.) related priority/1 Priority (lower number equals higher priority) labels Feb 13, 2023

himanshu-kun added the priority/2 Priority (lower number equals higher priority) label Feb 17, 2023

gardener-robot removed the priority/1 Priority (lower number equals higher priority) label Feb 17, 2023

himanshu-kun added needs/planning Needs (more) planning with other MCM maintainers area/robustness Robustness, reliability, resilience related labels Feb 17, 2023

gardener deleted a comment from gardener-robot Feb 21, 2023

himanshu-kun mentioned this issue Apr 25, 2023

update vm delete to not wait for datadisk detachment gardener/machine-controller-manager-provider-azure#95

Closed

himanshu-kun assigned elankath Jun 2, 2023

elankath mentioned this issue Aug 7, 2023

dont skip drain for unhealthy nodes #839

Merged

elankath changed the title ~~Don't skip Drain for Unhealthy Nodes which were NotReady for over 5min~~ Force Drain and detachment for Volumes for Unhealthy Nodes which were NotReady for over 5min Aug 31, 2023

himanshu-kun closed this as completed in #839 Sep 22, 2023

gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Sep 22, 2023

elankath mentioned this issue Nov 2, 2023

Parallelize the eviction of pods with volumes #848

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Force Drain and detachment for Volumes for Unhealthy Nodes which were `NotReady` for over 5min #781

Force Drain and detachment for Volumes for Unhealthy Nodes which were `NotReady` for over 5min #781

himanshu-kun commented Feb 13, 2023 •

edited

Loading

himanshu-kun commented Feb 13, 2023

himanshu-kun commented Feb 17, 2023

vlerenc commented Feb 21, 2023 •

edited

Loading

vlerenc commented Feb 21, 2023

elankath commented Mar 30, 2023 •

edited

Loading

elankath commented Aug 31, 2023 •

edited

Loading

Force Drain and detachment for Volumes for Unhealthy Nodes which were NotReady for over 5min #781

Force Drain and detachment for Volumes for Unhealthy Nodes which were NotReady for over 5min #781

Comments

himanshu-kun commented Feb 13, 2023 • edited Loading

himanshu-kun commented Feb 13, 2023

himanshu-kun commented Feb 17, 2023

Proposals after grooming

vlerenc commented Feb 21, 2023 • edited Loading

vlerenc commented Feb 21, 2023

elankath commented Mar 30, 2023 • edited Loading

High Level Proposal

Detailed Proposal

Overview of Current Deletion Flow

Proposed Enhancement

elankath commented Aug 31, 2023 • edited Loading

Force Drain and detachment for Volumes for Unhealthy Nodes which were `NotReady` for over 5min #781

Force Drain and detachment for Volumes for Unhealthy Nodes which were `NotReady` for over 5min #781

himanshu-kun commented Feb 13, 2023 •

edited

Loading

vlerenc commented Feb 21, 2023 •

edited

Loading

elankath commented Mar 30, 2023 •

edited

Loading

elankath commented Aug 31, 2023 •

edited

Loading