Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Force Drain and detachment for Volumes for Unhealthy Nodes which were NotReady for over 5min #781

Closed
himanshu-kun opened this issue Feb 13, 2023 · 6 comments · Fixed by #839
Assignees
Labels
area/performance Performance (across all domains, such as control plane, networking, storage, etc.) related area/robustness Robustness, reliability, resilience related kind/bug Bug needs/planning Needs (more) planning with other MCM maintainers priority/2 Priority (lower number equals higher priority) status/closed Issue is closed (either delivered or triaged)

Comments

@himanshu-kun
Copy link
Contributor

himanshu-kun commented Feb 13, 2023

How to categorize this issue?

/area performance
/kind bug
/priority 2

What happened:

Currently we skip drain for an unhealthy node which we want to remove, if we see that it was not Ready for >5 min.
We have now seen problems with this. Imagine a node with a lot of PVs attached. Now node goes unhealthy and we skip drain, and start deleting the VM.
In azure as part of deletion we currently detach the volumes from the VM (PV disks and root disk) and then proceed with VM deletion.
This means MCM now will detach the backing disks of the PVs for the pods directly and then trigger node deletion. K8s starts acting only after node deletion as till now everything was happening on infra.
KCM deletes the nodes, and then the orphan pod force deletion logic removes the pods. This only makes the attach/detach controller come into action.
The attach/detach controller relies on kubelet to direct CSI driver on the node to unmount the disk. Since there is no node , meaning no kubelet , the A/D controller waits till maxWaitForUnmountDuration (6min , non-configurable) and then only force detachment starts.

So there are two detachments , one by MCM and then by A/D controller. This leads to downtimes for the customers.

What you expected to happen:

Expect the pods with PV to recover faster after unhealthy nodes are removed

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

During normal draining a wait for detachment and wait for reattachment( wait for new volumeattachment formation in case volumeattachment support is enabled) is done , introduced in #608

live issue # 3645

Environment:

  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • Others:
@gardener-robot gardener-robot added area/performance Performance (across all domains, such as control plane, networking, storage, etc.) related priority/1 Priority (lower number equals higher priority) labels Feb 13, 2023
@himanshu-kun
Copy link
Contributor Author

Proposed solutions:

  • Raise issue on upstream so that unmounting attempt is not made if kubelet status is unknown or node is deleted.
  • MCM should force delete the pods, and then wait until the node.status.volumeAttached becomes empty. After that only proceed with VM deletion.
  • Delete the node object before deleting the VM. This way KCM's garbage collection would delete the pods and A/D controller would start detachment. But we need to figure out a way(like using volumeAttachments) to make sure not to start VM deletion before detachment is complete.

cc @dguendisch

@himanshu-kun himanshu-kun added the priority/2 Priority (lower number equals higher priority) label Feb 17, 2023
@gardener-robot gardener-robot removed the priority/1 Priority (lower number equals higher priority) label Feb 17, 2023
@himanshu-kun himanshu-kun added needs/planning Needs (more) planning with other MCM maintainers area/robustness Robustness, reliability, resilience related labels Feb 17, 2023
@himanshu-kun
Copy link
Contributor Author

Proposals after grooming

We have decided to do some testing , where we would first delete the node object and then delete the VM or delete them both in parallel. This would mean A/D controller detaching volumes and MCM azure(for now) also doing the detachment.
This decision was taken after going through the recommended approach of deleting pods https://kubernetes.io/docs/tasks/run-application/force-delete-stateful-set-pod/#delete-pods where its stated that deleting nodes is favoured over force deleting pods especially for pods with PV.

@vlerenc
Copy link
Member

vlerenc commented Feb 21, 2023

From issues#2556 mentioned by @himanshu-kun:

This make the KCM wait for the unmounting of volume. Unmounting is done by csi plugin on direction of kubelet of the node. But since node is gone => no kubelet and unmounting never happens
Why does KCM wait for the unmounting of volumes if the node is gone. Does that make sense? And how does it know that (where is the mount information persisted) and how does it do it?

The reconciler https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/volume/attachdetach/reconciler/reconciler.go#L232 works on https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/volume/attachdetach/cache/actual_state_of_world.go#L156 that is set by https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/volume/attachdetach/cache/actual_state_of_world.go#L388 which is called here https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/volume/attachdetach/attach_detach_controller.go#L667, but that only works with a node resource. To me this sounds somewhat strange, if it really depends on the node that is gone, is it not? Is it keeping that state in memory, because otherwise it wouldn’t know anymore, if the source of truth is the (missing) node resource and it got restarted.

I guess, what I am trying to ask: Was I reading that wrongly or are you saying, KCM does still wait 6m even if the node resource disappears?

@gardener gardener deleted a comment from gardener-robot Feb 21, 2023
@vlerenc
Copy link
Member

vlerenc commented Feb 21, 2023

Deleting the node resource before the machine termination was confirmed (or doing both in parallel) is risky, isn't it? To me, it doesn't sound safe/worth exploring, because you can never know whether the kubelet would come back to life and recreate the node resource (unlikely, but possible and a race condition in principle).

With a web hook in place, we could "purge" the volumesInUse on nodes we anyway plan to delete, but the more direct way is to forcefully terminate the pods on that node and then delete the volume attachments for that node and do what the kubelet after maxWaitForUnmountDuration would have done. Why is that no option? The node is lost anyway, if we have decided to start the machine deletion and the faster we get those volumes detached the better. It's because we are scared of parallel operations (CSI detacher and MCM machine deletion)? What if we wait for the CSI detacher to finish it's job before commencing? If the node is cordoned as well, no new pod would get onto it, the existing ones are terminated, so the CSI attacher won't try to attach anything in that time window and would only detach the existing stuff. Wrong?

@elankath
Copy link
Contributor

elankath commented Mar 30, 2023

Hi Vedran,

We discussed this and we agree with you. We propose the following enhancement:

High Level Proposal

  1. Initiate our drain routine (with forceful pod deletion) if the Node does not have the Ready condition for 5 min or has the ReadOnlyFileSystem is true for 5min. We also delete the VolumeAttachments associated with the node.

  2. Before Initiating VM Deletion, we wait for the node volumes to be detached by checking node.Status.volumesAttached (till a timeout) . This detachment is being carried out by A/D controller, MCM would just wait.

  3. Once volumes are detached or detach timeout has expired, we proceed with the VM Deletion as usual. This involves the call to Driver.DeleteMachine.

  4. We then delete the node as usual.

Detailed Proposal

Overview of Current Deletion Flow
  1. When the machine's Deletion Timestamp is set, the MC invokes the triggerDeletionFlow
  2. The triggerDeletion flow does the following:
    1. The triggerDeletionFlow sets the machine.Status.CurrentStatus.Phase to Terminating.
    2. Then the node drain stage is invoked after the operation is changed to InitiateDrain. This maybe graceful deletion - pod eviction,wait for detach/attach of volume to new pod or forceful deletion of pods.
    3. Then the VM deletion stage is invoked after the operation is changed to InitiateVMDeletion
    4. Then the Node deletion stage is invoked after the operation is changed to InitiateNodeDeletion
    5. Then machine finalizers are removed after the operation is changed to: InitiateFinalizerRemoval
Proposed Enhancement

We enhance the triggerDeletion flow and introduce the following:

  1. In the node drain stage, initiate our drain routine with forceful pod deletion if the Node does not have the Ready condition for 5 min or has the ReadOnlyFileSystem is true for 5min. We also additionally delete the VolumeAttachments associated with the node for the force pod deletion case.
  2. We introduce a new operation: deleteNodeVolAttachments prior to InitiateVMDeletion. This checks that volumes have been detached from the node using node.Status.volumesAttached. (This field should be 0). And returns with a ShortRetry to check again if not until a timeout is reached.
  3. Once the deleteNodeVolAttachments stage is done, we proceed with the to InitiateVMDeletion stage as usual.

To ensure we can appropriately timeout in waitForNodeVolumesDetachment, we propose enhancing LastOperation and add a LastStateTransitionTime field (meaning Last Operation State Transition Time).

The end of InitiateDrain will update the the field as follows (only illustrative). waitForNodeVolumesDetachment will leverage LastStateTransitionTime in order to determine when to timeout and move to the InitiateVMDeletion stage.

{
"Description":"waitForNodeVolumesDetach",
"State": "Processing",
"Type": "Delete",
"LastUpdateTime": "1:00 PM"
"LastStateTransitionTime": "1:00 PM"
}

@elankath elankath changed the title Don't skip Drain for Unhealthy Nodes which were NotReady for over 5min Force Drain and detachment for Volumes for Unhealthy Nodes which were NotReady for over 5min Aug 31, 2023
@elankath
Copy link
Contributor

elankath commented Aug 31, 2023

Another live issue # 3645

@gardener-robot gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Sep 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/performance Performance (across all domains, such as control plane, networking, storage, etc.) related area/robustness Robustness, reliability, resilience related kind/bug Bug needs/planning Needs (more) planning with other MCM maintainers priority/2 Priority (lower number equals higher priority) status/closed Issue is closed (either delivered or triaged)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants