Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't track reattachment of provider-unrelated PVs #937

Merged
merged 3 commits into from
Oct 18, 2024

Conversation

timebertt
Copy link
Member

@timebertt timebertt commented Aug 23, 2024

/kind bug

What this PR does / why we need it:

When draining nodes, MCM already ignores provider-unrelated PVs when waiting for the detachments.
However, it does wait for such PVs to reattach to other nodes until running into a timeout.

This PR ensures that the same filter of the driver is used for tracking reattachments of PVs as well.

For example, when draining nodes where NFS volumes are used, MCM would wait for the reattachment of these volumes.
There is no such thing as an attachment to a node for NFS volumes. NFS volumes only need to be mounted (by kubelet).
Accordingly, MCM always runs into the configured PV reattach timeout, if there are pods with NFS PVCs on the to-be-drained node.
This prolongs the drain operation of such nodes severely.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

cc @xoxys

Release note:

A bug has been fixed for draining nodes with provider-unrelated volumes like NFS volumes. With this fix, the machine controller doesn't try to track their (non-existing) VolumeAttachments.

@timebertt timebertt requested a review from a team as a code owner August 23, 2024 10:31
@gardener-robot
Copy link

@timebertt Thank you for your contribution.

@gardener-robot-ci-3
Copy link
Contributor

Thank you @timebertt for your contribution. Before I can start building your PR, a member of the organization must set the required label(s) {'reviewed/ok-to-test'}. Once started, you can check the build status in the PR checks section below.

@gardener-robot gardener-robot added the size/xs Size of pull request is tiny (see gardener-robot robot/bots/size.py) label Aug 23, 2024
@timebertt
Copy link
Member Author

We probably need to add tests for this PR. However, we wanted to open it to get some feedback on whether this is the correct approach/place to solve the issue.

@kon-angelo
Copy link
Contributor

@timebertt do you think, we should generalise this to also skip the check in case the CSIDriver is configured with attachRequired == false ? (Not in this PR even)

@timebertt
Copy link
Member Author

@timebertt do you think, we should generalise this to also skip the check in case the CSIDriver is configured with attachRequired == false ? (Not in this PR even)

Sounds good. I had the impression that the attachment tracking doesn't work at all if there are no VolumeAttachment objects, i.e., no CSI driver is used, and the operation just runs into the configured timeout. Is this correct?
If so, we should probably change this PR from "skip tracking NFS volumes" to "skip tracking non-CSI volumes".

Another PR could then enhance the check for CSI drivers where no VolumeAttachments are used.

@timebertt
Copy link
Member Author

@kon-angelo @gardener/mcm-maintainers can you provide feedback on the PR and the above discussion?

@unmarshall
Copy link
Contributor

unmarshall commented Sep 3, 2024

@timebertt do you think, we should generalise this to also skip the check in case the CSIDriver is configured with attachRequired == false ? (Not in this PR even)

I agree to @kon-angelo's observation that this can be generically handled. @timebertt since you have raised this PR we can also do this in 2 steps. We can skip NFS volumes (as you have done in this PR) and then later we can generalise this. Since generalisation would required additional lookup of the CSIDriver object and looking up its AttachRequired property to decide if wait for detachment for such a PV can be skipped.

@kon-angelo
Copy link
Contributor

kon-angelo commented Sep 3, 2024

Aside from my previous point, I kind of understand @timebertt 's suggestion. The MCM code relies on volumeattachments which also to my knowledge is only created for CSI volumes. We could probably update the logic in this PR to skip anything other than CSI volumes. (but my experience is just going through the code and this can be verified from one of the actual mcm maintainers).

Though one point is that I am not particularly fond that we "skip" reporting the existence of volumes, rather than not accounting them when draining - if it makes sense. Like this can be somewhat troubling to debug I find. But the actual function doing the work evictPodsWithPVInternal that also uses the getVolIDsFromDriver is used in many places and so I understand this implementation.

I just feel that at some point we will go down a debugging rabbit hole because getVolIDsFromDriver also does a little more extra than the name suggests which is the skipping part 🤷

@timebertt
Copy link
Member Author

Back from vacation, sorry for the delay.

I just feel that at some point we will go down a debugging rabbit hole because getVolIDsFromDriver also does a little more extra than the name suggests which is the skipping part 🤷

Got your point. I will rework the PR to skip any non-CSI volumes instead of handling NFS as a special case, improve the code flow (where the filtering is done), and add unit tests.

Another PR can be opened later to consult the CSIDriver object for the attachRequired == false check.

@timebertt
Copy link
Member Author

While looking into the filtering again, I noticed a misunderstanding on my side.
In short, the filtering I introduced in this PR and the filtering we envisioned wouldn't solve the problem at hand.

The actual problem is that PodVolumeInfo returned by getPodVolumeInfos has two different lists of volumes.

  • persistentVolumeList contains the referenced PV names of all PVCs referenced in the pod. This list is not filtered and is only used in waitForReattach.
  • volumeList contains the volume IDs/handles corresponding to persistentVolumeList. It could also be a subset thereof as it is filtered in Driver.GetVolumeIDs when mapping PV names to IDs. E.g., for openstack, this list only contains cinder volumes (in-tree or CSI). This list is only used in waitForDetach.

See doAccountingOfPvs:

func (o *Options) doAccountingOfPvs(ctx context.Context, pods []*corev1.Pod) map[string]PodVolumeInfo {
var (
pvMap = make(map[string][]string)
podVolumeInfoMap = make(map[string]PodVolumeInfo)
)
for _, pod := range pods {
podPVs, _ := o.getPVList(pod)
pvMap[getPodKey(pod)] = podPVs
}
// Filter the list of shared PVs
filterSharedPVs(pvMap)
for podKey, persistentVolumeList := range pvMap {
persistentVolumeListDeepCopy := persistentVolumeList
volumeList, err := o.getVolIDsFromDriver(ctx, persistentVolumeList)
if err != nil {
// In case of error, log and skip this set of volumes
klog.Errorf("error getting volume ID from cloud provider. Skipping volumes for pod: %v. Err: %v", podKey, err)
continue
}
podVolumeInfo := PodVolumeInfo{
persistentVolumeList: persistentVolumeListDeepCopy,
volumeList: volumeList,
}
podVolumeInfoMap[podKey] = podVolumeInfo
}
klog.V(4).Infof("PodVolumeInfoMap = %v", podVolumeInfoMap)
return podVolumeInfoMap
}

So when doing the filtering in getVolIDsFromDriver (as originally proposed by this PR), we only filter the volumeList used in waitForDetach.
However, this list never contained NFS or non-CSI volumes as it is already filtered by the driver.

The problem is that the persistentVolumeList is not filtered and the machine controller waits for reattachment of NFS or non-CSI volumes until it runs into a timeout.

To solve the problem, I reworked the PR so that there is only a single volume list in PodVolumeInfo that contains both types of information (PV name and volume ID). Only if the driver returns a volume ID, the PV is added to the list. Otherwise, we don't track detachment (this was already the case before) and don't track reattachment (new, this is the fix).

The PR is still missing unit tests.
First, I want to reproduce the issue on a test cluster and ensure that the change fixes the issue.
In the meantime, feedback on the change is highly appreciated.

@gardener-robot gardener-robot added size/m Size of pull request is medium (see gardener-robot robot/bots/size.py) and removed size/xs Size of pull request is tiny (see gardener-robot robot/bots/size.py) labels Oct 9, 2024
@timebertt timebertt changed the title Don't track attachments of NFS volumes Don't track reattachment of provider-unrelated PVs Oct 9, 2024
@timebertt
Copy link
Member Author

timebertt commented Oct 10, 2024

I reproduced the issue described above on an OpenStack shoot using the following steps:

  1. Install https://github.com/rancher/local-path-provisioner for creating volumes that are filtered out by machine-controller-manager-provider-openstack in GetVolumeIDs.
kustomize build "github.com/rancher/local-path-provisioner/deploy?ref=v0.0.30" | kubectl apply -f -
  1. Create an example StatefulSet using a local-path-provisioner PVC:
Click me to expand

apiVersion: apps/v1
kind: StatefulSet
metadata:
  creationTimestamp: null
  labels:
    app: test
  name: test
spec:
  replicas: 1
  selector:
    matchLabels:
      app: test
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: test
    spec:
      containers:
      - name: volume-test
        image: nginx:stable-alpine
        imagePullPolicy: IfNotPresent
        volumeMounts:
        - name: test
          mountPath: /data
        ports:
        - containerPort: 80
  volumeClaimTemplates:
  - apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      creationTimestamp: null
      name: test
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 128Mi
      storageClassName: local-path

  1. Delete the machine where the test pod runs and test volume is mounted:
$ k get po -owide
NAME     READY   STATUS    RESTARTS   AGE   IP              NODE                                              NOMINATED NODE   READINESS GATES
test-0   1/1     Running   0          50s   100.80.121.84   shoot--garden--s-eu01-000-worker-z2-54c6c-l7q9p   <none>           <none>
$ k get pvc
NAME          STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
test-test-0   Bound    pvc-283f2aef-3066-4b23-bc61-30e750ea5340   128Mi      RWO            local-path     4m30s
$ gardenctl target control-plane
$ k delete machine shoot--garden--s-eu01-000-worker-z2-54c6c-l7q9p
  1. Observe the drain operation in the MCM logs:
$ stern machine | grep -e shoot--garden--s-eu01-000-worker-z2-54c6c-l7q9p -e attach -e detach
machine-controller-manager-845b9d96f6-wl7dq machine-controller-manager-provider-openstack I1010 12:14:13.429520       1 machine_util.go:1104] Normal delete/drain has been triggerred for machine "shoot--garden--s-eu01-000-worker-z2-54c6c-l7q9p" with providerID "openstack:///RegionOne/0edc5a3c-7ea3-4fd8-9b45-a75e386f178a" and backing node "shoot--garden--s-eu01-000-worker-z2-54c6c-l7q9p" with drain-timeout:2h0m0s & maxEvictRetries:360

machine-controller-manager-845b9d96f6-wl7dq machine-controller-manager-provider-openstack W1010 12:14:13.634676       1 drain.go:834] Node name: "shoot--garden--s-eu01-000-worker-z2-54c6c-l7q9p", list of pod PVs to wait for detach: []
machine-controller-manager-845b9d96f6-wl7dq machine-controller-manager-provider-openstack I1010 12:14:13.634699       1 drain.go:914] Waiting for following volumes to reattach: [pvc-283f2aef-3066-4b23-bc61-30e750ea5340]

machine-controller-manager-845b9d96f6-wl7dq machine-controller-manager-provider-openstack W1010 12:15:43.635770       1 drain.go:932] Timeout occurred while waiting for PVs [pvc-283f2aef-3066-4b23-bc61-30e750ea5340] to reattach to a different node
machine-controller-manager-845b9d96f6-wl7dq machine-controller-manager-provider-openstack W1010 12:15:43.635857       1 drain.go:771] Timeout occurred for following volumes to reattach: [pvc-283f2aef-3066-4b23-bc61-30e750ea5340]

machine-controller-manager-845b9d96f6-wl7dq machine-controller-manager-provider-openstack I1010 12:15:43.635954       1 drain.go:781] Pod + volume detachment from node shoot--garden--s-eu01-000-worker-z2-54c6c-l7q9p + volume reattachment to another node for Pod default/test-0 took 1m30.05787152s
machine-controller-manager-845b9d96f6-wl7dq machine-controller-manager-provider-openstack I1010 12:15:43.636047       1 drain.go:200] Machine drain ended on 2024-10-10 12:15:43.636038402 +0000 UTC m=+2075.047348776 and took 1m30.163757152s for "shoot--garden--s-eu01-000-worker-z2-54c6c-l7q9p"

The logs show that waitForDetach skips the non-cinder volume, as volumeList is filtered correctly: list of pod PVs to wait for detach: [].
However, waitForReattach still waits for the volume to reattach, as persistentVolumeList is not filtered: Waiting for following volumes to reattach: [pvc-283f2aef-3066-4b23-bc61-30e750ea5340].
With this, MCM runs into the configured --machine-pv-reattach-timeout (90s by default).

The 90s for this unnecessary attachment tracking sums up quickly when rolling node pools having many provider-unrelated PVs.

@timebertt timebertt closed this Oct 10, 2024
@gardener-robot gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Oct 10, 2024
@timebertt timebertt reopened this Oct 10, 2024
@gardener-robot gardener-robot added status/accepted Issue was accepted as something we need to work on and removed status/closed Issue is closed (either delivered or triaged) labels Oct 10, 2024
@timebertt timebertt force-pushed the nfs-volumes branch 3 times, most recently from 9031fe6 to 03d4fed Compare October 10, 2024 15:46
@timebertt
Copy link
Member Author

Now, I verified my fix with a manually built image of mcm-provider-openstack and the steps described above.

With the change, both lists for detach and reattach tracking are empty and the drain operation finishes quickly without considering the non-cinder volume (as desired):

I1010 15:57:11.162420       1 machine_util.go:1130] Normal delete/drain has been triggerred for machine "shoot--garden--s-eu01-000-worker-z1-6f5b7-cbq7j" with providerID "openstack:///RegionOne/accb65ba-9cae-47b5-984d-0598daaae381" and backing node "shoot--garden--s-eu01-000-worker-z1-6f5b7-cbq7j" with drain-timeout:2h0m0s & maxEvictRetries:360

W1010 15:57:11.329585       1 drain.go:865] Node name: "shoot--garden--s-eu01-000-worker-z1-6f5b7-cbq7j", list of pod PVs to wait for detach: []
W1010 15:57:11.329639       1 drain.go:941] List of pod PVs waiting for reattachment is 0: []
I1010 15:57:11.329721       1 drain.go:810] Pod + volume detachment from node shoot--garden--s-eu01-000-worker-z1-6f5b7-cbq7j + volume reattachment to another node for Pod default/test-0 took 30.042329ms

I1010 15:57:11.329761       1 drain.go:228] Machine drain ended on 2024-10-10 15:57:11.329751778 +0000 UTC m=+486.934211889 and took 126.250956ms for "shoot--garden--s-eu01-000-worker-z1-6f5b7-cbq7j"

I also tested the change with a node that had 3 cinder volumes attached:

I1010 15:59:46.496527       1 machine_util.go:1130] Normal delete/drain has been triggerred for machine "shoot--garden--s-eu01-000-worker-z1-6f5b7-459p7" with providerID "openstack:///RegionOne/1fa6c33c-cb16-4ae9-b70e-7387aaf938cf" and backing node "shoot--garden--s-eu01-000-worker-z1-6f5b7-459p7" with drain-timeout:2h0m0s & maxEvictRetries:360
I1010 15:59:46.684754       1 drain.go:574] PodVolumeInfos: map[garden/prometheus-aggregate-0:{[{pv-shoot--garden--s-eu01-000-5ca2886d-85d9-46d6-8ef3-175d194aaa03 668f1bc6-3c5e-4683-bd34-b26d0874ffd1}]} garden/prometheus-cache-0:{[{pv-shoot--garden--s-eu01-000-21332da3-f21d-41d3-8c06-f2fdef77c9cf b3c6dc59-ec47-4486-bc0a-844824a5081e}]} garden/vali-0:{[{pv-shoot--garden--s-eu01-000-cb1a711c-1c05-468a-8266-025acf81dff3 3aad22f5-cade-44a2-aae9-bf8305b73bc1}]}]

I1010 15:59:46.753823       1 drain.go:763] Pod eviction/deletion for Pod garden/prometheus-aggregate-0 in Node "shoot--garden--s-eu01-000-worker-z1-6f5b7-459p7" and took 68.895848ms. Now waiting for volume detachment.
I1010 15:59:46.753853       1 drain.go:869] Waiting for following volumes to detach: [{pv-shoot--garden--s-eu01-000-5ca2886d-85d9-46d6-8ef3-175d194aaa03 668f1bc6-3c5e-4683-bd34-b26d0874ffd1}]
I1010 15:59:46.909744       1 drain.go:892] No of attached volumes for node "shoot--garden--s-eu01-000-worker-z1-6f5b7-459p7" is [{kubernetes.io/csi/cinder.csi.openstack.org^b3c6dc59-ec47-4486-bc0a-844824a5081e } {kubernetes.io/csi/cinder.csi.openstack.org^668f1bc6-3c5e-4683-bd34-b26d0874ffd1 } {kubernetes.io/csi/cinder.csi.openstack.org^3aad22f5-cade-44a2-aae9-bf8305b73bc1 }]
I1010 15:59:46.909938       1 drain.go:908] Found volume "668f1bc6-3c5e-4683-bd34-b26d0874ffd1" still attached to node "shoot--garden--s-eu01-000-worker-z1-6f5b7-459p7". Will re-check in 5s
...
I1010 15:59:53.514517       1 volume_attachment.go:42] Dispatching request for PV pv-shoot--garden--s-eu01-000-5ca2886d-85d9-46d6-8ef3-175d194aaa03
I1010 15:59:56.928776       1 drain.go:892] No of attached volumes for node "shoot--garden--s-eu01-000-worker-z1-6f5b7-459p7" is [{kubernetes.io/csi/cinder.csi.openstack.org^b3c6dc59-ec47-4486-bc0a-844824a5081e } {kubernetes.io/csi/cinder.csi.openstack.org^3aad22f5-cade-44a2-aae9-bf8305b73bc1 }]
I1010 15:59:56.929045       1 drain.go:921] Detached volumes [{pv-shoot--garden--s-eu01-000-5ca2886d-85d9-46d6-8ef3-175d194aaa03 668f1bc6-3c5e-4683-bd34-b26d0874ffd1}] from node "shoot--garden--s-eu01-000-worker-z1-6f5b7-459p7"
I1010 15:59:56.929243       1 drain.go:787] Pod + volume detachment from Node garden for Pod prometheus-aggregate-0/shoot--garden--s-eu01-000-worker-z1-6f5b7-459p7 and took 10.244327791s
I1010 15:59:56.929338       1 drain.go:945] Waiting for following volumes to reattach: [{pv-shoot--garden--s-eu01-000-5ca2886d-85d9-46d6-8ef3-175d194aaa03 668f1bc6-3c5e-4683-bd34-b26d0874ffd1}]

Now, that the fix is verified and I'm confident in the change of this PR, I only need to add unit tests.

While verifying my change, I discovered another bug (unrelated to this change, but it breaks the reattach tracking): #945

@gardener-robot gardener-robot removed the size/m Size of pull request is medium (see gardener-robot robot/bots/size.py) label Oct 11, 2024
@timebertt
Copy link
Member Author

Hey @gardener/mcm-maintainers @kon-angelo. I finished verifying this PR and adding tests.
It's ready for a full review now :)

break
}

response, err := o.Driver.GetVolumeIDs(ctx, &driver.GetVolumeIDsRequest{PVSpecs: []*corev1.PersistentVolumeSpec{pvSpec}})
Copy link
Contributor

@elankath elankath Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: The new implementation can result in excessive calls to Driver.GetVolumeIDs and is highly-dependent on the provider implementation making this cheap. Other can result in rate limit errors at provider. Earlier there was only a single call to Driver.GetVolumeIDs

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is true. I knowingly changed this, because existing implementation that I looked into (aws and openstack) didn't call an API on GetVolumeIDs.
I was under the impression that drivers are expected to extract the volume IDs purely from the PV spec without the help of an external API.

If this is not the case, I can restructure the code again to perform a single GetVolumeIDs call and use the list of IDs from that instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elankath can you comment on how the PR should handle this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, the godoc for GetVolumeIDs does mention this (though we should fix the grammar). It is fine - you don't need to restructure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is not the case, I can restructure the code again to perform a single GetVolumeIDs call and use the list of IDs from that instead.

While thinking about it again, this won't be possible.
The reason is, that the driver filters the list of input volumes depending on whether it recognizes the volume. However, we won't be able to correlate the input volumes and filtered output volumes, which would be required to achieve the needed filtering in the drain logic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is fine - you don't need to restructure.

Ok, great. Then, let's go for merging this PR :)

@elankath
Copy link
Contributor

@timebertt sorry for dumb question, but was there any additional code introduced for NFS filtering ? Since you raised that as as point, but we can't find the same. I am assuming that the standard filtering that is currently implemented in the provider implemention for Driver.GetVolumeIDs already does this job, right ? Or am I mistaken ?

@timebertt
Copy link
Member Author

was there any additional code introduced for NFS filtering ? [...] I am assuming that the standard filtering that is currently implemented in the provider implemention for Driver.GetVolumeIDs already does this job, right ?

Yes, the filtering in Driver.GetVolumeIDs does the job. Before this PR, the filtering was only applied to the input of waitForDetach but not to the input of waitForReattach.

@timebertt
Copy link
Member Author

@elankath can you take another look please?

@elankath
Copy link
Contributor

elankath commented Oct 18, 2024

@timebertt Just to confirm: was the force-drain logic also tested ? One can test with by using k label mc <machineName> force-deletion=true. This does deletion of volume attachments which wont be valid for this use-case ? Also - just to confirm for clarity of understanding- there is no explicit filtering for NFS volumes anywhere in MCM right, ie we are depending on the provider impl doing the filtering.

Thanks!

@timebertt
Copy link
Member Author

was the force-drain logic also tested ? One can test with by using k label mc <machineName> force-deletion=true. This does deletion of volume attachments which wont be valid for this use-case ?

I just tested the force deletion of a machine that had three cinder volumes and one local-path volume (see steps to reproduce: #937 (comment)).
MCM successfully deleted all pods and VolumeAttachments on the node as expected:

I1018 06:33:31.228271       1 machine_util.go:1121] Force delete/drain has been triggerred for machine "shoot--garden--s-eu01-000-worker-z2-54c6c-dtstr" with providerID "openstack:///RegionOne/40370abf-174b-4872-83fa-343339a0ccaa" and backing node "shoot--garden--s-eu01-000-worker-z2-54c6c-dtstr" due to Label:true, timeout:false
I1018 06:33:31.421384       1 drain.go:490] Forceful eviction of pods on the node: "shoot--garden--s-eu01-000-worker-z2-54c6c-dtstr"

I1018 06:33:31.421736       1 drain.go:1028] Evicting pod garden/vali-0 from node "shoot--garden--s-eu01-000-worker-z2-54c6c-dtstr"
I1018 06:33:31.421798       1 drain.go:387] Attempting to force-delete the pod:"vali-0" from node "shoot--garden--s-eu01-000-worker-z2-54c6c-dtstr"

I1018 06:33:31.422110       1 drain.go:1028] Evicting pod default/test-0 from node "shoot--garden--s-eu01-000-worker-z2-54c6c-dtstr"
I1018 06:33:31.422129       1 drain.go:387] Attempting to force-delete the pod:"test-0" from node "shoot--garden--s-eu01-000-worker-z2-54c6c-dtstr"

I1018 06:33:31.489072       1 drain.go:228] Machine drain ended on 2024-10-18 06:33:31.489041721 +0000 UTC m=+515.338557327 and took 216.774911ms for "shoot--garden--s-eu01-000-worker-z2-54c6c-dtstr"
I1018 06:33:31.489190       1 machine_util.go:1185] Drain successful for machine "shoot--garden--s-eu01-000-worker-z2-54c6c-dtstr" ,providerID "openstack:///RegionOne/40370abf-174b-4872-83fa-343339a0ccaa", backing node "shoot--garden--s-eu01-000-worker-z2-54c6c-dtstr".

I1018 06:33:36.515974       1 machine.go:116] reconcileClusterMachine: Start for "shoot--garden--s-eu01-000-worker-z2-54c6c-dtstr" with phase:"Terminating", description:"Force Drain successful. Delete Volume Attachments"
I1018 06:33:36.516125       1 machine_util.go:1683] (deleteVolumeAttachmentsForNode) Deleting #3 VolumeAttachment(s) for node "shoot--garden--s-eu01-000-worker-z2-54c6c-dtstr"
I1018 06:33:36.562143       1 machine_util.go:1269] (deleteNodeVolAttachments) Successfully deleted all volume attachments for node "shoot--garden--s-eu01-000-worker-z2-54c6c-dtstr", machine "shoot--garden--s-eu01-000-worker-z2-54c6c-dtstr"

I looked into the code and found that the deletion of VolumeAttachments in case of a force deletion is not related to the drain code that I changed in this PR. It also doesn't use the driver at all and just looks at the VolumeAttachments objects.
Why do you think that this functionality might be affected by this PR?

@timebertt
Copy link
Member Author

Also - just to confirm for clarity of understanding- there is no explicit filtering for NFS volumes anywhere in MCM right, ie we are depending on the provider impl doing the filtering.

Yes, exactly.

@elankath
Copy link
Contributor

Thanks for the clarification and test Tim. (The test is not related to your PR - just a basic sanity check just to check that the force-deletion flow was not impacted by drain)

Copy link
Contributor

@elankath elankath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm. Thanks for the contribution.

@timebertt
Copy link
Member Author

I think #937 (comment) is still open. Can you take a look at my thoughts?

@elankath
Copy link
Contributor

elankath commented Oct 18, 2024

No, I believe there is no need to restructure. Ideally we should have have had a better response instead of just a []string but that will need to wait for the MCM overhaul.

@rishabh-11 rishabh-11 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed needs/review Needs review labels Oct 18, 2024
@gardener-robot-ci-1 gardener-robot-ci-1 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Oct 18, 2024
Copy link
Contributor

@rishabh-11 rishabh-11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

I'll run the IT once with provider AWS and then merge this PR.

@gardener-robot gardener-robot added reviewed/lgtm Has approval for merging and removed needs/second-opinion Needs second review by someone else labels Oct 18, 2024
@gardener-robot-ci-2 gardener-robot-ci-2 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Oct 18, 2024
@rishabh-11
Copy link
Contributor

tests passed on AWS. I'll fix #945. Thanks for raising this issue.

@rishabh-11 rishabh-11 merged commit 67e4579 into gardener:master Oct 18, 2024
8 checks passed
@gardener-robot gardener-robot added status/closed Issue is closed (either delivered or triaged) and removed status/accepted Issue was accepted as something we need to work on labels Oct 18, 2024
@timebertt timebertt deleted the nfs-volumes branch October 18, 2024 10:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Bug needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) reviewed/lgtm Has approval for merging reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) size/l Size of pull request is large (see gardener-robot robot/bots/size.py) status/closed Issue is closed (either delivered or triaged)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants