Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Machine fails to finish draining/volume detachment after successful completion #11591

Open
Danil-Grigorev opened this issue Dec 17, 2024 · 5 comments · May be fixed by #11590
Open

Machine fails to finish draining/volume detachment after successful completion #11591

Danil-Grigorev opened this issue Dec 17, 2024 · 5 comments · May be fixed by #11590
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-priority Indicates an issue lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@Danil-Grigorev
Copy link
Member

Danil-Grigorev commented Dec 17, 2024

What steps did you take and what happened?

After upgrading CAPI to 1.9 we observed an issue with CAPRKE2 provider.

RKE2 uses kubelet local mode by default, so etcd membership management logic behaves as in k/k 1.32 in Kubeadm.
The problem causes loss of API server access after etcd member is removed, leading to inability to proceed with infrastructure machine deletion.

The issue is that in rke2 deployments, kubelet is configured to use local api server (127.0.0.1:443), which in turn relies on local etcd pod. But as this node is removed from etcd cluster, kubelet won't be able to reach the API any more, and it will fail to properly drain the node as all pods will remain stuck in Terminating state from kubernetes perspective.

Logs from the cluster:

12:21:45.068153       1 recorder.go:104] "success waiting for node volumes detaching Machine's node \"caprke2-e2e-fukv6s-upgrade-control-plane-kzt8q\"" logger="events" type="Normal" object={"kind":"Machine","namespace":"create-workload-cluster-s51eu2","name":"caprke2-e2e-fukv6s-upgrade-control-plane-kzt8q","uid":"825607f8-f44e-465b-a954-ce3de1eb291c","apiVersion":"cluster.x-k8s.io/v1beta1","resourceVersion":"2116"} reason="NodeVolumesDetached"
12:21:56.066942       1 recorder.go:104] "error waiting for node volumes detaching, Machine's node \"caprke2-e2e-fukv6s-upgrade-control-plane-kzt8q\": failed to list VolumeAttachments: failed to list VolumeAttachments: Get \"https://172.18.0.3:6443/apis/storage.k8s.io/v1/volumeattachments?limit=100&timeout=10s\": context deadline exceeded" logger="events" type="Warning" object={"kind":"Machine","namespace":"create-workload-cluster-s51eu2","name":"caprke2-e2e-fukv6s-upgrade-control-plane-kzt8q","uid":"825607f8-f44e-465b-a954-ce3de1eb291c","apiVersion":"cluster.x-k8s.io/v1beta1","resourceVersion":"2162"} reason="FailedWaitForVolumeDetach"
12:21:56.087814       1 controller.go:316] "Reconciler error" err="failed to list VolumeAttachments: failed to list VolumeAttachments: Get \"https://172.18.0.3:6443/apis/storage.k8s.io/v1/volumeattachments?limit=100&timeout=10s\": context deadline exceeded" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="create-workload-cluster-s51eu2/caprke2-e2e-fukv6s-upgrade-control-plane-kzt8q" namespace="create-workload-cluster-s51eu2" name="caprke2-e2e-fukv6s-upgrade-control-plane-kzt8q" reconcileID="960a3889-d9b7-41a3-92c4-63f438b0c980"

What did you expect to happen?

Draining and Volume detachment to succeed, and machine get deleted without issues.

Cluster API version

v1.9.0

Kubernetes version

v1.29.2 - management
v1.31.0 - workload

Anything else you would like to add?

Logs from CI run with all details: https://github.com/rancher/cluster-api-provider-rke2/actions/runs/12372669685/artifacts/2332172988

Label(s) to be applied

/kind bug
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-priority Indicates an issue lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 17, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If CAPI contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@chrischdi
Copy link
Member

chrischdi commented Dec 17, 2024

Questions:

  • is this using KCP? (I guess no?)
  • v1.29.2 Is only the management cluster, right? WL cluster is somewhat >= v1.31.

@Danil-Grigorev
Copy link
Member Author

It is using RKE2 as a bootstrap provider, workload cluster is 1.31.0. I opened a PR, which from what I could see followed up with machine deletion.

@chrischdi
Copy link
Member

Just to mention it: until we have a proper fix, there might be the workaround viable to add the following two annotations from the control-plane provider's side, once the time is reached that no drain/detach should be done:

  • machine.cluster.x-k8s.io/exclude-node-draining
  • machine.cluster.x-k8s.io/exclude-wait-for-node-volume-detach

@enxebre
Copy link
Member

enxebre commented Jan 7, 2025

is this a control plane Node? wouldn't this scenario make any other upcoming node deletion fail to query through the remote client as well?
EDIT: let's follow up here #11590 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-priority Indicates an issue lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants