Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod and namespace stuck in terminating state #647

Closed
nilsgstrabo opened this issue Nov 28, 2022 · 4 comments · Fixed by #652
Closed

Pod and namespace stuck in terminating state #647

nilsgstrabo opened this issue Nov 28, 2022 · 4 comments · Fixed by #652
Assignees
Labels
bug Something isn't working webhook
Milestone

Comments

@nilsgstrabo
Copy link
Contributor

Describe the bug
Pods with finalizers (e.g. "batch.kubernetes.io/job-tracking" when created by the Job controller in Kubernetes >= 1.23) are stuck in Terminating when deleting the Pod's namespace.

Deleting the namespace will delete the Job, the Pod and the ServiceAccount. Because of the finalizer, the Pod is only flagged as deleted. When the Job controller tries to remove the finalizer from the Pod, the update is rejected by the Azure Workload Identity mutating webhook because the ServiceAccount no longer exists. We have to stop the webhook to allow the Job controller to remove the finalizer. Once the finalizer has been removed, we can start the webhook again.

Steps To Reproduce

Create a namespace
kubectl create ns awi

Create a ServiceAccount:

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ServiceAccount
metadata:
  annotations:
    azure.workload.identity/client-id: <a valid client id>
  labels:
    azure.workload.identity/use: "true"
  name: my-sa
  namespace: awi
EOF

Create a Job:

cat <<EOF | kubectl apply -f -
apiVersion: batch/v1
kind: Job
metadata:
  name: quick-start-job
  namespace: awi
spec:
  template:
    metadata:
      labels:
        azure.workload.identity/use: "true"
    spec:
      serviceAccountName: my-sa
      restartPolicy: Never
      containers:
        - image: busybox
          args: [/bin/sh, -c, 'sleep 300']
          name: busybox
EOF

Delete the namespace before the sleep 300 expires
kubectl delete ns awi

The Pod and Namespace is now stuck in Terminating until the webhook is stopped.

Expected behavior

Not sure what the best/correct way to handle situations like this is. Perhaps ignore validation for pods with a deletionTimestamp?

Logs

E1128 15:09:24.741764       1 webhook.go:115] handler "msg"="failed to get service account" "error"="serviceaccounts \"my-sa\" not found" "namespace"="awi" "pod"="quick-start-job-rnpqt" "service-account"="my-sa"
E1128 15:09:24.795016       1 webhook.go:115] handler "msg"="failed to get service account" "error"="serviceaccounts \"my-sa\" not found" "namespace"="awi" "pod"="quick-start-job-rnpqt" "service-account"="my-sa"
E1128 15:09:24.823438       1 webhook.go:115] handler "msg"="failed to get service account" "error"="serviceaccounts \"my-sa\" not found" "namespace"="awi" "pod"="quick-start-job-rnpqt" "service-account"="my-sa"
E1128 15:09:25.311920       1 webhook.go:115] handler "msg"="failed to get service account" "error"="serviceaccounts \"my-sa\" not found" "namespace"="awi" "pod"="quick-start-job-rnpqt" "service-account"="my-sa"
E1128 15:09:35.344306       1 webhook.go:115] handler "msg"="failed to get service account" "error"="serviceaccounts \"my-sa\" not found" "namespace"="awi" "pod"="quick-start-job-rnpqt" "service-account"="my-sa"
E1128 15:09:38.938559       1 webhook.go:115] handler "msg"="failed to get service account" "error"="serviceaccounts \"my-sa\" not found" "namespace"="awi" "pod"="quick-start-job-rnpqt" "service-account"="my-sa"
E1128 15:09:39.237131       1 webhook.go:115] handler "msg"="failed to get service account" "error"="serviceaccounts \"my-sa\" not found" "namespace"="awi" "pod"="quick-start-job-rnpqt" "service-account"="my-sa"
E1128 15:09:55.390677       1 webhook.go:115] handler "msg"="failed to get service account" "error"="serviceaccounts \"my-sa\" not found" "namespace"="awi" "pod"="quick-start-job-rnpqt" "service-account"="my-sa"
E1128 15:12:35.440088       1 webhook.go:115] handler "msg"="failed to get service account" "error"="serviceaccounts \"my-sa\" not found" "namespace"="awi" "pod"="quick-start-job-rnpqt" "service-account"="my-sa"

Environment
AKS - version 1.23.8

@nilsgstrabo nilsgstrabo added the bug Something isn't working label Nov 28, 2022
@aramase
Copy link
Member

aramase commented Nov 29, 2022

Not sure what the best/correct way to handle situations like this is. Perhaps ignore validation for pods with a deletionTimestamp?

I think that's a valid point.

We're making to change to only mutate pods that have labels (ref: #601) but even with that, if the pod is terminating I don't think we need to handle that request.

Thanks for the detailed issue. I'll include this update in the next release.

@aramase
Copy link
Member

aramase commented Dec 7, 2022

Closed with #652

@aramase aramase closed this as completed Dec 7, 2022
@peter-edb
Copy link

peter-edb commented Nov 23, 2023

Hi @aramase, I think we are seeing something similar still in AKS v1.27.3 and Workload Identity v0.14.0 Is this expected as the fix is in v0.15.0 per #652 ?

k patch pod podName -n namespace -p '{"metadata":{"finalizers":null}}'
Error from server: admission webhook "mutation.azure-workload-identity.io" denied the request: serviceaccounts "podName" not found

@ringerc
Copy link

ringerc commented Nov 23, 2023

This issue can lead to the Job resource being deleted before the Pod has its batch.kubernetes.io/job-tracking finalizer removed, which shouldn't happen per kubernetes 1.26 job tracking.

The webhook should gracefully tolerate the absence of a service account when the pod is being modified to remove a finalizer, instead of failing with serviceaccounts "thepodserviceaccount" not found.

The webhook should not mutate the pod when a finalizer is being removed. See kubernetes/kubernetes#121828 (comment)

To work around the issue, you can temporarily re-create the service account the mutation.azure-workload-identity.io webhook expects to find, then patch the pod to delete the finalizer.

The fix in #652 (included in v0.15.0) makes the webhook skip firing on UPDATE of pods completely, so another workaround option is to patch the webhook manifest to apply that change locally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working webhook
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants