Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pods stuck in "Terminating" state on Windows nodes when deleting the pod after docker restart #207

Closed
ilanitr opened this issue Mar 9, 2022 · 35 comments
Labels
Windows on Kubernetes Windows Containers using Kubernetes

Comments

@ilanitr
Copy link

ilanitr commented Mar 9, 2022

We are using MCR version 20.10.9 (https://docs.mirantis.com/mcr/20.10/release-notes/20-10-9.html) and we have seen that after restarting the docker service and deleting a pod, the pod gets stuck at "Terminating" state. Describing the pod shows:

container ecddba6213ecd857171e6af11988979b5f0d16629509ad6f42385147ab24f2d9: driver "windowsfilter" failed to remove root filesystem: failed to detach VHD: failed to open virtual disk: The process cannot access the file because it is being used by another process.: rename C:\ProgramData\docker\windowsfilter\ecddba6213ecd857171e6af11988979b5f0d16629509ad6f42385147ab24f2d9 C:\ProgramData\docker\windowsfilter\ecddba6213ecd857171e6af11988979b5f0d16629509ad6f42385147ab24f2d9-removing: Access is denied.

Kubelet logs:

I0309 16:38:47.933233    2040 kuberuntime_container.go:661] "Killing container with a grace period override" pod="default/windows-log-test-6c9956b86d-t5jcp" podUID=9afb7084-8a8d-4239-8d7c-18176c80ae21 containerName="winloggertest" containerID="docker://8a26af90e71ec6eb8b8015253d6b76001d5efde3bd26e587eef9b8a8218ed2e2" gracePeriod=30
I0309 16:38:47.933773    2040 fake_topology_manager.go:47] "RemoveContainer" containerID="8a26af90e71ec6eb8b8015253d6b76001d5efde3bd26e587eef9b8a8218ed2e2"
E0309 16:38:47.945199    2040 remote_runtime.go:296] "RemoveContainer from runtime service failed" err="rpc error: code = Unknown desc = failed to remove container \"8a26af90e71ec6eb8b8015253d6b76001d5efde3bd26e587eef9b8a8218ed2e2\": Error response from daemon: container 8a26af90e71ec6eb8b8015253d6b76001d5efde3bd26e587eef9b8a8218ed2e2: driver \"windowsfilter\" failed to remove root filesystem: failed to detach VHD: failed to open virtual disk: The process cannot access the file because it is being used by another process.: rename C:\\ProgramData\\docker\\windowsfilter\\8a26af90e71ec6eb8b8015253d6b76001d5efde3bd26e587eef9b8a8218ed2e2 C:\\ProgramData\\docker\\windowsfilter\\8a26af90e71ec6eb8b8015253d6b76001d5efde3bd26e587eef9b8a8218ed2e2-removing: Access is denied." containerID="8a26af90e71ec6eb8b8015253d6b76001d5efde3bd26e587eef9b8a8218ed2e2"
E0309 16:38:47.945199    2040 kuberuntime_gc.go:146] "Failed to remove container" err="rpc error: code = Unknown desc = failed to remove container \"8a26af90e71ec6eb8b8015253d6b76001d5efde3bd26e587eef9b8a8218ed2e2\": Error response from daemon: container 8a26af90e71ec6eb8b8015253d6b76001d5efde3bd26e587eef9b8a8218ed2e2: driver \"windowsfilter\" failed to remove root filesystem: failed to detach VHD: failed to open virtual disk: The process cannot access the file because it is being used by another process.: rename C:\\ProgramData\\docker\\windowsfilter\\8a26af90e71ec6eb8b8015253d6b76001d5efde3bd26e587eef9b8a8218ed2e2 C:\\ProgramData\\docker\\windowsfilter\\8a26af90e71ec6eb8b8015253d6b76001d5efde3bd26e587eef9b8a8218ed2e2-removing: Access is denied." containerID="8a26af90e71ec6eb8b8015253d6b76001d5efde3bd26e587eef9b8a8218ed2e2"

Using kubectl delete pod --force did clean the pod from the cluster but according to kubelet logs, it keeps trying to remove the old container with no luck:

I0309 16:50:48.263635    2040 kuberuntime_container.go:661] "Killing container with a grace period override" pod="default/windows-log-test-6c9956b86d-t5jcp" podUID=9afb7084-8a8d-4239-8d7c-18176c80ae21 containerName="winloggertest" containerID="docker://8a26af90e71ec6eb8b8015253d6b76001d5efde3bd26e587eef9b8a8218ed2e2" gracePeriod=30
I0309 16:50:48.264169    2040 fake_topology_manager.go:47] "RemoveContainer" containerID="8a26af90e71ec6eb8b8015253d6b76001d5efde3bd26e587eef9b8a8218ed2e2"
E0309 16:50:48.275910    2040 remote_runtime.go:296] "RemoveContainer from runtime service failed" err="rpc error: code = Unknown desc = failed to remove container \"8a26af90e71ec6eb8b8015253d6b76001d5efde3bd26e587eef9b8a8218ed2e2\": Error response from daemon: container 8a26af90e71ec6eb8b8015253d6b76001d5efde3bd26e587eef9b8a8218ed2e2: driver \"windowsfilter\" failed to remove root filesystem: failed to detach VHD: failed to open virtual disk: The process cannot access the file because it is being used by another process.: rename C:\\ProgramData\\docker\\windowsfilter\\8a26af90e71ec6eb8b8015253d6b76001d5efde3bd26e587eef9b8a8218ed2e2 C:\\ProgramData\\docker\\windowsfilter\\8a26af90e71ec6eb8b8015253d6b76001d5efde3bd26e587eef9b8a8218ed2e2-removing: Access is denied." containerID="8a26af90e71ec6eb8b8015253d6b76001d5efde3bd26e587eef9b8a8218ed2e2"
E0309 16:50:48.275910    2040 kuberuntime_gc.go:146] "Failed to remove container" err="rpc error: code = Unknown desc = failed to remove container \"8a26af90e71ec6eb8b8015253d6b76001d5efde3bd26e587eef9b8a8218ed2e2\": Error response from daemon: container 8a26af90e71ec6eb8b8015253d6b76001d5efde3bd26e587eef9b8a8218ed2e2: driver \"windowsfilter\" failed to remove root filesystem: failed to detach VHD: failed to open virtual disk: The process cannot access the file because it is being used by another process.: rename C:\\ProgramData\\docker\\windowsfilter\\8a26af90e71ec6eb8b8015253d6b76001d5efde3bd26e587eef9b8a8218ed2e2 C:\\ProgramData\\docker\\windowsfilter\\8a26af90e71ec6eb8b8015253d6b76001d5efde3bd26e587eef9b8a8218ed2e2-removing: Access is denied." containerID="8a26af90e71ec6eb8b8015253d6b76001d5efde3bd26e587eef9b8a8218ed2e2"

Searching for handles on this path shows that System process holds sandbox.vhdx:

System	4	File	C:\ProgramData\docker\windowsfilter\8a26af90e71ec6eb8b8015253d6b76001d5efde3bd26e587eef9b8a8218ed2e2\sandbox.vhdx

Environment:

  • Kubernetes version: v1.21.5
  • OS version: Windows Server 2019 Datacenter 10.0.17763.2565
  • Docker version: docker://20.10.9
@ghost ghost added the triage New and needs attention label Mar 9, 2022
@cwilhit cwilhit added Windows on Kubernetes Windows Containers using Kubernetes and removed triage New and needs attention labels Mar 11, 2022
@cwilhit
Copy link
Contributor

cwilhit commented Mar 12, 2022

Is this Kubernetes cluster running in AKS or elsewhere? It looks kind of similar to #106, although that issue appeared related to cleaning up a logfile and your issues does not. @kevpar or @dcantah do you have any insights here?

@ilanitr
Copy link
Author

ilanitr commented Mar 14, 2022

The issue is reproduced on GKE on-prem cluster running in vSphere 7.0.

@ghost
Copy link

ghost commented Apr 14, 2022

This issue has been open for 30 days with no updates.
, please provide an update or close this issue.

@ilanitr
Copy link
Author

ilanitr commented Apr 14, 2022

We are still experiencing this issue. Could you reproduce this issue?

@cwilhit
Copy link
Contributor

cwilhit commented Apr 21, 2022

Hi, just following up to let you know we are currently investigating a similar issue. Nothing more to report right now though.

@ghost
Copy link

ghost commented May 22, 2022

This issue has been open for 30 days with no updates.
, please provide an update or close this issue.

1 similar comment
@ghost
Copy link

ghost commented Jun 22, 2022

This issue has been open for 30 days with no updates.
, please provide an update or close this issue.

@fady-azmy-msft
Copy link
Contributor

Hey @ilanitr, we're in the middle of a fix for this and tracking a similar issue internally (internal ID: 31661934)

@ghost
Copy link

ghost commented Jul 29, 2022

This issue has been open for 30 days with no updates.
, please provide an update or close this issue.

1 similar comment
@ghost
Copy link

ghost commented Aug 30, 2022

This issue has been open for 30 days with no updates.
, please provide an update or close this issue.

@fady-azmy-msft
Copy link
Contributor

No updates to share yet, but we're making progress on the fix for this.

@ghost
Copy link

ghost commented Oct 1, 2022

This issue has been open for 30 days with no updates.
, please provide an update or close this issue.

2 similar comments
@ghost
Copy link

ghost commented Oct 31, 2022

This issue has been open for 30 days with no updates.
, please provide an update or close this issue.

@ghost
Copy link

ghost commented Dec 1, 2022

This issue has been open for 30 days with no updates.
, please provide an update or close this issue.

@fady-azmy-msft
Copy link
Contributor

We've already worked on a fix, but we don't have timelines to share yet of when this will be available.

@ghost
Copy link

ghost commented Jan 1, 2023

This issue has been open for 30 days with no updates.
, please provide an update or close this issue.

@microsoft-github-policy-service
Copy link
Contributor

This issue has been open for 30 days with no updates.
no assignees, please provide an update or close this issue.

4 similar comments
@microsoft-github-policy-service
Copy link
Contributor

This issue has been open for 30 days with no updates.
no assignees, please provide an update or close this issue.

@microsoft-github-policy-service
Copy link
Contributor

This issue has been open for 30 days with no updates.
no assignees, please provide an update or close this issue.

@microsoft-github-policy-service
Copy link
Contributor

This issue has been open for 30 days with no updates.
no assignees, please provide an update or close this issue.

@microsoft-github-policy-service
Copy link
Contributor

This issue has been open for 30 days with no updates.
no assignees, please provide an update or close this issue.

@jeffwmiles
Copy link

@fady-azmy-msft you were the last to comment on this a while ago - has there been any progress on this item?

@fady-azmy-msft
Copy link
Contributor

@jeffwmiles yes, the fix was validated internally and we expect the fix to roll out soon. I don't have a timeline to share unfortunately since this depends on our backporting priority and process but I will update the thread when it comes out.

@braveness23
Copy link

I took am looking forward to a fix.

@microsoft-github-policy-service
Copy link
Contributor

This issue has been open for 30 days with no updates.
no assignees, please provide an update or close this issue.

@jeffwmiles
Copy link

Any update on this? We're closing in on 2 years since a fix was developed, and it seemingly has yet to be released.

Copy link
Contributor

This issue has been open for 30 days with no updates.
no assignees, please provide an update or close this issue.

Copy link
Contributor

This issue has been open for 30 days with no updates.
no assignees, please provide an update or close this issue.

1 similar comment
Copy link
Contributor

This issue has been open for 30 days with no updates.
no assignees, please provide an update or close this issue.

@fady-azmy-msft
Copy link
Contributor

Hey all, thank you for your patience on this. We shipped the fixed for this issue with the 2024 Feb release of Windows Server 2022. Can you confirm that the issue has been solved on your end?

@ntrappe-msft
Copy link
Contributor

^ Bumping this.

Copy link
Contributor

This issue has been open for 30 days with no updates.
no assignees, please provide an update or close this issue.

2 similar comments
Copy link
Contributor

This issue has been open for 30 days with no updates.
no assignees, please provide an update or close this issue.

Copy link
Contributor

This issue has been open for 30 days with no updates.
no assignees, please provide an update or close this issue.

@ntrappe-msft
Copy link
Contributor

Closing due to lack of response.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Windows on Kubernetes Windows Containers using Kubernetes
Projects
None yet
Development

No branches or pull requests

6 participants