-
Notifications
You must be signed in to change notification settings - Fork 40.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Windows Containerd: Kubelet processes not cleaned up due to container_log_manager cleanup issues, results in hung exec processes and memory explosion over time #98102
Comments
/sig windows |
@knabben can you make it so the netpol tests give us fine grained info on startup ? that will make them a better diagnostrict for this. |
in this case were running kube-proxy in userspace, and in this situation is failing to close service port portal, in certain cases. That may be part of what causes this mess to accumulate |
I've uploaded the Kublet Logs here |
note to self, |
container cc99130d8956da668e5b70ca5913389cf68ed82b9a539d5b7a88eab1932aba96 is unable to create 0.log but then cannot delete the folder. |
|
Full reproducer:
|
https://www.youtube.com/watch?v=OnRBkhutGP8&t=37s <-- recording is here |
After a while, you eventually can get tons of stray agnhost processes ...
only a few pods, but ... tons of agnhost processes floating around in containerd
|
this might be better off as a containerd issue, not sure yet. |
i think i see the issue...
Theres a TODO saying
|
We are seeing this error in the e2e tests as well:
|
ahhhhhhh ! thank you for confirming this :) |
Still seeing this error in the latest test runs. I also saw memory slowly tick up with these errors in the logs when scaling simple IIS images. /triage accepted |
I was able to reproduce this:
I can see there are now Windows pods:
But on the Windows node there is a left over containers running:
Containerd thinks there are two:
The container is killed at
10 mins later there is another stop container issued:
It continues to try to rotate the logs, and then there is this error 30 mins later it has a
Initially thought
Logs at the 22:40:
|
Interestingly there don't seem to be any containerd tasks running but HCS and containerd think they are running:
|
/remove-lifecycle rotten |
In the case on CAPZ, Windows defender was consuming high cpu (90% plus) adding exclusion to offending process fixed the issue. Now that I have a better understanding of the failure and confirmation that high cpu does indeed cause this issue I will see if I can track down the bug which I believe is related to handing containers on contianerd side/HCS side. It seems that if a container starts the removal process and containerd thinks it has deleted the task but the container is in an unknown state in HCS it doesn't know how to properly recover. It may be that the container needs to be properly cleaned up in HCS when it is in this state. I found that |
@dcantah do we think issue has been fixed by any of the recent hcsshim/containerd fixes? |
Mmm, I'd have to do an audit of the changes recently but not that I know of. Has this been seen lately? These two changes seem up this issues alley though: |
I don't know if this has been seen recently or not. |
Gotcha, but no other than those two issues (that aren't checked in) I'm not sure of anything that would've remedied this |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
I got this eror message, not sure if root cause is same on ASK 1.23.3 allocating Windows pods
|
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
@jsturtevant Is the issue has been fixed? |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What happened:
We are seeing that, on kubelets that use containerd on WINDOWS, periodically processes dont get cleaned up, and get leaked. this leads to gradual increase of resource useage over time, and also leads to what appears to be scenarios where
kubectl exec
'hangs', and never actually completes running a command.I found this when running e2e tests for windows, wherein several
agnhost connect
commands woudl run.ctr ls
showed severalagnhost
processes on the windows hostkubectl exec
commands . They hang right when creating the SPDY connectionsDetails
Here are the kubelet logs taken on systems which exhibit these leadked containerd processes
This results in pods not getting cleaned up properly, and processes leak through the roof !
Looking in the kubelet exe processes, we still see agnhost processes floating around, long afer all pods are deleted.
What you expected to happen:
Running containers in the kubelet, with containerd would go away cleanly on pod deletion.
How to reproduce it (as minimally and precisely as possible):
Run the tests defined here #98077
Anything else we need to know?:
On these clusters, the final manifestation of this bug is that , when running these tests, we get hung, sometimes, when running exec commands,
Environment:
kubectl version
): 1.19cat /etc/os-release
): windows 2019The text was updated successfully, but these errors were encountered: