Windows Containerd: Kubelet processes not cleaned up due to container_log_manager cleanup issues, results in hung exec processes and memory explosion over time #98102

jayunit100 · 2021-01-15T16:00:09Z

What happened:

We are seeing that, on kubelets that use containerd on WINDOWS, periodically processes dont get cleaned up, and get leaked. this leads to gradual increase of resource useage over time, and also leads to what appears to be scenarios where kubectl exec 'hangs', and never actually completes running a command.

I found this when running e2e tests for windows, wherein several agnhost connect commands woudl run.

after deleting all e2e pods, i could still see that ctr ls showed several agnhost processes on the windows host
coincidentally, at the same time, the e2e tests will hang sometimes when running kubectl exec commands . They hang right when creating the SPDY connections

Details

Here are the kubelet logs taken on systems which exhibit these leadked containerd processes

I0115 15:25:03.849485    4516 fake_cpu_manager.go:49] [fake cpumanager] AddContainer (pod: a, container: cont-81-udp, container id: c5d449b9fc60b9e67e1b1b6ff2e271d2801eb564c4a92f15bf62f0df1b261d07)
I0115 15:25:03.849485    4516 fake_topology_manager.go:43] [fake topologymanager] AddContainer  pod: &Pod{ObjectMeta:{a  netpol-7397-z /api/v1/namespaces/netpol-7397-z/pods/a baf64d23-ffa9-4e44-b93f-d7c3c6675668 396241 0 20
21-01-15 07:24:26 -0800 PST <nil> <nil> map[pod:a] map[kubernetes.io/config.seen:2021-01-15T15:24:26.612122900-08:00 kubernetes.io/config.source:api] [] []  [{e2e.test.win.12 Update v1 2021-01-15 07:24:26 -0800 PST FieldsV1
 {"f:metadata":{"f:labels":{".":{},"f:pod":{}}},"f:spec":{"f:containers":{"k:{\"name\":\"cont-80-tcp\"}":{".":{},"f:command":{},"f:image":{},"f:imagePullPolicy":{},"f:name":{},"f:ports":{".":{},"k:{\"containerPort\":80,\"pr
otocol\":\"TCP\"}":{".":{},"f:containerPort":{},"f:name":{},"f:protocol":{}}},"f:resources":{},"f:securityContext":{},"f:terminationMessagePath":{},"f:terminationMessagePolicy":{}},"k:{\"name\":\"cont-80-udp\"}":{".":{},"f:
command":{},"f:image":{},"f:imagePullPolicy":{},"f:name":{},"f:ports":{".":{},"k:{\"containerPort\":80,\"protocol\":\"UDP\"}":{".":{},"f:containerPort":{},"f:name":{},"f:protocol":{}}},"f:resources":{},"f:securityContext":{
},"f:terminationMessagePath":{},"f:terminationMessagePolicy":{}},"k:{\"name\":\"cont-81-tcp\"}":{".":{},"f:command":{},"f:image":{},"f:imagePullPolicy":{},"f:name":{},"f:ports":{".":{},"k:{\"containerPort\":81,\"protocol\":
\"TCP\"}":{".":{},"f:containerPort":{},"f:name":{},"f:protocol":{}}},"f:resources":{},"f:securityContext":{},"f:terminationMessagePath":{},"f:terminationMessagePolicy":{}},"k:{\"name\":\"cont-81-udp\"}":{".":{},"f:command":
{},"f:image":{},"f:imagePullPolicy":{},"f:name":{},"f:ports":{".":{},"k:{\"containerPort\":81,\"protocol\":\"UDP\"}":{".":{},"f:containerPort":{},"f:name":{},"f:protocol":{}}},"f:resources":{},"f:securityContext":{},"f:term
inationMessagePath":{},"f:terminationMessagePolicy":{}}},"f:dnsPolicy":{},"f:enableServiceLinks":{},"f:nodeSelector":{".":{},"f:kubernetes.io/os":{}},"f:restartPolicy":{},"f:schedulerName":{},"f:securityContext":{},"f:termi
nationGracePeriodSeconds":{}}}}]},Spec:PodSpec{Volumes:[]Volume{Volume{Name:default-token-z6nkr,VolumeSource:VolumeSource{HostPath:nil,EmptyDir:nil,GCEPersistentDisk:nil,AWSElasticBlockStore:nil,GitRepo:nil,Secret:&SecretVo
lumeSource{SecretName:default-token-z6nkr,Items:[]KeyToPath{},DefaultMode:*420,Optional:nil,},NFS:nil,ISCSI:nil,Glusterfs:nil,PersistentVolumeClaim:nil,RBD:nil,FlexVolume:nil,Cinder:nil,CephFS:nil,Flocker:nil,DownwardAPI:ni
l,FC:nil,AzureFile:nil,ConfigMap:nil,VsphereVolume:nil,Quobyte:nil,AzureDisk:nil,PhotonPersistentDisk:nil,PortworxVolume:nil,ScaleIO:nil,Projected:nil,StorageOS:nil,CSI:nil,Ephemeral:nil,},},},Containers:[]Container{Contain
er{Name:cont-80-tcp,Image:k8s.gcr.io/e2e-test-images/agnhost:2.21,Command:[/agnhost serve-hostname --tcp --http=false --port 80],Args:[],WorkingDir:,Ports:[]ContainerPort{ContainerPort{Name:serve-80-tcp,HostPort:0,Container
Port:80,Protocol:TCP,HostIP:,},},Env:[]EnvVar{},Resources:ResourceRequirements{Limits:ResourceList{},Requests:ResourceList{},},VolumeMounts:[]VolumeMount{VolumeMount{Name:default-token-z6nkr,ReadOnly:true,MountPath:/var/run
/secrets/kubernetes.io/serviceaccount,SubPath:,MountPropagation:nil,SubPathExpr:,},},LivenessProbe:nil,ReadinessProbe:nil,Lifecycle:nil,TerminationMessagePath:/dev/termination-log,ImagePullPolicy:IfNotPresent,SecurityContex
t:&SecurityContext{Capabilities:nil,Privileged:nil,SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,AllowPrivilegeEscalation:nil,RunAsGroup:nil,ProcMount:nil,WindowsOptions:nil,SeccompProfile:nil
,},Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},TerminationMessagePolicy:File,VolumeDevices:[]VolumeDevice{},StartupProbe:nil,},Container{Name:cont-80-udp,Image:k8s.gcr.io/e2e-test-images/agnhost:2.21,Com
mand:[/agnhost serve-hostname --udp --http=false --port 80],Args:[],WorkingDir:,Ports:[]ContainerPort{ContainerPort{Name:serve-80-udp,HostPort:0,ContainerPort:80,Protocol:UDP,HostIP:,},},Env:[]EnvVar{},Resources:ResourceReq
uirements{Limits:ResourceList{},Requests:ResourceList{},},VolumeMounts:[]VolumeMount{VolumeMount{Name:default-token-z6nkr,ReadOnly:true,MountPath:/var/run/secrets/kubernetes.io/serviceaccount,SubPath:,MountPropagation:nil,S
ubPathExpr:,},},LivenessProbe:nil,ReadinessProbe:nil,Lifecycle:nil,TerminationMessagePath:/dev/termination-log,ImagePullPolicy:IfNotPresent,SecurityContext:&SecurityContext{Capabilities:nil,Privileged:nil,SELinuxOptions:nil
,RunAsUser:nil,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,AllowPrivilegeEscalation:nil,RunAsGroup:nil,ProcMount:nil,WindowsOptions:nil,SeccompProfile:nil,},Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},Te
rminationMessagePolicy:File,VolumeDevices:[]VolumeDevice{},StartupProbe:nil,},Container{Name:cont-81-tcp,Image:k8s.gcr.io/e2e-test-images/agnhost:2.21,Command:[/agnhost serve-hostname --tcp --http=false --port 81],Args:[],W
orkingDir:,Ports:[]ContainerPort{ContainerPort{Name:serve-81-tcp,HostPort:0,ContainerPort:81,Protocol:TCP,HostIP:,},},Env:[]EnvVar{},Resources:ResourceRequirements{Limits:ResourceList{},Requests:ResourceList{},},VolumeMount
s:[]VolumeMount{VolumeMount{Name:default-token-z6nkr,ReadOnly:true,MountPath:/var/run/secrets/kubernetes.io/serviceaccount,SubPath:,MountPropagation:nil,SubPathExpr:,},},LivenessProbe:nil,ReadinessProbe:nil,Lifecycle:nil,Te
rminationMessagePath:/dev/termination-log,ImagePullPolicy:IfNotPresent,SecurityContext:&SecurityContext{Capabilities:nil,Privileged:nil,SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,AllowPrivi
legeEscalation:nil,RunAsGroup:nil,ProcMount:nil,WindowsOptions:nil,SeccompProfile:nil,},Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},TerminationMessagePolicy:File,VolumeDevices:[]VolumeDevice{},StartupPro
be:nil,},Container{Name:cont-81-udp,Image:k8s.gcr.io/e2e-test-images/agnhost:2.21,Command:[/agnhost serve-hostname --udp --http=false --port 81],Args:[],WorkingDir:,Ports:[]ContainerPort{ContainerPort{Name:serve-81-udp,Host
Port:0,ContainerPort:81,Protocol:UDP,HostIP:,},},Env:[]EnvVar{},Resources:ResourceRequirements{Limits:ResourceList{},Requests:ResourceList{},},VolumeMounts:[]VolumeMount{VolumeMount{Name:default-token-z6nkr,ReadOnly:true,Mo
untPath:/var/run/secrets/kubernetes.io/serviceaccount,SubPath:,MountPropagation:nil,SubPathExpr:,},},LivenessProbe:nil,ReadinessProbe:nil,Lifecycle:nil,TerminationMessagePath:/dev/termination-log,ImagePullPolicy:IfNotPresen
t,SecurityContext:&SecurityContext{Capabilities:nil,Privileged:nil,SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,AllowPrivilegeEscalation:nil,RunAsGroup:nil,ProcMount:nil,WindowsOptions:nil,Se
ccompProfile:nil,},Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},TerminationMessagePolicy:File,VolumeDevices:[]VolumeDevice{},StartupProbe:nil,},},RestartPolicy:Always,TerminationGracePeriodSeconds:*0,Acti
veDeadlineSeconds:nil,DNSPolicy:ClusterFirst,NodeSelector:map[string]string{kubernetes.io/os: windows,},ServiceAccountName:default,DeprecatedServiceAccount:default,NodeName:clusterapi-peri-md-0-windows-containerd-5dc857845c
-5p4jt,HostNetwork:false,HostPID:false,HostIPC:false,SecurityContext:&PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[]Sysctl{},WindowsOptions:n
il,FSGroupChangePolicy:nil,SeccompProfile:nil,},ImagePullSecrets:[]LocalObjectReference{},Hostname:,Subdomain:,Affinity:nil,SchedulerName:default-scheduler,InitContainers:[]Container{},AutomountServiceAccountToken:nil,Toler
ations:[]Toleration{Toleration{Key:node.kubernetes.io/not-ready,Operator:Exists,Value:,Effect:NoExecute,TolerationSeconds:*300,},Toleration{Key:node.kubernetes.io/unreachable,Operator:Exists,Value:,Effect:NoExecute,Tolerati
onSeconds:*300,},},HostAliases:[]HostAlias{},PriorityClassName:,Priority:*0,DNSConfig:nil,ShareProcessNamespace:nil,ReadinessGates:[]PodReadinessGate{},RuntimeClassName:nil,EnableServiceLinks:*true,PreemptionPolicy:*Preempt
LowerPriority,Overhead:ResourceList{},TopologySpreadConstraints:[]TopologySpreadConstraint{},EphemeralContainers:[]EphemeralContainer{},SetHostnameAsFQDN:nil,},Status:PodStatus{Phase:Pending,Conditions:[]PodCondition{PodCon
dition{Type:PodScheduled,Status:True,LastProbeTime:0001-01-01 00:00:00 +0000 UTC,LastTransitionTime:2021-01-15 07:24:26 -0800 PST,Reason:,Message:,},},Message:,Reason:,HostIP:,PodIP:,StartTime:<nil>,ContainerStatuses:[]Cont
ainerStatus{},QOSClass:BestEffort,InitContainerStatuses:[]ContainerStatus{},NominatedNodeName:,PodIPs:[]PodIP{},EphemeralContainerStatuses:[]ContainerStatus{},},} container id:  c5d449b9fc60b9e67e1b1b6ff2e271d2801eb564c4a92
f15bf62f0df1b261d07
E0115 15:25:07.471605    4516 remote_runtime.go:563] ReopenContainerLog "cc99130d8956da668e5b70ca5913389cf68ed82b9a539d5b7a88eab1932aba96" from runtime service failed: rpc error: code = Unknown desc = failed to create and o
pen log file: The system cannot find the path specified.
E0115 15:25:07.472669    4516 container_log_manager.go:243] Container "cc99130d8956da668e5b70ca5913389cf68ed82b9a539d5b7a88eab1932aba96" log "\\var\\log\\pods\\netpol-2992-z_a_d54facd6-41e5-4655-8342-51fcfb538634\\cont-81-t
cp\\0.log" doesn't exist, reopen container log failed: rpc error: code = Unknown desc = failed to create and open log file: The system cannot find the path specified.
E0115 15:25:17.530934    4516 remote_runtime.go:563] ReopenContainerLog "cc99130d8956da668e5b70ca5913389cf68ed82b9a539d5b7a88eab1932aba96" from runtime service failed: rpc error: code = Unknown desc = failed to create and o
pen log file: The system cannot find the path specified.
E0115 15:25:17.530934    4516 container_log_manager.go:243] Container "cc99130d8956da668e5b70ca5913389cf68ed82b9a539d5b7a88eab1932aba96" log "\\var\\log\\pods\\netpol-2992-z_a_d54facd6-41e5-4655-8342-51fcfb538634\\cont-81-t
cp\\0.log" doesn't exist, reopen container log failed: rpc error: code = Unknown desc = failed to create and open log file: The system cannot find the path specified.
E0115 15:25:27.553521    4516 remote_runtime.go:563] ReopenContainerLog "cc99130d8956da668e5b70ca5913389cf68ed82b9a539d5b7a88eab1932aba96" from runtime service failed: rpc error: code = Unknown desc = failed to create and o
pen log file: The system cannot find the path specified.
E0115 15:25:27.553521    4516 container_log_manager.go:243] Container "cc99130d8956da668e5b70ca5913389cf68ed82b9a539d5b7a88eab1932aba96" log "\\var\\log\\pods\\netpol-2992-z_a_d54facd6-41e5-4655-8342-51fcfb538634\\cont-81-t
cp\\0.log" doesn't exist, reopen container log failed: rpc error: code = Unknown desc = failed to create and open log file: The system cannot find the path specified.

This results in pods not getting cleaned up properly, and processes leak through the roof !

E0115 15:25:37.576247    4516 remote_runtime.go:563] ReopenContainerLog "cc99130d8956da668e5b70ca5913389cf68ed82b9a539d5b7a88eab1932aba96" from runtime service failed: rpc error: code = Unknown desc = failed to create and o
pen log file: The system cannot find the path specified.

Looking in the kubelet exe processes, we still see agnhost processes floating around, long afer all pods are deleted.


PS C:\Program Files\containerd> .\ctr.exe -n "k8s.io" c ls
CONTAINER                                                           IMAGE                                                                                                             RUNTIME
0492cc71a808c24eb791ed24c88326023809f657e23ea7d3639e8f74409bd2fe    mcr.microsoft.com/oss/kubernetes/pause@sha256:0343aed381a7b6892e85ec85dc0f8862b864e75584f08f215795a4696e3adae8    io.containerd.runhcs.v1
1d64098a2c06b2563dc6a7cf6a9ec627d6e15aa639e6971f3f5a29824d35dfcc    mcr.microsoft.com/windows/servercore:1809                                                                         io.containerd.runhcs.v1
24f81ca7dea57b3a5051fdc03c45c25f6d5d2b84e3daaf073d2ce8a524884cf5    k8s.gcr.io/e2e-test-images/agnhost:2.21                                                                           io.containerd.runhcs.v1
4d9f0374d8665bf665aed710d69be8ae00ec6516cec0bb5d2c4db659b850a573    mcr.microsoft.com/oss/kubernetes/pause@sha256:0343aed381a7b6892e85ec85dc0f8862b864e75584f08f215795a4696e3adae8    io.containerd.runhcs.v1
649edf8ab2238d2d9ac388929d197e4a8ddaa30f0295694bf0336283b56a40d9    mcr.microsoft.com/oss/kubernetes/pause@sha256:0343aed381a7b6892e85ec85dc0f8862b864e75584f08f215795a4696e3adae8    io.containerd.runhcs.v1
932ecfa4f796e84a7fd37e2dea50ea0377fa658045fc3f092a8127542da017f3    mcr.microsoft.com/windows/servercore/iis:latest                                                                   io.containerd.runhcs.v1
cc99130d8956da668e5b70ca5913389cf68ed82b9a539d5b7a88eab1932aba96    k8s.gcr.io/e2e-test-images/agnhost:2.21                                                                           io.containerd.runhcs.v1
f6cd258c8c9916d455edbfe7cc1444e1a47edc5eee9e034358017fe2eda63cae    mcr.microsoft.com/oss/kubernetes/pause@sha256:0343aed381a7b6892e85ec85dc0f8862b864e75584f08f215795a4696e3adae8    io.containerd.runhcs.v1

What you expected to happen:

Running containers in the kubelet, with containerd would go away cleanly on pod deletion.

How to reproduce it (as minimally and precisely as possible):

Run the tests defined here #98077

Anything else we need to know?:

On these clusters, the final manifestation of this bug is that , when running these tests, we get hung, sometimes, when running exec commands,

I0115 13:51:54.599746   60646 round_trippers.go:425] curl -k -v -XPOST  -H "User-Agent: e2e.test.win.12/v1.21.0 (linux/amd64) kubernetes/e745983 -- [sig-network] Netpol [LinuxOnly] NetworkPolicy between server a
nd client should deny ingress access to updated pod [Feature:NetworkPolicy]" -H "X-Stream-Protocol-Version: v4.channel.k8s.io" -H "X-Stream-Protocol-Version: v3.channel.k8s.io" -H "X-Stream-Protocol-Version: v2.
channel.k8s.io" -H "X-Stream-Protocol-Version: channel.k8s.io" 'https://12.10.10.205:6443/api/v1/namespaces/netpol-9074-y/pods/a/exec?command=%2Fagnhost&command=connect&command=s-netpol-9074-y-c.netpol-9074-y.sv
c.cluster.local%3A80&command=--timeout%3D1s&command=--protocol%3Dtcp&container=cont-81-tcp&container=cont-81-tcp&stderr=true&stdout=true'
Jan 15 13:51:54.599: INFO: probes 30 out of 81
I0115 13:51:54.639204   60646 round_trippers.go:445] POST https://12.10.10.205:6443/api/v1/namespaces/netpol-9074-y/pods/a/exec?command=%2Fagnhost&command=connect&command=s-netpol-9074-y-b.netpol-9074-y.svc.clus
ter.local%3A80&command=--timeout%3D1s&command=--protocol%3Dtcp&container=cont-81-tcp&container=cont-81-tcp&stderr=true&stdout=true 101 Switching Protocols in 41 milliseconds
I0115 13:51:54.639250   60646 round_trippers.go:451] Response Headers:
I0115 13:51:54.639270   60646 round_trippers.go:454]     Upgrade: SPDY/3.1
I0115 13:51:54.639278   60646 round_trippers.go:454]     X-Stream-Protocol-Version: v4.channel.k8s.io
I0115 13:51:54.639285   60646 round_trippers.go:454]     Date: Sat, 16 Jan 2021 05:51:53 GMT
I0115 13:51:54.639293   60646 round_trippers.go:454]     Connection: Upgrade
I0115 13:51:54.643250   60646 round_trippers.go:445] POST https://12.10.10.205:6443/api/v1/namespaces/netpol-9074-y/pods/a/exec?command=%2Fagnhost&command=connect&command=s-netpol-9074-y-a.netpol-9074-y.svc.clus
ter.local%3A80&command=--timeout%3D1s&command=--protocol%3Dtcp&container=cont-81-tcp&container=cont-81-tcp&stderr=true&stdout=true 101 Switching Protocols in 48 milliseconds
I0115 13:51:54.643295   60646 round_trippers.go:451] Response Headers:
I0115 13:51:54.643303   60646 round_trippers.go:454]     Date: Sat, 16 Jan 2021 05:51:53 GMT
I0115 13:51:54.643307   60646 round_trippers.go:454]     Connection: Upgrade
I0115 13:51:54.643311   60646 round_trippers.go:454]     Upgrade: SPDY/3.1
I0115 13:51:54.643326   60646 round_trippers.go:454]     X-Stream-Protocol-Version: v4.channel.k8s.io
I0115 13:51:54.644129   60646 round_trippers.go:445] POST https://12.10.10.205:6443/api/v1/namespaces/netpol-9074-y/pods/a/exec?command=%2Fagnhost&command=connect&command=s-netpol-9074-y-c.netpol-9074-y.svc.clus
ter.local%3A80&command=--timeout%3D1s&command=--protocol%3Dtcp&container=cont-81-tcp&container=cont-81-tcp&stderr=true&stdout=true 101 Switching Protocols in 44 milliseconds
I0115 13:51:54.644155   60646 round_trippers.go:451] Response Headers:
I0115 13:51:54.644165   60646 round_trippers.go:454]     X-Stream-Protocol-Version: v4.channel.k8s.io
I0115 13:51:54.644172   60646 round_trippers.go:454]     Date: Sat, 16 Jan 2021 05:51:53 GMT
I0115 13:51:54.644181   60646 round_trippers.go:454]     Connection: Upgrade
I0115 13:51:54.644191   60646 round_trippers.go:454]     Upgrade: SPDY/3.1

Environment:

Kubernetes version (use kubectl version): 1.19
Cloud provider or hardware configuration: vsphere
OS (e.g: cat /etc/os-release): windows 2019
Network plugin and version (if this is a network-related bug): antrea

The text was updated successfully, but these errors were encountered:

jayunit100 · 2021-01-15T16:05:17Z

/sig windows
/sig node

jayunit100 · 2021-01-15T16:05:47Z

@knabben can you make it so the netpol tests give us fine grained info on startup ? that will make them a better diagnostrict for this.

jayunit100 · 2021-01-15T16:08:00Z

in this case were running kube-proxy in userspace, and in this situation is failing to close service port portal, in certain cases. That may be part of what causes this mess to accumulate

perithompson · 2021-01-15T16:22:05Z

I've uploaded the Kublet Logs here

jayunit100 · 2021-01-15T16:25:12Z

note to self, .\ctr.exe -n "k8s.io" c ls | sls "agn" | measure is how were grepping this out of the windows process log :)

perithompson · 2021-01-15T16:40:45Z

container cc99130d8956da668e5b70ca5913389cf68ed82b9a539d5b7a88eab1932aba96 is unable to create 0.log but then cannot delete the folder.

perithompson · 2021-01-15T16:45:40Z

note to self, .\ctr.exe -n "k8s.io" c ls | sls "agn" | measure is how were grepping this out of the windows process log :)
cd "c:\program files\containerd" first @jayunit100

jayunit100 · 2021-01-15T17:55:09Z

Full reproducer:

wget https://storage.googleapis.com/jayunit100/content/e2e.test.win.12 <-- e2e test that starts windows networkpolicy pods


./e2e.test.win.12 --provider=local --kubeconfig=/home/kubo/.kube/config --ginkgo.focus="Netpol.*" --ginkgo.skip="udp|Slow|SCTP" --node-os-distro="windows" --dump-logs-on-failure=false

jayunit100 · 2021-01-15T20:24:47Z

https://www.youtube.com/watch?v=OnRBkhutGP8&t=37s <-- recording is here

jayunit100 · 2021-01-17T15:46:33Z

After a while, you eventually can get tons of stray agnhost processes ...

NAMESPACE     NAME                                            READY   STATUS        RESTARTS   AGE
default       iis-5bb9f7db4-8pjjx                             0/1     Pending       0          38h
default       iis-5bb9f7db4-nb6gf                             1/1     Terminating   2          2d6h
default       windows-server-iis-974757d7f-l7gfs              0/1     Pending       0          38h
default       windows-server-iis-974757d7f-ss6tg              1/1     Terminating   2          2d2h
kube-system   antrea-agent-8cchl                              2/2     Running       0          3d1h
kube-system   antrea-controller-84c7944b8d-2j7wz              1/1     Running       0          3d1h
kube-system   coredns-f9fd979d6-lrh9l                         1/1     Running       0          3d1h
kube-system   coredns-f9fd979d6-zbfnj                         1/1     Running       0          3d1h
kube-system   etcd-clusterapi-peri-2fbts                      1/1     Running       8          3d1h
kube-system   kube-apiserver-clusterapi-peri-2fbts            1/1     Running       10         3d1h
kube-system   kube-controller-manager-clusterapi-peri-2fbts   1/1     Running       51         3d1h
kube-system   kube-proxy-r7jb5                                1/1     Running       0          3d1h
kube-system   kube-scheduler-clusterapi-peri-2fbts            1/1     Running       53         3d1h
kube-system   kube-vip-clusterapi-peri-2fbts                  1/1     Running       55         3d1h
kube-system   vsphere-cloud-controller-manager-n2c62          1/1     Running       51         3d1h
kube-system   vsphere-csi-controller-78bbf545d6-gkskp         5/5     Running       129        3d1h

only a few pods, but ... tons of agnhost processes floating around in containerd

PS C:\Program Files\containerd> .\ctr.exe -n "k8s.io" c ls  | sls "agn"

0144e24d28195769a22f7038930fe070ab61a017808e846e67ab0f386ca3df6d    k8s.gcr.io/e2e-test-images/agnhost:2.21
   io.containerd.runhcs.v1
06e1f4d7d03dc24248cef3baa6df4227a182e65a0cf3a178e8318e9a392a02eb    k8s.gcr.io/e2e-test-images/agnhost:2.21
   io.containerd.runhcs.v1
082f4149ee2db96bc726f37210526324b21dec590065b87b82175673fafeef7d    k8s.gcr.io/e2e-test-images/agnhost:2.21
   io.containerd.runhcs.v1
098972074a1388c3fe07cf00b410f7496651c655314e275418da389e2707b88d    k8s.gcr.io/e2e-test-images/agnhost:2.21
   io.containerd.runhcs.v1
0f3d99d6f802d8740399fa7590464ee8995b2182427b195b36c294e0a524bfb2    k8s.gcr.io/e2e-test-images/agnhost:2.21
   io.containerd.runhcs.v1
12656300637b1145466610fbac2b90888770010ac0db3b548ec05193e234f915    k8s.gcr.io/e2e-test-images/agnhost:2.21
   io.containerd.runhcs.v1
136f467f1183b9e2c68171fca8c6e8efb713f79ea2466fdd9ff77e3f2249a000    k8s.gcr.io/e2e-test-images/agnhost:2.21
   io.containerd.runhcs.v1
1548aa1f2a6615bda20c291aa90a83deb8eda0cec125cca7fb35bdde92dc34da    k8s.gcr.io/e2e-test-images/agnhost:2.21
   io.containerd.runhcs.v1
2191dbc663bc1c88833c9f584e595c62bbfd5c4ff9f650d74440596f423a4a61    k8s.gcr.io/e2e-test-images/agnhost:2.21
   io.containerd.runhcs.v1
24f81ca7dea57b3a5051fdc03c45c25f6d5d2b84e3daaf073d2ce8a524884cf5    k8s.gcr.io/e2e-test-images/agnhost:2.21
   io.containerd.runhcs.v1
2661c54c1b1340c92d98b1051e509aff06555bc4b1e926606db7d60ff3ad0708    k8s.gcr.io/e2e-test-images/agnhost:2.21
   io.containerd.runhcs.v1
2b4da7a985e7da911c9177ddccdfd0d1caffef15b542d69a4a3300a6ee5963c6    k8s.gcr.io/e2e-test-images/agnhost:2.21
   io.containerd.runhcs.v1
2b6f3f5395f627c81ce99cd11437f0fc2aedf073586dc06fccf3486099c9a93d    k8s.gcr.io/e2e-test-images/agnhost:2.21
   io.containerd.runhcs.v1
2f4426b0f7cf40e96bd4334650192e8edd6d9d1e137d1e8c0e749ad633e29be3    k8s.gcr.io/e2e-test-images/agnhost:2.21
   io.containerd.runhcs.v1
3327741015923fe50b5c38c34181f742c100cc265080ebf22330971296629f80    k8s.gcr.io/e2e-test-images/agnhost:2.21
   io.containerd.runhcs.v1
34e410325c6c99733c3c07da1193f8461e5801d3cfc4de8287d1cdd6ca6ad7ae    k8s.gcr.io/e2e-test-images/agnhost:2.21
   io.containerd.runhcs.v1
369721daa1d51884286dd46e6aa459860cf860a9a4457ce340d334246cb2a059    k8s.gcr.io/e2e-test-images/agnhost:2.21
   io.containerd.runhcs.v1
4556c9bf9ac6a6a249b1890ae877f3e8c7f3804358f42927942b20dfc6d23f7b    k8s.gcr.io/e2e-test-images/agnhost:2.21

jayunit100 · 2021-01-22T01:12:49Z

this might be better off as a containerd issue, not sure yet.

jayunit100 · 2021-01-25T20:18:19Z

i think i see the issue...

kubernetes/pkg/kubelet/kuberuntime/kuberuntime_container.go

Line 861 in f425019

// TODO: Separate log and container lifecycle management.

Theres a TODO saying

separate log and conatiner lifecycle management :)

jsturtevant · 2021-01-26T00:26:58Z

We are seeing this error in the e2e tests as well:

E0125 14:52:20.590073 4520 container_log_manager.go:243] Container "908f039e1dd5fa3246551214bdb2192a04e4631ae5bec53ae0faa1d009d6ff27" log "\\var\\log\\pods\\deployment-6697_test-cleanup-deployment-874dc686f-8jx4g_48c69df9-a146-4b0c-a584-0df7044c775e\\agnhost\\0.log" doesn't exist, reopen container log failed: rpc error: code = Unknown desc = failed to create and open log file: The system cannot find the path specified.

https://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-aks-engine-azure-master-windows-containerd/1353706355919163392/artifacts/2273k8s000/

jayunit100 · 2021-01-26T01:34:37Z

ahhhhhhh ! thank you for confirming this :)

jsturtevant · 2021-02-17T00:47:38Z

Still seeing this error in the latest test runs. I also saw memory slowly tick up with these errors in the logs when scaling simple IIS images.

/triage accepted
/assign

jsturtevant · 2021-02-26T23:33:28Z

I was able to reproduce this:

./hack/ginkgo-e2e.sh --node-os-distro=windows --ginkgo.focus="(\[Conformance\])" --ginkgo.skip="\[LinuxOnly\]|\[Serial\]|Guestbook.application.should.create.and.stop.a.working.application|device.plugin.for.Windows"

k get nodes -o wide  
NAME                                 STATUS   ROLES    AGE     VERSION           INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                         KERNEL-VERSION     CONTAINER-RUNTIME
4745k8s01000000                      Ready    agent    4h39m   v1.21.0-alpha.3   10.240.0.65    <none>        Windows Server 2019 Datacenter   10.0.17763.1637    containerd://1.4.3
k8s-linuxpool1-47451885-vmss000000   Ready    agent    4h41m   v1.21.0-alpha.3   10.240.0.34    <none>        Ubuntu 18.04.5 LTS               5.4.0-1040-azure   containerd://1.3.9+azure
k8s-master-47451885-0                Ready    master   4h41m   v1.21.0-alpha.3   10.255.255.5   <none>        Ubuntu 18.04.5 LTS               5.4.0-1040-azure   containerd://1.3.9+azure

I can see there are now Windows pods:

k get pods -A -o wide               
NAMESPACE     NAME                                            READY   STATUS    RESTARTS   AGE     IP             NODE                                 NOMINATED NODE   READINESS GATES
kube-system   azure-cni-networkmonitor-bcxcl                  1/1     Running   1          4h45m   10.255.255.5   k8s-master-47451885-0                <none>           <none>
kube-system   azure-cni-networkmonitor-wxtpc                  1/1     Running   1          4h45m   10.240.0.34    k8s-linuxpool1-47451885-vmss000000   <none>           <none>
kube-system   azure-ip-masq-agent-f57f8                       1/1     Running   1          4h45m   10.255.255.5   k8s-master-47451885-0                <none>           <none>
kube-system   azure-ip-masq-agent-h6fgn                       1/1     Running   1          4h45m   10.240.0.34    k8s-linuxpool1-47451885-vmss000000   <none>           <none>
kube-system   blobfuse-flexvol-installer-rbhbt                1/1     Running   1          4h45m   10.240.0.46    k8s-linuxpool1-47451885-vmss000000   <none>           <none>
kube-system   coredns-6bbfb77cb5-hwd4b                        1/1     Running   1          4h45m   10.240.0.38    k8s-linuxpool1-47451885-vmss000000   <none>           <none>
kube-system   coredns-autoscaler-87b67c5fd-2r2f4              1/1     Running   1          4h45m   10.240.0.9     k8s-master-47451885-0                <none>           <none>
kube-system   csi-secrets-store-provider-azure-dv7qh          1/1     Running   1          4h45m   10.240.0.34    k8s-linuxpool1-47451885-vmss000000   <none>           <none>
kube-system   csi-secrets-store-z85ps                         3/3     Running   3          4h45m   10.240.0.34    k8s-linuxpool1-47451885-vmss000000   <none>           <none>
kube-system   kube-addon-manager-k8s-master-47451885-0        1/1     Running   1          4h45m   10.255.255.5   k8s-master-47451885-0                <none>           <none>
kube-system   kube-apiserver-k8s-master-47451885-0            1/1     Running   1          4h44m   10.255.255.5   k8s-master-47451885-0                <none>           <none>
kube-system   kube-controller-manager-k8s-master-47451885-0   1/1     Running   1          4h45m   10.255.255.5   k8s-master-47451885-0                <none>           <none>
kube-system   kube-proxy-4pm5c                                1/1     Running   1          4h45m   10.255.255.5   k8s-master-47451885-0                <none>           <none>
kube-system   kube-proxy-7q9jj                                1/1     Running   1          4h45m   10.240.0.34    k8s-linuxpool1-47451885-vmss000000   <none>           <none>
kube-system   kube-scheduler-k8s-master-47451885-0            1/1     Running   1          4h44m   10.255.255.5   k8s-master-47451885-0                <none>           <none>
kube-system   metrics-server-6c8cc7585b-b48dl                 1/1     Running   1          4h45m   10.240.0.49    k8s-linuxpool1-47451885-vmss000000   <none>           <none>
monitoring    alertmanager-main-0                             2/2     Running   0          3h13m   10.240.0.44    k8s-linuxpool1-47451885-vmss000000   <none>           <none>
monitoring    alertmanager-main-1                             2/2     Running   0          3h13m   10.240.0.43    k8s-linuxpool1-47451885-vmss000000   <none>           <none>
monitoring    alertmanager-main-2                             2/2     Running   0          3h13m   10.240.0.52    k8s-linuxpool1-47451885-vmss000000   <none>           <none>
monitoring    blackbox-exporter-556d889b47-jh2zw              3/3     Running   0          3h13m   10.240.0.36    k8s-linuxpool1-47451885-vmss000000   <none>           <none>
monitoring    grafana-577b6c985f-vm9km                        1/1     Running   0          3h12m   10.240.0.39    k8s-linuxpool1-47451885-vmss000000   <none>           <none>
monitoring    kube-state-metrics-544cf87cfd-fxd4t             3/3     Running   0          3h12m   10.240.0.37    k8s-linuxpool1-47451885-vmss000000   <none>           <none>
monitoring    node-exporter-2hdg5                             2/2     Running   0          3h12m   10.255.255.5   k8s-master-47451885-0                <none>           <none>
monitoring    node-exporter-zstbt                             2/2     Running   0          3h12m   10.240.0.34    k8s-linuxpool1-47451885-vmss000000   <none>           <none>
monitoring    prometheus-adapter-767f58977c-lf4fb             1/1     Running   0          3h12m   10.240.0.42    k8s-linuxpool1-47451885-vmss000000   <none>           <none>
monitoring    prometheus-k8s-0                                2/2     Running   1          3h12m   10.240.0.47    k8s-linuxpool1-47451885-vmss000000   <none>           <none>
monitoring    prometheus-operator-59976dc7d5-6qksq            2/2     Running   0          3h14m   10.240.0.63    k8s-linuxpool1-47451885-vmss000000   <none>           <none>

But on the Windows node there is a left over containers running:

hcsdiag list
de3b7a88cfbc924d87ec9b934993ae238991698f4965a515bb0489920ca1adcd
    Windows Server Container,   Running,                                     , containerd-shim-runhcs-v1.exe

Containerd thinks there are two:

ctr -n k8s.io c ls
CONTAINER                                                           IMAGE                                                                                                             RUNTIME
04598199c8c667e077857e69de9df263ff3eaaf4e82b027834322dc1b3427b79    k8sprow.azurecr.io/kubernetes-e2e-test-images/agnhost:2.28                                                        io.containerd.runhcs.v1
de3b7a88cfbc924d87ec9b934993ae238991698f4965a515bb0489920ca1adcd    mcr.microsoft.com/oss/kubernetes/pause@sha256:19d1f185e7569fabf91c70fcc4462a506c8c9409ed3f554072346ac4a0ae1542    io.containerd.runhcs.v1

The container is killed at 20:09:47.784715 then the log starts to show up 15 mins later at 20:14:03.256813 and repeats regularly:

sls -Path .\kubelet.err* -Pattern "04598199c8c667e077857e69de9df263ff3eaaf4e82b027834322dc1b3427b79" | more                                                                                                                                                                
                                                                                                                                                                                                                                                                                    
kubelet.err.log:13683:I0226 20:09:06.799633    4868 kubelet.go:1960] SyncLoop (PLEG): "pod-handle-http-request_container-lifecycle-hook-8655(a63cfea0-8bca-479f-ad37-e486507e4751)", event: &pleg.PodLifecycleEvent{ID:"a63cfea0-8bca-479f-ad37-e486507e4751",
Type:"ContainerStarted", Data:"04598199c8c667e077857e69de9df263ff3eaaf4e82b027834322dc1b3427b79"}
kubelet.err.log:13958:I0226 20:09:47.784715    4868 kuberuntime_container.go:644] Killing container "containerd://04598199c8c667e077857e69de9df263ff3eaaf4e82b027834322dc1b3427b79" with a 2 second grace period
kubelet.err.log:13964:I0226 20:09:48.140742    4868 kuberuntime_container.go:644] Killing container "containerd://04598199c8c667e077857e69de9df263ff3eaaf4e82b027834322dc1b3427b79" with a 2 second grace period
kubelet.err.log:15197:E0226 20:14:03.256813    4868 remote_runtime.go:574] ReopenContainerLog "04598199c8c667e077857e69de9df263ff3eaaf4e82b027834322dc1b3427b79" from runtime service failed: rpc error: code = Unknown desc = failed to create and open log file: The system     
cannot find the path specified.
kubelet.err.log:15198:E0226 20:14:03.256813    4868 container_log_manager.go:243] Container "04598199c8c667e077857e69de9df263ff3eaaf4e82b027834322dc1b3427b79" log
"\\var\\log\\pods\\container-lifecycle-hook-8655_pod-handle-http-request_a63cfea0-8bca-479f-ad37-e486507e4751\\agnhost-container\\0.log" doesn't exist, reopen container log failed: rpc error: code = Unknown desc = failed to create and open log file: The system cannot find  
the path specified.

10 mins later there is another stop container issued:

kubelet.err.log:18058:E0226 20:24:49.842055    4868 remote_runtime.go:274] StopContainer "04598199c8c667e077857e69de9df263ff3eaaf4e82b027834322dc1b3427b79" from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
kubelet.err.log:18059:E0226 20:24:49.842055    4868 kuberuntime_container.go:648] Container "containerd://04598199c8c667e077857e69de9df263ff3eaaf4e82b027834322dc1b3427b79" termination failed with gracePeriod 2: rpc error: code = DeadlineExceeded desc = context deadline       
exceeded
kubelet.err.log:18060:E0226 20:24:49.842055    4868 kuberuntime_container.go:670] killContainer "agnhost-container"(id={"containerd" "04598199c8c667e077857e69de9df263ff3eaaf4e82b027834322dc1b3427b79"}) for pod
"pod-handle-http-request_container-lifecycle-hook-8655(a63cfea0-8bca-479f-ad37-e486507e4751)" failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
kubelet.err.log:18061:E0226 20:24:50.197802    4868 remote_runtime.go:274] StopContainer "04598199c8c667e077857e69de9df263ff3eaaf4e82b027834322dc1b3427b79" from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
kubelet.err.log:18062:E0226 20:24:50.197802    4868 kuberuntime_container.go:648] Container "containerd://04598199c8c667e077857e69de9df263ff3eaaf4e82b027834322dc1b3427b79" termination failed with gracePeriod 2: rpc error: code = DeadlineExceeded desc = context deadline       
exceeded
kubelet.err.log:18063:E0226 20:24:50.197802    4868 kuberuntime_container.go:670] killContainer "agnhost-container"(id={"containerd" "04598199c8c667e077857e69de9df263ff3eaaf4e82b027834322dc1b3427b79"}) for pod

It continues to try to rotate the logs, and then there is this error 30 mins later it has a killContainer "agnhost-container"(id={"containerd" "04598199c8c667e077857e69de9df263ff3eaaf4e82b027834322dc1b3427b79"}) for pod "<nil>" failed:

kubelet.err.log:18433:E0226 20:54:49.307646    4868 container_log_manager.go:243] Container "04598199c8c667e077857e69de9df263ff3eaaf4e82b027834322dc1b3427b79" log
"\\var\\log\\pods\\container-lifecycle-hook-8655_pod-handle-http-request_a63cfea0-8bca-479f-ad37-e486507e4751\\agnhost-container\\0.log" doesn't exist, reopen container log failed: rpc error: code = Unknown desc = failed to create and open log file: The system cannot find    
the path specified.
kubelet.err.log:18434:E0226 20:54:54.274324    4868 remote_runtime.go:274] StopContainer "04598199c8c667e077857e69de9df263ff3eaaf4e82b027834322dc1b3427b79" from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
kubelet.err.log:18435:E0226 20:54:54.274692    4868 kuberuntime_container.go:648] Container "containerd://04598199c8c667e077857e69de9df263ff3eaaf4e82b027834322dc1b3427b79" termination failed with gracePeriod 2: rpc error: code = DeadlineExceeded desc = context deadline       
exceeded
kubelet.err.log:18436:E0226 20:54:54.274692    4868 kuberuntime_container.go:670] killContainer "agnhost-container"(id={"containerd" "04598199c8c667e077857e69de9df263ff3eaaf4e82b027834322dc1b3427b79"}) for pod "<nil>" failed: rpc error: code = DeadlineExceeded desc =
context deadline exceeded

Initially thought context deadline might be due to cpu which was spiked at 100% during the test run but still seeing issues now that the test finished and cpu is back to normal operating:

E0226 22:40:06.666665    4868 remote_runtime.go:143] StopPodSandbox "de3b7a88cfbc924d87ec9b934993ae238991698f4965a515bb0489920ca1adcd" from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded                                             
E0226 22:40:06.666665    4868 kuberuntime_manager.go:923] Failed to stop sandbox {"containerd" "de3b7a88cfbc924d87ec9b934993ae238991698f4965a515bb0489920ca1adcd"}
E0226 22:40:06.666665    4868 kubelet_pods.go:1283] Failed killing the pod "pod-handle-http-request": [failed to "KillContainer" for "agnhost-container" with KillContainerError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded", failed to "KillPodSandbox" 
for "a63cfea0-8bca-479f-ad37-e486507e4751" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"]                                                                                                                                       
 I0226 22:40:08.656181    4868 kubelet_pods.go:1280] Killing unwanted pod "pod-handle-http-request"
I0226 22:40:08.657171    4868 kuberuntime_container.go:644] Killing container "containerd://04598199c8c667e077857e69de9df263ff3eaaf4e82b027834322dc1b3427b79" with a 2 second grace period

Logs at the 22:40:

kubelet.err.log:19723:E0226 22:40:06.666665    4868 remote_runtime.go:143] StopPodSandbox "de3b7a88cfbc924d87ec9b934993ae238991698f4965a515bb0489920ca1adcd" from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
kubelet.err.log:19724:E0226 22:40:06.666665    4868 kuberuntime_manager.go:923] Failed to stop sandbox {"containerd" "de3b7a88cfbc924d87ec9b934993ae238991698f4965a515bb0489920ca1adcd"}
kubelet.err.log:19725:E0226 22:40:06.666665    4868 kubelet_pods.go:1283] Failed killing the pod "pod-handle-http-request": [failed to "KillContainer" for "agnhost-container" with KillContainerError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded",  
failed to "KillPodSandbox" for "a63cfea0-8bca-479f-ad37-e486507e4751" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
kubelet.err.log:19726:I0226 22:40:08.656181    4868 kubelet_pods.go:1280] Killing unwanted pod "pod-handle-http-request"
kubelet.err.log:19727:I0226 22:40:08.657171    4868 kuberuntime_container.go:644] Killing container "containerd://04598199c8c667e077857e69de9df263ff3eaaf4e82b027834322dc1b3427b79" with a 2 second grace period
kubelet.err.log:19728:E0226 22:40:16.065358    4868 remote_runtime.go:574] ReopenContainerLog "04598199c8c667e077857e69de9df263ff3eaaf4e82b027834322dc1b3427b79" from runtime service failed: rpc error: code = Unknown desc = failed to create and open log file: The system       
cannot find the path specified.

jsturtevant · 2021-02-27T00:29:42Z

Interestingly there don't seem to be any containerd tasks running but HCS and containerd think they are running:

hcsdiag list
de3b7a88cfbc924d87ec9b934993ae238991698f4965a515bb0489920ca1adcd
    Windows Server Container,   Running,                                     , containerd-shim-runhcs-v1.exe

ctr t ls
TASK    PID    STATUS

.\crictl.exe --runtime-endpoint “npipe:////./pipe/containerd-containerd” ps --all
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID
04598199c8c66 1fb09998ede64 4 hours ago Running agnhost-container 0 de3b7a88cfbc9

jsturtevant · 2021-10-01T21:01:46Z

/remove-lifecycle rotten

jsturtevant · 2021-10-01T23:01:23Z

In the case on CAPZ, Windows defender was consuming high cpu (90% plus) adding exclusion to offending process fixed the issue. Now that I have a better understanding of the failure and confirmation that high cpu does indeed cause this issue I will see if I can track down the bug which I believe is related to handing containers on contianerd side/HCS side.

It seems that if a container starts the removal process and containerd thinks it has deleted the task but the container is in an unknown state in HCS it doesn't know how to properly recover. It may be that the container needs to be properly cleaned up in HCS when it is in this state. I found that hcsdiag kill hangs when the container is in the "unkown" state.

marosset · 2021-12-20T22:55:04Z

@dcantah do we think issue has been fixed by any of the recent hcsshim/containerd fixes?

dcantah · 2021-12-20T23:01:50Z

Mmm, I'd have to do an audit of the changes recently but not that I know of. Has this been seen lately?

These two changes seem up this issues alley though:
microsoft/hcsshim#1249
microsoft/hcsshim#1234

marosset · 2021-12-20T23:14:46Z

I don't know if this has been seen recently or not.
I came back to it while cleaning up so old issues.

dcantah · 2021-12-22T13:34:50Z

Gotcha, but no other than those two issues (that aren't checked in) I'm not sure of anything that would've remedied this

k8s-triage-robot · 2022-03-22T14:23:18Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

dzmitry-lahoda · 2022-03-30T11:49:48Z

I got this eror message, not sure if root cause is same on ASK 1.23.3 allocating Windows pods


failed to try resolving symlinks in path "\\var\\log\\pods\\default_wwwwwwwww_b6bd1f90-7f9b-4bfc-be31-3a75384f0440\\configure-firewall\\3.log": CreateFile \var\log\pods\default_wwwwwwwww_b6bd1f90-7f9b-4bfc-be31-3a75384f0440\configure-firewall\3.log: The system cannot find the file specified.
unable to retrieve container logs for containerd://a4d05d8fee67ecd68757f7c3ff1235d2b7b635d62ad0f971a707bd3410625d32

vaibhav2107 · 2022-04-15T17:06:34Z

/remove-lifecycle stale

k8s-triage-robot · 2022-07-14T17:46:54Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-08-13T18:36:04Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

vaibhav2107 · 2022-08-15T16:48:43Z

@jsturtevant Is the issue has been fixed?

k8s-triage-robot · 2022-09-14T17:19:32Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2022-09-14T17:19:38Z

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jayunit100 added the kind/bug Categorizes issue or PR as related to a bug. label Jan 15, 2021

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 15, 2021

jayunit100 changed the title ~~Windows Containerd: Kubelet instability when running concurrent exec processes~~ Windows Containerd: Kubelet processes not cleaned up due to container_log_manager cleanup issues, results in hung exec processes and memory explosion over time Jan 15, 2021

k8s-ci-robot added sig/windows Categorizes an issue or PR as relevant to SIG Windows. sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 15, 2021

knabben mentioned this issue Jan 19, 2021

Enhance netpol E2E tests connectivity information on startup #98173

Closed

jayunit100 mentioned this issue Jan 26, 2021

[Flaky Test] [sig-storage] Downward API volume should provide container's memory limit [NodeConformance] [Conformance] #95226

Closed

k8s-ci-robot assigned jsturtevant Feb 17, 2021

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 17, 2021

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Oct 1, 2021

jsturtevant mentioned this issue Oct 1, 2021

Windows Containerd support using Calico and HostProcess kubernetes-sigs/cluster-api-provider-azure#1672

Merged

3 tasks

jsturtevant mentioned this issue Jan 6, 2022

Garbage collector should orphan pods created by rc if delete options say so on Windows causes flakiness #107394

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 22, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 15, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 14, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 13, 2022

marosset added this to [sig-windows] Issue Tracking Aug 17, 2022

k8s-ci-robot closed this as completed Sep 14, 2022

k8s-ci-robot moved this to Done in [sig-windows] Issue Tracking Sep 14, 2022

kannon92 added this to SIG Node Bugs Jul 22, 2024

kannon92 moved this to Done in SIG Node Bugs Jul 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Windows Containerd: Kubelet processes not cleaned up due to container_log_manager cleanup issues, results in hung exec processes and memory explosion over time #98102

Windows Containerd: Kubelet processes not cleaned up due to container_log_manager cleanup issues, results in hung exec processes and memory explosion over time #98102

jayunit100 commented Jan 15, 2021 •

edited

Loading

jayunit100 commented Jan 15, 2021

jayunit100 commented Jan 15, 2021

jayunit100 commented Jan 15, 2021

perithompson commented Jan 15, 2021

jayunit100 commented Jan 15, 2021

perithompson commented Jan 15, 2021

perithompson commented Jan 15, 2021

jayunit100 commented Jan 15, 2021

jayunit100 commented Jan 15, 2021

jayunit100 commented Jan 17, 2021

jayunit100 commented Jan 22, 2021

jayunit100 commented Jan 25, 2021

jsturtevant commented Jan 26, 2021

jayunit100 commented Jan 26, 2021

jsturtevant commented Feb 17, 2021

jsturtevant commented Feb 26, 2021

jsturtevant commented Feb 27, 2021

jsturtevant commented Oct 1, 2021

jsturtevant commented Oct 1, 2021 •

edited

Loading

marosset commented Dec 20, 2021

dcantah commented Dec 20, 2021 •

edited

Loading

marosset commented Dec 20, 2021

dcantah commented Dec 22, 2021

k8s-triage-robot commented Mar 22, 2022

dzmitry-lahoda commented Mar 30, 2022

vaibhav2107 commented Apr 15, 2022

k8s-triage-robot commented Jul 14, 2022

k8s-triage-robot commented Aug 13, 2022

vaibhav2107 commented Aug 15, 2022

k8s-triage-robot commented Sep 14, 2022

k8s-ci-robot commented Sep 14, 2022

Windows Containerd: Kubelet processes not cleaned up due to container_log_manager cleanup issues, results in hung exec processes and memory explosion over time #98102

Windows Containerd: Kubelet processes not cleaned up due to container_log_manager cleanup issues, results in hung exec processes and memory explosion over time #98102

Comments

jayunit100 commented Jan 15, 2021 • edited Loading

Details

jayunit100 commented Jan 15, 2021

jayunit100 commented Jan 15, 2021

jayunit100 commented Jan 15, 2021

perithompson commented Jan 15, 2021

jayunit100 commented Jan 15, 2021

perithompson commented Jan 15, 2021

perithompson commented Jan 15, 2021

jayunit100 commented Jan 15, 2021

jayunit100 commented Jan 15, 2021

jayunit100 commented Jan 17, 2021

jayunit100 commented Jan 22, 2021

jayunit100 commented Jan 25, 2021

jsturtevant commented Jan 26, 2021

jayunit100 commented Jan 26, 2021

jsturtevant commented Feb 17, 2021

jsturtevant commented Feb 26, 2021

jsturtevant commented Feb 27, 2021

jsturtevant commented Oct 1, 2021

jsturtevant commented Oct 1, 2021 • edited Loading

marosset commented Dec 20, 2021

dcantah commented Dec 20, 2021 • edited Loading

marosset commented Dec 20, 2021

dcantah commented Dec 22, 2021

k8s-triage-robot commented Mar 22, 2022

dzmitry-lahoda commented Mar 30, 2022

vaibhav2107 commented Apr 15, 2022

k8s-triage-robot commented Jul 14, 2022

k8s-triage-robot commented Aug 13, 2022

vaibhav2107 commented Aug 15, 2022

k8s-triage-robot commented Sep 14, 2022

k8s-ci-robot commented Sep 14, 2022

jayunit100 commented Jan 15, 2021 •

edited

Loading

jsturtevant commented Oct 1, 2021 •

edited

Loading

dcantah commented Dec 20, 2021 •

edited

Loading