Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows Containerd: Kubelet processes not cleaned up due to container_log_manager cleanup issues, results in hung exec processes and memory explosion over time #98102

Closed
jayunit100 opened this issue Jan 15, 2021 · 42 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/windows Categorizes an issue or PR as relevant to SIG Windows. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@jayunit100
Copy link
Member

jayunit100 commented Jan 15, 2021

What happened:

We are seeing that, on kubelets that use containerd on WINDOWS, periodically processes dont get cleaned up, and get leaked. this leads to gradual increase of resource useage over time, and also leads to what appears to be scenarios where kubectl exec 'hangs', and never actually completes running a command.

I found this when running e2e tests for windows, wherein several agnhost connect commands woudl run.

  • after deleting all e2e pods, i could still see that ctr ls showed several agnhost processes on the windows host
  • coincidentally, at the same time, the e2e tests will hang sometimes when running kubectl exec commands . They hang right when creating the SPDY connections

Details

Here are the kubelet logs taken on systems which exhibit these leadked containerd processes

I0115 15:25:03.849485    4516 fake_cpu_manager.go:49] [fake cpumanager] AddContainer (pod: a, container: cont-81-udp, container id: c5d449b9fc60b9e67e1b1b6ff2e271d2801eb564c4a92f15bf62f0df1b261d07)
I0115 15:25:03.849485    4516 fake_topology_manager.go:43] [fake topologymanager] AddContainer  pod: &Pod{ObjectMeta:{a  netpol-7397-z /api/v1/namespaces/netpol-7397-z/pods/a baf64d23-ffa9-4e44-b93f-d7c3c6675668 396241 0 20
21-01-15 07:24:26 -0800 PST <nil> <nil> map[pod:a] map[kubernetes.io/config.seen:2021-01-15T15:24:26.612122900-08:00 kubernetes.io/config.source:api] [] []  [{e2e.test.win.12 Update v1 2021-01-15 07:24:26 -0800 PST FieldsV1
 {"f:metadata":{"f:labels":{".":{},"f:pod":{}}},"f:spec":{"f:containers":{"k:{\"name\":\"cont-80-tcp\"}":{".":{},"f:command":{},"f:image":{},"f:imagePullPolicy":{},"f:name":{},"f:ports":{".":{},"k:{\"containerPort\":80,\"pr
otocol\":\"TCP\"}":{".":{},"f:containerPort":{},"f:name":{},"f:protocol":{}}},"f:resources":{},"f:securityContext":{},"f:terminationMessagePath":{},"f:terminationMessagePolicy":{}},"k:{\"name\":\"cont-80-udp\"}":{".":{},"f:
command":{},"f:image":{},"f:imagePullPolicy":{},"f:name":{},"f:ports":{".":{},"k:{\"containerPort\":80,\"protocol\":\"UDP\"}":{".":{},"f:containerPort":{},"f:name":{},"f:protocol":{}}},"f:resources":{},"f:securityContext":{
},"f:terminationMessagePath":{},"f:terminationMessagePolicy":{}},"k:{\"name\":\"cont-81-tcp\"}":{".":{},"f:command":{},"f:image":{},"f:imagePullPolicy":{},"f:name":{},"f:ports":{".":{},"k:{\"containerPort\":81,\"protocol\":
\"TCP\"}":{".":{},"f:containerPort":{},"f:name":{},"f:protocol":{}}},"f:resources":{},"f:securityContext":{},"f:terminationMessagePath":{},"f:terminationMessagePolicy":{}},"k:{\"name\":\"cont-81-udp\"}":{".":{},"f:command":
{},"f:image":{},"f:imagePullPolicy":{},"f:name":{},"f:ports":{".":{},"k:{\"containerPort\":81,\"protocol\":\"UDP\"}":{".":{},"f:containerPort":{},"f:name":{},"f:protocol":{}}},"f:resources":{},"f:securityContext":{},"f:term
inationMessagePath":{},"f:terminationMessagePolicy":{}}},"f:dnsPolicy":{},"f:enableServiceLinks":{},"f:nodeSelector":{".":{},"f:kubernetes.io/os":{}},"f:restartPolicy":{},"f:schedulerName":{},"f:securityContext":{},"f:termi
nationGracePeriodSeconds":{}}}}]},Spec:PodSpec{Volumes:[]Volume{Volume{Name:default-token-z6nkr,VolumeSource:VolumeSource{HostPath:nil,EmptyDir:nil,GCEPersistentDisk:nil,AWSElasticBlockStore:nil,GitRepo:nil,Secret:&SecretVo
lumeSource{SecretName:default-token-z6nkr,Items:[]KeyToPath{},DefaultMode:*420,Optional:nil,},NFS:nil,ISCSI:nil,Glusterfs:nil,PersistentVolumeClaim:nil,RBD:nil,FlexVolume:nil,Cinder:nil,CephFS:nil,Flocker:nil,DownwardAPI:ni
l,FC:nil,AzureFile:nil,ConfigMap:nil,VsphereVolume:nil,Quobyte:nil,AzureDisk:nil,PhotonPersistentDisk:nil,PortworxVolume:nil,ScaleIO:nil,Projected:nil,StorageOS:nil,CSI:nil,Ephemeral:nil,},},},Containers:[]Container{Contain
er{Name:cont-80-tcp,Image:k8s.gcr.io/e2e-test-images/agnhost:2.21,Command:[/agnhost serve-hostname --tcp --http=false --port 80],Args:[],WorkingDir:,Ports:[]ContainerPort{ContainerPort{Name:serve-80-tcp,HostPort:0,Container
Port:80,Protocol:TCP,HostIP:,},},Env:[]EnvVar{},Resources:ResourceRequirements{Limits:ResourceList{},Requests:ResourceList{},},VolumeMounts:[]VolumeMount{VolumeMount{Name:default-token-z6nkr,ReadOnly:true,MountPath:/var/run
/secrets/kubernetes.io/serviceaccount,SubPath:,MountPropagation:nil,SubPathExpr:,},},LivenessProbe:nil,ReadinessProbe:nil,Lifecycle:nil,TerminationMessagePath:/dev/termination-log,ImagePullPolicy:IfNotPresent,SecurityContex
t:&SecurityContext{Capabilities:nil,Privileged:nil,SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,AllowPrivilegeEscalation:nil,RunAsGroup:nil,ProcMount:nil,WindowsOptions:nil,SeccompProfile:nil
,},Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},TerminationMessagePolicy:File,VolumeDevices:[]VolumeDevice{},StartupProbe:nil,},Container{Name:cont-80-udp,Image:k8s.gcr.io/e2e-test-images/agnhost:2.21,Com
mand:[/agnhost serve-hostname --udp --http=false --port 80],Args:[],WorkingDir:,Ports:[]ContainerPort{ContainerPort{Name:serve-80-udp,HostPort:0,ContainerPort:80,Protocol:UDP,HostIP:,},},Env:[]EnvVar{},Resources:ResourceReq
uirements{Limits:ResourceList{},Requests:ResourceList{},},VolumeMounts:[]VolumeMount{VolumeMount{Name:default-token-z6nkr,ReadOnly:true,MountPath:/var/run/secrets/kubernetes.io/serviceaccount,SubPath:,MountPropagation:nil,S
ubPathExpr:,},},LivenessProbe:nil,ReadinessProbe:nil,Lifecycle:nil,TerminationMessagePath:/dev/termination-log,ImagePullPolicy:IfNotPresent,SecurityContext:&SecurityContext{Capabilities:nil,Privileged:nil,SELinuxOptions:nil
,RunAsUser:nil,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,AllowPrivilegeEscalation:nil,RunAsGroup:nil,ProcMount:nil,WindowsOptions:nil,SeccompProfile:nil,},Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},Te
rminationMessagePolicy:File,VolumeDevices:[]VolumeDevice{},StartupProbe:nil,},Container{Name:cont-81-tcp,Image:k8s.gcr.io/e2e-test-images/agnhost:2.21,Command:[/agnhost serve-hostname --tcp --http=false --port 81],Args:[],W
orkingDir:,Ports:[]ContainerPort{ContainerPort{Name:serve-81-tcp,HostPort:0,ContainerPort:81,Protocol:TCP,HostIP:,},},Env:[]EnvVar{},Resources:ResourceRequirements{Limits:ResourceList{},Requests:ResourceList{},},VolumeMount
s:[]VolumeMount{VolumeMount{Name:default-token-z6nkr,ReadOnly:true,MountPath:/var/run/secrets/kubernetes.io/serviceaccount,SubPath:,MountPropagation:nil,SubPathExpr:,},},LivenessProbe:nil,ReadinessProbe:nil,Lifecycle:nil,Te
rminationMessagePath:/dev/termination-log,ImagePullPolicy:IfNotPresent,SecurityContext:&SecurityContext{Capabilities:nil,Privileged:nil,SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,AllowPrivi
legeEscalation:nil,RunAsGroup:nil,ProcMount:nil,WindowsOptions:nil,SeccompProfile:nil,},Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},TerminationMessagePolicy:File,VolumeDevices:[]VolumeDevice{},StartupPro
be:nil,},Container{Name:cont-81-udp,Image:k8s.gcr.io/e2e-test-images/agnhost:2.21,Command:[/agnhost serve-hostname --udp --http=false --port 81],Args:[],WorkingDir:,Ports:[]ContainerPort{ContainerPort{Name:serve-81-udp,Host
Port:0,ContainerPort:81,Protocol:UDP,HostIP:,},},Env:[]EnvVar{},Resources:ResourceRequirements{Limits:ResourceList{},Requests:ResourceList{},},VolumeMounts:[]VolumeMount{VolumeMount{Name:default-token-z6nkr,ReadOnly:true,Mo
untPath:/var/run/secrets/kubernetes.io/serviceaccount,SubPath:,MountPropagation:nil,SubPathExpr:,},},LivenessProbe:nil,ReadinessProbe:nil,Lifecycle:nil,TerminationMessagePath:/dev/termination-log,ImagePullPolicy:IfNotPresen
t,SecurityContext:&SecurityContext{Capabilities:nil,Privileged:nil,SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,AllowPrivilegeEscalation:nil,RunAsGroup:nil,ProcMount:nil,WindowsOptions:nil,Se
ccompProfile:nil,},Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},TerminationMessagePolicy:File,VolumeDevices:[]VolumeDevice{},StartupProbe:nil,},},RestartPolicy:Always,TerminationGracePeriodSeconds:*0,Acti
veDeadlineSeconds:nil,DNSPolicy:ClusterFirst,NodeSelector:map[string]string{kubernetes.io/os: windows,},ServiceAccountName:default,DeprecatedServiceAccount:default,NodeName:clusterapi-peri-md-0-windows-containerd-5dc857845c
-5p4jt,HostNetwork:false,HostPID:false,HostIPC:false,SecurityContext:&PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[]Sysctl{},WindowsOptions:n
il,FSGroupChangePolicy:nil,SeccompProfile:nil,},ImagePullSecrets:[]LocalObjectReference{},Hostname:,Subdomain:,Affinity:nil,SchedulerName:default-scheduler,InitContainers:[]Container{},AutomountServiceAccountToken:nil,Toler
ations:[]Toleration{Toleration{Key:node.kubernetes.io/not-ready,Operator:Exists,Value:,Effect:NoExecute,TolerationSeconds:*300,},Toleration{Key:node.kubernetes.io/unreachable,Operator:Exists,Value:,Effect:NoExecute,Tolerati
onSeconds:*300,},},HostAliases:[]HostAlias{},PriorityClassName:,Priority:*0,DNSConfig:nil,ShareProcessNamespace:nil,ReadinessGates:[]PodReadinessGate{},RuntimeClassName:nil,EnableServiceLinks:*true,PreemptionPolicy:*Preempt
LowerPriority,Overhead:ResourceList{},TopologySpreadConstraints:[]TopologySpreadConstraint{},EphemeralContainers:[]EphemeralContainer{},SetHostnameAsFQDN:nil,},Status:PodStatus{Phase:Pending,Conditions:[]PodCondition{PodCon
dition{Type:PodScheduled,Status:True,LastProbeTime:0001-01-01 00:00:00 +0000 UTC,LastTransitionTime:2021-01-15 07:24:26 -0800 PST,Reason:,Message:,},},Message:,Reason:,HostIP:,PodIP:,StartTime:<nil>,ContainerStatuses:[]Cont
ainerStatus{},QOSClass:BestEffort,InitContainerStatuses:[]ContainerStatus{},NominatedNodeName:,PodIPs:[]PodIP{},EphemeralContainerStatuses:[]ContainerStatus{},},} container id:  c5d449b9fc60b9e67e1b1b6ff2e271d2801eb564c4a92
f15bf62f0df1b261d07
E0115 15:25:07.471605    4516 remote_runtime.go:563] ReopenContainerLog "cc99130d8956da668e5b70ca5913389cf68ed82b9a539d5b7a88eab1932aba96" from runtime service failed: rpc error: code = Unknown desc = failed to create and o
pen log file: The system cannot find the path specified.
E0115 15:25:07.472669    4516 container_log_manager.go:243] Container "cc99130d8956da668e5b70ca5913389cf68ed82b9a539d5b7a88eab1932aba96" log "\\var\\log\\pods\\netpol-2992-z_a_d54facd6-41e5-4655-8342-51fcfb538634\\cont-81-t
cp\\0.log" doesn't exist, reopen container log failed: rpc error: code = Unknown desc = failed to create and open log file: The system cannot find the path specified.
E0115 15:25:17.530934    4516 remote_runtime.go:563] ReopenContainerLog "cc99130d8956da668e5b70ca5913389cf68ed82b9a539d5b7a88eab1932aba96" from runtime service failed: rpc error: code = Unknown desc = failed to create and o
pen log file: The system cannot find the path specified.
E0115 15:25:17.530934    4516 container_log_manager.go:243] Container "cc99130d8956da668e5b70ca5913389cf68ed82b9a539d5b7a88eab1932aba96" log "\\var\\log\\pods\\netpol-2992-z_a_d54facd6-41e5-4655-8342-51fcfb538634\\cont-81-t
cp\\0.log" doesn't exist, reopen container log failed: rpc error: code = Unknown desc = failed to create and open log file: The system cannot find the path specified.
E0115 15:25:27.553521    4516 remote_runtime.go:563] ReopenContainerLog "cc99130d8956da668e5b70ca5913389cf68ed82b9a539d5b7a88eab1932aba96" from runtime service failed: rpc error: code = Unknown desc = failed to create and o
pen log file: The system cannot find the path specified.
E0115 15:25:27.553521    4516 container_log_manager.go:243] Container "cc99130d8956da668e5b70ca5913389cf68ed82b9a539d5b7a88eab1932aba96" log "\\var\\log\\pods\\netpol-2992-z_a_d54facd6-41e5-4655-8342-51fcfb538634\\cont-81-t
cp\\0.log" doesn't exist, reopen container log failed: rpc error: code = Unknown desc = failed to create and open log file: The system cannot find the path specified.

This results in pods not getting cleaned up properly, and processes leak through the roof !

E0115 15:25:37.576247    4516 remote_runtime.go:563] ReopenContainerLog "cc99130d8956da668e5b70ca5913389cf68ed82b9a539d5b7a88eab1932aba96" from runtime service failed: rpc error: code = Unknown desc = failed to create and o
pen log file: The system cannot find the path specified.

Looking in the kubelet exe processes, we still see agnhost processes floating around, long afer all pods are deleted.


PS C:\Program Files\containerd> .\ctr.exe -n "k8s.io" c ls
CONTAINER                                                           IMAGE                                                                                                             RUNTIME
0492cc71a808c24eb791ed24c88326023809f657e23ea7d3639e8f74409bd2fe    mcr.microsoft.com/oss/kubernetes/pause@sha256:0343aed381a7b6892e85ec85dc0f8862b864e75584f08f215795a4696e3adae8    io.containerd.runhcs.v1
1d64098a2c06b2563dc6a7cf6a9ec627d6e15aa639e6971f3f5a29824d35dfcc    mcr.microsoft.com/windows/servercore:1809                                                                         io.containerd.runhcs.v1
24f81ca7dea57b3a5051fdc03c45c25f6d5d2b84e3daaf073d2ce8a524884cf5    k8s.gcr.io/e2e-test-images/agnhost:2.21                                                                           io.containerd.runhcs.v1
4d9f0374d8665bf665aed710d69be8ae00ec6516cec0bb5d2c4db659b850a573    mcr.microsoft.com/oss/kubernetes/pause@sha256:0343aed381a7b6892e85ec85dc0f8862b864e75584f08f215795a4696e3adae8    io.containerd.runhcs.v1
649edf8ab2238d2d9ac388929d197e4a8ddaa30f0295694bf0336283b56a40d9    mcr.microsoft.com/oss/kubernetes/pause@sha256:0343aed381a7b6892e85ec85dc0f8862b864e75584f08f215795a4696e3adae8    io.containerd.runhcs.v1
932ecfa4f796e84a7fd37e2dea50ea0377fa658045fc3f092a8127542da017f3    mcr.microsoft.com/windows/servercore/iis:latest                                                                   io.containerd.runhcs.v1
cc99130d8956da668e5b70ca5913389cf68ed82b9a539d5b7a88eab1932aba96    k8s.gcr.io/e2e-test-images/agnhost:2.21                                                                           io.containerd.runhcs.v1
f6cd258c8c9916d455edbfe7cc1444e1a47edc5eee9e034358017fe2eda63cae    mcr.microsoft.com/oss/kubernetes/pause@sha256:0343aed381a7b6892e85ec85dc0f8862b864e75584f08f215795a4696e3adae8    io.containerd.runhcs.v1

What you expected to happen:

Running containers in the kubelet, with containerd would go away cleanly on pod deletion.

How to reproduce it (as minimally and precisely as possible):

Run the tests defined here #98077

Anything else we need to know?:

On these clusters, the final manifestation of this bug is that , when running these tests, we get hung, sometimes, when running exec commands,

I0115 13:51:54.599746   60646 round_trippers.go:425] curl -k -v -XPOST  -H "User-Agent: e2e.test.win.12/v1.21.0 (linux/amd64) kubernetes/e745983 -- [sig-network] Netpol [LinuxOnly] NetworkPolicy between server a
nd client should deny ingress access to updated pod [Feature:NetworkPolicy]" -H "X-Stream-Protocol-Version: v4.channel.k8s.io" -H "X-Stream-Protocol-Version: v3.channel.k8s.io" -H "X-Stream-Protocol-Version: v2.
channel.k8s.io" -H "X-Stream-Protocol-Version: channel.k8s.io" 'https://12.10.10.205:6443/api/v1/namespaces/netpol-9074-y/pods/a/exec?command=%2Fagnhost&command=connect&command=s-netpol-9074-y-c.netpol-9074-y.sv
c.cluster.local%3A80&command=--timeout%3D1s&command=--protocol%3Dtcp&container=cont-81-tcp&container=cont-81-tcp&stderr=true&stdout=true'
Jan 15 13:51:54.599: INFO: probes 30 out of 81
I0115 13:51:54.639204   60646 round_trippers.go:445] POST https://12.10.10.205:6443/api/v1/namespaces/netpol-9074-y/pods/a/exec?command=%2Fagnhost&command=connect&command=s-netpol-9074-y-b.netpol-9074-y.svc.clus
ter.local%3A80&command=--timeout%3D1s&command=--protocol%3Dtcp&container=cont-81-tcp&container=cont-81-tcp&stderr=true&stdout=true 101 Switching Protocols in 41 milliseconds
I0115 13:51:54.639250   60646 round_trippers.go:451] Response Headers:
I0115 13:51:54.639270   60646 round_trippers.go:454]     Upgrade: SPDY/3.1
I0115 13:51:54.639278   60646 round_trippers.go:454]     X-Stream-Protocol-Version: v4.channel.k8s.io
I0115 13:51:54.639285   60646 round_trippers.go:454]     Date: Sat, 16 Jan 2021 05:51:53 GMT
I0115 13:51:54.639293   60646 round_trippers.go:454]     Connection: Upgrade
I0115 13:51:54.643250   60646 round_trippers.go:445] POST https://12.10.10.205:6443/api/v1/namespaces/netpol-9074-y/pods/a/exec?command=%2Fagnhost&command=connect&command=s-netpol-9074-y-a.netpol-9074-y.svc.clus
ter.local%3A80&command=--timeout%3D1s&command=--protocol%3Dtcp&container=cont-81-tcp&container=cont-81-tcp&stderr=true&stdout=true 101 Switching Protocols in 48 milliseconds
I0115 13:51:54.643295   60646 round_trippers.go:451] Response Headers:
I0115 13:51:54.643303   60646 round_trippers.go:454]     Date: Sat, 16 Jan 2021 05:51:53 GMT
I0115 13:51:54.643307   60646 round_trippers.go:454]     Connection: Upgrade
I0115 13:51:54.643311   60646 round_trippers.go:454]     Upgrade: SPDY/3.1
I0115 13:51:54.643326   60646 round_trippers.go:454]     X-Stream-Protocol-Version: v4.channel.k8s.io
I0115 13:51:54.644129   60646 round_trippers.go:445] POST https://12.10.10.205:6443/api/v1/namespaces/netpol-9074-y/pods/a/exec?command=%2Fagnhost&command=connect&command=s-netpol-9074-y-c.netpol-9074-y.svc.clus
ter.local%3A80&command=--timeout%3D1s&command=--protocol%3Dtcp&container=cont-81-tcp&container=cont-81-tcp&stderr=true&stdout=true 101 Switching Protocols in 44 milliseconds
I0115 13:51:54.644155   60646 round_trippers.go:451] Response Headers:
I0115 13:51:54.644165   60646 round_trippers.go:454]     X-Stream-Protocol-Version: v4.channel.k8s.io
I0115 13:51:54.644172   60646 round_trippers.go:454]     Date: Sat, 16 Jan 2021 05:51:53 GMT
I0115 13:51:54.644181   60646 round_trippers.go:454]     Connection: Upgrade
I0115 13:51:54.644191   60646 round_trippers.go:454]     Upgrade: SPDY/3.1

Environment:

  • Kubernetes version (use kubectl version): 1.19
  • Cloud provider or hardware configuration: vsphere
  • OS (e.g: cat /etc/os-release): windows 2019
  • Network plugin and version (if this is a network-related bug): antrea
@jayunit100 jayunit100 added the kind/bug Categorizes issue or PR as related to a bug. label Jan 15, 2021
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 15, 2021
@jayunit100 jayunit100 changed the title Windows Containerd: Kubelet instability when running concurrent exec processes Windows Containerd: Kubelet processes not cleaned up due to container_log_manager cleanup issues, results in hung exec processes and memory explosion over time Jan 15, 2021
@jayunit100
Copy link
Member Author

/sig windows
/sig node

@k8s-ci-robot k8s-ci-robot added sig/windows Categorizes an issue or PR as relevant to SIG Windows. sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 15, 2021
@jayunit100
Copy link
Member Author

@knabben can you make it so the netpol tests give us fine grained info on startup ? that will make them a better diagnostrict for this.

@jayunit100
Copy link
Member Author

in this case were running kube-proxy in userspace, and in this situation is failing to close service port portal, in certain cases. That may be part of what causes this mess to accumulate

@perithompson
Copy link

I've uploaded the Kublet Logs here

@jayunit100
Copy link
Member Author

note to self, .\ctr.exe -n "k8s.io" c ls | sls "agn" | measure is how were grepping this out of the windows process log :)

@perithompson
Copy link

container cc99130d8956da668e5b70ca5913389cf68ed82b9a539d5b7a88eab1932aba96 is unable to create 0.log but then cannot delete the folder.

@perithompson
Copy link

note to self, .\ctr.exe -n "k8s.io" c ls | sls "agn" | measure is how were grepping this out of the windows process log :)
cd "c:\program files\containerd" first @jayunit100

@jayunit100
Copy link
Member Author

Full reproducer:

wget https://storage.googleapis.com/jayunit100/content/e2e.test.win.12 <-- e2e test that starts windows networkpolicy pods


./e2e.test.win.12 --provider=local --kubeconfig=/home/kubo/.kube/config --ginkgo.focus="Netpol.*" --ginkgo.skip="udp|Slow|SCTP" --node-os-distro="windows" --dump-logs-on-failure=false

@jayunit100
Copy link
Member Author

https://www.youtube.com/watch?v=OnRBkhutGP8&t=37s <-- recording is here

@jayunit100
Copy link
Member Author

After a while, you eventually can get tons of stray agnhost processes ...

NAMESPACE     NAME                                            READY   STATUS        RESTARTS   AGE
default       iis-5bb9f7db4-8pjjx                             0/1     Pending       0          38h
default       iis-5bb9f7db4-nb6gf                             1/1     Terminating   2          2d6h
default       windows-server-iis-974757d7f-l7gfs              0/1     Pending       0          38h
default       windows-server-iis-974757d7f-ss6tg              1/1     Terminating   2          2d2h
kube-system   antrea-agent-8cchl                              2/2     Running       0          3d1h
kube-system   antrea-controller-84c7944b8d-2j7wz              1/1     Running       0          3d1h
kube-system   coredns-f9fd979d6-lrh9l                         1/1     Running       0          3d1h
kube-system   coredns-f9fd979d6-zbfnj                         1/1     Running       0          3d1h
kube-system   etcd-clusterapi-peri-2fbts                      1/1     Running       8          3d1h
kube-system   kube-apiserver-clusterapi-peri-2fbts            1/1     Running       10         3d1h
kube-system   kube-controller-manager-clusterapi-peri-2fbts   1/1     Running       51         3d1h
kube-system   kube-proxy-r7jb5                                1/1     Running       0          3d1h
kube-system   kube-scheduler-clusterapi-peri-2fbts            1/1     Running       53         3d1h
kube-system   kube-vip-clusterapi-peri-2fbts                  1/1     Running       55         3d1h
kube-system   vsphere-cloud-controller-manager-n2c62          1/1     Running       51         3d1h
kube-system   vsphere-csi-controller-78bbf545d6-gkskp         5/5     Running       129        3d1h

only a few pods, but ... tons of agnhost processes floating around in containerd

PS C:\Program Files\containerd> .\ctr.exe -n "k8s.io" c ls  | sls "agn"

0144e24d28195769a22f7038930fe070ab61a017808e846e67ab0f386ca3df6d    k8s.gcr.io/e2e-test-images/agnhost:2.21
   io.containerd.runhcs.v1
06e1f4d7d03dc24248cef3baa6df4227a182e65a0cf3a178e8318e9a392a02eb    k8s.gcr.io/e2e-test-images/agnhost:2.21
   io.containerd.runhcs.v1
082f4149ee2db96bc726f37210526324b21dec590065b87b82175673fafeef7d    k8s.gcr.io/e2e-test-images/agnhost:2.21
   io.containerd.runhcs.v1
098972074a1388c3fe07cf00b410f7496651c655314e275418da389e2707b88d    k8s.gcr.io/e2e-test-images/agnhost:2.21
   io.containerd.runhcs.v1
0f3d99d6f802d8740399fa7590464ee8995b2182427b195b36c294e0a524bfb2    k8s.gcr.io/e2e-test-images/agnhost:2.21
   io.containerd.runhcs.v1
12656300637b1145466610fbac2b90888770010ac0db3b548ec05193e234f915    k8s.gcr.io/e2e-test-images/agnhost:2.21
   io.containerd.runhcs.v1
136f467f1183b9e2c68171fca8c6e8efb713f79ea2466fdd9ff77e3f2249a000    k8s.gcr.io/e2e-test-images/agnhost:2.21
   io.containerd.runhcs.v1
1548aa1f2a6615bda20c291aa90a83deb8eda0cec125cca7fb35bdde92dc34da    k8s.gcr.io/e2e-test-images/agnhost:2.21
   io.containerd.runhcs.v1
2191dbc663bc1c88833c9f584e595c62bbfd5c4ff9f650d74440596f423a4a61    k8s.gcr.io/e2e-test-images/agnhost:2.21
   io.containerd.runhcs.v1
24f81ca7dea57b3a5051fdc03c45c25f6d5d2b84e3daaf073d2ce8a524884cf5    k8s.gcr.io/e2e-test-images/agnhost:2.21
   io.containerd.runhcs.v1
2661c54c1b1340c92d98b1051e509aff06555bc4b1e926606db7d60ff3ad0708    k8s.gcr.io/e2e-test-images/agnhost:2.21
   io.containerd.runhcs.v1
2b4da7a985e7da911c9177ddccdfd0d1caffef15b542d69a4a3300a6ee5963c6    k8s.gcr.io/e2e-test-images/agnhost:2.21
   io.containerd.runhcs.v1
2b6f3f5395f627c81ce99cd11437f0fc2aedf073586dc06fccf3486099c9a93d    k8s.gcr.io/e2e-test-images/agnhost:2.21
   io.containerd.runhcs.v1
2f4426b0f7cf40e96bd4334650192e8edd6d9d1e137d1e8c0e749ad633e29be3    k8s.gcr.io/e2e-test-images/agnhost:2.21
   io.containerd.runhcs.v1
3327741015923fe50b5c38c34181f742c100cc265080ebf22330971296629f80    k8s.gcr.io/e2e-test-images/agnhost:2.21
   io.containerd.runhcs.v1
34e410325c6c99733c3c07da1193f8461e5801d3cfc4de8287d1cdd6ca6ad7ae    k8s.gcr.io/e2e-test-images/agnhost:2.21
   io.containerd.runhcs.v1
369721daa1d51884286dd46e6aa459860cf860a9a4457ce340d334246cb2a059    k8s.gcr.io/e2e-test-images/agnhost:2.21
   io.containerd.runhcs.v1
4556c9bf9ac6a6a249b1890ae877f3e8c7f3804358f42927942b20dfc6d23f7b    k8s.gcr.io/e2e-test-images/agnhost:2.21

@jayunit100
Copy link
Member Author

this might be better off as a containerd issue, not sure yet.

@jayunit100
Copy link
Member Author

i think i see the issue...

// TODO: Separate log and container lifecycle management.

Theres a TODO saying

separate log and conatiner lifecycle management :)

@jsturtevant
Copy link
Contributor

We are seeing this error in the e2e tests as well:

E0125 14:52:20.590073 4520 container_log_manager.go:243] Container "908f039e1dd5fa3246551214bdb2192a04e4631ae5bec53ae0faa1d009d6ff27" log "\\var\\log\\pods\\deployment-6697_test-cleanup-deployment-874dc686f-8jx4g_48c69df9-a146-4b0c-a584-0df7044c775e\\agnhost\\0.log" doesn't exist, reopen container log failed: rpc error: code = Unknown desc = failed to create and open log file: The system cannot find the path specified.

https://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-aks-engine-azure-master-windows-containerd/1353706355919163392/artifacts/2273k8s000/

@jayunit100
Copy link
Member Author

ahhhhhhh ! thank you for confirming this :)

@jsturtevant
Copy link
Contributor

Still seeing this error in the latest test runs. I also saw memory slowly tick up with these errors in the logs when scaling simple IIS images.

/triage accepted
/assign

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 17, 2021
@jsturtevant
Copy link
Contributor

I was able to reproduce this:

./hack/ginkgo-e2e.sh --node-os-distro=windows --ginkgo.focus="(\[Conformance\])" --ginkgo.skip="\[LinuxOnly\]|\[Serial\]|Guestbook.application.should.create.and.stop.a.working.application|device.plugin.for.Windows" 
k get nodes -o wide  
NAME                                 STATUS   ROLES    AGE     VERSION           INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                         KERNEL-VERSION     CONTAINER-RUNTIME
4745k8s01000000                      Ready    agent    4h39m   v1.21.0-alpha.3   10.240.0.65    <none>        Windows Server 2019 Datacenter   10.0.17763.1637    containerd://1.4.3
k8s-linuxpool1-47451885-vmss000000   Ready    agent    4h41m   v1.21.0-alpha.3   10.240.0.34    <none>        Ubuntu 18.04.5 LTS               5.4.0-1040-azure   containerd://1.3.9+azure
k8s-master-47451885-0                Ready    master   4h41m   v1.21.0-alpha.3   10.255.255.5   <none>        Ubuntu 18.04.5 LTS               5.4.0-1040-azure   containerd://1.3.9+azure

I can see there are now Windows pods:

k get pods -A -o wide               
NAMESPACE     NAME                                            READY   STATUS    RESTARTS   AGE     IP             NODE                                 NOMINATED NODE   READINESS GATES
kube-system   azure-cni-networkmonitor-bcxcl                  1/1     Running   1          4h45m   10.255.255.5   k8s-master-47451885-0                <none>           <none>
kube-system   azure-cni-networkmonitor-wxtpc                  1/1     Running   1          4h45m   10.240.0.34    k8s-linuxpool1-47451885-vmss000000   <none>           <none>
kube-system   azure-ip-masq-agent-f57f8                       1/1     Running   1          4h45m   10.255.255.5   k8s-master-47451885-0                <none>           <none>
kube-system   azure-ip-masq-agent-h6fgn                       1/1     Running   1          4h45m   10.240.0.34    k8s-linuxpool1-47451885-vmss000000   <none>           <none>
kube-system   blobfuse-flexvol-installer-rbhbt                1/1     Running   1          4h45m   10.240.0.46    k8s-linuxpool1-47451885-vmss000000   <none>           <none>
kube-system   coredns-6bbfb77cb5-hwd4b                        1/1     Running   1          4h45m   10.240.0.38    k8s-linuxpool1-47451885-vmss000000   <none>           <none>
kube-system   coredns-autoscaler-87b67c5fd-2r2f4              1/1     Running   1          4h45m   10.240.0.9     k8s-master-47451885-0                <none>           <none>
kube-system   csi-secrets-store-provider-azure-dv7qh          1/1     Running   1          4h45m   10.240.0.34    k8s-linuxpool1-47451885-vmss000000   <none>           <none>
kube-system   csi-secrets-store-z85ps                         3/3     Running   3          4h45m   10.240.0.34    k8s-linuxpool1-47451885-vmss000000   <none>           <none>
kube-system   kube-addon-manager-k8s-master-47451885-0        1/1     Running   1          4h45m   10.255.255.5   k8s-master-47451885-0                <none>           <none>
kube-system   kube-apiserver-k8s-master-47451885-0            1/1     Running   1          4h44m   10.255.255.5   k8s-master-47451885-0                <none>           <none>
kube-system   kube-controller-manager-k8s-master-47451885-0   1/1     Running   1          4h45m   10.255.255.5   k8s-master-47451885-0                <none>           <none>
kube-system   kube-proxy-4pm5c                                1/1     Running   1          4h45m   10.255.255.5   k8s-master-47451885-0                <none>           <none>
kube-system   kube-proxy-7q9jj                                1/1     Running   1          4h45m   10.240.0.34    k8s-linuxpool1-47451885-vmss000000   <none>           <none>
kube-system   kube-scheduler-k8s-master-47451885-0            1/1     Running   1          4h44m   10.255.255.5   k8s-master-47451885-0                <none>           <none>
kube-system   metrics-server-6c8cc7585b-b48dl                 1/1     Running   1          4h45m   10.240.0.49    k8s-linuxpool1-47451885-vmss000000   <none>           <none>
monitoring    alertmanager-main-0                             2/2     Running   0          3h13m   10.240.0.44    k8s-linuxpool1-47451885-vmss000000   <none>           <none>
monitoring    alertmanager-main-1                             2/2     Running   0          3h13m   10.240.0.43    k8s-linuxpool1-47451885-vmss000000   <none>           <none>
monitoring    alertmanager-main-2                             2/2     Running   0          3h13m   10.240.0.52    k8s-linuxpool1-47451885-vmss000000   <none>           <none>
monitoring    blackbox-exporter-556d889b47-jh2zw              3/3     Running   0          3h13m   10.240.0.36    k8s-linuxpool1-47451885-vmss000000   <none>           <none>
monitoring    grafana-577b6c985f-vm9km                        1/1     Running   0          3h12m   10.240.0.39    k8s-linuxpool1-47451885-vmss000000   <none>           <none>
monitoring    kube-state-metrics-544cf87cfd-fxd4t             3/3     Running   0          3h12m   10.240.0.37    k8s-linuxpool1-47451885-vmss000000   <none>           <none>
monitoring    node-exporter-2hdg5                             2/2     Running   0          3h12m   10.255.255.5   k8s-master-47451885-0                <none>           <none>
monitoring    node-exporter-zstbt                             2/2     Running   0          3h12m   10.240.0.34    k8s-linuxpool1-47451885-vmss000000   <none>           <none>
monitoring    prometheus-adapter-767f58977c-lf4fb             1/1     Running   0          3h12m   10.240.0.42    k8s-linuxpool1-47451885-vmss000000   <none>           <none>
monitoring    prometheus-k8s-0                                2/2     Running   1          3h12m   10.240.0.47    k8s-linuxpool1-47451885-vmss000000   <none>           <none>
monitoring    prometheus-operator-59976dc7d5-6qksq            2/2     Running   0          3h14m   10.240.0.63    k8s-linuxpool1-47451885-vmss000000   <none>           <none>

But on the Windows node there is a left over containers running:

hcsdiag list
de3b7a88cfbc924d87ec9b934993ae238991698f4965a515bb0489920ca1adcd
    Windows Server Container,   Running,                                     , containerd-shim-runhcs-v1.exe

Containerd thinks there are two:

ctr -n k8s.io c ls
CONTAINER                                                           IMAGE                                                                                                             RUNTIME
04598199c8c667e077857e69de9df263ff3eaaf4e82b027834322dc1b3427b79    k8sprow.azurecr.io/kubernetes-e2e-test-images/agnhost:2.28                                                        io.containerd.runhcs.v1
de3b7a88cfbc924d87ec9b934993ae238991698f4965a515bb0489920ca1adcd    mcr.microsoft.com/oss/kubernetes/pause@sha256:19d1f185e7569fabf91c70fcc4462a506c8c9409ed3f554072346ac4a0ae1542    io.containerd.runhcs.v1

The container is killed at 20:09:47.784715 then the log starts to show up 15 mins later at 20:14:03.256813 and repeats regularly:

sls -Path .\kubelet.err* -Pattern "04598199c8c667e077857e69de9df263ff3eaaf4e82b027834322dc1b3427b79" | more                                                                                                                                                                
                                                                                                                                                                                                                                                                                    
kubelet.err.log:13683:I0226 20:09:06.799633    4868 kubelet.go:1960] SyncLoop (PLEG): "pod-handle-http-request_container-lifecycle-hook-8655(a63cfea0-8bca-479f-ad37-e486507e4751)", event: &pleg.PodLifecycleEvent{ID:"a63cfea0-8bca-479f-ad37-e486507e4751",
Type:"ContainerStarted", Data:"04598199c8c667e077857e69de9df263ff3eaaf4e82b027834322dc1b3427b79"}
kubelet.err.log:13958:I0226 20:09:47.784715    4868 kuberuntime_container.go:644] Killing container "containerd://04598199c8c667e077857e69de9df263ff3eaaf4e82b027834322dc1b3427b79" with a 2 second grace period
kubelet.err.log:13964:I0226 20:09:48.140742    4868 kuberuntime_container.go:644] Killing container "containerd://04598199c8c667e077857e69de9df263ff3eaaf4e82b027834322dc1b3427b79" with a 2 second grace period
kubelet.err.log:15197:E0226 20:14:03.256813    4868 remote_runtime.go:574] ReopenContainerLog "04598199c8c667e077857e69de9df263ff3eaaf4e82b027834322dc1b3427b79" from runtime service failed: rpc error: code = Unknown desc = failed to create and open log file: The system     
cannot find the path specified.
kubelet.err.log:15198:E0226 20:14:03.256813    4868 container_log_manager.go:243] Container "04598199c8c667e077857e69de9df263ff3eaaf4e82b027834322dc1b3427b79" log
"\\var\\log\\pods\\container-lifecycle-hook-8655_pod-handle-http-request_a63cfea0-8bca-479f-ad37-e486507e4751\\agnhost-container\\0.log" doesn't exist, reopen container log failed: rpc error: code = Unknown desc = failed to create and open log file: The system cannot find  
the path specified.

10 mins later there is another stop container issued:

kubelet.err.log:18058:E0226 20:24:49.842055    4868 remote_runtime.go:274] StopContainer "04598199c8c667e077857e69de9df263ff3eaaf4e82b027834322dc1b3427b79" from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
kubelet.err.log:18059:E0226 20:24:49.842055    4868 kuberuntime_container.go:648] Container "containerd://04598199c8c667e077857e69de9df263ff3eaaf4e82b027834322dc1b3427b79" termination failed with gracePeriod 2: rpc error: code = DeadlineExceeded desc = context deadline       
exceeded
kubelet.err.log:18060:E0226 20:24:49.842055    4868 kuberuntime_container.go:670] killContainer "agnhost-container"(id={"containerd" "04598199c8c667e077857e69de9df263ff3eaaf4e82b027834322dc1b3427b79"}) for pod
"pod-handle-http-request_container-lifecycle-hook-8655(a63cfea0-8bca-479f-ad37-e486507e4751)" failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
kubelet.err.log:18061:E0226 20:24:50.197802    4868 remote_runtime.go:274] StopContainer "04598199c8c667e077857e69de9df263ff3eaaf4e82b027834322dc1b3427b79" from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
kubelet.err.log:18062:E0226 20:24:50.197802    4868 kuberuntime_container.go:648] Container "containerd://04598199c8c667e077857e69de9df263ff3eaaf4e82b027834322dc1b3427b79" termination failed with gracePeriod 2: rpc error: code = DeadlineExceeded desc = context deadline       
exceeded
kubelet.err.log:18063:E0226 20:24:50.197802    4868 kuberuntime_container.go:670] killContainer "agnhost-container"(id={"containerd" "04598199c8c667e077857e69de9df263ff3eaaf4e82b027834322dc1b3427b79"}) for pod

It continues to try to rotate the logs, and then there is this error 30 mins later it has a killContainer "agnhost-container"(id={"containerd" "04598199c8c667e077857e69de9df263ff3eaaf4e82b027834322dc1b3427b79"}) for pod "<nil>" failed:

kubelet.err.log:18433:E0226 20:54:49.307646    4868 container_log_manager.go:243] Container "04598199c8c667e077857e69de9df263ff3eaaf4e82b027834322dc1b3427b79" log
"\\var\\log\\pods\\container-lifecycle-hook-8655_pod-handle-http-request_a63cfea0-8bca-479f-ad37-e486507e4751\\agnhost-container\\0.log" doesn't exist, reopen container log failed: rpc error: code = Unknown desc = failed to create and open log file: The system cannot find    
the path specified.
kubelet.err.log:18434:E0226 20:54:54.274324    4868 remote_runtime.go:274] StopContainer "04598199c8c667e077857e69de9df263ff3eaaf4e82b027834322dc1b3427b79" from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
kubelet.err.log:18435:E0226 20:54:54.274692    4868 kuberuntime_container.go:648] Container "containerd://04598199c8c667e077857e69de9df263ff3eaaf4e82b027834322dc1b3427b79" termination failed with gracePeriod 2: rpc error: code = DeadlineExceeded desc = context deadline       
exceeded
kubelet.err.log:18436:E0226 20:54:54.274692    4868 kuberuntime_container.go:670] killContainer "agnhost-container"(id={"containerd" "04598199c8c667e077857e69de9df263ff3eaaf4e82b027834322dc1b3427b79"}) for pod "<nil>" failed: rpc error: code = DeadlineExceeded desc =
context deadline exceeded

Initially thought context deadline might be due to cpu which was spiked at 100% during the test run but still seeing issues now that the test finished and cpu is back to normal operating:

E0226 22:40:06.666665    4868 remote_runtime.go:143] StopPodSandbox "de3b7a88cfbc924d87ec9b934993ae238991698f4965a515bb0489920ca1adcd" from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded                                             
E0226 22:40:06.666665    4868 kuberuntime_manager.go:923] Failed to stop sandbox {"containerd" "de3b7a88cfbc924d87ec9b934993ae238991698f4965a515bb0489920ca1adcd"}
E0226 22:40:06.666665    4868 kubelet_pods.go:1283] Failed killing the pod "pod-handle-http-request": [failed to "KillContainer" for "agnhost-container" with KillContainerError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded", failed to "KillPodSandbox" 
for "a63cfea0-8bca-479f-ad37-e486507e4751" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"]                                                                                                                                       
 I0226 22:40:08.656181    4868 kubelet_pods.go:1280] Killing unwanted pod "pod-handle-http-request"
I0226 22:40:08.657171    4868 kuberuntime_container.go:644] Killing container "containerd://04598199c8c667e077857e69de9df263ff3eaaf4e82b027834322dc1b3427b79" with a 2 second grace period

Screenshot from 2021-02-26 14-52-00

Logs at the 22:40:

kubelet.err.log:19723:E0226 22:40:06.666665    4868 remote_runtime.go:143] StopPodSandbox "de3b7a88cfbc924d87ec9b934993ae238991698f4965a515bb0489920ca1adcd" from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
kubelet.err.log:19724:E0226 22:40:06.666665    4868 kuberuntime_manager.go:923] Failed to stop sandbox {"containerd" "de3b7a88cfbc924d87ec9b934993ae238991698f4965a515bb0489920ca1adcd"}
kubelet.err.log:19725:E0226 22:40:06.666665    4868 kubelet_pods.go:1283] Failed killing the pod "pod-handle-http-request": [failed to "KillContainer" for "agnhost-container" with KillContainerError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded",  
failed to "KillPodSandbox" for "a63cfea0-8bca-479f-ad37-e486507e4751" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
kubelet.err.log:19726:I0226 22:40:08.656181    4868 kubelet_pods.go:1280] Killing unwanted pod "pod-handle-http-request"
kubelet.err.log:19727:I0226 22:40:08.657171    4868 kuberuntime_container.go:644] Killing container "containerd://04598199c8c667e077857e69de9df263ff3eaaf4e82b027834322dc1b3427b79" with a 2 second grace period
kubelet.err.log:19728:E0226 22:40:16.065358    4868 remote_runtime.go:574] ReopenContainerLog "04598199c8c667e077857e69de9df263ff3eaaf4e82b027834322dc1b3427b79" from runtime service failed: rpc error: code = Unknown desc = failed to create and open log file: The system       
cannot find the path specified.

@jsturtevant
Copy link
Contributor

Interestingly there don't seem to be any containerd tasks running but HCS and containerd think they are running:

hcsdiag list
de3b7a88cfbc924d87ec9b934993ae238991698f4965a515bb0489920ca1adcd
    Windows Server Container,   Running,                                     , containerd-shim-runhcs-v1.exe
ctr t ls
TASK    PID    STATUS
.\crictl.exe --runtime-endpoint “npipe:////./pipe/containerd-containerd” ps --all
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID
04598199c8c66 1fb09998ede64 4 hours ago Running agnhost-container 0 de3b7a88cfbc9

@jsturtevant
Copy link
Contributor

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Oct 1, 2021
@jsturtevant
Copy link
Contributor

jsturtevant commented Oct 1, 2021

In the case on CAPZ, Windows defender was consuming high cpu (90% plus) adding exclusion to offending process fixed the issue. Now that I have a better understanding of the failure and confirmation that high cpu does indeed cause this issue I will see if I can track down the bug which I believe is related to handing containers on contianerd side/HCS side.

It seems that if a container starts the removal process and containerd thinks it has deleted the task but the container is in an unknown state in HCS it doesn't know how to properly recover. It may be that the container needs to be properly cleaned up in HCS when it is in this state. I found that hcsdiag kill hangs when the container is in the "unkown" state.

@marosset
Copy link
Contributor

@dcantah do we think issue has been fixed by any of the recent hcsshim/containerd fixes?

@dcantah
Copy link
Member

dcantah commented Dec 20, 2021

Mmm, I'd have to do an audit of the changes recently but not that I know of. Has this been seen lately?

These two changes seem up this issues alley though:
microsoft/hcsshim#1249
microsoft/hcsshim#1234

@marosset
Copy link
Contributor

I don't know if this has been seen recently or not.
I came back to it while cleaning up so old issues.

@dcantah
Copy link
Member

dcantah commented Dec 22, 2021

Gotcha, but no other than those two issues (that aren't checked in) I'm not sure of anything that would've remedied this

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 22, 2022
@dzmitry-lahoda
Copy link

I got this eror message, not sure if root cause is same on ASK 1.23.3 allocating Windows pods


failed to try resolving symlinks in path "\\var\\log\\pods\\default_wwwwwwwww_b6bd1f90-7f9b-4bfc-be31-3a75384f0440\\configure-firewall\\3.log": CreateFile \var\log\pods\default_wwwwwwwww_b6bd1f90-7f9b-4bfc-be31-3a75384f0440\configure-firewall\3.log: The system cannot find the file specified.
unable to retrieve container logs for containerd://a4d05d8fee67ecd68757f7c3ff1235d2b7b635d62ad0f971a707bd3410625d32

@vaibhav2107
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 15, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 14, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 13, 2022
@vaibhav2107
Copy link
Member

@jsturtevant Is the issue has been fixed?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/windows Categorizes an issue or PR as relevant to SIG Windows. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Archived in project
Status: Done
Development

No branches or pull requests

10 participants