Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

Install drivers-one-shot keeps retrying pull image, > x10 slower than manually docker pull #1226

Closed
hao1939 opened this issue Aug 30, 2018 · 4 comments · Fixed by #1227
Closed
Assignees

Comments

@hao1939
Copy link
Contributor

hao1939 commented Aug 30, 2018

When install drivers-one-shot, kuberenetes keeps retrying pull image.

It takes obviously long (>10x) than manually docker pull the_image.

Failed to pull image "10.0.1.5:5000/openpai/drivers:yife-check_resource-12782c6-8": rpc error: code = Canceled desc = context canceled

image

@hao1939
Copy link
Contributor Author

hao1939 commented Aug 30, 2018

In the kubelet container log, found a lot of Cancel pulling image "10.0.1.5:5000/openpai/drivers:yife-check_resource-12782c6-8" because of no progress for 1m0s.

...
I0830 13:24:23.011881     605 kube_docker_client.go:345] Pulling image "10.0.1.5:5000/openpai/drivers:yife-check_resource-12782c6-8": "e1e52a706380: Extracting [==================>                                ]  169.9MB/453MB"
W0830 13:24:31.701612     605 conversion.go:110] Could not get instant cpu stats: different number of cpus
I0830 13:24:33.011856     605 kube_docker_client.go:345] Pulling image "10.0.1.5:5000/openpai/drivers:yife-check_resource-12782c6-8": "03bdaf98f8ba: Extracting [=================================>                 ]  313.1MB/465.4MB"
I0830 13:24:43.011927     605 kube_docker_client.go:345] Pulling image "10.0.1.5:5000/openpai/drivers:yife-check_resource-12782c6-8": "03bdaf98f8ba: Extracting [==================================================>]  465.4MB/465.4MB"
I0830 13:24:53.011861     605 kube_docker_client.go:345] Pulling image "10.0.1.5:5000/openpai/drivers:yife-check_resource-12782c6-8": "03bdaf98f8ba: Extracting [==================================================>]  465.4MB/465.4MB"
I0830 13:25:03.011835     605 kube_docker_client.go:345] Pulling image "10.0.1.5:5000/openpai/drivers:yife-check_resource-12782c6-8": "03bdaf98f8ba: Extracting [==================================================>]  465.4MB/465.4MB"
I0830 13:25:13.011831     605 kube_docker_client.go:345] Pulling image "10.0.1.5:5000/openpai/drivers:yife-check_resource-12782c6-8": "03bdaf98f8ba: Extracting [==================================================>]  465.4MB/465.4MB"
I0830 13:25:23.011889     605 kube_docker_client.go:345] Pulling image "10.0.1.5:5000/openpai/drivers:yife-check_resource-12782c6-8": "03bdaf98f8ba: Extracting [==================================================>]  465.4MB/465.4MB"
I0830 13:25:33.011816     605 kube_docker_client.go:345] Pulling image "10.0.1.5:5000/openpai/drivers:yife-check_resource-12782c6-8": "03bdaf98f8ba: Extracting [==================================================>]  465.4MB/465.4MB"
E0830 13:25:43.011821     605 kube_docker_client.go:341] Cancel pulling image "10.0.1.5:5000/openpai/drivers:yife-check_resource-12782c6-8" because of no progress for 1m0s, latest progress: "03bdaf98f8ba: Extracting [==================================================>]  465.4MB/465.4MB"
E0830 13:25:43.012200     605 remote_image.go:108] PullImage "10.0.1.5:5000/openpai/drivers:yife-check_resource-12782c6-8" from image service failed: rpc error: code = Canceled desc = context canceled
E0830 13:25:43.012264     605 kuberuntime_image.go:50] Pull image "10.0.1.5:5000/openpai/drivers:yife-check_resource-12782c6-8" failed: rpc error: code = Canceled desc = context canceled
E0830 13:25:43.012369     605 kuberuntime_manager.go:734] container start failed: ErrImagePull: rpc error: code = Canceled desc = context canceled
E0830 13:25:43.012420     605 pod_workers.go:186] Error syncing pod f8397713-ac57-11e8-b93b-000d3a10c86f ("drivers-one-shot-k5m4f_default(f8397713-ac57-11e8-b93b-000d3a10c86f)"), skipping: failed to "StartContainer" for "nvidia-drivers" with ErrImagePull: "rpc error: code = Canceled desc = context canceled"
I0830 13:25:44.129408     605 kuberuntime_manager.go:514] Container {Name:nvidia-drivers Image:10.0.1.5:5000/openpai/drivers:yife-check_resource-12782c6-8 Command:[] Args:[] WorkingDir: Ports:[] EnvFrom:[] Env:[] Resources:{Limits:map[] Requests:map[]} VolumeMounts:[{Name:driver-path ReadOnly:false MountPath:/var/drivers SubPath: MountPropagation:<nil>} {Name:device-path ReadOnly:false MountPath:/dev SubPath: MountPropagation:<nil>} {Name:modules-path ReadOnly:false MountPath:/lib/modules SubPath: MountPropagation:<nil>} {Name:drivers-log ReadOnly:false MountPath:/var/log SubPath: MountPropagation:<nil>} {Name:kernel-head ReadOnly:false MountPath:/usr/src SubPath: MountPropagation:<nil>}] VolumeDevices:[] LivenessProbe:nil ReadinessProbe:&Probe{Handler:Handler{Exec:&ExecAction{Command:[cat /jobstatus/jobok],},HTTPGet:nil,TCPSocket:nil,},InitialDelaySeconds:5,TimeoutSeconds:1,PeriodSeconds:3,SuccessThreshold:1,FailureThreshold:3,} Lifecycle:nil TerminationMessagePath:/dev/termination-log TerminationMessagePolicy:File ImagePullPolicy:Always SecurityContext:&SecurityContext{Capabilities:&Capabilities{Add:[ALL],Drop:[],},Privileged:*true,SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,AllowPrivilegeEscalation:nil,} Stdin:false StdinOnce:false TTY:false} is dead, but RestartPolicy says that we should restart it.
E0830 13:25:44.130991     605 pod_workers.go:186] Error syncing pod f8397713-ac57-11e8-b93b-000d3a10c86f ("drivers-one-shot-k5m4f_default(f8397713-ac57-11e8-b93b-000d3a10c86f)"), skipping: failed to "StartContainer" for "nvidia-drivers" with ImagePullBackOff: "Back-off pulling image \"10.0.1.5:5000/openpai/drivers:yife-check_resource-12782c6-8\""
I0830 13:25:56.135536     605 container_manager_linux.go:425] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
I0830 13:25:57.435102     605 kuberuntime_manager.go:514] Container {Name:nvidia-drivers Image:10.0.1.5:5000/openpai/drivers:yife-check_resource-12782c6-8 Command:[] Args:[] WorkingDir: Ports:[] EnvFrom:[] Env:[] Resources:{Limits:map[] Requests:map[]} VolumeMounts:[{Name:driver-path ReadOnly:false MountPath:/var/drivers SubPath: MountPropagation:<nil>} {Name:device-path ReadOnly:false MountPath:/dev SubPath: MountPropagation:<nil>} {Name:modules-path ReadOnly:false MountPath:/lib/modules SubPath: MountPropagation:<nil>} {Name:drivers-log ReadOnly:false MountPath:/var/log SubPath: MountPropagation:<nil>} {Name:kernel-head ReadOnly:false MountPath:/usr/src SubPath: MountPropagation:<nil>}] VolumeDevices:[] LivenessProbe:nil ReadinessProbe:&Probe{Handler:Handler{Exec:&ExecAction{Command:[cat /jobstatus/jobok],},HTTPGet:nil,TCPSocket:nil,},InitialDelaySeconds:5,TimeoutSeconds:1,PeriodSeconds:3,SuccessThreshold:1,FailureThreshold:3,} Lifecycle:nil TerminationMessagePath:/dev/termination-log TerminationMessagePolicy:File ImagePullPolicy:Always SecurityContext:&SecurityContext{Capabilities:&Capabilities{Add:[ALL],Drop:[],},Privileged:*true,SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,AllowPrivilegeEscalation:nil,} Stdin:false StdinOnce:false TTY:false} is dead, but RestartPolicy says that we should restart it.
I0830 13:26:07.448853     605 kube_docker_client.go:345] Pulling image "10.0.1.5:5000/openpai/drivers:yife-check_resource-12782c6-8": "e1e52a706380: Downloading [=>                                                 ]  14.23MB/453MB"
I0830 13:26:17.448924     605 kube_docker_client.go:345] Pulling image "10.0.1.5:5000/openpai/drivers:yife-check_resource-12782c6-8": "e1e52a706380: Extracting [=======================================>           ]  353.7MB/453MB"
I0830 13:26:27.448857     605 kube_docker_client.go:345] Pulling image "10.0.1.5:5000/openpai/drivers:yife-check_resource-12782c6-8": "03bdaf98f8ba: Extracting [=================================================> ]  462.4MB/465.4MB"
I0830 13:26:37.448875     605 kube_docker_client.go:345] Pulling image "10.0.1.5:5000/openpai/drivers:yife-check_resource-12782c6-8": "0799695bf341: Extracting [==============================>                    ]  115.3MB/190MB"
I0830 13:26:47.448989     605 kube_docker_client.go:345] Pulling image "10.0.1.5:5000/openpai/drivers:yife-check_resource-12782c6-8": "0799695bf341: Extracting [==================================================>]    190MB/190MB"
I0830 13:26:57.448925     605 kube_docker_client.go:345] Pulling image "10.0.1.5:5000/openpai/drivers:yife-check_resource-12782c6-8": "0799695bf341: Extracting [==================================================>]    190MB/190MB"
I0830 13:27:07.448961     605 kube_docker_client.go:345] Pulling image "10.0.1.5:5000/openpai/drivers:yife-check_resource-12782c6-8": "0799695bf341: Extracting [==================================================>]    190MB/190MB"
I0830 13:27:17.448917     605 kube_docker_client.go:345] Pulling image "10.0.1.5:5000/openpai/drivers:yife-check_resource-12782c6-8": "0799695bf341: Extracting [==================================================>]    190MB/190MB"
I0830 13:27:27.448851     605 kube_docker_client.go:345] Pulling image "10.0.1.5:5000/openpai/drivers:yife-check_resource-12782c6-8": "0799695bf341: Extracting [==================================================>]    190MB/190MB"
I0830 13:27:37.448830     605 kube_docker_client.go:345] Pulling image "10.0.1.5:5000/openpai/drivers:yife-check_resource-12782c6-8": "0799695bf341: Extracting [==================================================>]    190MB/190MB"
E0830 13:27:47.448939     605 kube_docker_client.go:341] Cancel pulling image "10.0.1.5:5000/openpai/drivers:yife-check_resource-12782c6-8" because of no progress for 1m0s, latest progress: "0799695bf341: Extracting [==================================================>]    190MB/190MB"
E0830 13:27:47.449356     605 remote_image.go:108] PullImage "10.0.1.5:5000/openpai/drivers:yife-check_resource-12782c6-8" from image service failed: rpc error: code = Canceled desc = context canceled
E0830 13:27:47.449422     605 kuberuntime_image.go:50] Pull image "10.0.1.5:5000/openpai/drivers:yife-check_resource-12782c6-8" failed: rpc error: code = Canceled desc = context canceled
...

@hao1939
Copy link
Contributor Author

hao1939 commented Aug 30, 2018

Here's a kubelet option --image-pull-progress-deadline, see https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/

--image-pull-progress-deadline duration

  If no pulling progress is made before this deadline, the image pulling will be cancelled. (default 1m0s)

@YanjieGao
Copy link
Contributor

Hao, the image use local path and is different in peacetime? Why use this image like this
10.0.1.5:5000/openpai/drivers:yife-check_resource-12782c6-8

@hao1939
Copy link
Contributor Author

hao1939 commented Aug 31, 2018

Hi @YanjieGao , it's a local registry, so it's very fast, that's why I could find the root cause.
The slow is not because of network, it's kubectl always quitting in the middle of pull image.

@hao1939 hao1939 reopened this Aug 31, 2018
@hao1939 hao1939 self-assigned this Sep 4, 2018
@hao1939 hao1939 closed this as completed Sep 4, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants