`/paictl.py cluster k8s-bootup -p /cluster-configuration` randomly failed #1153

hao1939 · 2018-08-27T04:15:56Z

The k8s-bootup randomly failed because of kube-proxy.

...
2018-08-27 04:10:31,663 [INFO] - k8sPaiLibrary.maintainlib.kubectl_install : Generate the configuation file of kubectl.
NAME STATUS ROLES AGE VERSION
10.0.1.5 Ready 21s v1.9.4
10.0.1.7 NotReady 4s v1.9.4
10.0.1.8 Ready 20s v1.9.4
2018-08-27 04:10:35,939 [INFO] - k8sPaiLibrary.maintainlib.kubectl_install : Successfully install kubectl and configure it!
2018-08-27 04:10:35,939 [INFO] - k8sPaiLibrary.maintainlib.deploy : Create kube-proxy daemon for kuberentes cluster.
error: unable to recognize "kube-proxy.yaml": no matches for kind "DaemonSet" in version "apps/v1"
2018-08-27 04:10:36,241 [ERROR] - k8sPaiLibrary.maintainlib.common : Failed to create kube-proxy

The text was updated successfully, but these errors were encountered:

hao1939 · 2018-08-27T05:21:46Z

It's similar with #841 , all about can't create kube-proxy.

YitongFeng · 2018-08-27T09:20:48Z

also #917

fanyangCS · 2018-08-27T09:23:11Z

can you close all related issues and keep only one opened for ease of tracking.

YitongFeng · 2018-08-27T09:26:14Z

@fanyangCS OK, others are closed but can't be deleted. Reference of the similar issues is for easy tracking.

hao1939 · 2018-08-27T11:20:31Z

When Failed to create kube-proxy, if you try ./paictl.py service delete ... immediately, it will hang.

2018-08-27 11:07:17,287 [INFO] - paiLibrary.paiService.service_management_delete : ----------------------------------------------------------------------
2018-08-27 11:07:17,288 [INFO] - paiLibrary.paiService.service_management_delete : Begin to generate service zookeeper's template file
2018-08-27 11:07:17,288 [INFO] - paiLibrary.paiService.service_template_generate : Begin to generate the template file in service zookeeper's configuration.
2018-08-27 11:07:17,288 [INFO] - paiLibrary.paiService.service_template_generate : Create template mapper for service zookeeper.
2018-08-27 11:07:17,288 [INFO] - paiLibrary.paiService.service_template_generate : Done. Template mapper for service zookeeper is created.
2018-08-27 11:07:17,289 [INFO] - paiLibrary.paiService.service_template_generate : Generate the template file bootstrap/zookeeper/node-label.sh.template.
2018-08-27 11:07:17,289 [INFO] - paiLibrary.paiService.service_template_generate : Save the generated file to bootstrap/zookeeper/node-label.sh.
2018-08-27 11:07:17,296 [INFO] - paiLibrary.paiService.service_template_generate : Generate the template file bootstrap/zookeeper/zookeeper.yaml.template.
2018-08-27 11:07:17,296 [INFO] - paiLibrary.paiService.service_template_generate : Save the generated file to bootstrap/zookeeper/zookeeper.yaml.
2018-08-27 11:07:17,299 [INFO] - paiLibrary.paiService.service_template_generate : Generate the template file bootstrap/zookeeper/stop.sh.template.
2018-08-27 11:07:17,299 [INFO] - paiLibrary.paiService.service_template_generate : Save the generated file to bootstrap/zookeeper/stop.sh.
2018-08-27 11:07:17,302 [INFO] - paiLibrary.paiService.service_template_generate : Generate the template file bootstrap/zookeeper/refresh.sh.template.
2018-08-27 11:07:17,302 [INFO] - paiLibrary.paiService.service_template_generate : Save the generated file to bootstrap/zookeeper/refresh.sh.
2018-08-27 11:07:17,306 [INFO] - paiLibrary.paiService.service_template_generate : Generate the template file bootstrap/zookeeper/delete.yaml.template.
2018-08-27 11:07:17,306 [INFO] - paiLibrary.paiService.service_template_generate : Save the generated file to bootstrap/zookeeper/delete.yaml.
2018-08-27 11:07:17,309 [INFO] - paiLibrary.paiService.service_template_generate : The template file of service zookeeper is generated.
2018-08-27 11:07:17,309 [INFO] - paiLibrary.paiService.service_management_delete : Begin to delete service: [ zookeeper ]
2018-08-27 11:07:17,310 [INFO] - paiLibrary.paiService.service_delete : Begin to execute service zookeeper's delete script.
Call stop to stop all hadoop service first
No resources found.
label "zookeeper" not found.
node/10.0.1.8 not labeled
Create hadoop-delete configmap for deleting data on the host
configmap/zookeeper-delete created
Create cleaner daemon
daemonset.apps/delete-batch-job-zookeeper created
/usr/local/lib/python2.7/dist-packages/requests/init.py:83: RequestsDependencyWarning: Old version of cryptography ([1, 2, 3]) may cause slowdown.
warnings.warn(warning, RequestsDependencyWarning)
delete-batch-job-zookeeper is not ready yet. Please wait for a moment!
delete-batch-job-zookeeper is not ready yet. Please wait for a moment!
delete-batch-job-zookeeper is not ready yet. Please wait for a moment!
...

@mzmssg Could provide more details.

mzmssg · 2018-08-27T13:14:22Z

update in #1166

YitongFeng · 2018-08-28T02:10:23Z

For no matches for kind "DaemonSet" in version "apps/v1", it maybe the k8s API version's problem.
Our k8s version is 1.9, refer to k8s 1.9 API docs, it shows that v1.9 DaemontSet support "apps/v1", "apps/v1beta1" and "apps/v1beta2". The "beta" is testing version, if this version runs well k8s group will upgrade it to "apps/v1" and marked as stable version.

Our single-node k8s version is 1.9.4, maybe can't totally support the stable version "apps/v1", #1126 change the version to 1.9.9, and have not appear the error now. Or you can use "apps/v1beta1" and "apps/v1beta2" in your daemonset yaml file.

hao1939 · 2018-08-28T11:32:45Z

More details:

In the dev-box which raise error: unable to recognize "kube-proxy.yaml": no matches for kind "DaemonSet" in version "apps/v1beta2":

root@infra-03:/# kubectl version
Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.2", GitCommit:"bb9ffb1654d4a729bb4cec18ff088eacc153c239", GitTreeState:"clean", BuildDate:"2018-08-07T23:17:28Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.9", GitCommit:"57729ea3d9a1b75f3fc7bbbadc597ba707d47c8a", GitTreeState:"clean", BuildDate:"2018-06-29T01:07:01Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
root@infra-03:/# kubectl get daemonset
No resources found.
root@infra-03:/# kubectl api-resources |grep DaemonSet
daemonsets ds apps true DaemonSet
daemonsets ds extensions true DaemonSet

The dev-box which is fine:

root@infra-03:/# kubectl version
Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.2", GitCommit:"bb9ffb1654d4a729bb4cec18ff088eacc153c239", GitTreeState:"clean", BuildDate:"2018-08-07T23:17:28Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.9", GitCommit:"57729ea3d9a1b75f3fc7bbbadc597ba707d47c8a", GitTreeState:"clean", BuildDate:"2018-06-29T01:07:01Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
root@infra-03:/# kubectl get daemonset
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
drivers-one-shot 3 3 3 3 3 machinetype=gpu 18m
hadoop-data-node-ds 2 2 0 2 0 hadoop-data-node=true 43s
hadoop-name-node-ds 1 1 1 1 1 hadoop-name-node=true 1m
zookeeper-ds 1 1 1 1 1 zookeeper=true 20m
root@infra-03:/# kubectl api-resources |grep DaemonSet
daemonsets ds apps true DaemonSet
daemonsets ds extensions true DaemonSet

hao1939 · 2018-08-28T11:59:19Z

Hi @YitongFeng ,

Can we check kubectl api-resources |grep DaemonSet before install any DaemonSet?
Suspect it's a race condition problem: when trying to create the DaemonSet, the api-server is not fully ready.

YitongFeng · 2018-08-28T13:51:19Z

It's maybe etcd's problem.
When the error occurs, etcd log shows endless warning:

When deployed successfully, the etcd log is clean:

When deploying a pod, the api-servier will register ResourceGroup and ResourceVersion as RestAPI, and will communicate with etcd. So the etcd timeout may cause the deploy process can't find the Resource Type.

The possible cause of etcd timeout:
https://github.com/etcd-io/etcd/blob/master/Documentation/faq.md#what-does-the-etcd-warning-apply-entries-took-too-long-mean

sterowang · 2018-08-29T02:05:04Z

Is this possibly caused too small VMs for qualification bed? One possible action is we increase the VMs size on qualification bed to see is there any improvements.

YitongFeng · 2018-08-29T02:41:03Z

@sterowang low disk IO is the usually cause.
https://github.com/etcd-io/etcd/blob/master/Documentation/tuning.md#disk may help to tune etcd. + @hao1939 help to test later.

hao1939 · 2018-08-29T04:05:57Z

Try to use ramdisk for etcd, it works fine.
sudo mount -t tmpfs -o size=2048M tmpfs /var/etcd

fanyangCS · 2018-08-29T04:15:44Z

If this is the disk I/O issue, it could very likely happen in other bed as well. The short-term solution could be to add some retry to every etcd operation (don't forget to add some random backoff between retries). Long term, we could add an option to support etcd deployment to a disk other than OS disk. I do not prefer ramdisk as it is not a realistic real-world deployment environment

YitongFeng · 2018-08-30T03:57:39Z

Maybe we found the root cause:

The proxy was create at 3:31:23:

But the controllermanager's log shows daemonset resource is registerd at 3:31:24:

So it can't find the daemonset resources at that time.

We will fix it in our next serveral PRs.

hao1939 · 2018-08-30T05:30:45Z

We get some output from kubectl api-resources:
http://40.76.53.138:33080/blue/organizations/jenkins/pai/detail/hao%2Ftest_k8sbootup/4/pipeline/45

When we are trying to create DaemonSet, the resource DaemonSet is not ready.
We should wait for all resources ready.

hao1939 assigned YitongFeng Aug 27, 2018

hao1939 mentioned this issue Aug 28, 2018

Fix jenkins #1165

Merged

hao1939 added the known issue label Aug 28, 2018

YitongFeng mentioned this issue Aug 29, 2018

paictl.py service start ... failed, because of Error from server: etcdserver: request timed out #1189

Closed

hao1939 added the deployment PAI deployment related label Aug 29, 2018

YitongFeng mentioned this issue Aug 30, 2018

check api resources before install proxy #1210

Merged

hao1939 mentioned this issue Aug 30, 2018

[DON'T MERGE]test only! skip PAI services #1208

Closed

YitongFeng closed this as completed Aug 31, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`/paictl.py cluster k8s-bootup -p /cluster-configuration` randomly failed #1153

`/paictl.py cluster k8s-bootup -p /cluster-configuration` randomly failed #1153

hao1939 commented Aug 27, 2018

hao1939 commented Aug 27, 2018

YitongFeng commented Aug 27, 2018

fanyangCS commented Aug 27, 2018

YitongFeng commented Aug 27, 2018

hao1939 commented Aug 27, 2018 •

edited

Loading

mzmssg commented Aug 27, 2018

YitongFeng commented Aug 28, 2018

hao1939 commented Aug 28, 2018 •

edited

Loading

hao1939 commented Aug 28, 2018

YitongFeng commented Aug 28, 2018

sterowang commented Aug 29, 2018

YitongFeng commented Aug 29, 2018

hao1939 commented Aug 29, 2018

fanyangCS commented Aug 29, 2018

YitongFeng commented Aug 30, 2018

hao1939 commented Aug 30, 2018 •

edited

Loading

/paictl.py cluster k8s-bootup -p /cluster-configuration randomly failed #1153

/paictl.py cluster k8s-bootup -p /cluster-configuration randomly failed #1153

Comments

hao1939 commented Aug 27, 2018

hao1939 commented Aug 27, 2018

YitongFeng commented Aug 27, 2018

fanyangCS commented Aug 27, 2018

YitongFeng commented Aug 27, 2018

hao1939 commented Aug 27, 2018 • edited Loading

mzmssg commented Aug 27, 2018

YitongFeng commented Aug 28, 2018

hao1939 commented Aug 28, 2018 • edited Loading

hao1939 commented Aug 28, 2018

YitongFeng commented Aug 28, 2018

sterowang commented Aug 29, 2018

YitongFeng commented Aug 29, 2018

hao1939 commented Aug 29, 2018

fanyangCS commented Aug 29, 2018

YitongFeng commented Aug 30, 2018

hao1939 commented Aug 30, 2018 • edited Loading

`/paictl.py cluster k8s-bootup -p /cluster-configuration` randomly failed #1153

`/paictl.py cluster k8s-bootup -p /cluster-configuration` randomly failed #1153

hao1939 commented Aug 27, 2018 •

edited

Loading

hao1939 commented Aug 28, 2018 •

edited

Loading

hao1939 commented Aug 30, 2018 •

edited

Loading