Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

/paictl.py cluster k8s-bootup -p /cluster-configuration randomly failed #1153

Closed
hao1939 opened this issue Aug 27, 2018 · 16 comments
Closed
Assignees
Labels
deployment PAI deployment related known issue

Comments

@hao1939
Copy link
Contributor

hao1939 commented Aug 27, 2018

The k8s-bootup randomly failed because of kube-proxy.

...
2018-08-27 04:10:31,663 [INFO] - k8sPaiLibrary.maintainlib.kubectl_install : Generate the configuation file of kubectl.
NAME STATUS ROLES AGE VERSION
10.0.1.5 Ready 21s v1.9.4
10.0.1.7 NotReady 4s v1.9.4
10.0.1.8 Ready 20s v1.9.4
2018-08-27 04:10:35,939 [INFO] - k8sPaiLibrary.maintainlib.kubectl_install : Successfully install kubectl and configure it!
2018-08-27 04:10:35,939 [INFO] - k8sPaiLibrary.maintainlib.deploy : Create kube-proxy daemon for kuberentes cluster.
error: unable to recognize "kube-proxy.yaml": no matches for kind "DaemonSet" in version "apps/v1"
2018-08-27 04:10:36,241 [ERROR] - k8sPaiLibrary.maintainlib.common : Failed to create kube-proxy

@hao1939
Copy link
Contributor Author

hao1939 commented Aug 27, 2018

It's similar with #841 , all about can't create kube-proxy.

@YitongFeng
Copy link
Contributor

also #917

@fanyangCS
Copy link
Contributor

can you close all related issues and keep only one opened for ease of tracking.

@YitongFeng
Copy link
Contributor

@fanyangCS OK, others are closed but can't be deleted. Reference of the similar issues is for easy tracking.

@hao1939
Copy link
Contributor Author

hao1939 commented Aug 27, 2018

When Failed to create kube-proxy, if you try ./paictl.py service delete ... immediately, it will hang.

2018-08-27 11:07:17,287 [INFO] - paiLibrary.paiService.service_management_delete : ----------------------------------------------------------------------
2018-08-27 11:07:17,288 [INFO] - paiLibrary.paiService.service_management_delete : Begin to generate service zookeeper's template file
2018-08-27 11:07:17,288 [INFO] - paiLibrary.paiService.service_template_generate : Begin to generate the template file in service zookeeper's configuration.
2018-08-27 11:07:17,288 [INFO] - paiLibrary.paiService.service_template_generate : Create template mapper for service zookeeper.
2018-08-27 11:07:17,288 [INFO] - paiLibrary.paiService.service_template_generate : Done. Template mapper for service zookeeper is created.
2018-08-27 11:07:17,289 [INFO] - paiLibrary.paiService.service_template_generate : Generate the template file bootstrap/zookeeper/node-label.sh.template.
2018-08-27 11:07:17,289 [INFO] - paiLibrary.paiService.service_template_generate : Save the generated file to bootstrap/zookeeper/node-label.sh.
2018-08-27 11:07:17,296 [INFO] - paiLibrary.paiService.service_template_generate : Generate the template file bootstrap/zookeeper/zookeeper.yaml.template.
2018-08-27 11:07:17,296 [INFO] - paiLibrary.paiService.service_template_generate : Save the generated file to bootstrap/zookeeper/zookeeper.yaml.
2018-08-27 11:07:17,299 [INFO] - paiLibrary.paiService.service_template_generate : Generate the template file bootstrap/zookeeper/stop.sh.template.
2018-08-27 11:07:17,299 [INFO] - paiLibrary.paiService.service_template_generate : Save the generated file to bootstrap/zookeeper/stop.sh.
2018-08-27 11:07:17,302 [INFO] - paiLibrary.paiService.service_template_generate : Generate the template file bootstrap/zookeeper/refresh.sh.template.
2018-08-27 11:07:17,302 [INFO] - paiLibrary.paiService.service_template_generate : Save the generated file to bootstrap/zookeeper/refresh.sh.
2018-08-27 11:07:17,306 [INFO] - paiLibrary.paiService.service_template_generate : Generate the template file bootstrap/zookeeper/delete.yaml.template.
2018-08-27 11:07:17,306 [INFO] - paiLibrary.paiService.service_template_generate : Save the generated file to bootstrap/zookeeper/delete.yaml.
2018-08-27 11:07:17,309 [INFO] - paiLibrary.paiService.service_template_generate : The template file of service zookeeper is generated.
2018-08-27 11:07:17,309 [INFO] - paiLibrary.paiService.service_management_delete : Begin to delete service: [ zookeeper ]
2018-08-27 11:07:17,310 [INFO] - paiLibrary.paiService.service_delete : Begin to execute service zookeeper's delete script.
Call stop to stop all hadoop service first
No resources found.
label "zookeeper" not found.
node/10.0.1.8 not labeled
Create hadoop-delete configmap for deleting data on the host
configmap/zookeeper-delete created
Create cleaner daemon
daemonset.apps/delete-batch-job-zookeeper created
/usr/local/lib/python2.7/dist-packages/requests/init.py:83: RequestsDependencyWarning: Old version of cryptography ([1, 2, 3]) may cause slowdown.
warnings.warn(warning, RequestsDependencyWarning)
delete-batch-job-zookeeper is not ready yet. Please wait for a moment!
delete-batch-job-zookeeper is not ready yet. Please wait for a moment!
delete-batch-job-zookeeper is not ready yet. Please wait for a moment!
...

@mzmssg Could provide more details.

@mzmssg
Copy link
Member

mzmssg commented Aug 27, 2018

update in #1166

@YitongFeng
Copy link
Contributor

For no matches for kind "DaemonSet" in version "apps/v1", it maybe the k8s API version's problem.
Our k8s version is 1.9, refer to k8s 1.9 API docs, it shows that v1.9 DaemontSet support "apps/v1", "apps/v1beta1" and "apps/v1beta2". The "beta" is testing version, if this version runs well k8s group will upgrade it to "apps/v1" and marked as stable version.

Our single-node k8s version is 1.9.4, maybe can't totally support the stable version "apps/v1", #1126 change the version to 1.9.9, and have not appear the error now. Or you can use "apps/v1beta1" and "apps/v1beta2" in your daemonset yaml file.

@hao1939
Copy link
Contributor Author

hao1939 commented Aug 28, 2018

More details:

  • In the dev-box which raise error: unable to recognize "kube-proxy.yaml": no matches for kind "DaemonSet" in version "apps/v1beta2":

root@infra-03:/# kubectl version
Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.2", GitCommit:"bb9ffb1654d4a729bb4cec18ff088eacc153c239", GitTreeState:"clean", BuildDate:"2018-08-07T23:17:28Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.9", GitCommit:"57729ea3d9a1b75f3fc7bbbadc597ba707d47c8a", GitTreeState:"clean", BuildDate:"2018-06-29T01:07:01Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
root@infra-03:/# kubectl get daemonset
No resources found.
root@infra-03:/# kubectl api-resources |grep DaemonSet
daemonsets ds apps true DaemonSet
daemonsets ds extensions true DaemonSet

  • The dev-box which is fine:

root@infra-03:/# kubectl version
Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.2", GitCommit:"bb9ffb1654d4a729bb4cec18ff088eacc153c239", GitTreeState:"clean", BuildDate:"2018-08-07T23:17:28Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.9", GitCommit:"57729ea3d9a1b75f3fc7bbbadc597ba707d47c8a", GitTreeState:"clean", BuildDate:"2018-06-29T01:07:01Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
root@infra-03:/# kubectl get daemonset
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
drivers-one-shot 3 3 3 3 3 machinetype=gpu 18m
hadoop-data-node-ds 2 2 0 2 0 hadoop-data-node=true 43s
hadoop-name-node-ds 1 1 1 1 1 hadoop-name-node=true 1m
zookeeper-ds 1 1 1 1 1 zookeeper=true 20m
root@infra-03:/# kubectl api-resources |grep DaemonSet
daemonsets ds apps true DaemonSet
daemonsets ds extensions true DaemonSet

@hao1939
Copy link
Contributor Author

hao1939 commented Aug 28, 2018

Hi @YitongFeng ,

Can we check kubectl api-resources |grep DaemonSet before install any DaemonSet?
Suspect it's a race condition problem: when trying to create the DaemonSet, the api-server is not fully ready.

@YitongFeng
Copy link
Contributor

It's maybe etcd's problem.
When the error occurs, etcd log shows endless warning:
image

When deployed successfully, the etcd log is clean:
image

When deploying a pod, the api-servier will register ResourceGroup and ResourceVersion as RestAPI, and will communicate with etcd. So the etcd timeout may cause the deploy process can't find the Resource Type.

The possible cause of etcd timeout:
https://github.com/etcd-io/etcd/blob/master/Documentation/faq.md#what-does-the-etcd-warning-apply-entries-took-too-long-mean

@sterowang
Copy link

Is this possibly caused too small VMs for qualification bed? One possible action is we increase the VMs size on qualification bed to see is there any improvements.

@YitongFeng
Copy link
Contributor

@sterowang low disk IO is the usually cause.
https://github.com/etcd-io/etcd/blob/master/Documentation/tuning.md#disk may help to tune etcd. + @hao1939 help to test later.

@hao1939
Copy link
Contributor Author

hao1939 commented Aug 29, 2018

Try to use ramdisk for etcd, it works fine.
sudo mount -t tmpfs -o size=2048M tmpfs /var/etcd

@fanyangCS
Copy link
Contributor

If this is the disk I/O issue, it could very likely happen in other bed as well. The short-term solution could be to add some retry to every etcd operation (don't forget to add some random backoff between retries). Long term, we could add an option to support etcd deployment to a disk other than OS disk. I do not prefer ramdisk as it is not a realistic real-world deployment environment

@hao1939 hao1939 added the deployment PAI deployment related label Aug 29, 2018
@YitongFeng
Copy link
Contributor

Maybe we found the root cause:

The proxy was create at 3:31:23:
image

But the controllermanager's log shows daemonset resource is registerd at 3:31:24:
image

So it can't find the daemonset resources at that time.

We will fix it in our next serveral PRs.

@hao1939
Copy link
Contributor Author

hao1939 commented Aug 30, 2018

We get some output from kubectl api-resources:
http://40.76.53.138:33080/blue/organizations/jenkins/pai/detail/hao%2Ftest_k8sbootup/4/pipeline/45

image

When we are trying to create DaemonSet, the resource DaemonSet is not ready.
We should wait for all resources ready.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
deployment PAI deployment related known issue
Projects
None yet
Development

No branches or pull requests

5 participants