Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

[machine maintain]add machine and remove machine issue #1863

Closed
YanjieGao opened this issue Dec 11, 2018 · 4 comments
Closed

[machine maintain]add machine and remove machine issue #1863

YanjieGao opened this issue Dec 11, 2018 · 4 comments
Assignees

Comments

@YanjieGao
Copy link
Contributor

YanjieGao commented Dec 11, 2018

i follow to add and remove machine 10.0.0.7 to existing p100 cluster (v0.8.1 pai) but meet problem:
https://github.com/Microsoft/pai/blob/pai-0.8.y/docs/paictl/paictl-manual.md#Machine_Nodelist_Example

For add:

machine-list:

  • hostname: next-a-gpu-0001
    hostip: 10.0.0.7
    machine-type: GENERIC
    k8s-role: worker
    pai-worker: "true"
    sshport: 22
    username: xx
    password: xx

But datanode and namenode IP address is error:
datanode
image

namenode log:
image

Driver meet rediness probe error, log seems right:
image

For remove machine: meet error

image

@YanjieGao YanjieGao changed the title add machine and remove machine issue [machine maintain]add machine and remove machine issue Dec 11, 2018
@ydye
Copy link
Contributor

ydye commented Dec 11, 2018

Node-manager issue:

  1. Add the new node to the cluster-configuraiton
  2. refresh the service cluster-configuraiton
  3. refresh the service hadoop-data-node and hadoop-node-manager

please refer to this:
#253

drivers issue:
More detailed please.

remove problems: kind a bug

@YanjieGao
Copy link
Contributor Author

YanjieGao commented Dec 11, 2018

for nm issue from your #253 suggestion:

hostmame is the same with machinelist.yaml hostname:
image

root@paidevbox:~/PaiDeployment# cd pai-cluster-msr-next-p100/
image

Currently i want to remove this node and try to deploy but also meet remove node error.... above

@YanjieGao
Copy link
Contributor Author

For @253 i guess maybe the reason is i directly use config file and not gen config replace this lead to this issue. But this assumption currently not mentioned on our doc. If you have method tell me to remove this node and not throw exception, i could try to redeploy this.

https://github.com/Microsoft/pai/blob/3f7190dd185c75ca42a38a03a3c9def172798106/src/hadoop-data-node/deploy/hadoop-data-node-configuration/hdfs-site.xml#L271

@YanjieGao
Copy link
Contributor Author

YanjieGao commented Dec 11, 2018

Yundong's fix for k8s remove: #1864

@ydye ydye closed this as completed Dec 11, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants