Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

【cherry-pick 1.8】fix ports conflict when use paddlecloud to launch analogue multi-nodes #27117

Merged
merged 5 commits into from
Sep 15, 2020

Conversation

danleifeng
Copy link
Contributor

PR types

Bug fixes

PR changes

APIs

Describe

在Paddlecloud提交GPU多节点任务时,若Paddlecloud内部分配多节点到一台机器上,则使用launch/fleetrun命令时会出现端口冲突问题。
新增对 DISTRIBUTED_TRAINER_ENDPOINTS环境变量的解析(Paddlecloud 1.8.4之后提供)。

@paddle-bot-old
Copy link

paddle-bot-old bot commented Sep 7, 2020

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

ports = [x for x in range(started_port, started_port + len(selected_gpus))]
cluster, pod = get_cluster(node_ips, node_ip, ports, selected_gpus)
# DISTRIBUTED_TRAINER_ENDPOINTS: new environment since paddlecloud 1.8.4
trainer_endpoints = os.getenv("DISTRIBUTED_TRAINER_ENDPOINTS")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DISTRIBUTED_TRAINER_ENDPOINTS需要举一个例子说明其格式。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done,thanks!

cluster = Cluster(hdfs=None)
trainer_rank = 0
for node_rank, ip in enumerate(node_ips):
pod = Pod()
pod.rank = node_rank
pod.addr = ip
cur_node_endpoints = [
endpoint for endpoint in trainer_endpoints if ip in endpoint
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

endpoints应该是按照顺序来的吧?为何要判断 ip in endpoint, 如果一个node上启动多个pod?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trainer_endpoints 是 多机所有的ip:port, 这里需要取得当前ip(本机)的ip:port,paddlecloud上多机不同node的pod_ip也是不同的(eg: node1: job-a-trainer-0.b, node2: job-a-trainer-1.b)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

paddlecloud上多机不同node的pod_ip也是不同的 他们可能会给物理ip,而不是hostname。

@@ -51,6 +51,7 @@ fi

unset PADDLE_PORT
unset TRAINER_PORTS_NUM
export DISTRIBUTED_TRAINER_ENDPOINTS=127.0.0.1:6170,127.0.0.1:6171,127.0.0.2:6170,127.0.0.2:6171
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

使用DISTRIBUTED_TRAINER_ENDPOINTS 和不使用的两种协议都要测试到。
这个单测并且写上说明测试的是什么。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#21 行就是测试了没有设置DISTRIBUTED_TRAINER_ENDPOINTS 时的情况
#54 行这里是测试了设置DISTRIBUTED_TRAINER_ENDPOINTS 时的情况
已加注释,thanks!

cur_node_endpoints = [
endpoint for endpoint in trainer_endpoints if ip in endpoint
]
assert len(cur_node_endpoints) >= len(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

==

Copy link
Contributor Author

@danleifeng danleifeng Sep 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

>=是因为有可能paddlecloud指定的单机8卡,但它使用launch --selected_gpus 只用了4卡,这时endpoints的个数就会大于selected_gpus的个数。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个get_cluster是cloud和非cloud共用的,接口传入的selected_gpus是用户的参数,endpoints是所有卡的endpoints,若和所有的卡数比较,那么就是自己和自己比较,没有意义。

gongweibao
gongweibao previously approved these changes Sep 10, 2020
Copy link
Contributor

@gongweibao gongweibao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@gongweibao gongweibao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@danleifeng danleifeng merged commit 67f87d6 into PaddlePaddle:release/1.8 Sep 15, 2020
@danleifeng danleifeng deleted the launch_port_1.8 branch September 21, 2020 11:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants