【cherry-pick 1.8】fix ports conflict when use paddlecloud to launch analogue multi-nodes #27117

danleifeng · 2020-09-07T07:33:01Z

PR types

Bug fixes

PR changes

APIs

Describe

在Paddlecloud提交GPU多节点任务时，若Paddlecloud内部分配多节点到一台机器上，则使用launch/fleetrun命令时会出现端口冲突问题。
新增对 DISTRIBUTED_TRAINER_ENDPOINTS环境变量的解析（Paddlecloud 1.8.4之后提供）。

paddle-bot-old · 2020-09-07T07:33:23Z

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

gongweibao · 2020-09-08T02:47:43Z

python/paddle/distributed/cloud_utils.py

-    ports = [x for x in range(started_port, started_port + len(selected_gpus))]
-    cluster, pod = get_cluster(node_ips, node_ip, ports, selected_gpus)
+    # DISTRIBUTED_TRAINER_ENDPOINTS: new environment since paddlecloud 1.8.4
+    trainer_endpoints = os.getenv("DISTRIBUTED_TRAINER_ENDPOINTS")


DISTRIBUTED_TRAINER_ENDPOINTS需要举一个例子说明其格式。

done，thanks!

gongweibao · 2020-09-08T02:48:40Z

python/paddle/distributed/utils.py

    cluster = Cluster(hdfs=None)
    trainer_rank = 0
    for node_rank, ip in enumerate(node_ips):
        pod = Pod()
        pod.rank = node_rank
        pod.addr = ip
+        cur_node_endpoints = [
+            endpoint for endpoint in trainer_endpoints if ip in endpoint


endpoints应该是按照顺序来的吧？为何要判断 ip in endpoint，如果一个node上启动多个pod?

trainer_endpoints 是多机所有的ip:port，这里需要取得当前ip（本机）的ip:port，paddlecloud上多机不同node的pod_ip也是不同的（eg: node1: job-a-trainer-0.b, node2: job-a-trainer-1.b)

paddlecloud上多机不同node的pod_ip也是不同的 他们可能会给物理ip，而不是hostname。

gongweibao · 2020-09-08T02:49:32Z

python/paddle/fluid/tests/unittests/test_launch.sh

@@ -51,6 +51,7 @@ fi

 unset PADDLE_PORT
 unset TRAINER_PORTS_NUM
+export DISTRIBUTED_TRAINER_ENDPOINTS=127.0.0.1:6170,127.0.0.1:6171,127.0.0.2:6170,127.0.0.2:6171


使用DISTRIBUTED_TRAINER_ENDPOINTS 和不使用的两种协议都要测试到。
这个单测并且写上说明测试的是什么。

#21 行就是测试了没有设置DISTRIBUTED_TRAINER_ENDPOINTS 时的情况
#54 行这里是测试了设置DISTRIBUTED_TRAINER_ENDPOINTS 时的情况
已加注释，thanks！

gongweibao · 2020-09-09T12:16:59Z

python/paddle/distributed/utils.py

+        cur_node_endpoints = [
+            endpoint for endpoint in trainer_endpoints if ip in endpoint
+        ]
+        assert len(cur_node_endpoints) >= len(


>=是因为有可能paddlecloud指定的单机8卡，但它使用launch --selected_gpus 只用了4卡，这时endpoints的个数就会大于selected_gpus的个数。

这个get_cluster是cloud和非cloud共用的，接口传入的selected_gpus是用户的参数，endpoints是所有卡的endpoints，若和所有的卡数比较，那么就是自己和自己比较，没有意义。

gongweibao

LGTM

gongweibao

LGTM

add DISTRIBUTED_TRAINER_ENDPOINTS env for cloud;test=develop

0f91c45

danleifeng requested a review from gongweibao September 7, 2020 07:33

gongweibao requested changes Sep 8, 2020

View reviewed changes

danleifeng added 2 commits September 8, 2020 05:34

add annotation for DISTRIBUTED_TRAINER_ENDPOINTS;test=develop

091e9fd

add annotation for DISTRIBUTED_TRAINER_ENDPOINTS;test=develop

3ee6145

gongweibao reviewed Sep 9, 2020

View reviewed changes

edit get cur_endpoints method;test=develop

1630702

gongweibao previously approved these changes Sep 10, 2020

View reviewed changes

edit get cur_endpoints method;test=develop

7150aaf

danleifeng dismissed gongweibao’s stale review via 7150aaf September 10, 2020 11:20

gongweibao approved these changes Sep 15, 2020

View reviewed changes

danleifeng merged commit 67f87d6 into PaddlePaddle:release/1.8 Sep 15, 2020

danleifeng deleted the launch_port_1.8 branch September 21, 2020 11:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

【cherry-pick 1.8】fix ports conflict when use paddlecloud to launch analogue multi-nodes #27117

【cherry-pick 1.8】fix ports conflict when use paddlecloud to launch analogue multi-nodes #27117

danleifeng commented Sep 7, 2020

paddle-bot-old bot commented Sep 7, 2020

gongweibao Sep 8, 2020

danleifeng Sep 8, 2020

gongweibao Sep 8, 2020

danleifeng Sep 8, 2020

gongweibao Sep 8, 2020

gongweibao Sep 8, 2020

danleifeng Sep 8, 2020

gongweibao Sep 9, 2020

danleifeng Sep 9, 2020 •

edited

Loading

danleifeng Sep 9, 2020

gongweibao left a comment

gongweibao left a comment

【cherry-pick 1.8】fix ports conflict when use paddlecloud to launch analogue multi-nodes #27117

【cherry-pick 1.8】fix ports conflict when use paddlecloud to launch analogue multi-nodes #27117

Conversation

danleifeng commented Sep 7, 2020

PR types

PR changes

Describe

paddle-bot-old bot commented Sep 7, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danleifeng Sep 9, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gongweibao left a comment

Choose a reason for hiding this comment

gongweibao left a comment

Choose a reason for hiding this comment

danleifeng Sep 9, 2020 •

edited

Loading