-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
【cherry-pick 1.8】fix ports conflict when use paddlecloud to launch analogue multi-nodes #27117
【cherry-pick 1.8】fix ports conflict when use paddlecloud to launch analogue multi-nodes #27117
Conversation
Thanks for your contribution! |
ports = [x for x in range(started_port, started_port + len(selected_gpus))] | ||
cluster, pod = get_cluster(node_ips, node_ip, ports, selected_gpus) | ||
# DISTRIBUTED_TRAINER_ENDPOINTS: new environment since paddlecloud 1.8.4 | ||
trainer_endpoints = os.getenv("DISTRIBUTED_TRAINER_ENDPOINTS") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DISTRIBUTED_TRAINER_ENDPOINTS需要举一个例子说明其格式。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done,thanks!
python/paddle/distributed/utils.py
Outdated
cluster = Cluster(hdfs=None) | ||
trainer_rank = 0 | ||
for node_rank, ip in enumerate(node_ips): | ||
pod = Pod() | ||
pod.rank = node_rank | ||
pod.addr = ip | ||
cur_node_endpoints = [ | ||
endpoint for endpoint in trainer_endpoints if ip in endpoint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
endpoints应该是按照顺序来的吧?为何要判断 ip in endpoint
, 如果一个node上启动多个pod?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trainer_endpoints 是 多机所有的ip:port, 这里需要取得当前ip(本机)的ip:port,paddlecloud上多机不同node的pod_ip也是不同的(eg: node1: job-a-trainer-0.b, node2: job-a-trainer-1.b)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
paddlecloud上多机不同node的pod_ip也是不同的
他们可能会给物理ip,而不是hostname。
@@ -51,6 +51,7 @@ fi | |||
|
|||
unset PADDLE_PORT | |||
unset TRAINER_PORTS_NUM | |||
export DISTRIBUTED_TRAINER_ENDPOINTS=127.0.0.1:6170,127.0.0.1:6171,127.0.0.2:6170,127.0.0.2:6171 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
使用DISTRIBUTED_TRAINER_ENDPOINTS 和不使用的两种协议都要测试到。
这个单测并且写上说明测试的是什么。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cur_node_endpoints = [ | ||
endpoint for endpoint in trainer_endpoints if ip in endpoint | ||
] | ||
assert len(cur_node_endpoints) >= len( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
==
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
>=是因为有可能paddlecloud指定的单机8卡,但它使用launch --selected_gpus 只用了4卡,这时endpoints的个数就会大于selected_gpus的个数。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个get_cluster是cloud和非cloud共用的,接口传入的selected_gpus是用户的参数,endpoints是所有卡的endpoints,若和所有的卡数比较,那么就是自己和自己比较,没有意义。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR types
Bug fixes
PR changes
APIs
Describe
在Paddlecloud提交GPU多节点任务时,若Paddlecloud内部分配多节点到一台机器上,则使用launch/fleetrun命令时会出现端口冲突问题。
新增对 DISTRIBUTED_TRAINER_ENDPOINTS环境变量的解析(Paddlecloud 1.8.4之后提供)。