[Fleet Executor] Construct runtime graph #37158

LiYuRio · 2021-11-12T09:05:05Z

PR types

New Features

PR changes

Others

Describe

创建运行时图

paddle-bot-old · 2021-11-12T09:05:08Z

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

paddle/fluid/distributed/fleet_executor/runtime_graph.cc

python/paddle/fluid/executor.py

paddle/fluid/distributed/fleet_executor/runtime_graph.cc

FeixLiu · 2021-11-15T05:39:31Z

在关键的地方加一写VLOG（3）的输出用来debug吧。比如推导依赖的部分，比如interceptor_id 与 task_id 、rank等映射的部分。

python/paddle/fluid/tests/unittests/test_fleet_executor_multi_devices.py

paddle/fluid/distributed/fleet_executor/runtime_graph.cc

wangxicoding · 2021-11-16T04:00:07Z

paddle/fluid/distributed/fleet_executor/fleet_executor_desc.proto

@@ -24,4 +24,7 @@ message FleetExecutorDesc {
  optional string grain = 1 [ default = "coarse" ];
  optional int64 cur_rank = 2 [ default = 0 ]; // Rank id of current processor
  repeated RankInfo cluster_info = 3;
+  optional int32 dp_degree = 4 [ default = 1 ];


后面复用distributed_strategy是不是更好些，可能还会有sharding_degree

因为distributed_strategy.proto在framework目录下，和这个proto不在一个文件夹，在当前文件夹下的CMakeList里调用generic.cmake里定义的proto_library函数，会将protobuf的搜索路径设置为当前文件夹，同时protobuf的import不支持相对路径，所以暂时没想到怎么直接引用distributed_strategy.proto里的定义。

wangxicoding · 2021-11-16T04:04:39Z

paddle/fluid/distributed/fleet_executor/runtime_graph.cc

+  int32_t pp_indice = rank % pp_degree;
+  rank /= mp_degree;
+  int32_t dp_indice = rank % dp_degree;
+  return {dp_indice, pp_indice, mp_indice};


dp、pp、mp以后的顺序可能会变

wangxicoding · 2021-11-16T04:17:33Z

paddle/fluid/distributed/fleet_executor/runtime_graph.cc

+  return {dp_indice, pp_indice, mp_indice};
+}
+
+int64_t PPUpstreamRank(int64_t dp_degree, int64_t pp_degree, int64_t mp_degree,


建议把dp_degree、pp_degree、mp_degree这几个封装为一个结构体，当做笛卡尔坐标系，然后加上进程rank号和笛卡尔坐标系的相互转换，可能简洁一点点。然后顺序的问题加个映射也很容易解决
{x, y, z} = rank2coord(pid);
left_x = (x - 1 + xranks) % xranks; left_rank = coord2rank({left_x, y, z})

paddle/fluid/distributed/fleet_executor/runtime_graph.cc

wangxicoding

LGTM

LiYuRio force-pushed the runtime_graph branch from d3910f3 to c7b7dca Compare November 12, 2021 09:44

FeixLiu requested review from FeixLiu and wangxicoding November 15, 2021 01:34

FeixLiu reviewed Nov 15, 2021

View reviewed changes

paddle/fluid/distributed/fleet_executor/runtime_graph.cc Outdated Show resolved Hide resolved

LiYuRio force-pushed the runtime_graph branch from 91ed45e to 26c9f7b Compare November 15, 2021 04:59

FeixLiu reviewed Nov 15, 2021

View reviewed changes

python/paddle/fluid/tests/unittests/test_fleet_executor_multi_devices.py Outdated Show resolved Hide resolved

LiYuRio force-pushed the runtime_graph branch from 26c9f7b to b4e48fd Compare November 15, 2021 08:33

LiYuRio force-pushed the runtime_graph branch from b4e48fd to 1bdb86e Compare November 15, 2021 09:01

LiYuRio force-pushed the runtime_graph branch from 1bdb86e to 451a1ac Compare November 15, 2021 12:31

FeixLiu reviewed Nov 16, 2021

View reviewed changes

paddle/fluid/distributed/fleet_executor/runtime_graph.cc Show resolved Hide resolved

wangxicoding reviewed Nov 16, 2021

View reviewed changes

Construct runtime graph

849eb85

LiYuRio force-pushed the runtime_graph branch from 2a44f1f to 849eb85 Compare November 16, 2021 09:42

FeixLiu reviewed Nov 16, 2021

View reviewed changes

paddle/fluid/distributed/fleet_executor/runtime_graph.cc Show resolved Hide resolved

refine

43547fb

FeixLiu requested a review from wangxicoding November 16, 2021 10:53

wangxicoding approved these changes Nov 16, 2021

View reviewed changes

FeixLiu merged commit 0daa69d into PaddlePaddle:develop Nov 17, 2021

LiYuRio deleted the runtime_graph branch November 17, 2021 08:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fleet Executor] Construct runtime graph #37158

[Fleet Executor] Construct runtime graph #37158

LiYuRio commented Nov 12, 2021 •

edited by FeixLiu

Loading

paddle-bot-old bot commented Nov 12, 2021

FeixLiu commented Nov 15, 2021

wangxicoding Nov 16, 2021

LiYuRio Nov 16, 2021

wangxicoding Nov 16, 2021

wangxicoding Nov 16, 2021

wangxicoding left a comment

[Fleet Executor] Construct runtime graph #37158

[Fleet Executor] Construct runtime graph #37158

Conversation

LiYuRio commented Nov 12, 2021 • edited by FeixLiu Loading

PR types

PR changes

Describe

paddle-bot-old bot commented Nov 12, 2021

FeixLiu commented Nov 15, 2021

wangxicoding Nov 16, 2021

Choose a reason for hiding this comment

LiYuRio Nov 16, 2021

Choose a reason for hiding this comment

wangxicoding Nov 16, 2021

Choose a reason for hiding this comment

wangxicoding Nov 16, 2021

Choose a reason for hiding this comment

wangxicoding left a comment

Choose a reason for hiding this comment

LiYuRio commented Nov 12, 2021 •

edited by FeixLiu

Loading