[Kunlun]Multi xpu dygraph performance optimization , add distributed.spawn support for multi xpu and some bug-fixes #31130

vslyu · 2021-02-22T15:17:22Z

PR types

Performance optimization

PR changes

Others

Describe

Lists of major modifications and bug-fix about Baidu Kunlun XPU:

Multi xpu dygraph training performance optimization.
add new threads for multi xpu communication in imperative/reducer
Add distributed.spawn support for multi xpu. PR31032(merged into this PR)
register FLAGS_selected_xpus into pybind/global_value_getter_setter, add xpu interface in distributed/spawn and
distributed/utils. example at the end.
Export bkcl_comm_num interface for python
add bkcl_comm_num interface in pybind/pybind, and remove the limit of ‘num_threads = 1’ in fluid/compiler.
Fix extend device id(>10) to multi strings bug in fleet/launch_utils.
replace extend with append in fleet/launch_utils.
Fix some error info and macro definition in collective/c_comm_init_op and collective/gen_bkcl_id_op.

use environment variables:
export FLAGS_selected_xpus = 0,1,2,3
specifies these cards to run.

spawn example code:

from __future__ import print_function

import paddle
import paddle.nn as nn
import paddle.optimizer as opt
import paddle.distributed as dist

class LinearNet(nn.Layer):
    def __init__(self):
        super(LinearNet, self).__init__()
        self._linear1 = nn.Linear(10, 10)
        self._linear2 = nn.Linear(10, 1)

    def forward(self, x):
        return self._linear2(self._linear1(x))

def train(print_result=False):
    # 1. initialize parallel environment
    dist.init_parallel_env()

    # 2. create data parallel layer & optimizer
    layer = LinearNet()
    dp_layer = paddle.DataParallel(layer)

    loss_fn = nn.MSELoss()
    adam = opt.Adam(
        learning_rate=0.001, parameters=dp_layer.parameters())

    # 3. run layer
    inputs = paddle.randn([10, 10], 'float32')
    outputs = dp_layer(inputs)
    labels = paddle.randn([10, 1], 'float32')
    loss = loss_fn(outputs, labels)

    if print_result is True:
        print("loss:", loss.numpy())

    loss.backward()

    adam.step()
    adam.clear_grad()

# Usage 1: only pass function.
# If your training method no need any argument, and
# use all visible devices for parallel training.
if __name__ == '__main__':
    dist.spawn(train)

# Usage 2: pass function and arguments.
# If your training method need some arguments, and
# use all visible devices for parallel training.
if __name__ == '__main__':
    dist.spawn(train, args=(True,))

# Usage 3: pass function, arguments and nprocs.
# If your training method need some arguments, and
# only use part of visible devices for parallel training.
# If your machine hold 8 cards {0,1,2,3,4,5,6,7},
# this case will use cards {0,1}
if __name__ == '__main__':
    dist.spawn(train, args=(True,), nprocs=2)

# Usage 4: pass function, arguments, nprocs and gpus.
# If your training method need some arguments, and
# only use part of visible devices for parallel training, 
# you can pass `xpus` to select the XPU cards you want to use. 
# For example, this case will use cards {4,5}  if your machine hold more than 6 cards.
if __name__ == '__main__':
    dist.spawn(train, args=(True,), nprocs=2, xpus='4,5')

The only difference in API interface between xpus and gpus is usage 4.

model benchmarks:

BERT
结果测量方法：取第一个epoch第10-110step的平均
数据单位：step/s
卡类型：K200
单卡batch size=16

卡数/精度	1N1C	1N2C	1N4C	1N8C	1N16C
FP32	1.46x(1.0x)	1.42*2=2.88(1.9452x)	0.86*4=3.44(2.3561x)	0.77*8= 6.16(4.3380x)	0.52*16= 8.32(5.6986x)

ResNet
动态图多进程多卡
结果测量方法：取第一个epoch第10-110step的平均。
数据单位：ips
卡类型：K200
单卡batch size=32

卡数/精度	1N1C	1N2C	1N4C	1N8C	1N16C
FP32	52.56127	46.21590*2=92.4318(1.7585x)	41.66224*4=166.64896(3.1705x)	36.2901*8=290.32007=(5.5234x)	shm error

paddle-bot-old · 2021-02-22T15:18:35Z

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

…ication

…dev/xpu_pe5

wangxicoding · 2021-02-23T08:44:42Z

paddle/fluid/imperative/reducer.cc

@@ -640,56 +641,80 @@ void Reducer::MarkGroupReady(size_t group_index) {
    return;
  }

+  {
+    std::lock_guard<std::mutex> lock(mutex_);
+    multi_device_op_count_ = 0;


这个放在Reducer构造函数中吧

wangxicoding · 2021-02-23T08:54:36Z

paddle/fluid/imperative/reducer.cc

+        multi_device_op_count_ -= 1;  // lock
+        cv_.notify_all();
+      }
+      // cnt -= 1; // lock


wangxicoding · 2021-02-23T08:57:45Z

python/paddle/distributed/spawn.py

            nprocs = core.get_cuda_device_count()
+        else:


elif device == 'xpu'

wangxicoding · 2021-02-23T09:13:21Z

paddle/fluid/imperative/reducer.h

+
+  // std::vector<std::unique_ptr<::ThreadPool>> pool_;
+  // ::ThreadPool comm_pool_;
+  ::ThreadPool multi_device_op_pool_;


名字就改成上面的comm_pool，或者想一个更好的。再加些注释，这个线程用于调度通信allreduce

just comm_pool_?

wangxicoding · 2021-02-23T09:13:58Z

paddle/fluid/imperative/reducer.h

+  // std::vector<std::unique_ptr<::ThreadPool>> pool_;
+  // ::ThreadPool comm_pool_;
+  ::ThreadPool multi_device_op_pool_;
+  uint32_t multi_device_op_count_;


这个也改一下，comm_op_count，或者想个更好的

just comm_op_count_?

QingshuChen · 2021-02-23T09:29:59Z

paddle/fluid/imperative/reducer.cc

  for (; next_group_ < groups_.size() && groups_[next_group_].pending_ == 0;
       ++next_group_) {
    auto &group = groups_[next_group_];
    int run_order = next_group_ % nrings_;

-    // For CUDA or XPU, compute_stream --> comm_stream.
+    // For CUDA or XPU, compute_stream --event--> comm_stream.


event也加进去了吗？

没有，通信库现在还是阻塞的，加进去也没有效果

… multi cards

wangxicoding · 2021-02-24T02:45:49Z

paddle/fluid/imperative/reducer.cc

+        cv_.notify_all();
+      }
+    });
+#else


GPU的WaitCompute丟了

done, add it again.

chenwhql · 2021-03-02T02:13:30Z

python/paddle/distributed/spawn.py

+                                     (card_id, ",".join(env_devices_list)))
+
+    if core.is_compiled_with_xpu():
+        args.selected_gpus = options.get('xpus', None)


xpu也使用的是args.selected_gpus吗？感觉代码读起来有点乱

Done, uniformally define args.selected_devices for multi gpus training or multi xpus training, delete args.selected_gpus.

python/paddle/distributed/spawn.py

chenwhql · 2021-03-02T02:15:03Z

python/paddle/distributed/spawn.py

+                raise ValueError(
+                    "The number of selected gpus(%s) is not equal to "
+                    "the number of spawn processes(%d), please ensure that the "
+                    "correct `nprocs` and `gpus` arguments are passed." %


gpus -> xpus?

chenwhql · 2021-03-02T02:15:15Z

python/paddle/distributed/spawn.py

+                    (len(selected_gpu_list), nprocs))
+            for card_id in selected_gpu_list:
+                if card_id not in env_devices_list:
+                    raise ValueError("The selected gpu card %s cannot found in "


gpu -> xpu?

wangxicoding · 2021-03-02T08:25:54Z

paddle/fluid/imperative/reducer.cc

+    }
+
+    comm_pool_.enqueue([&] {
+      parallel_ctx_->WaitCompute(run_order);


SetXPUDevice呢

done, add it.

hutuxian · 2021-03-02T09:10:22Z

paddle/fluid/imperative/reducer.cc

+      group_size_limits_(group_size_limits),
+      find_unused_vars_(find_unused_vars),
+      comm_pool_(1),
+      comm_op_count_(0) {


only wrap the added code

ForFishes · 2021-03-03T03:44:09Z

paddle/fluid/imperative/reducer.cc

@@ -645,51 +652,84 @@ void Reducer::MarkGroupReady(size_t group_index) {
    auto &group = groups_[next_group_];
    int run_order = next_group_ % nrings_;

+    auto place = parallel_ctx_->GetDeviceContext(run_order)->GetPlace();
+


这里为什么需要拿place，place不是已经有了吗？这里涉及gpu和xpu的混合通信？

reducer.h 里已经有 _place了，不用获取了。

ForFishes · 2021-03-03T03:54:57Z

paddle/fluid/imperative/reducer.cc

+  }
+}
+
+void Reducer::FusedAllReduceSchedule(int run_order, Group group) {


换成常量引用，不然影响性能。

wangxicoding · 2021-03-03T03:56:34Z

paddle/fluid/imperative/reducer.cc

@@ -645,51 +652,84 @@ void Reducer::MarkGroupReady(size_t group_index) {
    auto &group = groups_[next_group_];
    int run_order = next_group_ % nrings_;

+    auto place = parallel_ctx_->GetDeviceContext(run_order)->GetPlace();
+


reducer.h 里已经有 _place了，不用获取了。

wangxicoding · 2021-03-03T03:57:58Z

paddle/fluid/imperative/reducer.cc

@@ -645,51 +652,84 @@ void Reducer::MarkGroupReady(size_t group_index) {
    auto &group = groups_[next_group_];
    int run_order = next_group_ % nrings_;

+    auto place = parallel_ctx_->GetDeviceContext(run_order)->GetPlace();
+
    // For CUDA or XPU, compute_stream --> comm_stream.
    // For CPU, do nothing.
    // NOTE. Because concat uses the comm_stream,
    // so we expose WaitCompute() interface and call
    // it here.
    parallel_ctx_->WaitCompute(run_order);


这个后续试一下WaitCompute放到这里，还是放到allreduce里面的性能好

wangxicoding · 2021-03-03T03:59:18Z

paddle/fluid/imperative/reducer.cc

-      } else {
-        VLOG(3) << "The sparse group[" << next_group_
-                << "] has no var to allreduce";
+    if (paddle::platform::is_xpu_place(place)) {


可以不用判断place了，直接BKCL宏定义线程调度。其余正常调度

wangxicoding · 2021-03-03T04:00:44Z

paddle/fluid/imperative/reducer.cc

+      comm_pool_->enqueue([&] {
+        auto dev_id = BOOST_GET_CONST(platform::XPUPlace, place).device;
+        platform::SetXPUDeviceId(dev_id);
+        FusedAllReduceSchedule(run_order, group);


线程调度，记个TOTO，后续加上try cache。否则出异常了主线程不知道还一直在跑

Xreki

LGTM for the modification of compiler.py

wangxicoding

LGTM

ForFishes

LGTM

vslyu added 3 commits February 18, 2021 12:04

add xpu support for spawn

a09cceb

add FLAGS_selected_xpus into PUBLIC_GLOBAL_VAR

0208e1c

multi xpu dygraph performance optimization

f26329a

vslyu added 2 commits February 23, 2021 08:05

multi xpu dygraph performance optimization, add new thread for commun…

9a3b925

…ication

Merge branch 'dev/xpu_spawn' of https://github.com/vslyu/Paddle into …

93959e9

…dev/xpu_pe5

wangxicoding requested changes Feb 23, 2021

View reviewed changes

wangxicoding reviewed Feb 23, 2021

View reviewed changes

QingshuChen reviewed Feb 23, 2021

View reviewed changes

vslyu added 2 commits February 23, 2021 15:02

fix new thread for communication, fix spawn ,and fix static graph for…

7a48890

… multi cards

fix comm_pool_ init

4f38ffe

wangxicoding requested changes Feb 24, 2021

View reviewed changes

vslyu added 3 commits February 24, 2021 06:30

fix reducer.cc

2417d2d

fix build debug for reducer.cc, and fix multi xpu launch bug

485c907

fix c_comm_init_op.cc and gen_bkcl_id_op.cc

f852090

chenwhql reviewed Mar 2, 2021

View reviewed changes

fix selected_gpus to selected_devices including gpus and xpus

669c4d3

vslyu changed the title ~~[Kunlun]multi xpu dygraph performance optimization~~ [Kunlun]Multi xpu dygraph performance optimization , add distributed.spawn support for multi xpu and some bug-fixes Mar 2, 2021

wangxicoding requested changes Mar 2, 2021

View reviewed changes

hutuxian reviewed Mar 2, 2021

View reviewed changes

fix reducer.cc

5b21647

ForFishes reviewed Mar 3, 2021

View reviewed changes

wangxicoding reviewed Mar 3, 2021

View reviewed changes

Xreki previously approved these changes Mar 3, 2021

View reviewed changes

fix reducer.cc and add PADDLE_WITH_BKCL flag

3959691

vslyu dismissed Xreki’s stale review via 3959691 March 3, 2021 08:35

fix reducer.cc

bdcd0ad

wangxicoding previously approved these changes Mar 3, 2021

View reviewed changes

fix,test=kunlun

c495f7a

vslyu dismissed wangxicoding’s stale review via c495f7a March 3, 2021 16:16

ForFishes approved these changes Mar 4, 2021

View reviewed changes

wangxicoding approved these changes Mar 4, 2021

View reviewed changes

Xreki approved these changes Mar 4, 2021

View reviewed changes

luotao1 approved these changes Mar 4, 2021

View reviewed changes

wangxicoding merged commit 9ebf05b into PaddlePaddle:develop Mar 5, 2021

[Kunlun]Multi xpu dygraph performance optimization , add distributed.spawn support for multi xpu and some bug-fixes #31130

[Kunlun]Multi xpu dygraph performance optimization , add distributed.spawn support for multi xpu and some bug-fixes #31130

Conversation

vslyu commented Feb 22, 2021 • edited Loading

PR types

PR changes

Describe

paddle-bot-old bot commented Feb 22, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wangxicoding Mar 3, 2021 • edited Loading

Choose a reason for hiding this comment

Xreki left a comment

Choose a reason for hiding this comment

wangxicoding left a comment

Choose a reason for hiding this comment

ForFishes left a comment

Choose a reason for hiding this comment

vslyu commented Feb 22, 2021 •

edited

Loading

wangxicoding Mar 3, 2021 •

edited

Loading