support multi node in heterps #31102

Thunderbrook · 2021-02-22T04:47:59Z

PR types

New features

PR changes

Others

Describe

support multi node in heterps mode

paddle-bot-old · 2021-02-22T04:48:10Z

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

paddle-bot-old · 2021-02-22T04:48:16Z

✅ This PR's description meets the template requirements!
Please wait for other CI results.

danleifeng · 2021-02-24T08:46:42Z

python/paddle/fluid/transpiler/collective.py

@@ -386,3 +386,27 @@ def __init__(self):
    def _transpile_startup_program(self):
        block = self.startup_program.global_block()
        block.append_op(type='c_comm_init_all', attrs={'ring_id': 0})
+
+
+class MultiThread(GradAllReduce):


需要在minimize中添加MultiThread的使用

danleifeng · 2021-02-24T08:47:34Z

paddle/fluid/framework/fleet/heter_ps/heter_comm.h

@@ -111,6 +173,12 @@ class HeterComm {
  CustomGradMerger merger_;
  int topo_aware_{1};
  std::vector<std::vector<Path>> path_;
+  std::vector<LocalStorage> storage_;
+  int feanum_{1800 * 2048};
+  int multi_node_{1};


写成可配置的形式

danleifeng · 2021-02-24T08:48:40Z

paddle/fluid/framework/fleet/heter_ps/heter_ps.cu

@@ -54,7 +54,14 @@ void HeterPs::show_one_table(int gpu_num) { comm_->show_one_table(gpu_num); }

 void HeterPs::push_sparse(int num, FeatureKey* d_keys,
                          FeaturePushValue* d_grads, size_t len) {
-  comm_->push_sparse(num, d_keys, d_grads, len, opt_);
+  // comm_->push_sparse(num, d_keys, d_grads, len, opt_);
+  comm_->push_sparse_multi_node(num, d_keys, d_grads, len, opt_);


需要加入单机多机的判断，走push_sparse 或 push_sparse_multi_node

* push multi node * multi node * MultiThread * remove log * solve bug in 30829

* solve build gpu task core (#30626) * build gpu task core * format * dump to cpu (#30750) * dump to cpu * format * format * format * support multi node in heterps (#31102) * push multi node * multi node * MultiThread * remove log * solve bug in 30829 * optimizer

Thunderbrook added 4 commits January 13, 2021 11:10

push multi node

875eddd

Merge remote-tracking branch 'upstream/develop' into multi_node

2f70117

multi node

8867f4c

MultiThread

8e67741

Thunderbrook added 2 commits February 22, 2021 12:49

remove log

d300825

solve conflict

6c64609

Thunderbrook changed the title ~~Multi node~~ support multi node in heterps Feb 22, 2021

solve bug in 30829

6ae4989

danleifeng reviewed Feb 24, 2021

View reviewed changes

danleifeng approved these changes Feb 24, 2021

View reviewed changes

Thunderbrook merged commit c4f279f into PaddlePaddle:develop Feb 24, 2021

Thunderbrook added a commit to Thunderbrook/Paddle that referenced this pull request Mar 1, 2021

support multi node in heterps (PaddlePaddle#31102)

040f259

* push multi node * multi node * MultiThread * remove log * solve bug in 30829

Thunderbrook mentioned this pull request Mar 1, 2021

[Cherry pick] cherry-pick #31102 #30750 #30626 #31336

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support multi node in heterps #31102

support multi node in heterps #31102

Thunderbrook commented Feb 22, 2021 •

edited

Loading

paddle-bot-old bot commented Feb 22, 2021

paddle-bot-old bot commented Feb 22, 2021 •

edited

Loading

danleifeng Feb 24, 2021

danleifeng Feb 24, 2021

danleifeng Feb 24, 2021

support multi node in heterps #31102

support multi node in heterps #31102

Conversation

Thunderbrook commented Feb 22, 2021 • edited Loading

PR types

PR changes

Describe

paddle-bot-old bot commented Feb 22, 2021

paddle-bot-old bot commented Feb 22, 2021 • edited Loading

danleifeng Feb 24, 2021

Choose a reason for hiding this comment

danleifeng Feb 24, 2021

Choose a reason for hiding this comment

danleifeng Feb 24, 2021

Choose a reason for hiding this comment

Thunderbrook commented Feb 22, 2021 •

edited

Loading

paddle-bot-old bot commented Feb 22, 2021 •

edited

Loading