optimize trace hang && fix event leak #58707

hitywt · 2023-11-06T03:47:35Z

PR types

Others

PR changes

Others

Description

fix cuda event leak
add nccl comm init logs for trace hang function usage

gongweibao

Some comments.

gongweibao · 2023-11-10T02:30:27Z

paddle/phi/core/distributed/utils.h

+// convert vector to string, concatenate continuous intervals with `:`,
+// concatenate discontinuous intervals with `#` eg: [1,2,3,4,5,7,8,9] =>
+// 1:3#4#5#7:9
+inline std::string VectorToString(const std::vector<int>& vec) {


Add unit test of it.

gongweibao · 2023-11-10T02:33:03Z

paddle/phi/core/distributed/nccl_comm_task.cc

-         ",seq:" + std::to_string(seq_) +
-         ",started:" + std::to_string(IsStarted()) +
-         ",completed:" + std::to_string(IsCompleted()) +
+  auto global_ranks =


Add unit test of it and test the limit of the Msg length.

gongweibao · 2023-11-10T02:34:34Z

paddle/fluid/distributed/collective/process_group.h

@@ -484,6 +484,7 @@ class ProcessGroup {
  }

 protected:
+  int global_rank_;


int global_rank_{-1};

int global_rank_{-1};

fixed

gongweibao · 2023-11-10T02:35:48Z

paddle/fluid/distributed/collective/process_group_nccl.cc

@@ -860,6 +862,44 @@ void ProcessGroupNCCL::CreateNCCLEnvCache(const Place& place,
  auto comm_ctx = std::make_unique<phi::GPUContext>(place);
  comm_ctx->set_nccl_comm(nccl_comm);

+  // gather global ranks in current group
+  int* gpu_global_rank = nullptr;


Use tensor instead of raw data since cuda memory APIs are not efficient?

If use raw data, check the result of CUDA API.

Use tensor instead of raw data since cuda memory APIs are not efficient?

If use raw data, check the result of CUDA API.

fixed

remove useless log

gongweibao

LGTM

* add comm async trace module, (#56916) * Fix trace hang (#57536) * fix trace hang * fix compile error * fix code style * tinyfix * tiny update * fix code style --------- Co-authored-by: ForFishes <[email protected]> * Fix nccl trace (#58338) * fix nccl_async_trace destruct problem when train finished * update * format code style * optimize trace hang && fix event leak (#58707) * update * fix compile problems * fix code style * fix logging * fix code style * remove useless * add ut && tinyfix * opt cudaMalloc and cudaMemcpy update * tinyfix --------- Co-authored-by: ForFishes <[email protected]>

wentaoyu added 11 commits November 6, 2023 11:43

fix event leak && add ncc comm group init info

0a30c66

fix

1787855

fix compile

b4d7fc4

add new trace hang func

7e89bf9

fix compile problem

2ece697

tinyfix

4577271

fix code style

fcf9bb3

update

b89f8fd

fix codestyle

8833d1e

update

6c6479d

fix codestyle

f733c58

gongweibao reviewed Nov 10, 2023

View reviewed changes

wentaoyu added 17 commits November 10, 2023 10:50

update

9831e0a

fix p2p task rank and size

5dff113

fix comments: add CUDA_CHECK for cuda api call

17a0a26

fix duplicate loggging

417fe03

process col and p2p group independently:

05682f5

remove useless log

9ea24c7

remove useless log

9342b95

trim trace log

8a3cea7

fix code style

94b8c87

support long log with size 20000+

2661b03

fix code style

8386025

tiny update

0d3a451

tinyfix

3a4697b

fix event destroy hang

3b87527

fix code style

c70fd0f

remove useless log

fix event hang

c53efe7

fix code style

0d8bf53

hitywt force-pushed the fix_trace_event branch from d96f29b to 0d8bf53 Compare November 15, 2023 11:53

hitywt force-pushed the fix_trace_event branch from 0a49939 to 5171f79 Compare November 15, 2023 15:13

update

a3a9e47

hitywt force-pushed the fix_trace_event branch from 5171f79 to a3a9e47 Compare November 15, 2023 15:14

wentaoyu added 5 commits November 16, 2023 13:29

update

b83b991

update event clear timeout 30s

c224df1

update

bcec091

fix code style

c6c804e

remove swap file

fe07543

hitywt changed the title ~~fix event leak && add ncc comm group init info~~ optimize trace hang && fix event leak Nov 17, 2023

gongweibao approved these changes Nov 18, 2023

View reviewed changes

gongweibao merged commit bcf9676 into PaddlePaddle:incubate/new_frl Nov 18, 2023

hitywt pushed a commit to hitywt/Paddle that referenced this pull request Nov 21, 2023

optimize trace hang && fix event leak (PaddlePaddle#58707)

de5f102

hitywt pushed a commit to hitywt/Paddle that referenced this pull request Nov 24, 2023

optimize trace hang && fix event leak (PaddlePaddle#58707)

f05b18a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize trace hang && fix event leak #58707

optimize trace hang && fix event leak #58707

hitywt commented Nov 6, 2023 •

edited

Loading

gongweibao left a comment

gongweibao Nov 10, 2023

gongweibao Nov 10, 2023

gongweibao Nov 10, 2023

hitywt Nov 10, 2023

gongweibao Nov 10, 2023

hitywt Nov 10, 2023

gongweibao left a comment

optimize trace hang && fix event leak #58707

optimize trace hang && fix event leak #58707

Conversation

hitywt commented Nov 6, 2023 • edited Loading

PR types

PR changes

Description

gongweibao left a comment

Choose a reason for hiding this comment

gongweibao Nov 10, 2023

Choose a reason for hiding this comment

gongweibao Nov 10, 2023

Choose a reason for hiding this comment

gongweibao Nov 10, 2023

Choose a reason for hiding this comment

hitywt Nov 10, 2023

Choose a reason for hiding this comment

gongweibao Nov 10, 2023

Choose a reason for hiding this comment

hitywt Nov 10, 2023

Choose a reason for hiding this comment

gongweibao left a comment

Choose a reason for hiding this comment

hitywt commented Nov 6, 2023 •

edited

Loading