-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Kunlun]Multi xpu dygraph performance optimization , add distributed.spawn support for multi xpu and some bug-fixes #31130
Conversation
Thanks for your contribution! |
paddle/fluid/imperative/reducer.cc
Outdated
@@ -640,56 +641,80 @@ void Reducer::MarkGroupReady(size_t group_index) { | |||
return; | |||
} | |||
|
|||
{ | |||
std::lock_guard<std::mutex> lock(mutex_); | |||
multi_device_op_count_ = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个放在Reducer构造函数中吧
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
paddle/fluid/imperative/reducer.cc
Outdated
multi_device_op_count_ -= 1; // lock | ||
cv_.notify_all(); | ||
} | ||
// cnt -= 1; // lock |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
python/paddle/distributed/spawn.py
Outdated
nprocs = core.get_cuda_device_count() | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
elif device == 'xpu'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
paddle/fluid/imperative/reducer.h
Outdated
|
||
// std::vector<std::unique_ptr<::ThreadPool>> pool_; | ||
// ::ThreadPool comm_pool_; | ||
::ThreadPool multi_device_op_pool_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
名字就改成上面的comm_pool,或者想一个更好的。再加些注释,这个线程用于调度通信allreduce
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just comm_pool_?
paddle/fluid/imperative/reducer.h
Outdated
// std::vector<std::unique_ptr<::ThreadPool>> pool_; | ||
// ::ThreadPool comm_pool_; | ||
::ThreadPool multi_device_op_pool_; | ||
uint32_t multi_device_op_count_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个也改一下,comm_op_count,或者想个更好的
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just comm_op_count_?
paddle/fluid/imperative/reducer.cc
Outdated
for (; next_group_ < groups_.size() && groups_[next_group_].pending_ == 0; | ||
++next_group_) { | ||
auto &group = groups_[next_group_]; | ||
int run_order = next_group_ % nrings_; | ||
|
||
// For CUDA or XPU, compute_stream --> comm_stream. | ||
// For CUDA or XPU, compute_stream --event--> comm_stream. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
event也加进去了吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
没有,通信库现在还是阻塞的,加进去也没有效果
cv_.notify_all(); | ||
} | ||
}); | ||
#else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GPU的WaitCompute丟了
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, add it again.
python/paddle/distributed/spawn.py
Outdated
(card_id, ",".join(env_devices_list))) | ||
|
||
if core.is_compiled_with_xpu(): | ||
args.selected_gpus = options.get('xpus', None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xpu也使用的是args.selected_gpus
吗?感觉代码读起来有点乱
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, uniformally define args.selected_devices for multi gpus training or multi xpus training, delete args.selected_gpus.
python/paddle/distributed/spawn.py
Outdated
raise ValueError( | ||
"The number of selected gpus(%s) is not equal to " | ||
"the number of spawn processes(%d), please ensure that the " | ||
"correct `nprocs` and `gpus` arguments are passed." % |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gpus
-> xpus
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
python/paddle/distributed/spawn.py
Outdated
(len(selected_gpu_list), nprocs)) | ||
for card_id in selected_gpu_list: | ||
if card_id not in env_devices_list: | ||
raise ValueError("The selected gpu card %s cannot found in " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gpu
-> xpu
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
paddle/fluid/imperative/reducer.cc
Outdated
} | ||
|
||
comm_pool_.enqueue([&] { | ||
parallel_ctx_->WaitCompute(run_order); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SetXPUDevice呢
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, add it.
paddle/fluid/imperative/reducer.cc
Outdated
group_size_limits_(group_size_limits), | ||
find_unused_vars_(find_unused_vars), | ||
comm_pool_(1), | ||
comm_op_count_(0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only wrap the added code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
paddle/fluid/imperative/reducer.cc
Outdated
@@ -645,51 +652,84 @@ void Reducer::MarkGroupReady(size_t group_index) { | |||
auto &group = groups_[next_group_]; | |||
int run_order = next_group_ % nrings_; | |||
|
|||
auto place = parallel_ctx_->GetDeviceContext(run_order)->GetPlace(); | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里为什么需要拿place,place不是已经有了吗?这里涉及gpu和xpu的混合通信?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reducer.h 里已经有 _place了,不用获取了。
paddle/fluid/imperative/reducer.cc
Outdated
} | ||
} | ||
|
||
void Reducer::FusedAllReduceSchedule(int run_order, Group group) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
换成常量引用,不然影响性能。
paddle/fluid/imperative/reducer.cc
Outdated
@@ -645,51 +652,84 @@ void Reducer::MarkGroupReady(size_t group_index) { | |||
auto &group = groups_[next_group_]; | |||
int run_order = next_group_ % nrings_; | |||
|
|||
auto place = parallel_ctx_->GetDeviceContext(run_order)->GetPlace(); | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reducer.h 里已经有 _place了,不用获取了。
@@ -645,51 +652,84 @@ void Reducer::MarkGroupReady(size_t group_index) { | |||
auto &group = groups_[next_group_]; | |||
int run_order = next_group_ % nrings_; | |||
|
|||
auto place = parallel_ctx_->GetDeviceContext(run_order)->GetPlace(); | |||
|
|||
// For CUDA or XPU, compute_stream --> comm_stream. | |||
// For CPU, do nothing. | |||
// NOTE. Because concat uses the comm_stream, | |||
// so we expose WaitCompute() interface and call | |||
// it here. | |||
parallel_ctx_->WaitCompute(run_order); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个后续试一下WaitCompute放到这里,还是放到allreduce里面的性能好
paddle/fluid/imperative/reducer.cc
Outdated
} else { | ||
VLOG(3) << "The sparse group[" << next_group_ | ||
<< "] has no var to allreduce"; | ||
if (paddle::platform::is_xpu_place(place)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
可以不用判断place了,直接BKCL宏定义线程调度。其余正常调度
paddle/fluid/imperative/reducer.cc
Outdated
comm_pool_->enqueue([&] { | ||
auto dev_id = BOOST_GET_CONST(platform::XPUPlace, place).device; | ||
platform::SetXPUDeviceId(dev_id); | ||
FusedAllReduceSchedule(run_order, group); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
线程调度,记个TOTO,后续加上try cache。否则出异常了主线程不知道还一直在跑
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for the modification of compiler.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR types
Performance optimization
PR changes
Others
Describe
Lists of major modifications and bug-fix about Baidu Kunlun XPU:
add new threads for multi xpu communication in imperative/reducer
register FLAGS_selected_xpus into pybind/global_value_getter_setter, add xpu interface in distributed/spawn and
distributed/utils. example at the end.
add bkcl_comm_num interface in pybind/pybind, and remove the limit of ‘num_threads = 1’ in fluid/compiler.
replace extend with append in fleet/launch_utils.
use environment variables:
export FLAGS_selected_xpus = 0,1,2,3
specifies these cards to run.
spawn example code:
The only difference in API interface between xpus and gpus is usage 4.
model benchmarks:
结果测量方法:取第一个epoch第10-110step的平均
数据单位:step/s
卡类型:K200
单卡batch size=16
动态图多进程多卡
结果测量方法:取第一个epoch第10-110step的平均。
数据单位:ips
卡类型:K200
单卡batch size=32