Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RDMA Implementation #151

Merged
merged 42 commits into from
Aug 18, 2019
Merged

RDMA Implementation #151

merged 42 commits into from
Aug 18, 2019

Conversation

changlan
Copy link
Contributor

@changlan changlan commented Jul 16, 2019

As we discussed in #124, we open sourced our internal RDMA implementation (https://github.com/bytedance/ps-lite) for PS-Lite at Bytedance. It is based on the implementation in #124, but we did many optimizations so that it outperforms TCP consistently.

Here are some end-to-end results on distributed training, copied from #124:

We use Tesla V100 GPUs, and set batch size as 32. Each machine (no NVLink) has 8 GPUs, and machines are inter-connected by 100 Gbps networking (can support TCP and RoCEv2). When using TCP, we are referring to the vanilla ZeroMQ implementation of ps-lite.

Note: The values are images per second.

ResNet50:

#GPU TCP RoCEv2
8 1008 2019
16 2279 4037
32 5048 7798
64 6954 14780

VGG16

#GPU TCP RDMA
8 163 303
16 361 692
32 694 1393
64 1370 2777

src/rdma_van.h Outdated Show resolved Hide resolved
Makefile Outdated Show resolved Hide resolved
src/postoffice.cc Outdated Show resolved Hide resolved
@eric-haibin-lin eric-haibin-lin merged commit 2c8ed25 into dmlc:master Aug 18, 2019
@byronyi byronyi mentioned this pull request Aug 29, 2019
@juncgu
Copy link

juncgu commented Sep 17, 2019

Hi,

I am testing the commit 2c8ed25 for ibverbs(RDMA). However, facing issues when running mxnet/example/image-classification/train_cifar10.py in MXNet (v.1.5.0) with

export DMLC_PS_VAN_TYPE='ibverbs'
export DMLC_INTERFACE='ib0'

The error is from the server:

[22:35:10] src/van.cc:307: Bind to role=server, ip=10.255.11.106, port=40243, is_recovery=0
[22:35:10] src/./ibverbs_van.h:568: Connecting to S
[22:35:18] src/./ibverbs_van.h:568: Connecting to S[8]
[22:35:18] src/./ibverbs_van.h:568: Connecting to S[8]
[22:35:18] src/./ibverbs_van.h:568: Connecting to S[8]
[22:35:18] src/van.cc:254: S[8] is connected to others
mlx5: node2: cqn=0x1d, got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000005 00000000 00000000 00000000
00000000 9d005304 09015c74 0c9f25d3
terminate called after throwing an instance of 'dmlc::Error'
  what():  [22:35:24] src/./ibverbs_van.h:814: Check failed: wc[i].status == IBV_WC_SUCCESS Failed status
local protection error 4 70363375509248 83

There is no problem if I use

export DMLC_PS_VAN_TYPE='zmq'

BTW, I can pass the tests in src/tests when DMLC_PS_VAN_TYPE='ibverbs'.

How to run this version of ps-lite with MXNet?
@changlan

Thank you

@changlan
Copy link
Contributor Author

@juncgu Would you please test it with MXNet 1.3 or 1.4? Just wanted to make sure it is not a regression issue.

@juncgu
Copy link

juncgu commented Sep 17, 2019

@juncgu Would you please test it with MXNet 1.3 or 1.4? Just wanted to make sure it is not a regression issue.

Hi @changlan,

I tried with MXNet 1.4.0, and got the same error:

[00:53:42] src/van.cc:307: Bind to role=server, ip=10.255.11.106, port=45062, is_recovery=0
[00:53:42] src/./ibverbs_van.h:568: Connecting to S
[00:53:44] src/./ibverbs_van.h:568: Connecting to S[8]
[00:53:44] src/./ibverbs_van.h:568: Connecting to S[8]
[00:53:44] src/./ibverbs_van.h:568: Connecting to S[8]
[00:53:44] src/van.cc:254: S[8] is connected to others
mlx5: node2: cqn=0x1d, got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000005 00000000 00000000 00000000
00000000 9d005304 09015c89 0768a5d2
terminate called after throwing an instance of 'dmlc::Error'
  what():  [00:53:49] src/./ibverbs_van.h:814: Check failed: wc[i].status == IBV_WC_SUCCESS Failed status
local protection error 4 70363979491568 83

Would you mind to provide the commit info of the MXNet you've used? Or any configurations inside ibverbs_van need to be tuned?

Thank you.

@zrss
Copy link

zrss commented Oct 8, 2019

hi @changlan , thx for this nice PR, are there any docs about how to enable the ib comunication ?

@bobzhuyb
Copy link

@zrss set DMLC_PS_VAN_TYPE=ibverbs

See

std::string van_type = GetEnv("DMLC_PS_VAN_TYPE", "zmq");

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants