-
Notifications
You must be signed in to change notification settings - Fork 543
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RDMA Implementation #151
RDMA Implementation #151
Conversation
1. Simple memory management: Per-QP buffer 2. Reduce the amount of control message. Decouple sending control messages and data message.
Now that we knew it's called Rendezvous Protocol...
i.e. two connections between any pair of PS and worker
candidate values: zmq, ibverbs, fabric
Hi, I am testing the commit 2c8ed25 for ibverbs(RDMA). However, facing issues when running export DMLC_PS_VAN_TYPE='ibverbs'
export DMLC_INTERFACE='ib0' The error is from the
There is no problem if I use export DMLC_PS_VAN_TYPE='zmq' BTW, I can pass the tests in How to run this version of ps-lite with MXNet? Thank you |
@juncgu Would you please test it with MXNet 1.3 or 1.4? Just wanted to make sure it is not a regression issue. |
Hi @changlan, I tried with MXNet 1.4.0, and got the same error:
Would you mind to provide the commit info of the MXNet you've used? Or any configurations inside Thank you. |
hi @changlan , thx for this nice PR, are there any docs about how to enable the ib comunication ? |
As we discussed in #124, we open sourced our internal RDMA implementation (https://github.com/bytedance/ps-lite) for PS-Lite at Bytedance. It is based on the implementation in #124, but we did many optimizations so that it outperforms TCP consistently.
Here are some end-to-end results on distributed training, copied from #124:
We use Tesla V100 GPUs, and set batch size as 32. Each machine (no NVLink) has 8 GPUs, and machines are inter-connected by 100 Gbps networking (can support TCP and RoCEv2). When using TCP, we are referring to the vanilla ZeroMQ implementation of ps-lite.
Note: The values are images per second.
ResNet50:
VGG16