NCCL integration #8294

ptrendx · 2017-10-16T17:32:44Z

Description

This PR provides new KVStore type with integration for NCCL communication library.

Checklist

Essentials

Passed code style checking (make lint)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage
For user-facing API changes, API doc string has been updated.
To my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

New nccl type of kvstore, using ncclReduce and ncclBcast
test_nccl.py added to tests/python/gpu, but not enabled, since NCCL is not present and enabled in CI

Comments

Interesting edge cases to note here:
- NCCL KVStore requires the same set of devices to be used for all communications (as is the case in typical data parallel training)
- in NCCL KVStore push and pull are implemented using 2 steps - launching NCCL kernels in 1 step and synchronizing in the second step. This was made to enable seamless aggregation support - several reductions are scheduled before a synchronization.

enabled in build)

piiswrong · 2017-10-16T17:35:22Z

@mli @eric-haibin-lin

eric-haibin-lin

Awesome work to bring NCCL to MXNet! I haven't finished reading all code, a few comments so far..

eric-haibin-lin · 2017-10-17T03:55:12Z

include/mxnet/kvstore.h

@@ -162,7 +162,7 @@ class KVStore {
   * \param priority Priority of the action.
   */
  virtual void Pull(const std::vector<int>& keys,
-                    const std::vector<NDArray*>& values,
+                    const std::vector<NDArray>& values,


Is it really necessary to change the interface here? Was this causing memory issues in pool_storage_manager?

This change is not really necessary, but I think it makes the interface more consistent (and C API interface is not changed so it should not affect frontend languages) - it makes it simpler to reuse code between push and pull for example. It was originally introduced as part of the previous NCCL integration effort (that never got merged) to accomodate allreduce API interface.

Is it still required for this PR?

No, and I changed it back (although this makes some of the other functions kind of ugly when you need to support both pointers and references to ndarrays).

eric-haibin-lin · 2017-10-17T03:58:38Z

python/mxnet/model.py

+    # Use aggregation by default only with NCCL
+    default_batch = 16 if 'nccl' in kvstore.type else 1
+    batch = int(os.getenv('MXNET_UPDATE_AGGREGATION_SIZE', default_batch))
+    while(start < size):


nit: while start < size:

Will change.

eric-haibin-lin · 2017-10-17T03:59:18Z

python/mxnet/model.py

+    size = len(grad_arrays)
+    start = 0
+    # Use aggregation by default only with NCCL
+    default_batch = 16 if 'nccl' in kvstore.type else 1


where does the magic number 16 come from?

Performance experiments :-). User may change that value with env variable though.

eric-haibin-lin · 2017-10-17T04:00:19Z

python/mxnet/model.py

        # pull back the weights
-        kvstore.pull(name, arg_list, priority=-index)
+        kvstore.pull(param_names[start:end], param_arrays[start:end], priority=-start)
+        start = end

 def _update_params(param_arrays, grad_arrays, updater, num_device,
                   kvstore=None, param_names=None):


Is this function not updated with batch aggregation?

This function is not used in GPU environment (only in local kvstore), and this aggregation is not "real" aggregation (although it makes it possible to implement actual aggregation in the future, either explicit by copying to long buffer and modifying pointers or implicit inside NCCL), so there is no sense in enabling it here.
What this aggregation does is basically delay the synchronization, so that multiple NCCL kernels may work at the same time, having better chance at saturating available links.

eric-haibin-lin · 2017-10-17T04:01:21Z

src/kvstore/comm.h

@@ -58,7 +76,10 @@ class Comm {
   */
  virtual void Broadcast(
      int key, const NDArray& src,
-      const std::vector<NDArray*> dst, int priority) = 0;
+      const std::vector<NDArray> dst, int priority) = 0;
+


Could you add brief comments for these two methods? Are they only for nccl? Do we want to declare it only when MXNET_USE_NCCL is set?

eric-haibin-lin · 2017-10-17T04:05:05Z

src/kvstore/kvstore_local.h

@@ -61,7 +61,10 @@ class KVStoreLocal : public KVStore {
  }

  virtual ~KVStoreLocal() {
-    delete comm_;


I think delete nullptr is safe

Ok, I will change that to only add comm_ = nullptr; after deleting. I added the check because the previous version of the destructor was not setting comm_ to nullptr and it gave me segfault when kvstore_nccl called both its own destructor and kvstore_local destructor (both trying to delete comm_).

eric-haibin-lin · 2017-10-17T04:06:15Z

src/kvstore/kvstore_local.h

-                        const std::vector<NDArray*>& values,
-                        std::vector<int> *uniq_keys,
-                        std::vector<std::vector<NDArray*>> *grouped_vals) {
+  virtual void GroupKVPairsPull(const std::vector<int>& keys,


Why is virtual added here?

Because the validator may be different in inheriting classes. That is the case for NCCL kvstore - it inherits from local kvstore to remove copy-paste code, but I can't support sparse types there.

eric-haibin-lin · 2017-10-17T04:10:25Z

include/mxnet/storage.h

@@ -78,6 +78,16 @@ class Storage {
   */
  virtual ~Storage() {}
  /*!
+   * \brief Returns mutex used by storage manager
+   */
+  std::mutex& GetMutex(Context::DeviceType dev) {


Could you add brief description when mutex is required?

Mutex is not really required outside of NCCL (and only for GPU allocations). See discussion from the previous NCCL PR: #5521 (comment)

eric-haibin-lin

Do we have any performance number for using nccl in MXNet?
Also adding @rahul003 for review

eric-haibin-lin · 2017-10-28T17:34:11Z

src/kvstore/comm.h

+            for (size_t i = 0; i < src.size(); ++i) {
+              NCCLEntry cur = nccl_data_[src[i].ctx().dev_id];
+              if (i == root_id) {
+              MSHADOW_TYPE_SWITCH(src[i].dtype(), DType,


nit: Please fix indentation

eric-haibin-lin · 2017-10-28T17:38:26Z

src/kvstore/comm.h

+        return ncclDouble;
+      case mshadow::kUint8:
+        return ncclChar;
+      case mshadow::kInt32:


Should kInt64 also be added??

eric-haibin-lin · 2017-10-28T17:41:27Z

src/kvstore/kvstore_nccl.h

+/**
+ * \brief store data in local machine using NCCL
+ */
+class KVStoreNCCL : public KVStoreLocal {


Does it mean multi-machine with GPUs cannot benefit from nccl?

Currently no - it is a future work. The biggest problem is how to bootstrap NCCL in multi-node scenario, and I do not yet understand MXNet's distributed kvstore enough to use it for that task.

eric-haibin-lin · 2017-10-28T17:42:27Z

tests/python/gpu/test_nccl.py

+# specific language governing permissions and limitations
+# under the License.
+
+# pylint: skip-file


Could you enable pylint here?

Most current tests have pylint disabled, I copied that part from them. Sure, I will do that.

eric-haibin-lin · 2017-11-03T00:43:47Z

tests/python/gpu/test_nccl.py

+            a = mx.nd.ones(shape, mx.gpu(0))
+            cur_key = str(key*max(gpus)+n_gpus)
+            kv_nccl.init(cur_key, a)
+            arr_list = [mx.nd.ones(shape, mx.gpu(x)) for x in xrange(n_gpus)]


xrange is not py3 compatible. Can you replace it with range?

rahul003 · 2017-11-03T00:22:21Z

src/kvstore/comm.h

@@ -32,6 +35,21 @@
 #include "mxnet/ndarray.h"
 #include "../ndarray/ndarray_function.h"
 #include "../operator/tensor/sparse_retain-inl.h"
+
+#if MXNET_USE_NCCL


Can't we merge the two #if MXNET_USE_NCCL ?

Linter would complain about system header being after local headers (I assume those are the 2 #ifs you would want merged.

Ok, then we can leave them as is

rahul003 · 2017-11-03T00:29:57Z

src/kvstore/comm.h

+      dev_ids.push_back(e.ctx().dev_id);
+    }
+    std::sort(dev_ids.begin(), dev_ids.end());
+    CHECK(device_ids_ == dev_ids) << "NCCL KVStore supports only single set of devices";


Do you want to check here that the set of devices don't change during the training?

Yes. Handling multiple sets of devices can be done, but not with the structure imposed by the Comm class. Basically in order to keep the benefits of batching I need to ensure that the root for the reduction will be the same for the whole batch, but I know who participates only during the actual push/pull, not during Init, and all of the data structures are initialized only once during the first push. This BTW should also be checked in the device kvstore (and currently is not), otherwise you can do something like this:

>>> import mxnet as mx >>> kv = mx.kv.create("device") >>> shape = (2,3) >>> kv.init(4, mx.nd.ones(shape)) >>> gpus = [mx.gpu(i) for i in range(2)] >>> b = [mx.nd.ones(shape, gpu) for gpu in gpus] >>> kv.push(4, b) >>> a = mx.nd.zeros(shape) >>> kv.pull(4, out = a) >>> a [[ 2. 2. 2.] [ 2. 2. 2.]] <NDArray 2x3 @cpu(0)> >>> gpus = [mx.gpu(i) for i in range(4)] >>> >>> b = [mx.nd.ones(shape, gpu) for gpu in gpus] >>> kv.push(4, b) Segmentation fault

rahul003 · 2017-11-03T01:43:07Z

src/kvstore/comm.h

+  using KeyAttrs = std::tuple<int, TShape, int>;
+  // try to allocate buff on device evenly
+  void InitMergeBuffer(const std::vector<Context>& devs) {
+    for (size_t i = 0; i < sorted_key_attrs_.size(); ++i) {


Doesn't look like sorted_key_attrs_ is actually sorted in this case. And also doesn't look like they need to be sorted here. If so, can you use a different variable?

rahul003 · 2017-11-03T01:43:52Z

src/kvstore/comm.h

+      auto& buf = merge_buf_[key];
+      Context ctx;
+      // use devs[0] as root
+      ctx = devs[0];


These three lines are strange. If you want to make devs[0] root always, please directly use devs[0] as arg in line 927 instead.

As I understand, the buffers are no longer evenly allocated on all devices? If so, please remove the comment on 917.
And why don't we want to do that anymore?

I will remove the comment.
We want to use devs[0] every time because this helps hide some of the latencies and keep more inter-GPU links occupied if the flow of data is always the same.

rahul003 · 2017-11-03T02:05:17Z

src/kvstore/comm.h

+          }
+        },
+        Context::CPU(),
+        const_vars,


This can be more compact if const_vars is replaced with {}. Same in below function.

rahul003 · 2017-11-03T07:35:33Z

src/kvstore/comm.h

+#include "../common/cuda_utils.h"
+
+#ifndef NCCL_MAJOR
+#define NCCL_MAJOR 1


Could you add a comment explaining this?

rahul003 · 2017-11-03T07:36:24Z

src/kvstore/comm.h

-  virtual void Init(int key, const NDArrayStorageType stype,
-                    const TShape& shape, int dtype = mshadow::kFloat32) = 0;
+  virtual void Init(int key, const NDArrayStorageType stype, const TShape& shape,
+      int dtype = mshadow::kFloat32, Context pinned_ctx = Context::CPUPinned(0)) = 0;


Are you changing the pinnned context to something other than CPU in
kVStoreNCCL? Or is this change just to generalize the function?

That was a relict of a previous iteration of the code - I will remove it.

rahul003 · 2017-11-03T08:03:47Z

src/kvstore/comm.h

+        }
+      }
+    } else {
+      auto& buf = merge_buf_[key];


could you please add some comments explaining the flow of data here

rahul003 · 2017-11-03T08:10:50Z

src/kvstore/comm.h

+      if (dst.size() == 1) return;
+      std::vector<Engine::VarHandle> mutable_vars;
+      for (size_t i = 0; i < dst.size(); ++i) {
+        if ( i != root_id)


Please add braces around line 848. This style has potential for future bugs :)

rahul003 · 2017-11-03T08:12:19Z

src/kvstore/comm.h

+      size_t root_id = -1;
+      for (size_t i = 0; i < dst.size(); ++i) {
+        if (dst[i].ctx().dev_id == root) {
+          root_id = i;


could you please wrap such tasks in small wrapper functions so the code becomes more readable?

I did for this snippet. Unfortunately the other small functions like this have slightly different elements so it's not as simple to wrap them and reuse. I also added some comments on what is being done.

eric-haibin-lin · 2017-11-07T22:40:04Z

Thanks for addressing all these review comments. Is anyone helping you to setup the CI test with NCCL?

ptrendx · 2017-11-07T22:43:20Z

Not really, no.

piiswrong · 2017-11-09T00:56:43Z

python/mxnet/model.py

        # push gradient, priority is negative index
-        kvstore.push(name, grad_list, priority=-index)
+        kvstore.push(param_names[start:end], grad_arrays[start:end], priority=-start)


what's the purpose of this? Why should it be done in frontend?

This enables aggregation of the reductions (in NCCL 2.1 it is not a real aggregation, since each reduction is being done in its own launch and they benefit from lack of synchronization between reductions in a group, but NCCL 2.2 will introduce real aggregation support).
It needs to be done in the frontend, because kvstore itself does not have any information on which gradients should be aggregated, in which order and how much time is it supposed to wait before real launch - mxnet's dependency engine does not really allow for that. That's why we need to provide data about all of the reductions in a group to benefit from aggregation.

Is it valid to push multiple keys in one call? If its valid why not push all keys in one call and decide aggregation size in backend?

Also the if grad_list[0] is None: logic is removed.

It is valid to push multiple keys in 1 call. But you still want to set priority to those reductions in frontend - kvstore should not assume the order in which you push gradients to it, so aggregating everything would get rid of priority altogether.
Regarding the missing logic - good point, I forgot about it - will fix.

eric-haibin-lin · 2017-11-12T07:48:34Z

@mbaijal We'll need your help setting up a new CI test with NCCL build after this is merged.

piiswrong · 2017-11-15T23:25:05Z

src/kvstore/comm.h

  /**
   * \brief copy from src to dst[i] for every i
   */
  virtual void Broadcast(
      int key, const NDArray& src,
      const std::vector<NDArray*> dst, int priority) = 0;

+#if MXNET_USE_NCCL
+  // Aggregated reductions
+  virtual void Reduce(const std::vector<int> keys,


If NCCL is going to do everything differently then it shouldn't inherit the comm interface. Do this in KVStoreNCCL directly

Ok, will do.

piiswrong · 2017-11-20T19:18:26Z

src/storage/pooled_storage_manager.h

@@ -88,7 +92,7 @@ class GPUPooledStorageManager final : public StorageManager {
 };  // class GPUPooledStorageManager

 void GPUPooledStorageManager::Alloc(Storage::Handle* handle) {
-  std::lock_guard<std::mutex> lock(mutex_);
+  std::lock_guard<std::mutex> lock(Storage::Get()->GetMutex(Context::kGPU));


So now memory allocation on all gpus share the same mutex? This could slow down memory allocation. Especially when using Gluon.

Could you suggest a Gluon benchmark to check the performance impact? If the impact is too big we can move to cooperative launch in NCCL2 (that would not need a shared mutex anymore), but this would mean it will be compatible only with NCCL 2 and CUDA 9. Also, currently cooperative launch is slower than parallel.

I may be looking for places to insert many-read/single-write shared mutexes in places to help performance. Do you think that this would be a good candidate for this, or is there a reason that there needs to be a global lock for this operation?

There is a reason it needs to be global lock - NCCL needs to finish scheduling all kernels before anybody can start allocation/deallocation of gpu memory, otherwise deadlock will happen.

For multiple GPUs, would this just be applicable to the two GPU's involved in the transfer? Or just one GPU? Or all?

NCCL is collective communication library so all gpus are involved at the same time.

Thank you for the clarification

marcoabreu · 2017-11-20T22:46:27Z

@eric-haibin-lin Please create an issue and mention me so I can keep track of this case. I've been thinking about adding a p2/p3-specific job to the Unit Tests - this would cover features which are unsupported by our usual CI-machiens.

ptrendx · 2017-11-21T03:42:12Z

Integrated last comments from @piiswrong and merged with 2-bit compression PR (NCCL does not support gradient compression and prints error message when trying to use it).

cjolivier01 · 2017-11-21T03:53:00Z

Is this ready to go in? (assuming CI passes)

ptrendx · 2017-11-21T04:08:00Z

@cjolivier01 As far as I am concerned - yes, it is done. @piiswrong?

piiswrong · 2017-11-21T05:17:04Z

Yes, we decided to deal with the aggregation policy later

* NCCL integration * Skipping NCCL test (since it requires NCCL library to be present and enabled in build) * Add Apache header to test_nccl.py * Fixes from review * Trigger CI * Removing API change for Pull * Fixes * Fix * Fix * Fix * Fix * Fix * Indentation fixes and importing unittest in test_nccl.py * sorted_key_attrs -> key_attrs * More fixes from review * Fix * Fix lint * Support for aggregation in NCCL * Fix typo * Fix missing logic * Move from CommNCCL to KVStoreNCCL * Fix * Moved nccl update to separate function * Add message about not supporting gradient compression * Fix lint * Trigger CI

ptrendx added 2 commits October 16, 2017 09:29

NCCL integration

9ae29f4

Skipping NCCL test (since it requires NCCL library to be present and

0eebbaa

enabled in build)

ptrendx mentioned this pull request Oct 16, 2017

[WIP] NCCL support #5521

Closed

eric-haibin-lin self-requested a review October 17, 2017 03:49

eric-haibin-lin reviewed Oct 17, 2017

View reviewed changes

ptrendx added 3 commits October 17, 2017 10:31

Add Apache header to test_nccl.py

2091dbb

Fixes from review

0aafdbe

Trigger CI

1256f84

eric-haibin-lin reviewed Oct 28, 2017

View reviewed changes

eric-haibin-lin reviewed Nov 3, 2017

View reviewed changes

rahul003 suggested changes Nov 3, 2017

View reviewed changes

ptrendx added 13 commits November 3, 2017 13:25

Removing API change for Pull

036b829

Fixes

f521ed0

Fix

33eb64f

Fix

36767bf

Fix

cb01f70

Fix

c963daa

Fix

9ece19f

Indentation fixes and importing unittest in test_nccl.py

aa8b526

sorted_key_attrs -> key_attrs

3d1d0f6

More fixes from review

348548f

Fix

077c588

Merge branch 'upstream' into nccl_pr_2

fab7e0d

Fix lint

b1ddbb2

piiswrong reviewed Nov 9, 2017

View reviewed changes

Support for aggregation in NCCL

228b067

Fix typo

a35121d

piiswrong reviewed Nov 15, 2017

View reviewed changes

ptrendx added 5 commits November 17, 2017 10:20

Fix missing logic

493cc10

Move from CommNCCL to KVStoreNCCL

7f54f6c

Fix

aa86d4a

Merge branch 'upstream' into nccl_pr_2

d690939

Merge branch 'upstream' into nccl_pr_2

cb66bf8

piiswrong reviewed Nov 20, 2017

View reviewed changes

ptrendx added 3 commits November 20, 2017 16:32

Moved nccl update to separate function

d9a6b5a

Merge branch 'upstream' into nccl_pr_2

f8fbb81

Add message about not supporting gradient compression

083ba31

ptrendx added 2 commits November 20, 2017 21:51

Fix lint

40a6263

Trigger CI

f16c314

cjolivier01 merged commit cace29f into apache:master Nov 21, 2017

huyangc mentioned this pull request Mar 7, 2018

Can mxnet support distribute training using nccl v2.0? #10015

Closed

NCCL integration #8294

NCCL integration #8294

Conversation

ptrendx commented Oct 16, 2017

Description

Checklist

Essentials

Changes

Comments

piiswrong commented Oct 16, 2017

eric-haibin-lin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eric-haibin-lin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rahul003 Nov 3, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eric-haibin-lin commented Nov 7, 2017

ptrendx commented Nov 7, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eric-haibin-lin commented Nov 12, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marcoabreu commented Nov 20, 2017

ptrendx commented Nov 21, 2017

rahul003 Nov 3, 2017 •

edited

Loading

cjolivier01 commented Nov 21, 2017 •

edited

Loading