Fix mxnet 2 nightly build #2155

eric-haibin-lin · 2020-08-03T04:00:00Z

Checklist before submitting

Did you read the contributor guide?
Did you update the docs?
Did you write any tests to validate this change?
Did you update the CHANGELOG, if this change affects users?

Description

MXNet nightly (version 2.0) made a few changes:

mxnet-mkl and mxnet-cuxxmkl binaries are no longer published. mkl is turned on by default.
mx.gluon.parameter.ParameterDict is dropped in favor of plain python dict
Parameter.name is not longer unique globally unique. This means that we can no longer use param.name as the key for collective operations. Instead, this PR reverts back to using the index in the parameter list.

Review process to land

All tests and other checks must succeed.
At least one member of the technical steering committee must review and approve.
If any member of the technical steering committee requests changes, they must be addressed.

use mx.library.compiled_with_cxx11_abi in setup.py update nightly build path Signed-off-by: eric-haibin-lin [email protected] <[email protected]>

Signed-off-by: eric-haibin-lin [email protected] <[email protected]>

Signed-off-by: Haibin Lin <[email protected]>

Signed-off-by: Haibin Lin <[email protected]> Signed-off-by: Lin <[email protected]>

Signed-off-by: Lin <[email protected]>

Signed-off-by: Lin <[email protected]> Signed-off-by: Haibin Lin <[email protected]>

Signed-off-by: Lin <[email protected]>

Signed-off-by: Haibin Lin <[email protected]>

Signed-off-by: Lin <[email protected]>

romerojosh · 2020-08-07T17:54:02Z

horovod/mxnet/__init__.py

        for i, param in enumerate(self._params):
            if param.grad_req != 'null':
                allreduce_(param.list_grad()[0], average=False,
-                           name=param.name, priority=-i)
+                           name=str(i), priority=-i)


Could you elaborate more on the param.name being no longer unique in MXNet 2.0? I'd hesitate to revert back to using str(i) as the allreduce request names, as it would reintroduce the problem described in #1679. Basically, in cases where users may have multiple DistributedTrainers, this naming scheme will not differentiate between gradients being submitted by the different optimizers, instead producing multiple requests with the same name (i.e. multiple allreduce.0, allreduce.1, etc.).

I see in your changes to broadcast_parameters you use the dictionary key values, but I guess that isn't an option here because self._params is just a list of params, rather than the original dictionary.

Maybe one option would be to add an optional name argument for the DistributedTrainer that gets prepended to the names of the allreduce operations being submitted? That way, users can provide unique trainer names if they need to disambiguate allreduce operations from multiple trainers.

Thanks for the review. As part of this PR https://github.com/apache/incubator-mxnet/pull/18619/files, the name scope class is removed to avoid the usage of thread local object in the python front-end. The parameters are now distinguished by their uuid instead (idx = self._param2idx[parameter._uuid]) and the parameter name saved in the Parameter class can be identical.

You brought up a valid point on the use case where multiple DistributedTrainers. Adding a name parameter to distributed trainer sounds reasonable.

I'm adding a parameter -> name mapping in the constructor (self._param2name) in apache/mxnet#18877. Maybe we can use that, too.

I think using name = self._param2name[p._uuid] after apache/mxnet#18877 lands might be a cleaner option than adding a global name to the DistributedTrainer. Aligns more closely to the existing code and would be less disruptive to users who might be using multiple optimizers already. What do you think?

cc @leezu to suggest on the choice of unique names for parameters.

The names in self._param2name[p._uuid] will not be globally unique if the user creates multiple Blocks with the same structure but different parameters. Actually I'm not convinced we should add self._param2name[p._uuid]

@romerojosh @eric-haibin-lin what's the uniqueness and consistency requirement for this identifier? Does it need to be consistent across different workers?

The name of the allreduce operation needs to be consistent across all workers participating in the communication. This name identifier is what is used to match up tensors in the Horovod backend. Concerning uniqueness, names can be reused; however, all operations in flight must have unique names (i.e. a name can only be reused if the previous Horovod operation using that name has completed).

In fact after checking with @leezu, the following two blocks will have the same name for parameters in mxnet2:

net1 = nn.Dense(10) net2 = nn.Dense(10) assert net1.collect_params().keys() == net2.collect_params().keys()

So for horovod we probably need to go back to the approach where we add a name argument to DistributedTrainer.

szha · 2020-08-09T04:51:22Z

Dockerfile.test.gpu

@@ -116,7 +116,7 @@ RUN pip install "Pillow<7.0" --no-deps

 # Install MXNet.
 RUN if [[ ${MXNET_PACKAGE} == "mxnet-nightly" ]]; then \
-        pip install --pre mxnet-cu101mkl -f https://dist.mxnet.io/python/all; \
+        pip install --pre mxnet-cu101 -f https://dist.mxnet.io/python/all; \


cu101 build is being discontinued as NVIDIA only supports the latest two major versions and minor versions. can the horovod project update its dependencies?

leezu

LGTM regarding avoiding tensor name collisions. Thank you @eric-haibin-lin!

Signed-off-by: eric-haibin-lin [email protected] <[email protected]>

Signed-off-by: Lin <[email protected]> Signed-off-by: Haibin Lin <[email protected]>

Signed-off-by: Haibin Lin <[email protected]>

Signed-off-by: Haibin Lin <[email protected]> Signed-off-by: Lin <[email protected]>

szha · 2020-08-23T01:30:15Z

Dockerfile.test.cpu

@@ -138,7 +138,7 @@ RUN pip install "Pillow<7.0" --no-deps

 # Install MXNet.
 RUN if [[ ${MXNET_PACKAGE} == "mxnet-nightly" ]]; then \
-        pip install --pre mxnet-mkl -f https://dist.mxnet.io/python/all; \
+        pip install --pre mxnet -f https://dist.mxnet.io/python/all; \


I wonder if there should be a separate pipeline for 1.x

szha · 2020-08-23T01:49:07Z

For the missing symbol, I'm guessing g++ version difference might be related. In short, the different versions of compliers used can lead to different symbol name mangling behavior, and in turn causes ctypes to look for a different symbol. mxnet is currently using g++-7 while horovod CI is using g++-4.8 for building the shared object. Any chance of using g++-7 for building horovod too?

eric-haibin-lin · 2020-09-24T18:04:27Z

Merged in #2205

add fixes for mxnet2

a894d05

use mx.library.compiled_with_cxx11_abi in setup.py update nightly build path Signed-off-by: eric-haibin-lin [email protected] <[email protected]>

eric-haibin-lin marked this pull request as draft August 3, 2020 04:00

eric-haibin-lin and others added 3 commits August 3, 2020 05:27

jfix import

ad01d96

Signed-off-by: eric-haibin-lin [email protected] <[email protected]>

fix contextvars dependency issue

a238f00

Signed-off-by: Haibin Lin <[email protected]>

fix merge conflict

61d3c63

Signed-off-by: Haibin Lin <[email protected]> Signed-off-by: Lin <[email protected]>

eric-haibin-lin force-pushed the mx2-pr branch from 64108f6 to 61d3c63 Compare August 4, 2020 23:24

Lin added 2 commits August 4, 2020 16:54

update cxx11_abi default value

0b9f9c0

Signed-off-by: Lin <[email protected]>

test mxnet2 examples in CI

ab42437

Signed-off-by: Lin <[email protected]> Signed-off-by: Haibin Lin <[email protected]>

eric-haibin-lin force-pushed the mx2-pr branch from d0e17a5 to ab42437 Compare August 5, 2020 03:10

Lin and others added 4 commits August 5, 2020 16:41

revert python version change

2a68930

Signed-off-by: Lin <[email protected]>

update pipeline yaml

9995410

Signed-off-by: Haibin Lin <[email protected]>

update library api

4cb4ee4

Signed-off-by: Haibin Lin <[email protected]>

make test mx2 compatible

dd7011f

Signed-off-by: Lin <[email protected]>

eric-haibin-lin marked this pull request as ready for review August 7, 2020 06:49

eric-haibin-lin requested review from romerojosh and tgaddair August 7, 2020 06:49

romerojosh reviewed Aug 7, 2020

View reviewed changes

szha reviewed Aug 9, 2020

View reviewed changes

eric-haibin-lin force-pushed the mx2-pr branch 2 times, most recently from b6be89f to 9ec553f Compare August 12, 2020 16:06

leezu approved these changes Aug 12, 2020

View reviewed changes

add prefix arg

bddcb91

Signed-off-by: eric-haibin-lin [email protected] <[email protected]>

eric-haibin-lin force-pushed the mx2-pr branch from 9ec553f to bddcb91 Compare August 12, 2020 18:15

Merge remote-tracking branch 'upstream/master' into mx2-pr

f9f898e

eric-haibin-lin force-pushed the mx2-pr branch from 7cdb6bd to dd9efb3 Compare August 15, 2020 00:45

fix test

452ed90

Signed-off-by: Lin <[email protected]> Signed-off-by: Haibin Lin <[email protected]>

eric-haibin-lin force-pushed the mx2-pr branch from dd9efb3 to 452ed90 Compare August 15, 2020 16:47

Haibin Lin added 3 commits August 17, 2020 20:59

fix confclit

63f7828

Signed-off-by: Haibin Lin <[email protected]>

fix doc

c05b718

Signed-off-by: Haibin Lin <[email protected]>

fix doc.

3b37bbb

Signed-off-by: Haibin Lin <[email protected]> Signed-off-by: Lin <[email protected]>

eric-haibin-lin force-pushed the mx2-pr branch from 9174155 to 3b37bbb Compare August 20, 2020 16:24

Merge remote-tracking branch 'upstream/master' into mx2-pr

66319b7

szha reviewed Aug 23, 2020

View reviewed changes

leezu mentioned this pull request Aug 26, 2020

Fix mxnet 2 nightly build #2205

Merged

eric-haibin-lin closed this Sep 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix mxnet 2 nightly build #2155

Fix mxnet 2 nightly build #2155

eric-haibin-lin commented Aug 3, 2020 •

edited

Loading

romerojosh Aug 7, 2020

eric-haibin-lin Aug 7, 2020

eric-haibin-lin Aug 7, 2020 •

edited

Loading

romerojosh Aug 7, 2020

szha Aug 9, 2020

leezu Aug 10, 2020

szha Aug 10, 2020

romerojosh Aug 10, 2020

eric-haibin-lin Aug 10, 2020

szha Aug 9, 2020 •

edited

Loading

leezu left a comment

szha Aug 23, 2020

szha commented Aug 23, 2020 •

edited

Loading

eric-haibin-lin commented Sep 24, 2020

Fix mxnet 2 nightly build #2155

Fix mxnet 2 nightly build #2155

Conversation

eric-haibin-lin commented Aug 3, 2020 • edited Loading

Checklist before submitting

Description

Review process to land

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eric-haibin-lin Aug 7, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szha Aug 9, 2020 • edited Loading

Choose a reason for hiding this comment

leezu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szha commented Aug 23, 2020 • edited Loading

eric-haibin-lin commented Sep 24, 2020

eric-haibin-lin commented Aug 3, 2020 •

edited

Loading

eric-haibin-lin Aug 7, 2020 •

edited

Loading

szha Aug 9, 2020 •

edited

Loading

szha commented Aug 23, 2020 •

edited

Loading