Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix mxnet 2 nightly build #2155

Closed
wants to merge 17 commits into from
Closed

Conversation

eric-haibin-lin
Copy link
Collaborator

@eric-haibin-lin eric-haibin-lin commented Aug 3, 2020

Checklist before submitting

  • Did you read the contributor guide?
  • Did you update the docs?
  • Did you write any tests to validate this change?
  • Did you update the CHANGELOG, if this change affects users?

Description

MXNet nightly (version 2.0) made a few changes:

  • mxnet-mkl and mxnet-cuxxmkl binaries are no longer published. mkl is turned on by default.
  • mx.gluon.parameter.ParameterDict is dropped in favor of plain python dict
  • Parameter.name is not longer unique globally unique. This means that we can no longer use param.name as the key for collective operations. Instead, this PR reverts back to using the index in the parameter list.

Review process to land

  1. All tests and other checks must succeed.
  2. At least one member of the technical steering committee must review and approve.
  3. If any member of the technical steering committee requests changes, they must be addressed.

use mx.library.compiled_with_cxx11_abi in setup.py

update nightly build path

Signed-off-by: eric-haibin-lin [email protected] <[email protected]>
@eric-haibin-lin eric-haibin-lin marked this pull request as draft August 3, 2020 04:00
eric-haibin-lin and others added 3 commits August 3, 2020 05:27
Signed-off-by: eric-haibin-lin [email protected] <[email protected]>
Signed-off-by: Haibin Lin <[email protected]>
Signed-off-by: Lin <[email protected]>
Lin added 2 commits August 4, 2020 16:54
Signed-off-by: Lin <[email protected]>
Signed-off-by: Haibin Lin <[email protected]>
Lin and others added 4 commits August 5, 2020 16:41
Signed-off-by: Haibin Lin <[email protected]>
Signed-off-by: Haibin Lin <[email protected]>
@eric-haibin-lin eric-haibin-lin marked this pull request as ready for review August 7, 2020 06:49
for i, param in enumerate(self._params):
if param.grad_req != 'null':
allreduce_(param.list_grad()[0], average=False,
name=param.name, priority=-i)
name=str(i), priority=-i)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate more on the param.name being no longer unique in MXNet 2.0? I'd hesitate to revert back to using str(i) as the allreduce request names, as it would reintroduce the problem described in #1679. Basically, in cases where users may have multiple DistributedTrainers, this naming scheme will not differentiate between gradients being submitted by the different optimizers, instead producing multiple requests with the same name (i.e. multiple allreduce.0, allreduce.1, etc.).

I see in your changes to broadcast_parameters you use the dictionary key values, but I guess that isn't an option here because self._params is just a list of params, rather than the original dictionary.

Maybe one option would be to add an optional name argument for the DistributedTrainer that gets prepended to the names of the allreduce operations being submitted? That way, users can provide unique trainer names if they need to disambiguate allreduce operations from multiple trainers.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. As part of this PR https://github.com/apache/incubator-mxnet/pull/18619/files, the name scope class is removed to avoid the usage of thread local object in the python front-end. The parameters are now distinguished by their uuid instead (idx = self._param2idx[parameter._uuid]) and the parameter name saved in the Parameter class can be identical.

You brought up a valid point on the use case where multiple DistributedTrainers. Adding a name parameter to distributed trainer sounds reasonable.

Copy link
Collaborator Author

@eric-haibin-lin eric-haibin-lin Aug 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm adding a parameter -> name mapping in the constructor (self._param2name) in apache/mxnet#18877. Maybe we can use that, too.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using name = self._param2name[p._uuid] after apache/mxnet#18877 lands might be a cleaner option than adding a global name to the DistributedTrainer. Aligns more closely to the existing code and would be less disruptive to users who might be using multiple optimizers already. What do you think?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @leezu to suggest on the choice of unique names for parameters.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The names in self._param2name[p._uuid] will not be globally unique if the user creates multiple Blocks with the same structure but different parameters. Actually I'm not convinced we should add self._param2name[p._uuid]

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@romerojosh @eric-haibin-lin what's the uniqueness and consistency requirement for this identifier? Does it need to be consistent across different workers?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of the allreduce operation needs to be consistent across all workers participating in the communication. This name identifier is what is used to match up tensors in the Horovod backend. Concerning uniqueness, names can be reused; however, all operations in flight must have unique names (i.e. a name can only be reused if the previous Horovod operation using that name has completed).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact after checking with @leezu, the following two blocks will have the same name for parameters in mxnet2:

net1 = nn.Dense(10)
net2 = nn.Dense(10)

assert net1.collect_params().keys() == net2.collect_params().keys()

So for horovod we probably need to go back to the approach where we add a name argument to DistributedTrainer.

@@ -116,7 +116,7 @@ RUN pip install "Pillow<7.0" --no-deps

# Install MXNet.
RUN if [[ ${MXNET_PACKAGE} == "mxnet-nightly" ]]; then \
pip install --pre mxnet-cu101mkl -f https://dist.mxnet.io/python/all; \
pip install --pre mxnet-cu101 -f https://dist.mxnet.io/python/all; \
Copy link

@szha szha Aug 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cu101 build is being discontinued as NVIDIA only supports the latest two major versions and minor versions. can the horovod project update its dependencies?

@eric-haibin-lin eric-haibin-lin force-pushed the mx2-pr branch 2 times, most recently from b6be89f to 9ec553f Compare August 12, 2020 16:06
Copy link
Collaborator

@leezu leezu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM regarding avoiding tensor name collisions. Thank you @eric-haibin-lin!

Signed-off-by: eric-haibin-lin [email protected] <[email protected]>
Signed-off-by: Lin <[email protected]>
Signed-off-by: Haibin Lin <[email protected]>
Haibin Lin added 3 commits August 17, 2020 20:59
Signed-off-by: Haibin Lin <[email protected]>
Signed-off-by: Haibin Lin <[email protected]>
Signed-off-by: Haibin Lin <[email protected]>
Signed-off-by: Lin <[email protected]>
@@ -138,7 +138,7 @@ RUN pip install "Pillow<7.0" --no-deps

# Install MXNet.
RUN if [[ ${MXNET_PACKAGE} == "mxnet-nightly" ]]; then \
pip install --pre mxnet-mkl -f https://dist.mxnet.io/python/all; \
pip install --pre mxnet -f https://dist.mxnet.io/python/all; \
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there should be a separate pipeline for 1.x

@szha
Copy link

szha commented Aug 23, 2020

For the missing symbol, I'm guessing g++ version difference might be related. In short, the different versions of compliers used can lead to different symbol name mangling behavior, and in turn causes ctypes to look for a different symbol. mxnet is currently using g++-7 while horovod CI is using g++-4.8 for building the shared object. Any chance of using g++-7 for building horovod too?

@leezu leezu mentioned this pull request Aug 26, 2020
@eric-haibin-lin
Copy link
Collaborator Author

Merged in #2205

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants