-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix mxnet 2 nightly build #2155
Conversation
use mx.library.compiled_with_cxx11_abi in setup.py update nightly build path Signed-off-by: eric-haibin-lin [email protected] <[email protected]>
Signed-off-by: eric-haibin-lin [email protected] <[email protected]>
Signed-off-by: Haibin Lin <[email protected]>
Signed-off-by: Haibin Lin <[email protected]> Signed-off-by: Lin <[email protected]>
Signed-off-by: Lin <[email protected]>
Signed-off-by: Lin <[email protected]> Signed-off-by: Haibin Lin <[email protected]>
Signed-off-by: Lin <[email protected]>
Signed-off-by: Haibin Lin <[email protected]>
Signed-off-by: Haibin Lin <[email protected]>
Signed-off-by: Lin <[email protected]>
horovod/mxnet/__init__.py
Outdated
for i, param in enumerate(self._params): | ||
if param.grad_req != 'null': | ||
allreduce_(param.list_grad()[0], average=False, | ||
name=param.name, priority=-i) | ||
name=str(i), priority=-i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate more on the param.name
being no longer unique in MXNet 2.0? I'd hesitate to revert back to using str(i)
as the allreduce request names, as it would reintroduce the problem described in #1679. Basically, in cases where users may have multiple DistributedTrainer
s, this naming scheme will not differentiate between gradients being submitted by the different optimizers, instead producing multiple requests with the same name (i.e. multiple allreduce.0
, allreduce.1
, etc.).
I see in your changes to broadcast_parameters
you use the dictionary key values, but I guess that isn't an option here because self._params
is just a list of params, rather than the original dictionary.
Maybe one option would be to add an optional name
argument for the DistributedTrainer
that gets prepended to the names of the allreduce operations being submitted? That way, users can provide unique trainer names if they need to disambiguate allreduce operations from multiple trainers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review. As part of this PR https://github.com/apache/incubator-mxnet/pull/18619/files, the name scope class is removed to avoid the usage of thread local object in the python front-end. The parameters are now distinguished by their uuid instead (idx = self._param2idx[parameter._uuid]
) and the parameter name
saved in the Parameter
class can be identical.
You brought up a valid point on the use case where multiple DistributedTrainers. Adding a name parameter to distributed trainer sounds reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm adding a parameter -> name mapping in the constructor (self._param2name
) in apache/mxnet#18877. Maybe we can use that, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using name = self._param2name[p._uuid]
after apache/mxnet#18877 lands might be a cleaner option than adding a global name to the DistributedTrainer
. Aligns more closely to the existing code and would be less disruptive to users who might be using multiple optimizers already. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @leezu to suggest on the choice of unique names for parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The names in self._param2name[p._uuid]
will not be globally unique if the user creates multiple Blocks with the same structure but different parameters. Actually I'm not convinced we should add self._param2name[p._uuid]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@romerojosh @eric-haibin-lin what's the uniqueness and consistency requirement for this identifier? Does it need to be consistent across different workers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name
of the allreduce operation needs to be consistent across all workers participating in the communication. This name identifier is what is used to match up tensors in the Horovod backend. Concerning uniqueness, names can be reused; however, all operations in flight must have unique names (i.e. a name can only be reused if the previous Horovod operation using that name has completed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact after checking with @leezu, the following two blocks will have the same name for parameters in mxnet2:
net1 = nn.Dense(10)
net2 = nn.Dense(10)
assert net1.collect_params().keys() == net2.collect_params().keys()
So for horovod we probably need to go back to the approach where we add a name
argument to DistributedTrainer.
@@ -116,7 +116,7 @@ RUN pip install "Pillow<7.0" --no-deps | |||
|
|||
# Install MXNet. | |||
RUN if [[ ${MXNET_PACKAGE} == "mxnet-nightly" ]]; then \ | |||
pip install --pre mxnet-cu101mkl -f https://dist.mxnet.io/python/all; \ | |||
pip install --pre mxnet-cu101 -f https://dist.mxnet.io/python/all; \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cu101 build is being discontinued as NVIDIA only supports the latest two major versions and minor versions. can the horovod project update its dependencies?
b6be89f
to
9ec553f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM regarding avoiding tensor name collisions. Thank you @eric-haibin-lin!
Signed-off-by: eric-haibin-lin [email protected] <[email protected]>
9ec553f
to
bddcb91
Compare
7cdb6bd
to
dd9efb3
Compare
Signed-off-by: Lin <[email protected]> Signed-off-by: Haibin Lin <[email protected]>
dd9efb3
to
452ed90
Compare
Signed-off-by: Haibin Lin <[email protected]>
Signed-off-by: Haibin Lin <[email protected]>
Signed-off-by: Haibin Lin <[email protected]> Signed-off-by: Lin <[email protected]>
9174155
to
3b37bbb
Compare
@@ -138,7 +138,7 @@ RUN pip install "Pillow<7.0" --no-deps | |||
|
|||
# Install MXNet. | |||
RUN if [[ ${MXNET_PACKAGE} == "mxnet-nightly" ]]; then \ | |||
pip install --pre mxnet-mkl -f https://dist.mxnet.io/python/all; \ | |||
pip install --pre mxnet -f https://dist.mxnet.io/python/all; \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if there should be a separate pipeline for 1.x
For the missing symbol, I'm guessing g++ version difference might be related. In short, the different versions of compliers used can lead to different symbol name mangling behavior, and in turn causes ctypes to look for a different symbol. mxnet is currently using g++-7 while horovod CI is using g++-4.8 for building the shared object. Any chance of using g++-7 for building horovod too? |
Merged in #2205 |
Checklist before submitting
Description
MXNet nightly (version 2.0) made a few changes:
mx.gluon.parameter.ParameterDict
is dropped in favor of plain pythondict
Parameter.name
is not longer unique globally unique. This means that we can no longer useparam.name
as the key for collective operations. Instead, this PR reverts back to using the index in the parameter list.Review process to land