Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

simple_bind elemwise_add with group2ctx fails #7080

Open
eric-haibin-lin opened this issue Jul 17, 2017 · 6 comments
Open

simple_bind elemwise_add with group2ctx fails #7080

eric-haibin-lin opened this issue Jul 17, 2017 · 6 comments
Labels

Comments

@eric-haibin-lin
Copy link
Member

For bugs or installation issues, please provide the following information.
The more information you provide, the more likely people will be able to help you.

Environment info

Operating System: AWS Deep Learning AMI

Package used (Python/R/Scala/Julia): python

Or if installed from source:

MXNet commit hash (git rev-parse HEAD): 8c81ee4

If you are using python package, please provide

Python version and distribution: python 2.7

Error Message:

Please paste the full error message, including stack trace.

.[22:46:08] /home/ubuntu/upstream-gpu/dmlc-core/include/dmlc/logging.h:304: [22:46:08] src/executor/graph_executor.cc:340: Check failed: device[nid] == devid (0 vs. 1) device of sam
e output not equal to each other

Stack trace returned 10 entries:
[bt] (0) /home/ubuntu/upstream-gpu/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f0bf44d8abc]
[bt] (1) /home/ubuntu/upstream-gpu/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet4exec13AssignContextEN4nnvm5GraphERKNS_7ContextERKSt3mapISsS3_St4lessISsESaISt4pairIKSsS3_EEERKSt6vect
orIS3_SaIS3_EESK_SK_mm+0x12df) [0x7f0bf51fd25f]
[bt] (2) /home/ubuntu/upstream-gpu/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet4exec13GraphExecutor9InitGraphEN4nnvm6SymbolERKNS_7ContextERKSt3mapISsS4_St4lessISsESaISt4pairIKSsS4_E
EERKSt6vectorIS4_SaIS4_EESL_SL_RKSH_INS_9OpReqTypeESaISM_EE+0xaf) [0x7f0bf5206f4f]
[bt] (3) /home/ubuntu/upstream-gpu/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet4exec13GraphExecutor4InitEN4nnvm6SymbolERKNS_7ContextERKSt3mapISsS4_St4lessISsESaISt4pairIKSsS4_EEERK$
t6vectorIS4_SaIS4_EESL_SL_RKSt13unordered_mapISsNS2_6TShapeESt4hashISsESt8equal_toISsESaISA_ISB_SN_EEERKSM_ISsiSP_SR_SaISA_ISB_iEEERKSH_INS_9OpReqTypeESaIS12_EERKSt13unordered_setI$
sSP_SR_SaISsEEPSH_INS_7NDArrayESaIS1C_EES1F_S1F_PSM_ISsS1C_SP_SR_SaISA_ISB_S1C_EEEPNS_8ExecutorERKSM_INS2_9NodeEntryES1C_NS2_13NodeEntryHashENS2_14NodeEntryEqualESaISA_IKS1M_S1C_EE$
+0xa0) [0x7f0bf5208c30]
[bt] (4) /home/ubuntu/upstream-gpu/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet8Executor10SimpleBindEN4nnvm6SymbolERKNS_7ContextERKSt3mapISsS3_St4lessISsESaISt4pairIKSsS3_EEERKSt6v$
ctorIS3_SaIS3_EESK_SK_RKSt13unordered_mapISsNS1_6TShapeESt4hashISsESt8equal_toISsESaIS9_ISA_SM_EEERKSL_ISsiSO_SQ_SaIS9_ISA_iEEERKSG_INS_9OpReqTypeESaIS11_EERKSt13unordered_setISsSO$
SQ_SaISsEEPSG_INS_7NDArrayESaIS1B_EES1E_S1E_PSL_ISsS1B_SO_SQ_SaIS9_ISA_S1B_EEEPS0_+0x194) [0x7f0bf5209934]
[bt] (5) /home/ubuntu/upstream-gpu/python/mxnet/../../lib/libmxnet.so(MXExecutorSimpleBind+0x2221) [0x7f0bf5198991]
[bt] (6) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f0c080caadc]
[bt] (7) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x1fc) [0x7f0c080ca40c]
[bt] (8) /usr/lib/python3.4/lib-dynload/_ctypes.cpython-34m-x86_64-linux-gnu.so(_ctypes_callproc+0x21d) [0x7f0c082dc12d]
[bt] (9) /usr/lib/python3.4/lib-dynload/_ctypes.cpython-34m-x86_64-linux-gnu.so(+0xf6a3) [0x7f0c082dc6a3]

Minimum reproducible example

if you are using your own code, please provide a short script that reproduces the error.

Steps to reproduce

or if you are running standard examples, please provide the commands you have run that lead to the error.

  1. run the following code:
    with mx.AttrScope(ctx_group='stage1'):
        lhs = mx.symbol.Variable('lhs')
        rhs = mx.symbol.Variable('rhs')
        plus  = mx.symbol.elemwise_add(lhs, rhs, name='plus')

    set_stage1 = set(plus.list_arguments())
    with mx.AttrScope(ctx_group='stage2'):
        softmax  = mx.symbol.SoftmaxOutput(data = plus, name = 'softmax')

    set_stage2 = set(softmax.list_arguments()) - set_stage1

    group2ctx = {
        'stage1' : mx.cpu(1),
        'stage2' : mx.cpu(2)
    }
    texec = softmax.simple_bind(mx.cpu(0), group2ctx=group2ctx, lhs=(1,200), rhs=(1,200))

What have you tried to solve it?

  1. This is due to the recent change of using CloneGradient(src/operator/elemwise_op_common.h) for elemwise_add backward pass which reduces the number of copies. This however, failed to copy the gradient across devices. It could probably solved by registering inplace_identity attribute for inplace updates.
@ptrendx
Copy link
Member

ptrendx commented Jul 19, 2017

This seems tricky :-(... I don't think we have information of device placement during graph construction and that's when we choose to do CloneGradient method. @piiswrong Any ideas how to work around that? Is there a device placement pass in NNVM that we could use for making a copy node (or any pass that knows about devices)?

@sergeykolychev
Copy link
Contributor

@eric-haibin-lin you are right, I just verified the perl test is failing on the master branch as well and it's likely related to this issue. I'll be removing this perl test from the master branch via another pull request until the issue is fixed. You can go ahead with merging my pull for your sparse branch.

@eric-haibin-lin
Copy link
Member Author

@sergeykolychev thanks!

@eric-haibin-lin
Copy link
Member Author

@ptrendx are you referring to the place_device pass?
https://github.com/dmlc/nnvm/blob/master/src/pass/place_device.cc

@vrakesh
Copy link
Contributor

vrakesh commented Nov 27, 2018

@eric-haibin-lin requesting an update on this bug, has this been resolved?

@eric-haibin-lin
Copy link
Member Author

@vrakesh No. The example is already posted above and you should be able to reproduce it.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants