simple_bind elemwise_add with group2ctx fails #7080

eric-haibin-lin · 2017-07-17T22:50:58Z

For bugs or installation issues, please provide the following information.
The more information you provide, the more likely people will be able to help you.

Environment info

Operating System: AWS Deep Learning AMI

Package used (Python/R/Scala/Julia): python

Or if installed from source:

MXNet commit hash (git rev-parse HEAD): 8c81ee4

If you are using python package, please provide

Python version and distribution: python 2.7

Error Message:

Please paste the full error message, including stack trace.

.[22:46:08] /home/ubuntu/upstream-gpu/dmlc-core/include/dmlc/logging.h:304: [22:46:08] src/executor/graph_executor.cc:340: Check failed: device[nid] == devid (0 vs. 1) device of sam
e output not equal to each other

Stack trace returned 10 entries:
[bt] (0) /home/ubuntu/upstream-gpu/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f0bf44d8abc]
[bt] (1) /home/ubuntu/upstream-gpu/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet4exec13AssignContextEN4nnvm5GraphERKNS_7ContextERKSt3mapISsS3_St4lessISsESaISt4pairIKSsS3_EEERKSt6vect
orIS3_SaIS3_EESK_SK_mm+0x12df) [0x7f0bf51fd25f]
[bt] (2) /home/ubuntu/upstream-gpu/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet4exec13GraphExecutor9InitGraphEN4nnvm6SymbolERKNS_7ContextERKSt3mapISsS4_St4lessISsESaISt4pairIKSsS4_E
EERKSt6vectorIS4_SaIS4_EESL_SL_RKSH_INS_9OpReqTypeESaISM_EE+0xaf) [0x7f0bf5206f4f]
[bt] (3) /home/ubuntu/upstream-gpu/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet4exec13GraphExecutor4InitEN4nnvm6SymbolERKNS_7ContextERKSt3mapISsS4_St4lessISsESaISt4pairIKSsS4_EEERK$
t6vectorIS4_SaIS4_EESL_SL_RKSt13unordered_mapISsNS2_6TShapeESt4hashISsESt8equal_toISsESaISA_ISB_SN_EEERKSM_ISsiSP_SR_SaISA_ISB_iEEERKSH_INS_9OpReqTypeESaIS12_EERKSt13unordered_setI$
sSP_SR_SaISsEEPSH_INS_7NDArrayESaIS1C_EES1F_S1F_PSM_ISsS1C_SP_SR_SaISA_ISB_S1C_EEEPNS_8ExecutorERKSM_INS2_9NodeEntryES1C_NS2_13NodeEntryHashENS2_14NodeEntryEqualESaISA_IKS1M_S1C_EE$
+0xa0) [0x7f0bf5208c30]
[bt] (4) /home/ubuntu/upstream-gpu/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet8Executor10SimpleBindEN4nnvm6SymbolERKNS_7ContextERKSt3mapISsS3_St4lessISsESaISt4pairIKSsS3_EEERKSt6v$
ctorIS3_SaIS3_EESK_SK_RKSt13unordered_mapISsNS1_6TShapeESt4hashISsESt8equal_toISsESaIS9_ISA_SM_EEERKSL_ISsiSO_SQ_SaIS9_ISA_iEEERKSG_INS_9OpReqTypeESaIS11_EERKSt13unordered_setISsSO$
SQ_SaISsEEPSG_INS_7NDArrayESaIS1B_EES1E_S1E_PSL_ISsS1B_SO_SQ_SaIS9_ISA_S1B_EEEPS0_+0x194) [0x7f0bf5209934]
[bt] (5) /home/ubuntu/upstream-gpu/python/mxnet/../../lib/libmxnet.so(MXExecutorSimpleBind+0x2221) [0x7f0bf5198991]
[bt] (6) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f0c080caadc]
[bt] (7) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x1fc) [0x7f0c080ca40c]
[bt] (8) /usr/lib/python3.4/lib-dynload/_ctypes.cpython-34m-x86_64-linux-gnu.so(_ctypes_callproc+0x21d) [0x7f0c082dc12d]
[bt] (9) /usr/lib/python3.4/lib-dynload/_ctypes.cpython-34m-x86_64-linux-gnu.so(+0xf6a3) [0x7f0c082dc6a3]

Minimum reproducible example

if you are using your own code, please provide a short script that reproduces the error.

Steps to reproduce

or if you are running standard examples, please provide the commands you have run that lead to the error.

run the following code:

    with mx.AttrScope(ctx_group='stage1'):
        lhs = mx.symbol.Variable('lhs')
        rhs = mx.symbol.Variable('rhs')
        plus  = mx.symbol.elemwise_add(lhs, rhs, name='plus')

    set_stage1 = set(plus.list_arguments())
    with mx.AttrScope(ctx_group='stage2'):
        softmax  = mx.symbol.SoftmaxOutput(data = plus, name = 'softmax')

    set_stage2 = set(softmax.list_arguments()) - set_stage1

    group2ctx = {
        'stage1' : mx.cpu(1),
        'stage2' : mx.cpu(2)
    }
    texec = softmax.simple_bind(mx.cpu(0), group2ctx=group2ctx, lhs=(1,200), rhs=(1,200))

What have you tried to solve it?

This is due to the recent change of using CloneGradient(src/operator/elemwise_op_common.h) for elemwise_add backward pass which reduces the number of copies. This however, failed to copy the gradient across devices. It could probably solved by registering inplace_identity attribute for inplace updates.

The text was updated successfully, but these errors were encountered:

ptrendx · 2017-07-19T23:25:52Z

This seems tricky :-(... I don't think we have information of device placement during graph construction and that's when we choose to do CloneGradient method. @piiswrong Any ideas how to work around that? Is there a device placement pass in NNVM that we could use for making a copy node (or any pass that knows about devices)?

sergeykolychev · 2017-08-02T22:52:00Z

@eric-haibin-lin you are right, I just verified the perl test is failing on the master branch as well and it's likely related to this issue. I'll be removing this perl test from the master branch via another pull request until the issue is fixed. You can go ahead with merging my pull for your sparse branch.

eric-haibin-lin · 2017-08-02T23:25:18Z

@sergeykolychev thanks!

eric-haibin-lin · 2017-08-06T18:28:09Z

@ptrendx are you referring to the place_device pass?
https://github.com/dmlc/nnvm/blob/master/src/pass/place_device.cc

vrakesh · 2018-11-27T01:45:31Z

@eric-haibin-lin requesting an update on this bug, has this been resolved?

eric-haibin-lin · 2019-01-12T06:21:37Z

@vrakesh No. The example is already posted above and you should be able to reproduce it.

eric-haibin-lin mentioned this issue Aug 2, 2017

ignoring variables in SimpleBind that is used on python's sparse branch for now. eric-haibin-lin/mxnet#135

Merged

sergeykolychev mentioned this issue Aug 2, 2017

Attempting to add Perl interface to Apache CI. #7170

Merged

eric-haibin-lin mentioned this issue Aug 6, 2017

How to switch a layer to cpu mode while the other layers are in gpu mode. #7034

Closed

eric-haibin-lin added the Bug label Oct 25, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

simple_bind elemwise_add with group2ctx fails #7080

simple_bind elemwise_add with group2ctx fails #7080

eric-haibin-lin commented Jul 17, 2017

ptrendx commented Jul 19, 2017

sergeykolychev commented Aug 2, 2017

eric-haibin-lin commented Aug 2, 2017

eric-haibin-lin commented Aug 6, 2017

vrakesh commented Nov 27, 2018

eric-haibin-lin commented Jan 12, 2019

simple_bind elemwise_add with group2ctx fails #7080

simple_bind elemwise_add with group2ctx fails #7080

Comments

eric-haibin-lin commented Jul 17, 2017

Environment info

Error Message:

Minimum reproducible example

Steps to reproduce

What have you tried to solve it?

ptrendx commented Jul 19, 2017

sergeykolychev commented Aug 2, 2017

eric-haibin-lin commented Aug 2, 2017

eric-haibin-lin commented Aug 6, 2017

vrakesh commented Nov 27, 2018

eric-haibin-lin commented Jan 12, 2019