Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Re-enable the test_gpu_memory_profiler_gluon test case #18704

Merged
merged 2 commits into from
Sep 8, 2020

Conversation

ArmageddonKnight
Copy link
Contributor

Description

This PR is supposed to fix the issue described in #18564

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Removed the setting of continuous_dump when setting the profiler config.

Comments

This PR is for fixing the issue described in #18564 . Specifically, the reason why the error happens is because the profiler config has the flag continuous_dump set, which would further propagate to the core function Profiler::SetContinuousProfileDump (shown below) and give an error on line 268.

https://github.com/apache/incubator-mxnet/blob/9d623926d4857a2cfa32515b58cd1398371f97f3/src/profiler/profiler.cc#L258-L268

FYI, @szha

@mxnet-bot
Copy link

Hey @ArmageddonKnight , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

  • To trigger all jobs: @mxnet-bot run ci [all]
  • To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [windows-gpu, windows-cpu, unix-cpu, sanity, centos-gpu, centos-cpu, website, edge, miscellaneous, clang, unix-gpu]


Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

@szha
Copy link
Member

szha commented Jul 15, 2020

FYI I merged #18701

@ArmageddonKnight
Copy link
Contributor Author

@szha I have updated the changes accordingly.

@ArmageddonKnight
Copy link
Contributor Author

ArmageddonKnight commented Jul 15, 2020

@mxnet-bot run ci [centos-cpu, centos-gpu, unix-cpu, unix-gpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [centos-gpu, unix-cpu, unix-gpu, centos-cpu]

@Zha0q1
Copy link
Contributor

Zha0q1 commented Jul 15, 2020

Any ideas on what might have caused the error in Profiler::SetContinuousProfileDump?

This same error seems to also cause issues in tests/python/unittest/test_profiler.py, as seen in http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-18701/1/pipeline/378/

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [unix-cpu, centos-cpu, unix-gpu, centos-gpu]

@szha
Copy link
Member

szha commented Aug 23, 2020

Looks like the parameter initialization allocation is missing. @leezu looks like this part of tagging is missing as a result of gluon refactor. Could you help suggest a way of naming these allocation in the new Gluon implementation? https://github.com/apache/incubator-mxnet/pull/17656/files#diff-29da832c2145752f3906a2f71c7b63ba

@leezu
Copy link
Contributor

leezu commented Aug 24, 2020

@szha the gpu memory profiler tests have never been run. They were disabled until #18701 due to declaration in the wrong file. I'm not sure what the following lines refer to, and if / why they were recorded differently before. As per the comment in the test file, the test was designed to capture 26 lines, but we now capture 33 lines. In particular, the following lines were not recorded before and appear to be the problematic ones. I don't think that recording more events in the memory profiler is related to the name-scope refactor, as the profiler is enabled and disabled separately from the name-scope.

[2020-07-15T20:15:02.236Z] <unk>:_random_uniform,67108864,0,67108864,0

[2020-07-15T20:15:02.236Z] <unk>:_random_uniform,67108864,0,67108864,0

[2020-07-15T20:15:02.236Z] <unk>:_zeros,67108864,0,67108864,0

[2020-07-15T20:15:02.236Z] <unk>:_zeros,67108864,0,67108864,0

[2020-07-15T20:15:02.236Z] <unk>:in_arg:data,640,0,640,0

[2020-07-15T20:15:02.236Z] <unk>:unknown,67108864,0,67108864,0

[2020-07-15T20:15:02.236Z] <unk>:unknown,67108864,0,67108864,0

Also note that the profiler scope implementation has some fundamental threading issues. The scope in the backend is not thread local, but the frontend "claimns" thread safety by using a thread local variable in the frontend. There same problem applies to single-threaded but asynchronous code.

@ArmageddonKnight the profiler scope is automatically set in Block and Optimizer. Would you be able to debug the issue and fix it in this PR?

https://github.com/apache/incubator-mxnet/blob/0de7484884292eb028342b1e5669233792429af0/python/mxnet/gluon/block.py#L951-L952

https://github.com/apache/incubator-mxnet/blob/0de7484884292eb028342b1e5669233792429af0/python/mxnet/optimizer/updater.py#L58-L59

Parameter initialization happens separately from Block, and if you want to set the profiler scope based there, you can add with _profiler.scope(self._uuid + ':'): in _finish_deferred_init inside parameter.py.

@szha
Copy link
Member

szha commented Aug 24, 2020

you can add with _profiler.scope(self._uuid + ':'): in _finish_deferred_init inside parameter.py.

The problem with this is that the result will not be meaningful. The memory profiler output should reflect which block a parameter belongs to in order to properly capture the memory usage of a block. We will need some way of mapping the initialization back to the block and not rely on uuid.

@leezu
Copy link
Contributor

leezu commented Aug 24, 2020

A parameter may belong to multiple blocks or to no Block at all, thus there is no 1-1 mapping. I think this is a separate issue and we may first need to fix the memory profiler. The first question here is why the _random_uniform calls are recorded now and were not recorded when the test file was written (ie. 26 vs 33 recorded calls).

@ArmageddonKnight
Copy link
Contributor Author

@szha I am looking into this issue:

  • The unknown allocation entries actually come from the symbolic execution backend, rather than Gluon.
  • The parameters of in the Gluon backend have weird tags, something similar to the following:
"hybridsequential0:arg_grad:param_277c1286_334d_46dc_bbb2_0850c497c7d0_weight","8192","0","8192","0"
"hybridsequential0:arg_grad:param_9c9c45b2_2e7f_4af9_8b60_b80ef0f2dd71_weight","5120","0","8192","0"

@szha
Copy link
Member

szha commented Aug 24, 2020

The unknown allocation entries actually come from the symbolic execution backend, rather than Gluon

On master branch, we no longer have a symbolic execution backend as graph executor was removed.

The parameters of in the Gluon backend have weird tags

That's the problem I mentioned in the above comment about the usage of uuid in naming.

@ArmageddonKnight
Copy link
Contributor Author

@szha FYI, I have cleaned up the debugging traces and applied the fixes that we have discussed previously. The sample GPU memory profile for Gluon now looks like the following:

"Attribute Name","Requested Size","Device","Actual Size","Reuse?"
<unk>:in_arg:data,640,0,4096,0
hybridsequential:activation0:hybridsequential_activation0_fwd,2048,0,4096,0
hybridsequential:activation0:hybridsequential_activation0_fwd_backward,8192,0,8192,0
hybridsequential:activation0:hybridsequential_activation0_fwd_head_grad,2048,0,4096,0
hybridsequential:dense0:activation0:hybridsequential_dense0_activation0_fwd,8192,0,8192,0
hybridsequential:dense0:arg_grad:bias,512,0,4096,0
hybridsequential:dense0:arg_grad:weight,5120,0,8192,0
hybridsequential:dense0:hybridsequential_dense0_fwd,8192,0,8192,0
hybridsequential:dense0:in_arg:bias,512,0,4096,0
hybridsequential:dense0:in_arg:weight,5120,0,8192,0
hybridsequential:dense1:activation0:hybridsequential_dense1_activation0_fwd,4096,0,4096,0
hybridsequential:dense1:arg_grad:bias,256,0,4096,0
hybridsequential:dense1:arg_grad:weight,32768,0,32768,0
hybridsequential:dense1:hybridsequential_dense1_fwd,4096,0,4096,0
hybridsequential:dense1:in_arg:bias,256,0,4096,0
hybridsequential:dense1:in_arg:weight,32768,0,32768,0
hybridsequential:dense2:arg_grad:bias,128,0,4096,0
hybridsequential:dense2:arg_grad:weight,8192,0,8192,0
hybridsequential:dense2:hybridsequential_dense2_fwd_backward,4096,0,4096,1
hybridsequential:dense2:in_arg:bias,128,0,4096,0
hybridsequential:dense2:in_arg:weight,8192,0,8192,0
hybridsequential:dropout0:hybridsequential_dropout0_fwd,8192,0,8192,0
hybridsequential:dropout0:hybridsequential_dropout0_fwd,8192,0,8192,0
resource:cudnn_dropout_state (dropout-inl.h +256),1474560,0,1474560,0
resource:temp_space (fully_connected-inl.h +316),15360,0,16384,0

Please let me know if you have any questions or concerns. Thank you.

@szha
Copy link
Member

szha commented Sep 7, 2020

@ArmageddonKnight thanks for the update! the change looks good to me. There are a couple of tests that failed due to the storage_handle() calls. Once they are fixed I think we should be good to go.

@szha szha merged commit 5b7a6d9 into apache:master Sep 8, 2020
@szha
Copy link
Member

szha commented Sep 8, 2020

thanks for the fix!

Copy link
Contributor

@leezu leezu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ArmageddonKnight please help address the review comments below. Thank you

@leezu
Copy link
Contributor

leezu commented Sep 14, 2020

@ArmageddonKnight it would be great if you can help address the issues introduced by this PR as discussed above

If you don't have time to address them, please mention it here and either someone else can fix it or we can revert the PR and resubmit with fixes later. WDYT?

@ArmageddonKnight
Copy link
Contributor Author

@ArmageddonKnight it would be great if you can help address the issues introduced by this PR as discussed above

If you don't have time to address them, please mention it here and either someone else can fix it or we can revert the PR and resubmit with fixes later. WDYT?

@leezu I have been traveling and just came back. Addressing those comments now. Pls give me some time. Thanks.

@leezu
Copy link
Contributor

leezu commented Sep 14, 2020

No worries. Thank you!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants