Re-enable the test_gpu_memory_profiler_gluon test case #18704

ArmageddonKnight · 2020-07-14T03:22:20Z

Description

This PR is supposed to fix the issue described in #18564

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Removed the setting of continuous_dump when setting the profiler config.

Comments

This PR is for fixing the issue described in #18564 . Specifically, the reason why the error happens is because the profiler config has the flag continuous_dump set, which would further propagate to the core function Profiler::SetContinuousProfileDump (shown below) and give an error on line 268.

https://github.com/apache/incubator-mxnet/blob/9d623926d4857a2cfa32515b58cd1398371f97f3/src/profiler/profiler.cc#L258-L268

FYI, @szha

mxnet-bot · 2020-07-14T03:22:22Z

Hey @ArmageddonKnight , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

To trigger all jobs: @mxnet-bot run ci [all]
To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [windows-gpu, windows-cpu, unix-cpu, sanity, centos-gpu, centos-cpu, website, edge, miscellaneous, clang, unix-gpu]

Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

szha · 2020-07-15T01:02:50Z

FYI I merged #18701

ArmageddonKnight · 2020-07-15T01:25:27Z

@szha I have updated the changes accordingly.

ArmageddonKnight · 2020-07-15T05:28:02Z

@mxnet-bot run ci [centos-cpu, centos-gpu, unix-cpu, unix-gpu]

mxnet-bot · 2020-07-15T05:28:08Z

Jenkins CI successfully triggered : [centos-gpu, unix-cpu, unix-gpu, centos-cpu]

Zha0q1 · 2020-07-15T06:07:37Z

Any ideas on what might have caused the error in Profiler::SetContinuousProfileDump?

This same error seems to also cause issues in tests/python/unittest/test_profiler.py, as seen in http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-18701/1/pipeline/378/

mxnet-bot · 2020-07-15T19:06:35Z

Jenkins CI successfully triggered : [unix-cpu, centos-cpu, unix-gpu, centos-gpu]

szha · 2020-08-23T07:26:21Z

Looks like the parameter initialization allocation is missing. @leezu looks like this part of tagging is missing as a result of gluon refactor. Could you help suggest a way of naming these allocation in the new Gluon implementation? https://github.com/apache/incubator-mxnet/pull/17656/files#diff-29da832c2145752f3906a2f71c7b63ba

leezu · 2020-08-24T03:46:57Z

@szha the gpu memory profiler tests have never been run. They were disabled until #18701 due to declaration in the wrong file. I'm not sure what the following lines refer to, and if / why they were recorded differently before. As per the comment in the test file, the test was designed to capture 26 lines, but we now capture 33 lines. In particular, the following lines were not recorded before and appear to be the problematic ones. I don't think that recording more events in the memory profiler is related to the name-scope refactor, as the profiler is enabled and disabled separately from the name-scope.

[2020-07-15T20:15:02.236Z] <unk>:_random_uniform,67108864,0,67108864,0

[2020-07-15T20:15:02.236Z] <unk>:_random_uniform,67108864,0,67108864,0

[2020-07-15T20:15:02.236Z] <unk>:_zeros,67108864,0,67108864,0

[2020-07-15T20:15:02.236Z] <unk>:_zeros,67108864,0,67108864,0

[2020-07-15T20:15:02.236Z] <unk>:in_arg:data,640,0,640,0

[2020-07-15T20:15:02.236Z] <unk>:unknown,67108864,0,67108864,0

[2020-07-15T20:15:02.236Z] <unk>:unknown,67108864,0,67108864,0

Also note that the profiler scope implementation has some fundamental threading issues. The scope in the backend is not thread local, but the frontend "claimns" thread safety by using a thread local variable in the frontend. There same problem applies to single-threaded but asynchronous code.

@ArmageddonKnight the profiler scope is automatically set in Block and Optimizer. Would you be able to debug the issue and fix it in this PR?

https://github.com/apache/incubator-mxnet/blob/0de7484884292eb028342b1e5669233792429af0/python/mxnet/gluon/block.py#L951-L952

https://github.com/apache/incubator-mxnet/blob/0de7484884292eb028342b1e5669233792429af0/python/mxnet/optimizer/updater.py#L58-L59

Parameter initialization happens separately from Block, and if you want to set the profiler scope based there, you can add with _profiler.scope(self._uuid + ':'): in _finish_deferred_init inside parameter.py.

szha · 2020-08-24T03:51:23Z

you can add with _profiler.scope(self._uuid + ':'): in _finish_deferred_init inside parameter.py.

The problem with this is that the result will not be meaningful. The memory profiler output should reflect which block a parameter belongs to in order to properly capture the memory usage of a block. We will need some way of mapping the initialization back to the block and not rely on uuid.

leezu · 2020-08-24T03:56:15Z

A parameter may belong to multiple blocks or to no Block at all, thus there is no 1-1 mapping. I think this is a separate issue and we may first need to fix the memory profiler. The first question here is why the _random_uniform calls are recorded now and were not recorded when the test file was written (ie. 26 vs 33 recorded calls).

ArmageddonKnight · 2020-08-24T22:06:53Z

@szha I am looking into this issue:

The unknown allocation entries actually come from the symbolic execution backend, rather than Gluon.
The parameters of in the Gluon backend have weird tags, something similar to the following:

"hybridsequential0:arg_grad:param_277c1286_334d_46dc_bbb2_0850c497c7d0_weight","8192","0","8192","0"
"hybridsequential0:arg_grad:param_9c9c45b2_2e7f_4af9_8b60_b80ef0f2dd71_weight","5120","0","8192","0"

szha · 2020-08-24T22:18:36Z

The unknown allocation entries actually come from the symbolic execution backend, rather than Gluon

On master branch, we no longer have a symbolic execution backend as graph executor was removed.

The parameters of in the Gluon backend have weird tags

That's the problem I mentioned in the above comment about the usage of uuid in naming.

ArmageddonKnight · 2020-09-07T21:32:12Z

@szha FYI, I have cleaned up the debugging traces and applied the fixes that we have discussed previously. The sample GPU memory profile for Gluon now looks like the following:

"Attribute Name","Requested Size","Device","Actual Size","Reuse?"
<unk>:in_arg:data,640,0,4096,0
hybridsequential:activation0:hybridsequential_activation0_fwd,2048,0,4096,0
hybridsequential:activation0:hybridsequential_activation0_fwd_backward,8192,0,8192,0
hybridsequential:activation0:hybridsequential_activation0_fwd_head_grad,2048,0,4096,0
hybridsequential:dense0:activation0:hybridsequential_dense0_activation0_fwd,8192,0,8192,0
hybridsequential:dense0:arg_grad:bias,512,0,4096,0
hybridsequential:dense0:arg_grad:weight,5120,0,8192,0
hybridsequential:dense0:hybridsequential_dense0_fwd,8192,0,8192,0
hybridsequential:dense0:in_arg:bias,512,0,4096,0
hybridsequential:dense0:in_arg:weight,5120,0,8192,0
hybridsequential:dense1:activation0:hybridsequential_dense1_activation0_fwd,4096,0,4096,0
hybridsequential:dense1:arg_grad:bias,256,0,4096,0
hybridsequential:dense1:arg_grad:weight,32768,0,32768,0
hybridsequential:dense1:hybridsequential_dense1_fwd,4096,0,4096,0
hybridsequential:dense1:in_arg:bias,256,0,4096,0
hybridsequential:dense1:in_arg:weight,32768,0,32768,0
hybridsequential:dense2:arg_grad:bias,128,0,4096,0
hybridsequential:dense2:arg_grad:weight,8192,0,8192,0
hybridsequential:dense2:hybridsequential_dense2_fwd_backward,4096,0,4096,1
hybridsequential:dense2:in_arg:bias,128,0,4096,0
hybridsequential:dense2:in_arg:weight,8192,0,8192,0
hybridsequential:dropout0:hybridsequential_dropout0_fwd,8192,0,8192,0
hybridsequential:dropout0:hybridsequential_dropout0_fwd,8192,0,8192,0
resource:cudnn_dropout_state (dropout-inl.h +256),1474560,0,1474560,0
resource:temp_space (fully_connected-inl.h +316),15360,0,16384,0

Please let me know if you have any questions or concerns. Thank you.

szha · 2020-09-07T22:19:57Z

@ArmageddonKnight thanks for the update! the change looks good to me. There are a couple of tests that failed due to the storage_handle() calls. Once they are fixed I think we should be good to go.

szha · 2020-09-08T07:49:16Z

thanks for the fix!

leezu

@ArmageddonKnight please help address the review comments below. Thank you

tests/python/gpu/test_profiler_gpu.py

src/imperative/imperative.cc

src/profiler/storage_profiler.cc

python/mxnet/gluon/parameter.py

This reverts commit 5b7a6d9.

leezu · 2020-09-14T15:57:56Z

@ArmageddonKnight it would be great if you can help address the issues introduced by this PR as discussed above

If you don't have time to address them, please mention it here and either someone else can fix it or we can revert the PR and resubmit with fixes later. WDYT?

ArmageddonKnight · 2020-09-14T16:09:10Z

@ArmageddonKnight it would be great if you can help address the issues introduced by this PR as discussed above

If you don't have time to address them, please mention it here and either someone else can fix it or we can revert the PR and resubmit with fixes later. WDYT?

@leezu I have been traveling and just came back. Addressing those comments now. Pls give me some time. Thanks.

leezu · 2020-09-14T16:12:00Z

No worries. Thank you!

tests/python/gpu/test_profiler_gpu.py

ArmageddonKnight force-pushed the bojian/issue_18564 branch from 430de07 to 7018636 Compare July 15, 2020 01:24

ArmageddonKnight force-pushed the bojian/issue_18564 branch from ff6496d to b6daf27 Compare August 24, 2020 22:14

ArmageddonKnight mentioned this pull request Aug 24, 2020

test_gpu_memory_profiler_gluon Intermittent Failure UofT-EcoSystem/MXNet-GPU_Memory_Profiler#6

Closed

ArmageddonKnight requested a review from szha as a code owner August 25, 2020 21:57

ArmageddonKnight requested a review from eric-haibin-lin as a code owner August 26, 2020 18:28

ArmageddonKnight force-pushed the bojian/issue_18564 branch from aebc4f3 to bbda177 Compare September 5, 2020 21:04

Re-enable the test_gpu_memory_profiler_gluon test case

cbd35e4

ArmageddonKnight force-pushed the bojian/issue_18564 branch from 2ec4357 to cbd35e4 Compare September 7, 2020 21:30

Change the naming of head gradients

7c4d2a2

szha merged commit 5b7a6d9 into apache:master Sep 8, 2020

leezu reviewed Sep 8, 2020

View reviewed changes

leezu added a commit that referenced this pull request Sep 14, 2020

Revert "Re-enable the test_gpu_memory_profiler_gluon test case (#18704)"

b181d3b

This reverts commit 5b7a6d9.

leezu mentioned this pull request Sep 14, 2020

Revert "Re-enable the test_gpu_memory_profiler_gluon test case" #19141

Closed

ArmageddonKnight mentioned this pull request Sep 15, 2020

Address Leo's comments on PR#18704 #19154

Merged

ArmageddonKnight commented Sep 15, 2020

View reviewed changes

tests/python/gpu/test_profiler_gpu.py Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-enable the test_gpu_memory_profiler_gluon test case #18704

Re-enable the test_gpu_memory_profiler_gluon test case #18704

ArmageddonKnight commented Jul 14, 2020

mxnet-bot commented Jul 14, 2020

szha commented Jul 15, 2020

ArmageddonKnight commented Jul 15, 2020

ArmageddonKnight commented Jul 15, 2020 •

edited

Loading

mxnet-bot commented Jul 15, 2020

Zha0q1 commented Jul 15, 2020

mxnet-bot commented Jul 15, 2020

szha commented Aug 23, 2020

leezu commented Aug 24, 2020

szha commented Aug 24, 2020

leezu commented Aug 24, 2020 •

edited

Loading

ArmageddonKnight commented Aug 24, 2020

szha commented Aug 24, 2020

ArmageddonKnight commented Sep 7, 2020

szha commented Sep 7, 2020

szha commented Sep 8, 2020

leezu left a comment

leezu commented Sep 14, 2020

ArmageddonKnight commented Sep 14, 2020

leezu commented Sep 14, 2020

Re-enable the test_gpu_memory_profiler_gluon test case #18704

Re-enable the test_gpu_memory_profiler_gluon test case #18704

Conversation

ArmageddonKnight commented Jul 14, 2020

Description

Checklist

Essentials

Changes

Comments

mxnet-bot commented Jul 14, 2020

szha commented Jul 15, 2020

ArmageddonKnight commented Jul 15, 2020

ArmageddonKnight commented Jul 15, 2020 • edited Loading

mxnet-bot commented Jul 15, 2020

Zha0q1 commented Jul 15, 2020

mxnet-bot commented Jul 15, 2020

szha commented Aug 23, 2020

leezu commented Aug 24, 2020

szha commented Aug 24, 2020

leezu commented Aug 24, 2020 • edited Loading

ArmageddonKnight commented Aug 24, 2020

szha commented Aug 24, 2020

ArmageddonKnight commented Sep 7, 2020

szha commented Sep 7, 2020

szha commented Sep 8, 2020

leezu left a comment

Choose a reason for hiding this comment

leezu commented Sep 14, 2020

ArmageddonKnight commented Sep 14, 2020

leezu commented Sep 14, 2020

ArmageddonKnight commented Jul 15, 2020 •

edited

Loading

leezu commented Aug 24, 2020 •

edited

Loading