-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Re-enable the test_gpu_memory_profiler_gluon test case #18704
Conversation
Hey @ArmageddonKnight , Thanks for submitting the PR
CI supported jobs: [windows-gpu, windows-cpu, unix-cpu, sanity, centos-gpu, centos-cpu, website, edge, miscellaneous, clang, unix-gpu] Note: |
FYI I merged #18701 |
430de07
to
7018636
Compare
@szha I have updated the changes accordingly. |
@mxnet-bot run ci [centos-cpu, centos-gpu, unix-cpu, unix-gpu] |
Jenkins CI successfully triggered : [centos-gpu, unix-cpu, unix-gpu, centos-cpu] |
Any ideas on what might have caused the error in This same error seems to also cause issues in |
Jenkins CI successfully triggered : [unix-cpu, centos-cpu, unix-gpu, centos-gpu] |
Looks like the parameter initialization allocation is missing. @leezu looks like this part of tagging is missing as a result of gluon refactor. Could you help suggest a way of naming these allocation in the new Gluon implementation? https://github.com/apache/incubator-mxnet/pull/17656/files#diff-29da832c2145752f3906a2f71c7b63ba |
@szha the gpu memory profiler tests have never been run. They were disabled until #18701 due to declaration in the wrong file. I'm not sure what the following lines refer to, and if / why they were recorded differently before. As per the comment in the test file, the test was designed to capture 26 lines, but we now capture 33 lines. In particular, the following lines were not recorded before and appear to be the problematic ones. I don't think that recording more events in the memory profiler is related to the name-scope refactor, as the profiler is enabled and disabled separately from the name-scope.
Also note that the profiler scope implementation has some fundamental threading issues. The scope in the backend is not thread local, but the frontend "claimns" thread safety by using a thread local variable in the frontend. There same problem applies to single-threaded but asynchronous code. @ArmageddonKnight the profiler scope is automatically set in Block and Optimizer. Would you be able to debug the issue and fix it in this PR? Parameter initialization happens separately from Block, and if you want to set the profiler scope based there, you can add |
The problem with this is that the result will not be meaningful. The memory profiler output should reflect which block a parameter belongs to in order to properly capture the memory usage of a block. We will need some way of mapping the initialization back to the block and not rely on uuid. |
A parameter may belong to multiple blocks or to no Block at all, thus there is no 1-1 mapping. I think this is a separate issue and we may first need to fix the memory profiler. The first question here is why the |
@szha I am looking into this issue:
|
ff6496d
to
b6daf27
Compare
On master branch, we no longer have a symbolic execution backend as graph executor was removed.
That's the problem I mentioned in the above comment about the usage of uuid in naming. |
aebc4f3
to
bbda177
Compare
2ec4357
to
cbd35e4
Compare
@szha FYI, I have cleaned up the debugging traces and applied the fixes that we have discussed previously. The sample GPU memory profile for Gluon now looks like the following:
Please let me know if you have any questions or concerns. Thank you. |
@ArmageddonKnight thanks for the update! the change looks good to me. There are a couple of tests that failed due to the |
thanks for the fix! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ArmageddonKnight please help address the review comments below. Thank you
@ArmageddonKnight it would be great if you can help address the issues introduced by this PR as discussed above If you don't have time to address them, please mention it here and either someone else can fix it or we can revert the PR and resubmit with fixes later. WDYT? |
@leezu I have been traveling and just came back. Addressing those comments now. Pls give me some time. Thanks. |
No worries. Thank you! |
Description
This PR is supposed to fix the issue described in #18564
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Changes
continuous_dump
when setting the profiler config.Comments
This PR is for fixing the issue described in #18564 . Specifically, the reason why the error happens is because the profiler config has the flag
continuous_dump
set, which would further propagate to the core functionProfiler::SetContinuousProfileDump
(shown below) and give an error on line 268.https://github.com/apache/incubator-mxnet/blob/9d623926d4857a2cfa32515b58cd1398371f97f3/src/profiler/profiler.cc#L258-L268
FYI, @szha