Fix flaky test test_gluon.test_hybrid_static_memory_switching #11577

zheng-da · 2018-07-06T07:39:38Z

Description

This is to fix the flaky test #11171
There was a race condition when we invalidate MKLDNN memory in NDArray. The original code invalidates MKLDNN memory in the memory planning phase (AsArray) as well, which is called outside the threaded engine. The static memory allocation mode of CachedOp shares the same NDArrays in different model executions. Therefore, the NDArrays may still be used in the threaded engine from the previous model execution, while the same CachedOp is executed for the next execution and the same NDArrays are used in memory planning. With MKLDNN enabled, NDArrays were modified in the memory planning phase, which leads to race condition.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

ZhennanQin · 2018-07-06T12:40:06Z

Hi Zhengda,

After applying this PR with recent master, 3 unit tests failed when enabling MKLDNN. They all failed with

mxnet.base.MXNetError: [20:22:53] src/ndarray/ndarray.cc:706: Check failed: !IsMKLDNNData() We can't generate TBlob for MKLDNN data. Please use Reorder2Default() to generate a new NDArray first

I also have questions for this change. Can you explain why can't we do invalidate for those cases? Why they work well in dynamic allocation while don't in static allocation? I suspect there're some bugs between static allocation and threaded engine, and this bug doesn't relate to MKLDNN only. Because when I ran this test many times, I saw a lot double free and memory corruption without mkldnn stack.

zheng-da · 2018-07-07T01:00:17Z

@ZhennanQin I described how race condition happens. It occurs in a special case and may not be related to your problem.

ZhennanQin · 2018-07-07T01:47:40Z

Thanks for your declaration, I get the problem now. If my understanding is correct, we should avoid invalidating MKLDNN memory in memory planning phase. Can we add some check or at least some comments for InvalidateMKLDNNData() to prevent same problem happens?(developers should carefully use InvalidateMKLDNNData() after enabling static memory allocation, they should be noticed that) Or, can static memory allocation do more sanity check to capture this kind of issue early instead of random crash?

zheng-da · 2018-07-07T19:20:58Z

this is a general problem. That is, we shouldn't modify NDArray outside the threaded engine. I don't think we should treat InvalidateMKLDNNData() so specially.
InvalidateMKLDNNData() will be removed once we move MKLDNN to subgraphs.

ZhennanQin · 2018-07-08T02:57:31Z

No questions from my side. LGTM.

eric-haibin-lin · 2018-07-09T17:27:26Z

tests/python/unittest/test_gluon.py

@@ -1187,17 +1187,19 @@ def check_hybrid_static_memory_switching(**kwargs):

    x = mx.nd.random.uniform(shape=(4, 3, 32, 32))
    net(x)
+    x.attach_grad()


why is this necessary?

without it, it seems the backward computation couldn't proceed.

Couldn't proceed how? Usually we don't need to compute the grad w.r.t input x.

what i mean is that without this modification, i don't see operators executed in the threaded engine. i don't know why. you can give it a try.

…#11577) * enable tests. * update tests. * don't invalidate in AsArray. * don't invalidate in FC. * fix.

zheng-da added 3 commits July 6, 2018 06:43

enable tests.

8f7b16b

update tests.

77a44d3

don't invalidate in AsArray.

ac494eb

don't invalidate in FC.

13fc0b9

zheng-da force-pushed the fix_mkldnn_racecond1 branch from baa7406 to 13fc0b9 Compare July 6, 2018 18:11

eric-haibin-lin reviewed Jul 9, 2018

View reviewed changes

fix.

b09b8af

zheng-da requested a review from anirudh2290 as a code owner July 10, 2018 18:11

eric-haibin-lin approved these changes Jul 10, 2018

View reviewed changes

eric-haibin-lin merged commit b264f6f into apache:master Jul 11, 2018

ZhennanQin mentioned this pull request Jul 16, 2018

Flaky test MKLDNN test_bucket_module #10724

Closed

zheng-da mentioned this pull request Jul 16, 2018

Flaky test: test_gluon.test_hybrid_static_memory_switching #11171

Closed

XinYao1994 pushed a commit to XinYao1994/incubator-mxnet that referenced this pull request Aug 29, 2018

Fix flaky test test_gluon.test_hybrid_static_memory_switching (apache…

48d0624

…#11577) * enable tests. * update tests. * don't invalidate in AsArray. * don't invalidate in FC. * fix.

zheng-da deleted the fix_mkldnn_racecond1 branch September 29, 2018 21:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix flaky test test_gluon.test_hybrid_static_memory_switching #11577

Fix flaky test test_gluon.test_hybrid_static_memory_switching #11577

zheng-da commented Jul 6, 2018 •

edited

Loading

ZhennanQin commented Jul 6, 2018

zheng-da commented Jul 7, 2018

ZhennanQin commented Jul 7, 2018

zheng-da commented Jul 7, 2018

ZhennanQin commented Jul 8, 2018

eric-haibin-lin Jul 9, 2018

zheng-da Jul 9, 2018

eric-haibin-lin Jul 9, 2018

zheng-da Jul 9, 2018

Fix flaky test test_gluon.test_hybrid_static_memory_switching #11577

Fix flaky test test_gluon.test_hybrid_static_memory_switching #11577

Conversation

zheng-da commented Jul 6, 2018 • edited Loading

Description

Checklist

Essentials

ZhennanQin commented Jul 6, 2018

zheng-da commented Jul 7, 2018

ZhennanQin commented Jul 7, 2018

zheng-da commented Jul 7, 2018

ZhennanQin commented Jul 8, 2018

eric-haibin-lin Jul 9, 2018

Choose a reason for hiding this comment

zheng-da Jul 9, 2018

Choose a reason for hiding this comment

eric-haibin-lin Jul 9, 2018

Choose a reason for hiding this comment

zheng-da Jul 9, 2018

Choose a reason for hiding this comment

zheng-da commented Jul 6, 2018 •

edited

Loading