Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

Fix flaky test test_gluon.test_hybrid_static_memory_switching #11577

Merged
merged 5 commits into from
Jul 11, 2018

Conversation

zheng-da
Copy link
Contributor

@zheng-da zheng-da commented Jul 6, 2018

Description

This is to fix the flaky test #11171
There was a race condition when we invalidate MKLDNN memory in NDArray. The original code invalidates MKLDNN memory in the memory planning phase (AsArray) as well, which is called outside the threaded engine. The static memory allocation mode of CachedOp shares the same NDArrays in different model executions. Therefore, the NDArrays may still be used in the threaded engine from the previous model execution, while the same CachedOp is executed for the next execution and the same NDArrays are used in memory planning. With MKLDNN enabled, NDArrays were modified in the memory planning phase, which leads to race condition.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

@ZhennanQin
Copy link
Contributor

Hi Zhengda,

After applying this PR with recent master, 3 unit tests failed when enabling MKLDNN. They all failed with

mxnet.base.MXNetError: [20:22:53] src/ndarray/ndarray.cc:706: Check failed: !IsMKLDNNData() We can't generate TBlob for MKLDNN data. Please use Reorder2Default() to generate a new NDArray first

I also have questions for this change. Can you explain why can't we do invalidate for those cases? Why they work well in dynamic allocation while don't in static allocation? I suspect there're some bugs between static allocation and threaded engine, and this bug doesn't relate to MKLDNN only. Because when I ran this test many times, I saw a lot double free and memory corruption without mkldnn stack.

@zheng-da zheng-da force-pushed the fix_mkldnn_racecond1 branch from baa7406 to 13fc0b9 Compare July 6, 2018 18:11
@zheng-da
Copy link
Contributor Author

zheng-da commented Jul 7, 2018

@ZhennanQin I described how race condition happens. It occurs in a special case and may not be related to your problem.

@ZhennanQin
Copy link
Contributor

Thanks for your declaration, I get the problem now. If my understanding is correct, we should avoid invalidating MKLDNN memory in memory planning phase. Can we add some check or at least some comments for InvalidateMKLDNNData() to prevent same problem happens?(developers should carefully use InvalidateMKLDNNData() after enabling static memory allocation, they should be noticed that) Or, can static memory allocation do more sanity check to capture this kind of issue early instead of random crash?

@zheng-da
Copy link
Contributor Author

zheng-da commented Jul 7, 2018

this is a general problem. That is, we shouldn't modify NDArray outside the threaded engine. I don't think we should treat InvalidateMKLDNNData() so specially.
InvalidateMKLDNNData() will be removed once we move MKLDNN to subgraphs.

@ZhennanQin
Copy link
Contributor

No questions from my side. LGTM.

@@ -1187,17 +1187,19 @@ def check_hybrid_static_memory_switching(**kwargs):

x = mx.nd.random.uniform(shape=(4, 3, 32, 32))
net(x)
x.attach_grad()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without it, it seems the backward computation couldn't proceed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't proceed how? Usually we don't need to compute the grad w.r.t input x.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what i mean is that without this modification, i don't see operators executed in the threaded engine. i don't know why. you can give it a try.

@zheng-da zheng-da requested a review from anirudh2290 as a code owner July 10, 2018 18:11
@eric-haibin-lin eric-haibin-lin merged commit b264f6f into apache:master Jul 11, 2018
XinYao1994 pushed a commit to XinYao1994/incubator-mxnet that referenced this pull request Aug 29, 2018
…#11577)

* enable tests.

* update tests.

* don't invalidate in AsArray.

* don't invalidate in FC.

* fix.
@zheng-da zheng-da deleted the fix_mkldnn_racecond1 branch September 29, 2018 21:32
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants