-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Fix flaky test test_gluon.test_hybrid_static_memory_switching #11577
Conversation
Hi Zhengda, After applying this PR with recent master, 3 unit tests failed when enabling MKLDNN. They all failed with mxnet.base.MXNetError: [20:22:53] src/ndarray/ndarray.cc:706: Check failed: !IsMKLDNNData() We can't generate TBlob for MKLDNN data. Please use Reorder2Default() to generate a new NDArray first I also have questions for this change. Can you explain why can't we do invalidate for those cases? Why they work well in dynamic allocation while don't in static allocation? I suspect there're some bugs between static allocation and threaded engine, and this bug doesn't relate to MKLDNN only. Because when I ran this test many times, I saw a lot double free and memory corruption without mkldnn stack. |
baa7406
to
13fc0b9
Compare
@ZhennanQin I described how race condition happens. It occurs in a special case and may not be related to your problem. |
Thanks for your declaration, I get the problem now. If my understanding is correct, we should avoid invalidating MKLDNN memory in memory planning phase. Can we add some check or at least some comments for InvalidateMKLDNNData() to prevent same problem happens?(developers should carefully use InvalidateMKLDNNData() after enabling static memory allocation, they should be noticed that) Or, can static memory allocation do more sanity check to capture this kind of issue early instead of random crash? |
this is a general problem. That is, we shouldn't modify NDArray outside the threaded engine. I don't think we should treat InvalidateMKLDNNData() so specially. |
No questions from my side. LGTM. |
tests/python/unittest/test_gluon.py
Outdated
@@ -1187,17 +1187,19 @@ def check_hybrid_static_memory_switching(**kwargs): | |||
|
|||
x = mx.nd.random.uniform(shape=(4, 3, 32, 32)) | |||
net(x) | |||
x.attach_grad() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
without it, it seems the backward computation couldn't proceed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't proceed how? Usually we don't need to compute the grad w.r.t input x.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what i mean is that without this modification, i don't see operators executed in the threaded engine. i don't know why. you can give it a try.
…#11577) * enable tests. * update tests. * don't invalidate in AsArray. * don't invalidate in FC. * fix.
Description
This is to fix the flaky test #11171
There was a race condition when we invalidate MKLDNN memory in NDArray. The original code invalidates MKLDNN memory in the memory planning phase (AsArray) as well, which is called outside the threaded engine. The static memory allocation mode of CachedOp shares the same NDArrays in different model executions. Therefore, the NDArrays may still be used in the threaded engine from the previous model execution, while the same CachedOp is executed for the next execution and the same NDArrays are used in memory planning. With MKLDNN enabled, NDArrays were modified in the memory planning phase, which leads to race condition.
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.