Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Enable MKL-DNN FullyConnected backward #17318

Merged
merged 27 commits into from
Feb 19, 2020
Merged

Conversation

TaoLv
Copy link
Member

@TaoLv TaoLv commented Jan 15, 2020

Description

Discussed with @rongzha1 offline. This PR incorporates the code changes in #16890 which was reverted previously.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@TaoLv TaoLv requested a review from anirudh2290 as a code owner February 9, 2020 03:13
@TaoLv TaoLv changed the title [WIP] Enable MKL-DNN FullyConnected backward Enable MKL-DNN FullyConnected backward Feb 14, 2020
@TaoLv
Copy link
Member Author

TaoLv commented Feb 14, 2020

This is ready for review now. I also changed the cpp test to address the FullyConnectedOp failure reported before. The main reason is the tensor shape is relative large and the max value in the tensor is 50. It will lead to float number accumulation error. @rongzha1 @ciyongch @pengzhao-intel

@@ -330,7 +330,7 @@ inline void PrintVerifyMsg(const NDArrayAttrs &arr1, const NDArrayAttrs &arr2) {
*/
inline std::vector<NDArrayAttrs> GetTestInputArrays(
int types = ArrayTypes::All, bool rand = false,
std::vector<float> scale = {1}, bool spatial_data_format = false) {
std::vector<float> scale = {1}, bool spatial_data_format = false, int max = 50) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the new parameter for better usability?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@ciyongch ciyongch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, the failure bothered us a lot in the past:)

const TBlob &blob = arr->data();
mshadow::default_real_t *data = blob.dptr<mshadow::default_real_t>();
int size = blob.Size();

for (int i = 0; i < size; i++)
if (is_rand) {
data[i] = (std::rand() % 100) - 50;
data[i] = (std::rand() % (max * 2)) - max;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about change to data[i] = std::rand() * 1.0 / RAND_MAX - 0.5;? As max = 1 will only generate two values: -1.0and 0.0 .

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I don't want to affect other test case which still use the default max=50 to generate integers in [50, 50). But For the FullyConnectedOp, I want to generate relative small numbers. With the given code, the elements will be -1 and 0. Any suggestion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've no idea about why the range was set to [-50, 50) previously, and I can't figure out any specific reasons to use this range for the tests (any upper bound test?). It'll be great if you have any background for it.
But anyway, the tensors with only two values (-1 and 0, 50% are 0) might not be a good candidate for the tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, i will change to generate float numbers in [-max, max) rather than integer numbers. Previously I thought sparse (say 50% zeros) is also a way to avoid float number accumulation error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With memcmp to check the results, then the only choice is integer numbers. Is it reasonable to check the results by AssertEqual within a small enough threshold like 1e-6, then we can keep the floating number with better distribution?
Or we can just increase max to filling more different numbers other than only -1 and 0.
What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reverted the changes for floating numbers. Changing ``memcmptoAssertEqual` is out of the scope of this PR, so I will keep it as is.

Or we can just increase max to filling more different numbers other than only -1 and 0.

I was thinking about including number 2 into the generated tensor but found that with the given shapes, there still has chance to get error. That means for the worst case, the intermediate accumulation value will be > 2^24, so the 1 will be ignored when accumulating another 1 to it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's keep it as is for now.

} else {
data[i] = i % 100 - 50;
data[i] = i % (max * 2) - max;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, how about change to something like data[i] = i * 2.0 / size - 1.0 to generate [-1.0, 1.0)?

Copy link
Contributor

@ciyongch ciyongch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pengzhao-intel pengzhao-intel merged commit cf42535 into apache:master Feb 19, 2020
zheyuye pushed a commit to zheyuye/incubator-mxnet that referenced this pull request Feb 19, 2020
* fix mkldnn fc bwd bug due to data inplace

* enable mkldnn fc bwd

* fix cpp tests

* try: fix random seed

* fix cpp test

* loose rtol for fc cpp test

* improve error message

* limit max value for mkldnn tensors

* limit the max value of test tensors

* fix lint

* remove fixed random seed

* address review comments

* Revert "address review comments"

This reverts commit 56d873f.

Co-authored-by: rongzha1 <[email protected]>
anirudh2290 pushed a commit to anirudh2290/mxnet that referenced this pull request May 29, 2020
* fix mkldnn fc bwd bug due to data inplace

* enable mkldnn fc bwd

* fix cpp tests

* try: fix random seed

* fix cpp test

* loose rtol for fc cpp test

* improve error message

* limit max value for mkldnn tensors

* limit the max value of test tensors

* fix lint

* remove fixed random seed

* address review comments

* Revert "address review comments"

This reverts commit 56d873f.

Co-authored-by: rongzha1 <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants