Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Memory fixes. Resolves #10867, and resolves #14080 #14372

Merged
merged 14 commits into from
Mar 28, 2019

Conversation

andrewfayres
Copy link
Contributor

@andrewfayres andrewfayres commented Mar 8, 2019

Description

There were a few memory leaks in executor and some instances of where ResourceScope was causing problems. This should fix those known issues.

To elaborate on the ResourceScope issue: there are some times when classes such as Executor creates other NativeResources (usually NDArrays) after the Executor has been instantiated. The executor then has a dependency on that new NativeResource. Since these happen at different times, they could potentially be created at different ResourceScope levels. The result is an unstable state where exiting the scope would cause a crash. Fixed this by adding a method to NativeResource which allows moving a resource to the same scope as another. This allows the parent class to control the scope of their dependencies.

Resolves #10867, and resolves #14080.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@andrewfayres andrewfayres changed the title Memory fixes. Resolves #11926, and resolves #14080 Memory fixes. Resolves #10867, and resolves #14080 Mar 8, 2019
@vandanavk
Copy link
Contributor

@mxnet-label-bot add [Memory, Scala, pr-awaiting-review]

@marcoabreu marcoabreu added Memory pr-awaiting-review PR is waiting for code review Scala labels Mar 8, 2019
@lanking520 lanking520 self-requested a review March 9, 2019 00:01
@andrewfayres
Copy link
Contributor Author

NeuralStyle on gpu seems to be failing. I'll have to dig into why. I also think I've thought of a better way to deal with the interdependency of symbol/executor than what is being done right now so I'll update that as well.

@nswamy
Copy link
Member

nswamy commented Mar 9, 2019

If I understood correct(from the description), the problem is Executor creates some NDArrays that gets disposed before the executor?
I am wondering how are these NDArrays getting released before the destruction of the executor?

@andrewfayres
Copy link
Contributor Author

@nswamy This is because there is no guarantee that they are created at the same ResourceScope level. For example, the Executor gets created and returned to the user. The user then opens up a ResourceScope to do inference. In this scope, the executor might end up creating new resources for itself. Those resources are then in the scope level of the user's ResourceScope and get disposed when that closes. This leaves the Executor without the dependencies it created.

The solution is to have objects like executor put any dependencies that they create into the same ResourceScope as themselves. This way they are all coupled together. I thought about using moveToOuterScope but there's no way to know how many levels of scope could be between the two.

@andrewfayres andrewfayres changed the title Memory fixes. Resolves #10867, and resolves #14080 [WIP] Memory fixes. Resolves #10867, and resolves #14080 Mar 9, 2019
@nswamy
Copy link
Member

nswamy commented Mar 11, 2019

@andrewfayres, thanks for taking care of this, Its a really nice find :)

I find the design of executor less than desirable(where it creates states in a method dynamically) after the
the object is created.
Anyway, here is what we discussed offline:

  • moveToScopeOf needs cognizant use of it after the creation of the resources
  • calling moveToScopeOf on a large amount of objects will have performance impact.

My suggestion is to create a couple methods in ResourceScope.
ResourceScope.setScope(scope: ResourceScope)
ResourceScope.resetScope()
We can document these methods to be used when you want to hold NativeResources as state variables inside another NativeResource, in this case the Executor.

Explore to automatically detect(possibly reflection) creation of NativeResource within another NativeResource.
Ideal would have been to pass the scope in which to create to a native resource, unfortunately that would require changing every API of NDArray, Symbol, etc., since most of them are exposed as static methods.

I am open to hear any other interesting ideas that is less intrusive.

@andrewfayres
Copy link
Contributor Author

I came up with a different solution which we (@nswamy and I) discussed offline. Essentially, ResourceScope.using already has an optional parameter for an existing scope. A few changes minor changes should let us reuse that.

This approach should be more consistent and intuitive. I'll update the PR.

@lanking520
Copy link
Member

Any updates?

@andrewfayres
Copy link
Contributor Author

I was on call last week and didn't have time to pick this back up unfortunately. I started working on it again yesterday and finished the code changes. I'm going to do testing today and if all goes well I'll have an update this afternoon.

@@ -179,6 +175,16 @@ object ResourceScope {
}
}

private[mxnet] def usingIfScopeExists[A](scope: Option[ResourceScope])(body: => A): A = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need this new method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah so originally I wanted to use the .using method directly but ran into some problems. There are times when we need to add new native resources to the same scope of the parent but there's no guarantee that the parent is actually in an existing scope.

When this happens it leads to a few issues. First, the parent is in None scope which we cannot pass to ResourceScope.using. Second, we still want to execute the body and allocate all the new resources but don't want to want to make a new ResourceScope because all those new resources will disappear with the new scope.

Alternatives that I thought of were: 1.) to have the caller check whether or not it was in a scope and handle it appropriately. This is ugly and puts the onus on the callers in multiple places. 2.) Changing the using method to work with a None scope. I opted not to do this because it complicates that method and I believe would require changing the method parameters which I didn't want to do. 3.) Changing the default scope to be something other than None. This is probably a reasonable solution. Maybe we have some kind of base scope or something similar. That's likely to be a fairly significant change in both the design and behavior of this class.

@andrewfayres
Copy link
Contributor Author

Found a new bug last night where many of the optimizers cause a seg fault when used with FeedForward in 1.4.0. Surprisingly, no one has filed an issue for this so I went ahead and did so. It's fixed in this PR.

Resolves #14498

@andrewfayres andrewfayres changed the title [WIP] Memory fixes. Resolves #10867, and resolves #14080 Memory fixes. Resolves #10867, and resolves #14080 Mar 21, 2019
@andrewfayres
Copy link
Contributor Author

CI seems to be experiencing unrelated failures. I'll restart this later once they appear to be resolved.

@nswamy
Copy link
Member

nswamy commented Mar 23, 2019

I am surprised that the tests did not catch the seg-faults, I ran the code through several examples for many days and did not see the seg-faults.

@nswamy
Copy link
Member

nswamy commented Mar 23, 2019

Feedforward is to be deprecated and Module should be used instead, we never deprecated that module, may be we should consider deprecating that.

@nswamy
Copy link
Member

nswamy commented Mar 27, 2019

summary of offline discussion:

  • The method is private, this needs to be re-evaluate if we can do without adding scope twice into threadlocal. (ResourceScope:L162)
  • There is a bug in using method that closes the passed in scope(it should not), we will track a JIRA to track and resolve.
  • Also noting that checking the map could impact the performance, with assumption that we won't have too many ResourceScope in action on each thread at at time, at this time this solution is acceptable.

@nswamy
Copy link
Member

nswamy commented Mar 28, 2019

Thanks for nailing down this issue, great Job 💯 Please link the JIRA we discussed for fixing the scope closing if passed

@nswamy nswamy merged commit 102b46f into apache:master Mar 28, 2019
@andrewfayres
Copy link
Contributor Author

Link to Jira for looking into & fixing the ResourceScope.using closing scopes that are passed in.

https://issues.apache.org/jira/browse/MXNET-1376

vdantu pushed a commit to vdantu/incubator-mxnet that referenced this pull request Mar 31, 2019
…e#14372)

* Fixes for memory leak when reshaping executor

* Fixed Adam Optimizer memory leak

* Cleanup for PR

* Added unit test for new ResourceScope method

* Removing import that was added by overzealous ide

* Add back in an import

* Added flags for executor to know whether or not it owns NDArrays for disposal

* Moving to ResourceScope.using implementation

* Changes to make ResourceScope.using work with existing scope

* Updating ResourceScope to work with existing scopes via usingIfScopeExists method

* Fix clojure unit tests

* Fixes to be compatibile with how clojure is using ResourceScope

* Removing some unnecessary changes

* Adding scope assertion in unit test
lanking520 pushed a commit to lanking520/incubator-mxnet that referenced this pull request Apr 1, 2019
…e#14372)

* Fixes for memory leak when reshaping executor

* Fixed Adam Optimizer memory leak

* Cleanup for PR

* Added unit test for new ResourceScope method

* Removing import that was added by overzealous ide

* Add back in an import

* Added flags for executor to know whether or not it owns NDArrays for disposal

* Moving to ResourceScope.using implementation

* Changes to make ResourceScope.using work with existing scope

* Updating ResourceScope to work with existing scopes via usingIfScopeExists method

* Fix clojure unit tests

* Fixes to be compatibile with how clojure is using ResourceScope

* Removing some unnecessary changes

* Adding scope assertion in unit test
ZhennanQin pushed a commit to ZhennanQin/incubator-mxnet that referenced this pull request Apr 3, 2019
…e#14372)

* Fixes for memory leak when reshaping executor

* Fixed Adam Optimizer memory leak

* Cleanup for PR

* Added unit test for new ResourceScope method

* Removing import that was added by overzealous ide

* Add back in an import

* Added flags for executor to know whether or not it owns NDArrays for disposal

* Moving to ResourceScope.using implementation

* Changes to make ResourceScope.using work with existing scope

* Updating ResourceScope to work with existing scopes via usingIfScopeExists method

* Fix clojure unit tests

* Fixes to be compatibile with how clojure is using ResourceScope

* Removing some unnecessary changes

* Adding scope assertion in unit test
nswamy pushed a commit that referenced this pull request Apr 5, 2019
* Fixes for memory leak when reshaping executor

* Fixed Adam Optimizer memory leak

* Cleanup for PR

* Added unit test for new ResourceScope method

* Removing import that was added by overzealous ide

* Add back in an import

* Added flags for executor to know whether or not it owns NDArrays for disposal

* Moving to ResourceScope.using implementation

* Changes to make ResourceScope.using work with existing scope

* Updating ResourceScope to work with existing scopes via usingIfScopeExists method

* Fix clojure unit tests

* Fixes to be compatibile with how clojure is using ResourceScope

* Removing some unnecessary changes

* Adding scope assertion in unit test
lanking520 added a commit that referenced this pull request Apr 5, 2019
* Fixes for memory leak when reshaping executor

* Fixed Adam Optimizer memory leak

* Cleanup for PR

* Added unit test for new ResourceScope method

* Removing import that was added by overzealous ide

* Add back in an import

* Added flags for executor to know whether or not it owns NDArrays for disposal

* Moving to ResourceScope.using implementation

* Changes to make ResourceScope.using work with existing scope

* Updating ResourceScope to work with existing scopes via usingIfScopeExists method

* Fix clojure unit tests

* Fixes to be compatibile with how clojure is using ResourceScope

* Removing some unnecessary changes

* Adding scope assertion in unit test
haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019
…e#14372)

* Fixes for memory leak when reshaping executor

* Fixed Adam Optimizer memory leak

* Cleanup for PR

* Added unit test for new ResourceScope method

* Removing import that was added by overzealous ide

* Add back in an import

* Added flags for executor to know whether or not it owns NDArrays for disposal

* Moving to ResourceScope.using implementation

* Changes to make ResourceScope.using work with existing scope

* Updating ResourceScope to work with existing scopes via usingIfScopeExists method

* Fix clojure unit tests

* Fixes to be compatibile with how clojure is using ResourceScope

* Removing some unnecessary changes

* Adding scope assertion in unit test
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Memory pr-awaiting-review PR is waiting for code review Scala
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adam Optimizer Memory Leak in Scala Scala Module API resize is leaking memory on the native size.
5 participants