Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Allow clearing gpu cache #14252

Merged
merged 9 commits into from
May 25, 2019
Merged

Conversation

vladoovtcharov
Copy link
Contributor

Allows releasing memory from pool from c and python api.
(Should resolve feature request #13482)

szha
szha previously requested changes Feb 25, 2019
Copy link
Member

@szha szha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exposing this API as is would cause everything to stop for memory release which could be problematic for an asynchronous execution environment. Because of the impact on performance for not caching memory, a much more preferable option is to provide the option to completely turn off memory caching, so that people can use it and be sure that mxnet is not holding extra memory for special use cases.

@szha
Copy link
Member

szha commented Feb 25, 2019

@vladoovtcharov thanks for the contribution. Instead of exposing ReleaseAll, would you mind adding an environment variable for MXNET_USE_GPU_MEM_POOL which defaults to 1, and use this value to decide whether Free should release the memory on the spot or return it to memory pool?

@vladoovtcharov
Copy link
Contributor Author

Hi @szha, thanks for the feedback. I'm not sure if I completely understood though.
If the user wanted the memory to be freed right away couldn't they use a non-pooled storage manager (i.e. storage::NaiveStorageManagerstorage::GPUDeviceStorage()) or did you have something else in mind? (Although currently there is no value for MXNET_GPU_MEM_POOL_TYPE that would do this)

At least for my use case I was hoping to still use the memory pool/caching (for performance) but periodically pause and release the memory (so another process could use the gpu memory). I understand the release_all call could be expensive but it would not/should not be called regularly so overall it seems like it would effect performance less than using non pooled memory.

@szha
Copy link
Member

szha commented Feb 26, 2019

@vladoovtcharov yes, they could use GPUDeviceStorage, and exposing the option in MXNET_GPU_MEM_POOL_TYPE is also fine. My general concern around exposing this type of knob at runtime is exactly to avoid supporting this type of use case, as it's hard to make any promise in this setting. For example, releasing objects in memory pool likely won't solve all your problem as there are other levels of caches. For now I think it's better to keep it as simple as on or off.

@vladoovtcharov
Copy link
Contributor Author

I added an additional MXNET_GPU_MEM_POOL_TYPE (Unpooled) but I am worried that it is not the best thing to do.
As expected the performance is pretty bad when using unpooled memory (about 30% slower in my case for a simple classifier as in https://mxnet.incubator.apache.org/versions/master/tutorials/gluon/mnist.html)

It seems like using a pooled memory type with the ability to call ReleaseAll would be preferable in most use cases.

Also I'm not sure why the pooled memory managers would have any difference in what can be promised.
When using a storage manager that doesn't use pooling, when free gets called the memory is freed.
When using a storage manager that does use a pool, when free gets called the memory is recycled to the pool.
When ReleaseAll is called the recycled memory in the pool is freed.
So using the pooled manager with ReleaseAll would have the exact same guarantees as using an unpooled manager, but has the added benefit of improved performance.

@szha
Copy link
Member

szha commented Feb 26, 2019

The problem is in the statement "the memory is freed". Not all memories are freed, and users shouldn't be concerned with how caching works in mxnet.

@szha
Copy link
Member

szha commented Feb 26, 2019

I'm happy to take this into further consideration in 2.0 design (#9686). For now, let's not add the C API yet.

@vladoovtcharov
Copy link
Contributor Author

@szha ok sounds good

@szha
Copy link
Member

szha commented Feb 27, 2019

@vladoovtcharov thanks for understanding. Shall we get the Unpooled option merged first? If so, it would be great if you could look into the CI problems.

@vladoovtcharov
Copy link
Contributor Author

Would it help to split into another pull request for just the unpooling? (One more formatting error to fix...)

@szha
Copy link
Member

szha commented Feb 27, 2019

Yes, feel free to do so if it's easier :)

@anirudhacharya
Copy link
Member

@mxnet-label-bot add [pr-work-in-progress]

@marcoabreu marcoabreu added the pr-work-in-progress PR is still work in progress label Feb 27, 2019
@pinaraws
Copy link

@vladoovtcharov Did you get a chance to work on changes requested by @szha ?

@piyushghai
Copy link
Contributor

@vladoovtcharov Gentle ping...

@Roshrini
Copy link
Member

@vladoovtcharov Can you please resolve conflicts so that we can move forward with this PR?

@vladoovtcharov vladoovtcharov force-pushed the patch/gpu_release_all branch from 2bc55f2 to b72472b Compare April 17, 2019 04:03
@vladoovtcharov vladoovtcharov changed the title Allow releasing all gpu memory Allow clearing gpu cache Apr 17, 2019
@vladoovtcharov
Copy link
Contributor Author

I went ahead and split into two merge requests, as requested. The second is here (#14716)

@vladoovtcharov vladoovtcharov force-pushed the patch/gpu_release_all branch from b16a96c to 98cbc92 Compare May 7, 2019 18:38
@szha szha self-requested a review May 7, 2019 20:10
python/mxnet/context.py Outdated Show resolved Hide resolved
@szha szha dismissed their stale review May 7, 2019 23:24

proper naming and doc addressed concerns.

@vladoovtcharov
Copy link
Contributor Author

I'll update the c-api naming/documentation as well

@karan6181
Copy link
Contributor

@vladoovtcharov Thanks for your contribution! Did you get a chance to update the docs ? Thanks!

@vladoovtcharov
Copy link
Contributor Author

@karan6181 yes everything should be checked in and ready to be merged

@szha szha merged commit db2295b into apache:master May 25, 2019
haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019
* Allow releasing all gpu memory

* fix white space

* stuck ci checks

* Fix whitespace

* Rename release_all -> empty_cache and provide documentation

* fix indentation

* Rename c_api's MXStorageReleaseAll -> MXStorageEmptyCache and clarify documention

* nudge ci

* Update context.py
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pr-work-in-progress PR is still work in progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants