-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Scala Module API resize is leaking memory on the native size. #10867
Comments
Also tested on master branch 10ac529 and bug is present |
Hey @jessebrizzi - thanks for reporting the issue! Adding @nswamy who contributes to the Scala package |
@lupesko I updated the example code to use the mxnet 1.2.0 package released on Maven and reran the test and the memory leak is still present. I have also put together a docker container that you can use with nvidia-docker to run my example code with sbt/scala/java/cudnn/cuda all installed here https://hub.docker.com/r/jessebrizzi/dl-dev/ to control for environment differences. |
I spent some time digging into this to see if I could find a fix. Them memory leak seems to be exclusive to when an This could be related to the new empty I looked into the python API to see if it was doing anything different that could be adapted into the Scala interface to solve the problem. It looks like the Python API has changed from resizing the Any help or eyes on this problem would be greatly appreciated, I am currently using MXNet in a production environment and have a pretty hacky work around in place currently to avoid this issue. |
@jessebrizzi thanks for your efforts in trying to solve this memory issue. @andrewfayres started looking into this part of the code(JNI) recently, he might be able to help you. @lanking520 as FYI. |
@jessebrizzi I actually started looking into this a few days ago. I got a bit sidetracked with something else but I think I should have a solution for you in a day or two. |
@jessebrizzi I've been looking at the JNI code you wrote and figured out that the sigsegv is happening here. However, I've been unable to figure out exactly what's causing the error thus far. Presumably at least one of the variables on that line is invalid but I'm not seeing why. I'm going to change gears a bit and attempt to fix the memory leak in the existing scala code to at least resolve the issue for you. |
@jessebrizzi I've got what I believe to be a working fix in my repo. I need to do some more thorough testing and add some automated testing to this before submitting a PR but my preliminary testing is looking good. Feel free to take a look and let me know if you find any issues. I'll work on moving all of this over to the native code as soon as I get a few other things off my backlog. |
Thanks @andrewfayres ! Just catching up on all of this after taking some time off. I'm going to pull down your change and do some testing myself and get back to you. |
Sorry for taking so long to get back to you @andrewfayres I pulled down your fix from your repo and ran it both on my OSX machine in CPU mode and in a nvidia-docker container (https://hub.docker.com/r/jessebrizzi/dl-dev/) in linux for GPU mode and I still think a memory leak issue is still present. I have update my bug reproduction repository here to show the exact behavior I am testing. The memory leak that I am still observing seems to be independent of the binded network size. I tested this by running through various max batch sizes that I would randomly sample from for my test input and regardless of what I set after 10000 iterations the native memory growth seems to be around the same. The fix does seem to address the upsize vs downlise issue observed earlier. I would still suggest, to maintain parity with the python interface, that the resize logic should be switched to the backend method that the python interface uses, but I know now messing with a JNI interface change is a pain. Could you post an example of the code you where running to test the change? |
@jessebrizzi Sorry for the slow reply, I've been out on vacation. As soon as I get a little bit of free time I'll pull down the update you've made to your repository and see if I can reproduce. I mainly checked for the resize issue when I was looking earlier. It's possible there's also a different leak. I definitely agree with moving it to native is our best long term solution but at the moment I don't have the bandwidth to dedicate to this. The example code that I've been running is actually the same as what you posted originally. As a side note, there is ongoing work being done by @nswamy to provide a much better solution to scala memory management. Feel free to take a look at the design. Hopefully, once this work is finished all the leaking issues will be resolved. |
@jessebrizzi The NativeResource Management was added as part of #12647. You can find some documentation about it https://github.com/apache/incubator-mxnet/blob/master/scala-package/memory-management.md. Does this look like it fixes your problem? |
@andrewfayres , could you please take a look and test again whether this memory leak issue is fixed with Phantom references implementation? |
@zachgk Sorry for the late response, I had to find some time to run through my tests. I modified my repo (https://github.com/jessebrizzi/MXNet-Bug) with the example code for the memory leak bug to use the snapshots published in the Nightly Repo https://repository.apache.org/#nexus-search;gav~org.apache.mxnet I think I am still seeing the bug (native memory climbing when JVM and GPU memory stays stable), but I am running into a crashing issue when running my "bugged" reproduction code https://github.com/jessebrizzi/MXNet-Bug/blob/master/scala/TestBug.scala In CPU mode the script will freeze after a few hundred loops with all CPU's pegged and in GPU mode it freezes with the GPU/CPU idling. In the non-bugged version (https://github.com/jessebrizzi/MXNet-Bug/blob/master/scala/TestNoBug.scala) of my repoduction code this does not happen, it runs fine since it avoids the resize call.
|
Updated https://github.com/jessebrizzi/MXNet-Bug to support testing with the released MXNet 1.3.1, 1.4.0-SNAPSHOT, and 1.5.0-SNAPSHOT The bug exists in 1.3.1 (this was already known), running https://github.com/jessebrizzi/MXNet-Bug/blob/master/scala/TestBug.scala for 10000 forward passes of random input resizes leads to over of 6gb of system memory used and climbing when using the GPU. Both 1.4.0-SNAPSHOT and 1.5.0-SNAPSHOT cannot finish the 10000 forward passes random resize test as they crash/freeze as described in my previous comment. for all versions, the https://github.com/jessebrizzi/MXNet-Bug/blob/master/scala/TestNoBug.scala test runs fine with only 1.8ish GB of mem used on my machine (no random resizes on the input) for both GPU and CPU backed nets. The dockerfile/README in my bug repo https://github.com/jessebrizzi/MXNet-Bug has the needed instructions if someone wants to try and reproduce my observations. |
Thanks @jessebrizzi, I'll use your tests on my end to reproduce and try to resolve this. |
I've finally come up with a good fix for this. I'll be submitting a PR shortly. I've run your test on my local laptop and am able to get through all passes without issue and memory looks stable. I've also fixed it so that this code can be wrapped in a ResourceScope block without it crashing. I'm going to start up an instance to run the test over the weekend just to be thorough but everything's looking good. |
@andrewfayres Fantastic! Really excited about this 💯 |
* Fixes for memory leak when reshaping executor * Fixed Adam Optimizer memory leak * Cleanup for PR * Added unit test for new ResourceScope method * Removing import that was added by overzealous ide * Add back in an import * Added flags for executor to know whether or not it owns NDArrays for disposal * Moving to ResourceScope.using implementation * Changes to make ResourceScope.using work with existing scope * Updating ResourceScope to work with existing scopes via usingIfScopeExists method * Fix clojure unit tests * Fixes to be compatibile with how clojure is using ResourceScope * Removing some unnecessary changes * Adding scope assertion in unit test
…e#14372) * Fixes for memory leak when reshaping executor * Fixed Adam Optimizer memory leak * Cleanup for PR * Added unit test for new ResourceScope method * Removing import that was added by overzealous ide * Add back in an import * Added flags for executor to know whether or not it owns NDArrays for disposal * Moving to ResourceScope.using implementation * Changes to make ResourceScope.using work with existing scope * Updating ResourceScope to work with existing scopes via usingIfScopeExists method * Fix clojure unit tests * Fixes to be compatibile with how clojure is using ResourceScope * Removing some unnecessary changes * Adding scope assertion in unit test
…e#14372) * Fixes for memory leak when reshaping executor * Fixed Adam Optimizer memory leak * Cleanup for PR * Added unit test for new ResourceScope method * Removing import that was added by overzealous ide * Add back in an import * Added flags for executor to know whether or not it owns NDArrays for disposal * Moving to ResourceScope.using implementation * Changes to make ResourceScope.using work with existing scope * Updating ResourceScope to work with existing scopes via usingIfScopeExists method * Fix clojure unit tests * Fixes to be compatibile with how clojure is using ResourceScope * Removing some unnecessary changes * Adding scope assertion in unit test
…e#14372) * Fixes for memory leak when reshaping executor * Fixed Adam Optimizer memory leak * Cleanup for PR * Added unit test for new ResourceScope method * Removing import that was added by overzealous ide * Add back in an import * Added flags for executor to know whether or not it owns NDArrays for disposal * Moving to ResourceScope.using implementation * Changes to make ResourceScope.using work with existing scope * Updating ResourceScope to work with existing scopes via usingIfScopeExists method * Fix clojure unit tests * Fixes to be compatibile with how clojure is using ResourceScope * Removing some unnecessary changes * Adding scope assertion in unit test
* Fixes for memory leak when reshaping executor * Fixed Adam Optimizer memory leak * Cleanup for PR * Added unit test for new ResourceScope method * Removing import that was added by overzealous ide * Add back in an import * Added flags for executor to know whether or not it owns NDArrays for disposal * Moving to ResourceScope.using implementation * Changes to make ResourceScope.using work with existing scope * Updating ResourceScope to work with existing scopes via usingIfScopeExists method * Fix clojure unit tests * Fixes to be compatibile with how clojure is using ResourceScope * Removing some unnecessary changes * Adding scope assertion in unit test
* Fixes for memory leak when reshaping executor * Fixed Adam Optimizer memory leak * Cleanup for PR * Added unit test for new ResourceScope method * Removing import that was added by overzealous ide * Add back in an import * Added flags for executor to know whether or not it owns NDArrays for disposal * Moving to ResourceScope.using implementation * Changes to make ResourceScope.using work with existing scope * Updating ResourceScope to work with existing scopes via usingIfScopeExists method * Fix clojure unit tests * Fixes to be compatibile with how clojure is using ResourceScope * Removing some unnecessary changes * Adding scope assertion in unit test
…e#14372) * Fixes for memory leak when reshaping executor * Fixed Adam Optimizer memory leak * Cleanup for PR * Added unit test for new ResourceScope method * Removing import that was added by overzealous ide * Add back in an import * Added flags for executor to know whether or not it owns NDArrays for disposal * Moving to ResourceScope.using implementation * Changes to make ResourceScope.using work with existing scope * Updating ResourceScope to work with existing scopes via usingIfScopeExists method * Fix clojure unit tests * Fixes to be compatibile with how clojure is using ResourceScope * Removing some unnecessary changes * Adding scope assertion in unit test
Description
Create and bind a MXNet Module with batch size N+1 and proceed to loop and pass DataBatches to it that require the Module to resize before performing the forward pass. Monitor the system resources (With htop, nvidia-smi, jvmtop) and you will notice the used system memory in htop will start to grow, but not the jvm heap size (the system memory usages grows beyond the set max JVM heap size) or GPU memory usage. This will continue until your system runs out of memory and there is a crash or the JVM is killed clearing all of the leaked used system memory with it.
Environment info (Required)
Package used (Python/R/Scala/Julia):
Scala. This seems to be specific to the Scala API. Can not reproduce in Python.
For Scala user, please provide:
Java version: (
java -version
)java version "1.8.0_131"
Java(TM) SE Runtime Environment (build 1.8.0_131-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)
Maven version: (
mvn -version
)Apache Maven 3.0.5
Maven home: /usr/share/maven
Java version: 1.8.0_131, vendor: Oracle Corporation
Java home: /usr/lib/jvm/java-8-oracle/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "4.4.0-31-generic", arch: "amd64", family: "unix"
Scala runtime if applicable: (
scala -version
)2.11.11
Build info (Required if built from source)
Compiler (gcc/clang/mingw/visual studio): GCC 4.8.4
MXNet commit hash:
07a83a0
Build config:
make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas USE_CUDA=1 USE_CUDA_PATH=/usr/local/cuda USE_CUDNN=1
make scalainstall -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas USE_CUDA=1 USE_CUDA_PATH=/usr/local/cuda USE_CUDNN=1
Error Message:
None
Minimum reproducible example
link to simple Scala project/code to reproduce issue
https://github.com/jessebrizzi/MXNet-Bug/blob/master/scala/TestBug.scala
Steps to reproduce
(Paste the commands you ran that produced the error.)
What have you tried to solve it?
The text was updated successfully, but these errors were encountered: