Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Fix memory leaks in Gluon #18328

Merged
merged 2 commits into from
May 15, 2020
Merged

Fix memory leaks in Gluon #18328

merged 2 commits into from
May 15, 2020

Conversation

leezu
Copy link
Contributor

@leezu leezu commented May 15, 2020

Description

Previously the _BlockScope keeps references to the parameter ndarrays, preventing memory from being freed if a Block is not used anymore. Among other problems, this causes memory usage to increase constantly in unittests (due to testing different blocks and disposing them at the end of the test) until the garbage collector kicks in (which can be too late and the system can run OOM as the parameter arrays can be large).

@leezu leezu requested a review from szha as a code owner May 15, 2020 01:05
@mxnet-bot
Copy link

Hey @leezu , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

  • To trigger all jobs: @mxnet-bot run ci [all]
  • To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [miscellaneous, website, centos-gpu, centos-cpu, sanity, unix-cpu, unix-gpu, windows-gpu, edge, windows-cpu, clang]


Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

@leezu leezu mentioned this pull request May 15, 2020
@leezu
Copy link
Contributor Author

leezu commented May 15, 2020

@mxnet-bot run ci [centos-gpu, windows-gpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [centos-gpu, windows-gpu]

@sxjscience
Copy link
Member

@Jerryzcn I think this is also related to your previous benchmark.

@leezu leezu merged commit 3e676fc into apache:master May 15, 2020
@leezu leezu deleted the fixblockmemoryleaks branch May 15, 2020 17:01
@leezu
Copy link
Contributor Author

leezu commented May 15, 2020

@ciyongch should we backport this to 1.7?

@ciyongch
Copy link
Contributor

Hi @leezu , if the issue also appears in 1.7 then please help to backport to 1.7 and 1.x branches and tag me to the new PR, thanks!

@leezu
Copy link
Contributor Author

leezu commented May 16, 2020

This issue is present in all versions of Gluon. OK, let's backport the fix.

leezu added a commit to leezu/mxnet that referenced this pull request May 18, 2020
Fix leak of ndarray objects in the frontend due to reference cycle.

Backport of 3e676fc
leezu added a commit to leezu/mxnet that referenced this pull request May 18, 2020
Fix leak of ndarray objects in the frontend due to reference cycle.

Backport of 3e676fc
@leezu
Copy link
Contributor Author

leezu commented May 18, 2020

@ciyongch I created the backport PRs

@ciyongch
Copy link
Contributor

Thanks @leezu to help backport the PR.

TaoLv pushed a commit that referenced this pull request May 19, 2020
Fix leak of ndarray objects in the frontend due to reference cycle.

Backport of 3e676fc
@apeforest
Copy link
Contributor

apeforest commented May 19, 2020

@ChaiBapchya @access2rohit This may have also fixed our out-of-memory issue in large tensor nightly test when running them in sequence.

TaoLv pushed a commit that referenced this pull request May 27, 2020
Fix leak of ndarray objects in the frontend due to reference cycle.

Backport of 3e676fc
AntiZpvoh pushed a commit to AntiZpvoh/incubator-mxnet that referenced this pull request Jul 6, 2020
Fix leak of ndarray objects in the frontend due to reference cycle.
rondogency added a commit to rondogency/incubator-mxnet that referenced this pull request Jul 10, 2020
szha pushed a commit that referenced this pull request Jul 12, 2020
rondogency added a commit to rondogency/incubator-mxnet that referenced this pull request Jul 13, 2020
chinakook pushed a commit to chinakook/mxnet that referenced this pull request Jul 24, 2020
Fix leak of ndarray objects in the frontend due to reference cycle.
ChaiBapchya pushed a commit to ChaiBapchya/mxnet that referenced this pull request Aug 15, 2020
Fix leak of ndarray objects in the frontend due to reference cycle.

Backport of 3e676fc
leezu added a commit that referenced this pull request Sep 18, 2020
samskalicky pushed a commit that referenced this pull request Sep 19, 2020
chinakook added a commit to chinakook/gluon-cv that referenced this pull request Nov 17, 2020
After this commit apache/mxnet#18328 , some memory leak were fixed.
Whitout this commit faster rcnn traning cannot be successfully closed.
These commits can be commit again after this yolo training fix.
apache/mxnet#18692
apache/mxnet@0496690
zhreshold added a commit to dmlc/gluon-cv that referenced this pull request Nov 24, 2020
* Fix yolo to support a memory leak fix

After this commit apache/mxnet#18328 , some memory leak were fixed.
Whitout this commit faster rcnn traning cannot be successfully closed.
These commits can be commit again after this yolo training fix.
apache/mxnet#18692
apache/mxnet@0496690

* fix all generator error in windows when training with multiprocessing

* add pylint disable not-callable

* Fix pylint

* Fix pylint

Co-authored-by: Joshua Z. Zhang <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants