-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Memory allocation failed out of memory #19420
Comments
UpdateTo ReproduceIt is able to reproduce this error by running a small set of tests.
Error Message
Possible memory leak.There is possible GPU memory leak when running
|
Here are the logs before and after reverting #19378 Before Revert
After Revert
|
@barry-jin : To investigate this problem I need to compile MxNet locally. Do you know what set of cmake options I need to use for that? |
From my experience, I just used following commands to build MxNet locally and reproduce the issue:
|
Thanks a lot for the script! Unfortunately, I am having a linking problem:
The file |
You may try to update 3rdparty modules
|
@barry-jin : Is it true, that the script you gave me should reproduce this problem? I tried, and I don't see it: |
@andrei5055 Thanks for your investigation. I think the warning message should be "TVM is not supported". You can follow tvm documentation to install tvm. Alternatively, I will provide test suite without tvm support that will reproduce this issue. |
You can checkout gluon-nlp to dmlc/gluon-nlp@7910d6d and run following test suite.
|
@barry-jin: Still cannot reproduce this problem: BTW, all warnings are of following two types:
Type 2:
|
Description
pytest
onmxnet-cu102==2.0.0b20201022
will introduce threading error (see Error Message).mxnet-cu102==2.0.0b20201016
will not introduce this error.Error Message
Run GluonNLP pytest with `mxnet-cu102==2.0.0b20201022`
To Reproduce
run reproduce.sh
reproduce.sh
What have you tried to solve it?
Some observations:
mx.npx.waitall()
multiprocessing.Pool()
mxnet-cu102==2.0.0b20201016
andmxnet-cu102==2.0.0b20201022
, I find the first bad commit is Remove cleanup on side threads #19378Environment
We recommend using our script for collecting the diagnostic information with the following command
curl --retry 10 -s https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/diagnose.py | python3
Environment Information
The text was updated successfully, but these errors were encountered: