Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Memory allocation failed out of memory #19420

Open
barry-jin opened this issue Oct 23, 2020 · 10 comments
Open

Memory allocation failed out of memory #19420

barry-jin opened this issue Oct 23, 2020 · 10 comments

Comments

@barry-jin
Copy link
Contributor

barry-jin commented Oct 23, 2020

Description

  1. Run GluonNLP full suite of tests with pytest on mxnet-cu102==2.0.0b20201022 will introduce threading error (see Error Message).
  2. But run full suite of tests on mxnet-cu102==2.0.0b20201016 will not introduce this error.
  3. Also, run these tests separately will not introduce this error.

Error Message

Run GluonNLP pytest with `mxnet-cu102==2.0.0b20201022`
[2020-10-22T21:15:51.430Z] ============================= test session starts ==============================
[2020-10-22T21:15:51.430Z] platform linux -- Python 3.6.9, pytest-6.1.1, py-1.9.0, pluggy-0.13.1
[2020-10-22T21:15:51.432Z] rootdir: /workspace/gluon-nlp, configfile: pytest.ini
[2020-10-22T21:15:51.432Z] plugins: cov-2.10.1
[2020-10-22T21:15:52.426Z] collected 1283 items
[2020-10-22T21:16:01.630Z] tests/test_attention_cell.py ........................................... [  3%]
[2020-10-22T21:16:06.668Z] ......................................................................   [  8%]
[2020-10-22T21:16:06.796Z] tests/test_data_batchify.py ............................................ [ 12%]
[2020-10-22T21:16:21.672Z] .................................                                        [ 14%]
[2020-10-22T21:16:30.051Z] tests/test_data_filtering.py .....                                       [ 15%]
[2020-10-22T21:16:36.895Z] tests/test_data_loading.py .                                             [ 15%]
[2020-10-22T21:16:37.213Z] tests/test_data_sampler.py ............................................. [ 18%]
[2020-10-22T21:16:38.566Z] ........................................................................ [ 24%]
[2020-10-22T21:16:40.003Z] ........................................................................ [ 30%]
[2020-10-22T21:16:40.579Z] ........................................................................ [ 35%]
[2020-10-22T21:16:41.143Z] ........................................................................ [ 41%]
[2020-10-22T21:16:42.040Z] ........................................................................ [ 46%]
[2020-10-22T21:16:42.299Z] ...............                                                          [ 48%]
[2020-10-22T21:18:34.088Z] tests/test_data_tokenizers.py ..............                             [ 49%]
[2020-10-22T21:18:34.095Z] tests/test_data_vocab.py .                                               [ 49%]
[2020-10-22T21:22:22.268Z] tests/test_embedding.py ..                                               [ 49%]
[2020-10-22T21:22:59.289Z] tests/test_gluon_block.py .....                                          [ 49%]
[2020-10-22T21:22:59.328Z] tests/test_initializer.py ...                                            [ 49%]
[2020-10-22T21:23:00.225Z] tests/test_layers.py ...........................                         [ 52%]
[2020-10-22T21:23:00.312Z] tests/test_loss.py ........................                              [ 53%]
[2020-10-22T21:37:39.851Z] tests/test_models.py ................................................    [ 57%]
[2020-10-22T21:38:46.438Z] tests/test_models_albert.py .................                            [ 59%]
[2020-10-22T21:39:38.599Z] tests/test_models_bart.py ......                                         [ 59%]
[2020-10-22T21:44:18.743Z] tests/test_models_bert.py ............                                   [ 60%]
[2020-10-22T21:46:00.142Z] tests/test_models_electra.py ........                                    [ 61%]
[2020-10-22T21:49:47.086Z] tests/test_models_gpt2.py .......F                                       [ 61%]
[2020-10-22T21:49:57.226Z] tests/test_models_mobilebert.py .....                                    [ 62%]
[2020-10-22T21:51:27.552Z] tests/test_models_roberta.py ....FF                                      [ 62%]
[2020-10-22T21:52:10.783Z] tests/test_models_transformer.py ....................................... [ 65%]
[2020-10-22T21:53:33.876Z] ........................................................................ [ 71%]
[2020-10-22T21:54:26.540Z] ..........................................FFFFF                          [ 74%]
[2020-10-22T21:54:34.975Z] tests/test_models_transformer_xl.py ......                               [ 75%]
[2020-10-22T21:55:47.820Z] tests/test_models_xlmr.py .FF                                            [ 75%]
[2020-10-22T21:55:48.122Z] tests/test_op.py ....................................................... [ 79%]
[2020-10-22T21:55:48.754Z] ........................................................................ [ 85%]
[2020-10-22T21:55:49.195Z] ....                                                                     [ 85%]
[2020-10-22T21:56:20.712Z] tests/test_optimizer.py .                                                [ 85%]
[2020-10-22T21:56:20.716Z] tests/test_pytest.py .                                                   [ 85%]
[2020-10-22T21:56:21.005Z] tests/test_sequence_sampler.py ......................................... [ 89%]
[2020-10-22T21:56:21.522Z] ........................................................................ [ 94%]
[2020-10-22T21:56:33.345Z] .......................................                                  [ 97%]
[2020-10-22T21:56:33.590Z] Fatal Python error: Aborted
[2020-10-22T21:56:33.590Z] Thread 0x00007f92b9fff700 (most recent call first):
[2020-10-22T21:56:33.590Z]   File "/usr/lib/python3.6/threading.py", line 299 in wait
[2020-10-22T21:56:33.590Z]   File "/usr/lib/python3.6/threading.py", line 551 in wait
[2020-10-22T21:56:33.590Z]   File "/usr/local/lib/python3.6/dist-packages/tqdm/_monitor.py", line 59 in run
[2020-10-22T21:56:33.590Z]   File "/usr/lib/python3.6/threading.py", line 916 in _bootstrap_inner
[2020-10-22T21:56:33.590Z]   File "/usr/lib/python3.6/threading.py", line 884 in _bootstrap
[2020-10-22T21:56:33.590Z] Current thread 0x00007f9457153740 (most recent call first):
[2020-10-22T21:56:33.590Z]   File "/usr/lib/python3.6/multiprocessing/popen_fork.py", line 66 in _launch
[2020-10-22T21:56:33.590Z]   File "/usr/lib/python3.6/multiprocessing/popen_fork.py", line 19 in __init__
[2020-10-22T21:56:33.590Z]   File "/usr/lib/python3.6/multiprocessing/context.py", line 277 in _Popen
[2020-10-22T21:56:33.590Z]   File "/usr/lib/python3.6/multiprocessing/process.py", line 105 in start
[2020-10-22T21:56:33.590Z]   File "/usr/lib/python3.6/multiprocessing/pool.py", line 239 in _repopulate_pool
[2020-10-22T21:56:33.591Z]   File "/usr/lib/python3.6/multiprocessing/pool.py", line 174 in __init__
[2020-10-22T21:56:33.591Z]   File "/usr/lib/python3.6/multiprocessing/context.py", line 119 in Pool
[2020-10-22T21:56:33.591Z]   File "/workspace/gluon-nlp/tests/test_utils_misc.py", line 87 in verify_download
[2020-10-22T21:56:33.591Z]   File "/workspace/gluon-nlp/tests/test_utils_misc.py", line 102 in test_download_s3
[2020-10-22T21:56:33.591Z]   File "/root/.local/lib/python3.6/site-packages/_pytest/python.py", line 184 in pytest_pyfunc_call
[2020-10-22T21:56:33.591Z]   File "/root/.local/lib/python3.6/site-packages/pluggy/callers.py", line 187 in _multicall
[2020-10-22T21:56:33.591Z]   File "/root/.local/lib/python3.6/site-packages/pluggy/manager.py", line 87 in <lambda>
[2020-10-22T21:56:33.591Z]   File "/root/.local/lib/python3.6/site-packages/pluggy/manager.py", line 93 in _hookexec
[2020-10-22T21:56:33.591Z]   File "/root/.local/lib/python3.6/site-packages/pluggy/hooks.py", line 286 in __call__
[2020-10-22T21:56:33.591Z]   File "/root/.local/lib/python3.6/site-packages/_pytest/python.py", line 1627 in runtest
[2020-10-22T21:56:33.591Z]   File "/root/.local/lib/python3.6/site-packages/_pytest/runner.py", line 163 in pytest_runtest_call
[2020-10-22T21:56:33.591Z]   File "/root/.local/lib/python3.6/site-packages/pluggy/callers.py", line 187 in _multicall
[2020-10-22T21:56:33.591Z]   File "/root/.local/lib/python3.6/site-packages/pluggy/manager.py", line 87 in <lambda>
[2020-10-22T21:56:33.592Z]   File "/root/.local/lib/python3.6/site-packages/pluggy/manager.py", line 93 in _hookexec
[2020-10-22T21:56:33.592Z]   File "/root/.local/lib/python3.6/site-packages/pluggy/hooks.py", line 286 in __call__
[2020-10-22T21:56:33.592Z]   File "/root/.local/lib/python3.6/site-packages/_pytest/runner.py", line 256 in <lambda>
[2020-10-22T21:56:33.592Z]   File "/root/.local/lib/python3.6/site-packages/_pytest/runner.py", line 310 in from_call
[2020-10-22T21:56:33.592Z]   File "/root/.local/lib/python3.6/site-packages/_pytest/runner.py", line 256 in call_runtest_hook
[2020-10-22T21:56:33.592Z]   File "/root/.local/lib/python3.6/site-packages/_pytest/runner.py", line 216 in call_and_report
[2020-10-22T21:56:33.592Z]   File "/root/.local/lib/python3.6/site-packages/_pytest/runner.py", line 127 in runtestprotocol
[2020-10-22T21:56:33.592Z]   File "/root/.local/lib/python3.6/site-packages/_pytest/runner.py", line 110 in pytest_runtest_protocol
[2020-10-22T21:56:33.592Z]   File "/root/.local/lib/python3.6/site-packages/pluggy/callers.py", line 187 in _multicall
[2020-10-22T21:56:33.592Z]   File "/root/.local/lib/python3.6/site-packages/pluggy/manager.py", line 87 in <lambda>
[2020-10-22T21:56:33.592Z]   File "/root/.local/lib/python3.6/site-packages/pluggy/manager.py", line 93 in _hookexec
[2020-10-22T21:56:33.592Z]   File "/root/.local/lib/python3.6/site-packages/pluggy/hooks.py", line 286 in __call__
[2020-10-22T21:56:33.593Z]   File "/root/.local/lib/python3.6/site-packages/_pytest/main.py", line 338 in pytest_runtestloop
[2020-10-22T21:56:33.593Z]   File "/root/.local/lib/python3.6/site-packages/pluggy/callers.py", line 187 in _multicall
[2020-10-22T21:56:33.593Z]   File "/root/.local/lib/python3.6/site-packages/pluggy/manager.py", line 87 in <lambda>
[2020-10-22T21:56:33.593Z]   File "/root/.local/lib/python3.6/site-packages/pluggy/manager.py", line 93 in _hookexec
[2020-10-22T21:56:33.593Z]   File "/root/.local/lib/python3.6/site-packages/pluggy/hooks.py", line 286 in __call__
[2020-10-22T21:56:33.593Z]   File "/root/.local/lib/python3.6/site-packages/_pytest/main.py", line 313 in _main
[2020-10-22T21:56:33.593Z]   File "/root/.local/lib/python3.6/site-packages/_pytest/main.py", line 257 in wrap_session
[2020-10-22T21:56:33.593Z]   File "/root/.local/lib/python3.6/site-packages/_pytest/main.py", line 306 in pytest_cmdline_main
[2020-10-22T21:56:33.593Z]   File "/root/.local/lib/python3.6/site-packages/pluggy/callers.py", line 187 in _multicall
[2020-10-22T21:56:33.593Z]   File "/root/.local/lib/python3.6/site-packages/pluggy/manager.py", line 87 in <lambda>
[2020-10-22T21:56:33.593Z]   File "/root/.local/lib/python3.6/site-packages/pluggy/manager.py", line 93 in _hookexec
[2020-10-22T21:56:33.593Z]   File "/root/.local/lib/python3.6/site-packages/pluggy/hooks.py", line 286 in __call__
[2020-10-22T21:56:33.594Z]   File "/root/.local/lib/python3.6/site-packages/_pytest/config/__init__.py", line 165 in main
[2020-10-22T21:56:33.594Z]   File "/root/.local/lib/python3.6/site-packages/_pytest/config/__init__.py", line 187 in console_main
[2020-10-22T21:56:33.594Z]   File "/root/.local/lib/python3.6/site-packages/pytest/__main__.py", line 5 in <module>
[2020-10-22T21:56:33.594Z]   File "/usr/lib/python3.6/runpy.py", line 85 in _run_code
[2020-10-22T21:56:33.594Z]   File "/usr/lib/python3.6/runpy.py", line 193 in _run_module_as_main
[2020-10-22T22:00:07.664Z] ./gluon_nlp_job.sh: line 39:    44 Aborted                 (core dumped) /bin/bash -o pipefail -c "$COMMAND"

To Reproduce

Compute Environment: 
Instance type: g4dn.4x
vCPUs: 16 

run reproduce.sh

reproduce.sh
#!/bin/bash
python3 -m pip install -U --quiet --pre "mxnet-cu102==2.0.0b20201022" -f https://dist.mxnet.io/python
git clone https://github.com/dmlc/gluon-nlp; cd gluon-nlp
git checkout master
python3 -m pip install --quiet -e .[extras]
python3 -m pytest --cov=. --cov-config=./.coveragerc --cov-report=xml --durations=50 --device="gpu" --runslow ./tests/
$ chmod +x reproduce.sh
$ ./reproduce.sh

What have you tried to solve it?

Some observations:

  1. The failed tests all use mx.npx.waitall()
  2. The test failed on multiprocessing.Pool()
  3. After bisect by commits between nightly build mxnet-cu102==2.0.0b20201016 and mxnet-cu102==2.0.0b20201022, I find the first bad commit is Remove cleanup on side threads #19378

Environment

We recommend using our script for collecting the diagnostic information with the following command
curl --retry 10 -s https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/diagnose.py | python3

Environment Information
[2020-10-27T16:59:27.002Z] ----------Python Info----------
[2020-10-27T16:59:27.002Z] Version      : 3.6.9
[2020-10-27T16:59:27.002Z] Compiler     : GCC 8.4.0
[2020-10-27T16:59:27.002Z] Build        : ('default', 'Oct  8 2020 12:12:24')
[2020-10-27T16:59:27.003Z] Arch         : ('64bit', '')
[2020-10-27T16:59:27.003Z] ------------Pip Info-----------
[2020-10-27T16:59:27.004Z] Version      : 20.2.4
[2020-10-27T16:59:27.004Z] Directory    : /usr/local/lib/python3.6/dist-packages/pip
[2020-10-27T16:59:27.004Z] ----------MXNet Info-----------
[2020-10-27T16:59:28.271Z] Version      : 2.0.0
[2020-10-27T16:59:28.271Z] Directory    : /root/.local/lib/python3.6/site-packages/mxnet
[2020-10-27T16:59:28.271Z] Commit hash file "/root/.local/lib/python3.6/site-packages/mxnet/COMMIT_HASH" not found. Not installed from pre-built package or built from source.
[2020-10-27T16:59:28.271Z] Library      : ['/root/.local/lib/python3.6/site-packages/mxnet/libmxnet.so']
[2020-10-27T16:59:28.271Z] Build features:
[2020-10-27T16:59:28.271Z] ✔ CUDA
[2020-10-27T16:59:28.271Z] ✔ CUDNN
[2020-10-27T16:59:28.271Z] ✖ NCCL
[2020-10-27T16:59:28.271Z] ✖ TENSORRT
[2020-10-27T16:59:28.271Z] ✖ CUTENSOR
[2020-10-27T16:59:28.271Z] ✔ CPU_SSE
[2020-10-27T16:59:28.271Z] ✔ CPU_SSE2
[2020-10-27T16:59:28.271Z] ✔ CPU_SSE3
[2020-10-27T16:59:28.271Z] ✖ CPU_SSE4_1
[2020-10-27T16:59:28.271Z] ✖ CPU_SSE4_2
[2020-10-27T16:59:28.271Z] ✖ CPU_SSE4A
[2020-10-27T16:59:28.271Z] ✖ CPU_AVX
[2020-10-27T16:59:28.271Z] ✖ CPU_AVX2
[2020-10-27T16:59:28.271Z] ✔ OPENMP
[2020-10-27T16:59:28.271Z] ✖ SSE
[2020-10-27T16:59:28.271Z] ✖ F16C
[2020-10-27T16:59:28.271Z] ✖ JEMALLOC
[2020-10-27T16:59:28.271Z] ✔ BLAS_OPEN
[2020-10-27T16:59:28.271Z] ✖ BLAS_ATLAS
[2020-10-27T16:59:28.271Z] ✖ BLAS_MKL
[2020-10-27T16:59:28.271Z] ✖ BLAS_APPLE
[2020-10-27T16:59:28.271Z] ✔ LAPACK
[2020-10-27T16:59:28.271Z] ✔ MKLDNN
[2020-10-27T16:59:28.271Z] ✔ OPENCV
[2020-10-27T16:59:28.271Z] ✔ DIST_KVSTORE
[2020-10-27T16:59:28.271Z] ✖ INT64_TENSOR_SIZE
[2020-10-27T16:59:28.271Z] ✔ SIGNAL_HANDLER
[2020-10-27T16:59:28.271Z] ✖ DEBUG
[2020-10-27T16:59:28.271Z] ✖ TVM_OP
[2020-10-27T16:59:28.271Z] ----------System Info----------
[2020-10-27T16:59:28.272Z] Platform     : Linux-4.14.186-146.268.amzn2.x86_64-x86_64-with-Ubuntu-18.04-bionic
[2020-10-27T16:59:28.272Z] system       : Linux
[2020-10-27T16:59:28.272Z] node         : ip-10-20-91-122.ec2.internal
[2020-10-27T16:59:28.272Z] release      : 4.14.186-146.268.amzn2.x86_64
[2020-10-27T16:59:28.272Z] version      : #1 SMP Tue Jul 14 18:16:52 UTC 2020
[2020-10-27T16:59:28.272Z] ----------Hardware Info----------
[2020-10-27T16:59:28.272Z] machine      : x86_64
[2020-10-27T16:59:28.272Z] processor    : x86_64
[2020-10-27T16:59:28.297Z] Architecture:        x86_64
[2020-10-27T16:59:28.297Z] CPU op-mode(s):      32-bit, 64-bit
[2020-10-27T16:59:28.297Z] Byte Order:          Little Endian
[2020-10-27T16:59:28.297Z] CPU(s):              16
[2020-10-27T16:59:28.297Z] On-line CPU(s) list: 0-15
[2020-10-27T16:59:28.297Z] Thread(s) per core:  2
[2020-10-27T16:59:28.297Z] Core(s) per socket:  8
[2020-10-27T16:59:28.297Z] Socket(s):           1
[2020-10-27T16:59:28.297Z] NUMA node(s):        1
[2020-10-27T16:59:28.297Z] Vendor ID:           GenuineIntel
[2020-10-27T16:59:28.297Z] CPU family:          6
[2020-10-27T16:59:28.297Z] Model:               85
[2020-10-27T16:59:28.297Z] Model name:          Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
[2020-10-27T16:59:28.297Z] Stepping:            7
[2020-10-27T16:59:28.297Z] CPU MHz:             3103.458
[2020-10-27T16:59:28.297Z] BogoMIPS:            4999.99
[2020-10-27T16:59:28.297Z] Hypervisor vendor:   KVM
[2020-10-27T16:59:28.297Z] Virtualization type: full
[2020-10-27T16:59:28.297Z] L1d cache:           32K
[2020-10-27T16:59:28.297Z] L1i cache:           32K
[2020-10-27T16:59:28.297Z] L2 cache:            1024K
[2020-10-27T16:59:28.297Z] L3 cache:            36608K
[2020-10-27T16:59:28.297Z] NUMA node0 CPU(s):   0-15
[2020-10-27T16:59:28.297Z] Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke avx512_vnni
[2020-10-27T16:59:28.298Z] ----------Network Test----------
[2020-10-27T16:59:28.298Z] Setting timeout: 10
[2020-10-27T16:59:28.766Z] Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0007 sec, LOAD: 0.4678 sec.
[2020-10-27T16:59:29.018Z] Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0861 sec, LOAD: 0.1656 sec.
[2020-10-27T16:59:29.168Z] Error open Gluon Tutorial(cn): https://zh.gluon.ai, <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:852)>, DNS finished in 0.11675071716308594 sec.
[2020-10-27T16:59:29.307Z] Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0076 sec, LOAD: 0.1308 sec.
[2020-10-27T16:59:29.489Z] Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0034 sec, LOAD: 0.1785 sec.
[2020-10-27T16:59:29.564Z] Error open Conda: https://repo.continuum.io/pkgs/free/, HTTP Error 403: Forbidden, DNS finished in 0.02842235565185547 sec.
[2020-10-27T16:59:29.564Z] ----------Environment----------
@barry-jin
Copy link
Contributor Author

Update

To Reproduce

It is able to reproduce this error by running a small set of tests.

python3 -m pip install -U --quiet --pre "mxnet-cu102==2.0.0b20201022" -f https://dist.mxnet.io/python
git clone https://github.com/dmlc/gluon-nlp; cd gluon-nlp
git checkout master
python3 -m pip install --quiet -e .[extras]
python3 -m pytest --device='gpu' --verbose --runslow tests/test_models.py tests/test_models_albert.py tests/test_models_bart.py tests/test_models_bert.py
Error Message
Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=1362441855 to reproduce.
============================== test session starts ===============================
platform linux -- Python 3.6.9, pytest-6.1.2, py-1.9.0, pluggy-0.13.1 -- /usr/bin/python3
cachedir: .pytest_cache
rootdir: /workspace/gluon-nlp, configfile: pytest.ini
plugins: cov-2.10.1
collected 95 items                                                               

tests/test_models.py::test_list_backbone_names PASSED                      [  1%]
tests/test_models.py::test_get_backbone[ctx0-google_albert_base_v2] PASSED [  2%]
tests/test_models.py::test_get_backbone[ctx0-google_albert_large_v2] PASSED [  3%]
tests/test_models.py::test_get_backbone[ctx0-google_albert_xlarge_v2] PASSED [  4%]
tests/test_models.py::test_get_backbone[ctx0-google_albert_xxlarge_v2] PASSED [  5%]
tests/test_models.py::test_get_backbone[ctx0-google_en_cased_bert_base] PASSED [  6%]
tests/test_models.py::test_get_backbone[ctx0-google_en_cased_bert_large] PASSED [  7%]
tests/test_models.py::test_get_backbone[ctx0-google_en_cased_bert_wwm_large] PASSED [  8%]
tests/test_models.py::test_get_backbone[ctx0-google_en_uncased_bert_base] PASSED [  9%]
tests/test_models.py::test_get_backbone[ctx0-google_en_uncased_bert_large] PASSED [ 10%]
tests/test_models.py::test_get_backbone[ctx0-google_en_uncased_bert_wwm_large] PASSED [ 11%]
tests/test_models.py::test_get_backbone[ctx0-google_multi_cased_bert_base] PASSED [ 12%]
tests/test_models.py::test_get_backbone[ctx0-google_zh_bert_base] PASSED   [ 13%]
tests/test_models.py::test_get_backbone[ctx0-gluon_electra_small_owt] PASSED [ 14%]
tests/test_models.py::test_get_backbone[ctx0-google_electra_base] PASSED   [ 15%]
tests/test_models.py::test_get_backbone[ctx0-google_electra_large] PASSED  [ 16%]
tests/test_models.py::test_get_backbone[ctx0-google_electra_small] PASSED  [ 17%]
tests/test_models.py::test_get_backbone[ctx0-gpt2_124M] PASSED             [ 18%]
tests/test_models.py::test_get_backbone[ctx0-gpt2_1558M] PASSED            [ 20%]
tests/test_models.py::test_get_backbone[ctx0-gpt2_355M] PASSED             [ 21%]
tests/test_models.py::test_get_backbone[ctx0-gpt2_774M] PASSED             [ 22%]
tests/test_models.py::test_get_backbone[ctx0-google_uncased_mobilebert] PASSED [ 23%]
tests/test_models.py::test_get_backbone[ctx0-fairseq_roberta_base] PASSED  [ 24%]
tests/test_models.py::test_get_backbone[ctx0-fairseq_roberta_large] PASSED [ 25%]
tests/test_models.py::test_get_backbone[ctx0-fairseq_xlmr_base] PASSED     [ 26%]
tests/test_models.py::test_get_backbone[ctx0-fairseq_xlmr_large] PASSED    [ 27%]
tests/test_models.py::test_get_backbone[ctx0-fairseq_bart_base] PASSED     [ 28%]
tests/test_models.py::test_get_backbone[ctx0-fairseq_bart_large] PASSED    [ 29%]
tests/test_models.py::test_tvm_integration[ctx0-NT-2-4-google_albert_base_v2] PASSED [ 30%]
tests/test_models.py::test_tvm_integration[ctx0-NT-2-4-google_en_cased_bert_base] PASSED [ 31%]
tests/test_models.py::test_tvm_integration[ctx0-NT-2-4-google_electra_small] PASSED [ 32%]
tests/test_models.py::test_tvm_integration[ctx0-NT-2-4-fairseq_bart_base] PASSED [ 33%]
tests/test_models.py::test_tvm_integration[ctx0-NT-1-4-google_albert_base_v2] PASSED [ 34%]
tests/test_models.py::test_tvm_integration[ctx0-NT-1-4-google_en_cased_bert_base] PASSED [ 35%]
tests/test_models.py::test_tvm_integration[ctx0-NT-1-4-google_electra_small] PASSED [ 36%]
tests/test_models.py::test_tvm_integration[ctx0-NT-1-4-fairseq_bart_base] PASSED [ 37%]
tests/test_models.py::test_tvm_integration[ctx0-TN-2-4-google_albert_base_v2] PASSED [ 38%]
tests/test_models.py::test_tvm_integration[ctx0-TN-2-4-google_en_cased_bert_base] PASSED [ 40%]
tests/test_models.py::test_tvm_integration[ctx0-TN-2-4-google_electra_small] PASSED [ 41%]
tests/test_models.py::test_tvm_integration[ctx0-TN-2-4-fairseq_bart_base] PASSED [ 42%]
tests/test_models.py::test_tvm_integration[ctx0-TN-1-4-google_albert_base_v2] PASSED [ 43%]
tests/test_models.py::test_tvm_integration[ctx0-TN-1-4-google_en_cased_bert_base] PASSED [ 44%]
tests/test_models.py::test_tvm_integration[ctx0-TN-1-4-google_electra_small] PASSED [ 45%]
tests/test_models.py::test_tvm_integration[ctx0-TN-1-4-fairseq_bart_base] PASSED [ 46%]
tests/test_models_albert.py::test_albert_backbone[auto-False-False] PASSED [ 47%]
tests/test_models_albert.py::test_albert_backbone[auto-True-True] PASSED   [ 48%]
tests/test_models_albert.py::test_albert_backbone[NT-False-False] PASSED   [ 49%]
tests/test_models_albert.py::test_albert_backbone[NT-True-True] PASSED     [ 50%]
tests/test_models_albert.py::test_albert_backbone[TN-False-False] PASSED   [ 51%]
tests/test_models_albert.py::test_albert_backbone[TN-True-True] PASSED     [ 52%]
tests/test_models_albert.py::test_albert_for_mlm_model[auto] PASSED        [ 53%]
tests/test_models_albert.py::test_albert_for_mlm_model[NT] PASSED          [ 54%]
tests/test_models_albert.py::test_albert_for_mlm_model[TN] PASSED          [ 55%]
tests/test_models_albert.py::test_albert_for_pretrain_model[auto] PASSED   [ 56%]
tests/test_models_albert.py::test_albert_for_pretrain_model[NT] PASSED     [ 57%]
tests/test_models_albert.py::test_albert_for_pretrain_model[TN] PASSED     [ 58%]
tests/test_models_albert.py::test_list_pretrained_albert PASSED            [ 60%]
tests/test_models_albert.py::test_albert_get_pretrained[google_albert_base_v2] PASSED [ 61%]
tests/test_models_albert.py::test_albert_get_pretrained[google_albert_large_v2] PASSED [ 62%]
tests/test_models_albert.py::test_albert_get_pretrained[google_albert_xlarge_v2] PASSED [ 63%]
tests/test_models_albert.py::test_albert_get_pretrained[google_albert_xxlarge_v2] PASSED [ 64%]
tests/test_models_bart.py::test_list_pretrained_bart PASSED                [ 65%]
tests/test_models_bart.py::test_bart[fairseq_bart_base] PASSED             [ 66%]
tests/test_models_bart.py::test_bart[fairseq_bart_large] PASSED            [ 67%]
tests/test_models_bart.py::test_bart_cfg_registry PASSED                   [ 68%]
tests/test_models_bart.py::test_bart_cfg[bart_base] PASSED                 [ 69%]
tests/test_models_bart.py::test_bart_cfg[bart_large] PASSED                [ 70%]
tests/test_models_bert.py::test_list_pretrained_bert PASSED                [ 71%]
tests/test_models_bert.py::test_bert_small_cfg[ctx0-auto] PASSED           [ 72%]
tests/test_models_bert.py::test_bert_small_cfg[ctx0-NT] PASSED             [ 73%]
tests/test_models_bert.py::test_bert_small_cfg[ctx0-TN] PASSED             [ 74%]
tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_en_cased_bert_base] PASSED [ 75%]
tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_en_cased_bert_large] PASSED [ 76%]
tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_en_cased_bert_wwm_large] PASSED [ 77%]
tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_en_uncased_bert_base] PASSED [ 78%]
tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_en_uncased_bert_large] PASSED [ 80%]
tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_en_uncased_bert_wwm_large] PASSED [ 81%]
tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_multi_cased_bert_base] PASSED [ 82%]
tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_zh_bert_base] PASSED [ 83%]
tests/test_models_electra.py::test_list_pretrained_electra PASSED          [ 84%]
tests/test_models_electra.py::test_electra_model[ctx0-auto] PASSED         [ 85%]
tests/test_models_electra.py::test_electra_model[ctx0-NT] PASSED           [ 86%]
tests/test_models_electra.py::test_electra_model[ctx0-TN] PASSED           [ 87%]
tests/test_models_electra.py::test_electra_get_pretrained[ctx0-gluon_electra_small_owt] PASSED [ 88%]
tests/test_models_electra.py::test_electra_get_pretrained[ctx0-google_electra_base] PASSED [ 89%]
tests/test_models_electra.py::test_electra_get_pretrained[ctx0-google_electra_large] PASSED [ 90%]
tests/test_models_electra.py::test_electra_get_pretrained[ctx0-google_electra_small] PASSED [ 91%]
tests/test_models_gpt2.py::test_list_pretrained_gpt2 PASSED                [ 92%]
tests/test_models_gpt2.py::test_gpt2_small_config[ctx0-auto] PASSED        [ 93%]
tests/test_models_gpt2.py::test_gpt2_small_config[ctx0-TN] PASSED          [ 94%]
tests/test_models_gpt2.py::test_gpt2_small_config[ctx0-NT] PASSED          [ 95%]
tests/test_models_gpt2.py::test_gpt2_incremental_states[ctx0] PASSED       [ 96%]
tests/test_models_gpt2.py::test_gpt2[ctx0-gpt2_124M] PASSED                [ 97%]
tests/test_models_gpt2.py::test_gpt2[ctx0-gpt2_355M] PASSED                                                        [ 98%]
tests/test_models_gpt2.py::test_gpt2[ctx0-gpt2_774M] FAILED                                                                    [100%]

============================================================== FAILURES ==============================================================
_____________________________________________________ test_gpt2[ctx0-gpt2_774M] ______________________________________________________

model_name = 'gpt2_774M', ctx = gpu(0)

    @pytest.mark.slow
    @pytest.mark.remote_required
    @pytest.mark.parametrize('model_name', ['gpt2_124M', 'gpt2_355M', 'gpt2_774M'])
    def test_gpt2(model_name, ctx):
        # test from pretrained
        assert len(list_pretrained_gpt2()) > 0
        with tempfile.TemporaryDirectory() as root, ctx:
            cfg, tokenizer, params_path, lm_params_path =\
                get_pretrained_gpt2(model_name, load_backbone=True, load_lm=True, root=root)
            assert cfg.MODEL.vocab_size == len(tokenizer.vocab)
            # test backbone
            gpt2_model = GPT2Model.from_cfg(cfg)
            gpt2_model.load_parameters(params_path)
            # test lm model
            gpt2_lm_model = GPT2ForLM(cfg)
            gpt2_lm_model.load_parameters(lm_params_path)
    
            # test forward
            batch_size = 3
            seq_length = 32
            vocab_size = len(tokenizer.vocab)
            input_ids = mx.np.array(
                np.random.randint(
                    2,
                    vocab_size,
                    (batch_size, seq_length)
                ),
                dtype=np.int32,
                ctx=ctx
            )
            logits, _ = gpt2_lm_model(
                input_ids,
                gpt2_lm_model.init_states(batch_size, ctx)
            )
>           mx.npx.waitall()

tests/test_models_gpt2.py:142: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/usr/local/lib/python3.6/dist-packages/mxnet/ndarray/ndarray.py:240: in waitall
    check_call(_LIB.MXNDArrayWaitAll())
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

ret = -1

    def check_call(ret):
        """Check the return value of C API call.
    
        This function will raise an exception when an error occurs.
        Wrap every API call with this function.
    
        Parameters
        ----------
        ret : int
            return value from API calls.
        """
        if ret != 0:
>           raise get_last_ffi_error()
E           mxnet.base.MXNetError: Traceback (most recent call last):
E             File "../src/storage/./pooled_storage_manager.h", line 192
E           MXNetError: Memory allocation failed out of memory

/usr/local/lib/python3.6/dist-packages/mxnet/base.py:246: MXNetError
-------------------------------------------------------- Captured stdout call --------------------------------------------------------
Downloading /tmp/tmpbj080s2v/gpt2_774M/gpt2-9dc62091.vocab from https://gluonnlp-numpy-data.s3-accelerate.amazonaws.com/models/gpt2_774M/gpt2-9dc62091.vocab...
Downloading /tmp/tmpbj080s2v/gpt2_774M/gpt2-396d4d8e.merges from https://gluonnlp-numpy-data.s3-accelerate.amazonaws.com/models/gpt2_774M/gpt2-396d4d8e.merges...
Downloading /tmp/tmpbj080s2v/gpt2_774M/model-9917e24e.params from https://gluonnlp-numpy-data.s3-accelerate.amazonaws.com/models/gpt2_774M/model-9917e24e.params...
Downloading /tmp/tmpbj080s2v/gpt2_774M/model_lm-cfbfa641.params from https://gluonnlp-numpy-data.s3-accelerate.amazonaws.com/models/gpt2_774M/model_lm-cfbfa641.params...
-------------------------------------------------------- Captured stderr call --------------------------------------------------------
100%|██████████| 558k/558k [00:00<00:00, 7.15MiB/s]
100%|██████████| 456k/456k [00:00<00:00, 6.39MiB/s]
100%|██████████| 3.10G/3.10G [01:16<00:00, 40.5MiB/s]
100%|██████████| 3.10G/3.10G [01:20<00:00, 38.6MiB/s]
========================================================== warnings summary ==========================================================
src/gluonnlp/attention_cell.py:715
  /workspace/gluon-nlp/src/gluonnlp/attention_cell.py:715: DeprecationWarning: invalid escape sequence \s
    """

src/gluonnlp/op.py:226
  /workspace/gluon-nlp/src/gluonnlp/op.py:226: DeprecationWarning: invalid escape sequence \p
    """

tests/test_models_albert.py: 6 warnings
tests/test_models_bart.py: 2 warnings
tests/test_models_bert.py: 3 warnings
tests/test_models_gpt2.py: 3 warnings
  /usr/local/lib/python3.6/dist-packages/mxnet/gluon/block.py:572: UserWarning: Parameter 'weight' is already initialized, ignoring. Set force_reinit=True to re-initialize.
    v.initialize(None, ctx, init, force_reinit=force_reinit)

-- Docs: https://docs.pytest.org/en/stable/warnings.html
====================================================== short test summary info =======================================================
FAILED tests/test_models_gpt2.py::test_gpt2[ctx0-gpt2_774M] - mxnet.base.MXNetError: Traceback (most recent call last):
======================================= 1 failed, 94 passed, 16 warnings in 1990.67s (0:33:10) =======================================

Possible memory leak.

There is possible GPU memory leak when running test_models.py::test_tvm_integration on 10.22 nightly release.

python3 -m pytest --device='gpu' --verbose --runslow tests/test_models.py::test_tvm_integration

Screen Shot 2020-11-04 at 9 39 48 AM

Screen Shot 2020-11-04 at 9 44 52 AM

Screen Shot 2020-11-04 at 9 40 09 AM

Screen Shot 2020-11-04 at 9 45 13 AM

@barry-jin barry-jin changed the title [BUG] Fatal Python error when running GluonNLP pytest on MXNet linux nightly build Memory allocation failed out of memory Nov 4, 2020
@barry-jin
Copy link
Contributor Author

Here are the logs before and after reverting #19378

Before Revert
root@6a1ad75b3392:/workspace/incubator-mxnet# git log -1
commit 43750c8bfed6ca91fc47fd1fa6d620197e26c84c (HEAD)
Author: Przemyslaw Tredak <[email protected]>
Date:   Wed Oct 21 11:50:12 2020 -0700

    Remove cleanup on side threads (#19378)
    
    * Remove cleanup on side threads
    
    * removed comment
root@6a1ad75b3392:/workspace/incubator-mxnet# cd ../gluon-nlp/ ; python3 -m pytest --device='gpu' --verbose --runslow tests/test_models.py tests/test_models_albert.py tests/test_models_bart.py tests/test_models_bert.py tests/test_models_gpt2.py
Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=1033001789 to reproduce.
=================================== test session starts ====================================
platform linux -- Python 3.6.9, pytest-6.1.2, py-1.9.0, pluggy-0.13.1 -- /usr/bin/python3
cachedir: .pytest_cache
rootdir: /workspace/gluon-nlp, configfile: pytest.ini
plugins: cov-2.10.1
collected 87 items                                                                         

tests/test_models.py::test_list_backbone_names PASSED                                [  1%]
tests/test_models.py::test_get_backbone[ctx0-google_albert_base_v2] PASSED           [  2%]
tests/test_models.py::test_get_backbone[ctx0-google_albert_large_v2] PASSED          [  3%]
tests/test_models.py::test_get_backbone[ctx0-google_albert_xlarge_v2] PASSED         [  4%]
tests/test_models.py::test_get_backbone[ctx0-google_albert_xxlarge_v2] PASSED        [  5%]
tests/test_models.py::test_get_backbone[ctx0-google_en_cased_bert_base] PASSED       [  6%]
tests/test_models.py::test_get_backbone[ctx0-google_en_cased_bert_large] PASSED      [  8%]
tests/test_models.py::test_get_backbone[ctx0-google_en_cased_bert_wwm_large] PASSED  [  9%]
tests/test_models.py::test_get_backbone[ctx0-google_en_uncased_bert_base] PASSED     [ 10%]
tests/test_models.py::test_get_backbone[ctx0-google_en_uncased_bert_large] PASSED    [ 11%]
tests/test_models.py::test_get_backbone[ctx0-google_en_uncased_bert_wwm_large] PASSED [ 12%]
tests/test_models.py::test_get_backbone[ctx0-google_multi_cased_bert_base] PASSED    [ 13%]
tests/test_models.py::test_get_backbone[ctx0-google_zh_bert_base] PASSED             [ 14%]
tests/test_models.py::test_get_backbone[ctx0-gluon_electra_small_owt] PASSED         [ 16%]
tests/test_models.py::test_get_backbone[ctx0-google_electra_base] PASSED             [ 17%]
tests/test_models.py::test_get_backbone[ctx0-google_electra_large] PASSED            [ 18%]
tests/test_models.py::test_get_backbone[ctx0-google_electra_small] PASSED            [ 19%]
tests/test_models.py::test_get_backbone[ctx0-gpt2_124M] PASSED                       [ 20%]
tests/test_models.py::test_get_backbone[ctx0-gpt2_1558M] PASSED                      [ 21%]
tests/test_models.py::test_get_backbone[ctx0-gpt2_355M] PASSED                       [ 22%]
tests/test_models.py::test_get_backbone[ctx0-gpt2_774M] PASSED                       [ 24%]
tests/test_models.py::test_get_backbone[ctx0-google_uncased_mobilebert] PASSED       [ 25%]
tests/test_models.py::test_get_backbone[ctx0-fairseq_roberta_base] PASSED            [ 26%]
tests/test_models.py::test_get_backbone[ctx0-fairseq_roberta_large] PASSED           [ 27%]
tests/test_models.py::test_get_backbone[ctx0-fairseq_xlmr_base] PASSED               [ 28%]
tests/test_models.py::test_get_backbone[ctx0-fairseq_xlmr_large] PASSED              [ 29%]
tests/test_models.py::test_get_backbone[ctx0-fairseq_bart_base] PASSED               [ 31%]
tests/test_models.py::test_get_backbone[ctx0-fairseq_bart_large] PASSED              [ 32%]
tests/test_models.py::test_tvm_integration[ctx0-NT-2-4-google_albert_base_v2] PASSED [ 33%]
tests/test_models.py::test_tvm_integration[ctx0-NT-2-4-google_en_cased_bert_base] PASSED [ 34%]
tests/test_models.py::test_tvm_integration[ctx0-NT-2-4-google_electra_small] PASSED  [ 35%]
tests/test_models.py::test_tvm_integration[ctx0-NT-2-4-fairseq_bart_base] PASSED     [ 36%]
tests/test_models.py::test_tvm_integration[ctx0-NT-1-4-google_albert_base_v2] PASSED [ 37%]
tests/test_models.py::test_tvm_integration[ctx0-NT-1-4-google_en_cased_bert_base] PASSED [ 39%]
tests/test_models.py::test_tvm_integration[ctx0-NT-1-4-google_electra_small] PASSED  [ 40%]
tests/test_models.py::test_tvm_integration[ctx0-NT-1-4-fairseq_bart_base] PASSED     [ 41%]
tests/test_models.py::test_tvm_integration[ctx0-TN-2-4-google_albert_base_v2] PASSED [ 42%]
tests/test_models.py::test_tvm_integration[ctx0-TN-2-4-google_en_cased_bert_base] PASSED [ 43%]
tests/test_models.py::test_tvm_integration[ctx0-TN-2-4-google_electra_small] PASSED  [ 44%]
tests/test_models.py::test_tvm_integration[ctx0-TN-2-4-fairseq_bart_base] PASSED     [ 45%]
tests/test_models.py::test_tvm_integration[ctx0-TN-1-4-google_albert_base_v2] PASSED [ 47%]
tests/test_models.py::test_tvm_integration[ctx0-TN-1-4-google_en_cased_bert_base] PASSED [ 48%]
tests/test_models.py::test_tvm_integration[ctx0-TN-1-4-google_electra_small] PASSED  [ 49%]
tests/test_models.py::test_tvm_integration[ctx0-TN-1-4-fairseq_bart_base] PASSED     [ 50%]
tests/test_models_albert.py::test_albert_backbone[auto-False-False] PASSED           [ 51%]
tests/test_models_albert.py::test_albert_backbone[auto-True-True] PASSED             [ 52%]
tests/test_models_albert.py::test_albert_backbone[NT-False-False] PASSED             [ 54%]
tests/test_models_albert.py::test_albert_backbone[NT-True-True] PASSED               [ 55%]
tests/test_models_albert.py::test_albert_backbone[TN-False-False] PASSED             [ 56%]
tests/test_models_albert.py::test_albert_backbone[TN-True-True] PASSED               [ 57%]
tests/test_models_albert.py::test_albert_for_mlm_model[auto] PASSED                  [ 58%]
tests/test_models_albert.py::test_albert_for_mlm_model[NT] PASSED                    [ 59%]
tests/test_models_albert.py::test_albert_for_mlm_model[TN] PASSED                    [ 60%]
tests/test_models_albert.py::test_albert_for_pretrain_model[auto] PASSED             [ 62%]
tests/test_models_albert.py::test_albert_for_pretrain_model[NT] PASSED               [ 63%]
tests/test_models_albert.py::test_albert_for_pretrain_model[TN] PASSED               [ 64%]
tests/test_models_albert.py::test_list_pretrained_albert PASSED                      [ 65%]
tests/test_models_albert.py::test_albert_get_pretrained[google_albert_base_v2] PASSED [ 66%]
tests/test_models_albert.py::test_albert_get_pretrained[google_albert_large_v2] PASSED [ 67%]
tests/test_models_albert.py::test_albert_get_pretrained[google_albert_xlarge_v2] PASSED [ 68%]
tests/test_models_albert.py::test_albert_get_pretrained[google_albert_xxlarge_v2] PASSED [ 70%]
tests/test_models_bart.py::test_list_pretrained_bart PASSED                          [ 71%]
tests/test_models_bart.py::test_bart[fairseq_bart_base] PASSED                       [ 72%]
tests/test_models_bart.py::test_bart[fairseq_bart_large] PASSED                      [ 73%]
tests/test_models_bart.py::test_bart_cfg_registry PASSED                             [ 74%]
tests/test_models_bart.py::test_bart_cfg[bart_base] PASSED                           [ 75%]
tests/test_models_bart.py::test_bart_cfg[bart_large] PASSED                          [ 77%]
tests/test_models_bert.py::test_list_pretrained_bert PASSED                          [ 78%]
tests/test_models_bert.py::test_bert_small_cfg[ctx0-auto] PASSED                     [ 79%]
tests/test_models_bert.py::test_bert_small_cfg[ctx0-NT] PASSED                       [ 80%]
tests/test_models_bert.py::test_bert_small_cfg[ctx0-TN] PASSED                       [ 81%]
tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_en_cased_bert_base] PASSED [ 82%]
tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_en_cased_bert_large] PASSED [ 83%]
tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_en_cased_bert_wwm_large] PASSED [ 85%]
tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_en_uncased_bert_base] PASSED [ 86%]
tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_en_uncased_bert_large] PASSED [ 87%]
tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_en_uncased_bert_wwm_large] PASSED [ 88%]
tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_multi_cased_bert_base] PASSED [ 89%]
tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_zh_bert_base] PASSED [ 90%]
tests/test_models_gpt2.py::test_list_pretrained_gpt2 PASSED                          [ 91%]
tests/test_models_gpt2.py::test_gpt2_small_config[ctx0-auto] PASSED                  [ 93%]
tests/test_models_gpt2.py::test_gpt2_small_config[ctx0-TN] PASSED                    [ 94%]
tests/test_models_gpt2.py::test_gpt2_small_config[ctx0-NT] PASSED                    [ 95%]
tests/test_models_gpt2.py::test_gpt2_incremental_states[ctx0] PASSED                 [ 96%]
tests/test_models_gpt2.py::test_gpt2[ctx0-gpt2_124M] PASSED                          [ 97%]
tests/test_models_gpt2.py::test_gpt2[ctx0-gpt2_355M] PASSED                          [ 98%]
tests/test_models_gpt2.py::test_gpt2[ctx0-gpt2_774M] FAILED                          [100%]

========================================= FAILURES =========================================
________________________________ test_gpt2[ctx0-gpt2_774M] _________________________________

model_name = 'gpt2_774M', ctx = gpu(0)

    @pytest.mark.slow
    @pytest.mark.remote_required
    @pytest.mark.parametrize('model_name', ['gpt2_124M', 'gpt2_355M', 'gpt2_774M'])
    def test_gpt2(model_name, ctx):
        # test from pretrained
        assert len(list_pretrained_gpt2()) > 0
        with tempfile.TemporaryDirectory() as root, ctx:
            cfg, tokenizer, params_path, lm_params_path =\
                get_pretrained_gpt2(model_name, load_backbone=True, load_lm=True, root=root)
            assert cfg.MODEL.vocab_size == len(tokenizer.vocab)
            # test backbone
            gpt2_model = GPT2Model.from_cfg(cfg)
            gpt2_model.load_parameters(params_path)
            # test lm model
            gpt2_lm_model = GPT2ForLM(cfg)
            gpt2_lm_model.load_parameters(lm_params_path)
    
            # test forward
            batch_size = 3
            seq_length = 32
            vocab_size = len(tokenizer.vocab)
            input_ids = mx.np.array(
                np.random.randint(
                    2,
                    vocab_size,
                    (batch_size, seq_length)
                ),
                dtype=np.int32,
                ctx=ctx
            )
            logits, _ = gpt2_lm_model(
                input_ids,
                gpt2_lm_model.init_states(batch_size, ctx)
            )
>           mx.npx.waitall()

tests/test_models_gpt2.py:142: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../incubator-mxnet/python/mxnet/ndarray/ndarray.py:240: in waitall
    check_call(_LIB.MXNDArrayWaitAll())
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

ret = -1

    def check_call(ret):
        """Check the return value of C API call.
    
        This function will raise an exception when an error occurs.
        Wrap every API call with this function.
    
        Parameters
        ----------
        ret : int
            return value from API calls.
        """
        if ret != 0:
>           raise get_last_ffi_error()
E           mxnet.base.MXNetError: Traceback (most recent call last):
E             File "../src/storage/./pooled_storage_manager.h", line 192
E           MXNetError: Memory allocation failed out of memory

../incubator-mxnet/python/mxnet/base.py:246: MXNetError
----------------------------------- Captured stdout call -----------------------------------
Downloading /tmp/tmpzxj5da72/gpt2_774M/gpt2-9dc62091.vocab from https://gluonnlp-numpy-data.s3-accelerate.amazonaws.com/models/gpt2_774M/gpt2-9dc62091.vocab...
Downloading /tmp/tmpzxj5da72/gpt2_774M/gpt2-396d4d8e.merges from https://gluonnlp-numpy-data.s3-accelerate.amazonaws.com/models/gpt2_774M/gpt2-396d4d8e.merges...
Downloading /tmp/tmpzxj5da72/gpt2_774M/model-9917e24e.params from https://gluonnlp-numpy-data.s3-accelerate.amazonaws.com/models/gpt2_774M/model-9917e24e.params...
Downloading /tmp/tmpzxj5da72/gpt2_774M/model_lm-cfbfa641.params from https://gluonnlp-numpy-data.s3-accelerate.amazonaws.com/models/gpt2_774M/model_lm-cfbfa641.params...
----------------------------------- Captured stderr call -----------------------------------
100%|██████████| 558k/558k [00:00<00:00, 3.45MiB/s]
100%|██████████| 456k/456k [00:00<00:00, 4.16MiB/s]
100%|██████████| 3.10G/3.10G [01:07<00:00, 45.9MiB/s]
100%|██████████| 3.10G/3.10G [01:18<00:00, 39.4MiB/s]
===================================== warnings summary =====================================
../incubator-mxnet/python/mxnet/contrib/onnx/mx2onnx/_op_translations.py:67
  /workspace/incubator-mxnet/python/mxnet/contrib/onnx/mx2onnx/_op_translations.py:67: DeprecationWarning: invalid escape sequence \(
    tuple_re = re.compile('\([0-9L|,| ]+\)')

src/gluonnlp/attention_cell.py:715
  /workspace/gluon-nlp/src/gluonnlp/attention_cell.py:715: DeprecationWarning: invalid escape sequence \s
    """

src/gluonnlp/op.py:226
  /workspace/gluon-nlp/src/gluonnlp/op.py:226: DeprecationWarning: invalid escape sequence \p
    """

tests/test_models_albert.py: 6 warnings
tests/test_models_bart.py: 2 warnings
tests/test_models_bert.py: 3 warnings
tests/test_models_gpt2.py: 3 warnings
  /workspace/incubator-mxnet/python/mxnet/gluon/block.py:572: UserWarning: Parameter 'weight' is already initialized, ignoring. Set force_reinit=True to re-initialize.
    v.initialize(None, ctx, init, force_reinit=force_reinit)

-- Docs: https://docs.pytest.org/en/stable/warnings.html
================================= short test summary info ==================================
FAILED tests/test_models_gpt2.py::test_gpt2[ctx0-gpt2_774M] - mxnet.base.MXNetError: Trac...
================== 1 failed, 86 passed, 17 warnings in 1718.22s (0:28:38) ==================
root@6a1ad75b3392:/workspace/gluon-nlp# 
After Revert
root@6a1ad75b3392:/workspace/incubator-mxnet# git log -1
commit d786518725ebfdfceeea7b09d3ecb8edf6bbbfaa (HEAD)
Author: barry-jin <[email protected]>
Date:   Tue Dec 8 21:42:28 2020 +0000

    Revert "Remove cleanup on side threads (#19378)"
    
    This reverts commit 43750c8bfed6ca91fc47fd1fa6d620197e26c84c.
root@6a1ad75b3392:/workspace/incubator-mxnet# cd ../gluon-nlp/ ; python3 -m pytest --device='gpu' --verbose --runslow tests/test_models.py tests/test_models_albert.py tests/test_models_bart.py tests/test_models_bert.py tests/test_models_gpt2.py
Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=1725596454 to reproduce.
=================================== test session starts ====================================
platform linux -- Python 3.6.9, pytest-6.1.2, py-1.9.0, pluggy-0.13.1 -- /usr/bin/python3
cachedir: .pytest_cache
rootdir: /workspace/gluon-nlp, configfile: pytest.ini
plugins: cov-2.10.1
collected 87 items                                                                         

tests/test_models.py::test_list_backbone_names PASSED                                [  1%]
tests/test_models.py::test_get_backbone[ctx0-google_albert_base_v2] PASSED           [  2%]
tests/test_models.py::test_get_backbone[ctx0-google_albert_large_v2] PASSED          [  3%]
tests/test_models.py::test_get_backbone[ctx0-google_albert_xlarge_v2] PASSED         [  4%]
tests/test_models.py::test_get_backbone[ctx0-google_albert_xxlarge_v2] PASSED        [  5%]
tests/test_models.py::test_get_backbone[ctx0-google_en_cased_bert_base] PASSED       [  6%]
tests/test_models.py::test_get_backbone[ctx0-google_en_cased_bert_large] PASSED      [  8%]
tests/test_models.py::test_get_backbone[ctx0-google_en_cased_bert_wwm_large] PASSED  [  9%]
tests/test_models.py::test_get_backbone[ctx0-google_en_uncased_bert_base] PASSED     [ 10%]
tests/test_models.py::test_get_backbone[ctx0-google_en_uncased_bert_large] PASSED    [ 11%]
tests/test_models.py::test_get_backbone[ctx0-google_en_uncased_bert_wwm_large] PASSED [ 12%]
tests/test_models.py::test_get_backbone[ctx0-google_multi_cased_bert_base] PASSED    [ 13%]
tests/test_models.py::test_get_backbone[ctx0-google_zh_bert_base] PASSED             [ 14%]
tests/test_models.py::test_get_backbone[ctx0-gluon_electra_small_owt] PASSED         [ 16%]
tests/test_models.py::test_get_backbone[ctx0-google_electra_base] PASSED             [ 17%]
tests/test_models.py::test_get_backbone[ctx0-google_electra_large] PASSED            [ 18%]
tests/test_models.py::test_get_backbone[ctx0-google_electra_small] PASSED            [ 19%]
tests/test_models.py::test_get_backbone[ctx0-gpt2_124M] PASSED                       [ 20%]
tests/test_models.py::test_get_backbone[ctx0-gpt2_1558M] PASSED                      [ 21%]
tests/test_models.py::test_get_backbone[ctx0-gpt2_355M] PASSED                       [ 22%]
tests/test_models.py::test_get_backbone[ctx0-gpt2_774M] PASSED                       [ 24%]
tests/test_models.py::test_get_backbone[ctx0-google_uncased_mobilebert] PASSED       [ 25%]
tests/test_models.py::test_get_backbone[ctx0-fairseq_roberta_base] PASSED            [ 26%]
tests/test_models.py::test_get_backbone[ctx0-fairseq_roberta_large] PASSED           [ 27%]
tests/test_models.py::test_get_backbone[ctx0-fairseq_xlmr_base] PASSED               [ 28%]
tests/test_models.py::test_get_backbone[ctx0-fairseq_xlmr_large] PASSED              [ 29%]
tests/test_models.py::test_get_backbone[ctx0-fairseq_bart_base] PASSED               [ 31%]
tests/test_models.py::test_get_backbone[ctx0-fairseq_bart_large] PASSED              [ 32%]
tests/test_models.py::test_tvm_integration[ctx0-NT-2-4-google_albert_base_v2] PASSED [ 33%]
tests/test_models.py::test_tvm_integration[ctx0-NT-2-4-google_en_cased_bert_base] PASSED [ 34%]
tests/test_models.py::test_tvm_integration[ctx0-NT-2-4-google_electra_small] PASSED  [ 35%]
tests/test_models.py::test_tvm_integration[ctx0-NT-2-4-fairseq_bart_base] PASSED     [ 36%]
tests/test_models.py::test_tvm_integration[ctx0-NT-1-4-google_albert_base_v2] PASSED [ 37%]
tests/test_models.py::test_tvm_integration[ctx0-NT-1-4-google_en_cased_bert_base] PASSED [ 39%]
tests/test_models.py::test_tvm_integration[ctx0-NT-1-4-google_electra_small] PASSED  [ 40%]
tests/test_models.py::test_tvm_integration[ctx0-NT-1-4-fairseq_bart_base] PASSED     [ 41%]
tests/test_models.py::test_tvm_integration[ctx0-TN-2-4-google_albert_base_v2] PASSED [ 42%]
tests/test_models.py::test_tvm_integration[ctx0-TN-2-4-google_en_cased_bert_base] PASSED [ 43%]
tests/test_models.py::test_tvm_integration[ctx0-TN-2-4-google_electra_small] PASSED  [ 44%]
tests/test_models.py::test_tvm_integration[ctx0-TN-2-4-fairseq_bart_base] PASSED     [ 45%]
tests/test_models.py::test_tvm_integration[ctx0-TN-1-4-google_albert_base_v2] PASSED [ 47%]
tests/test_models.py::test_tvm_integration[ctx0-TN-1-4-google_en_cased_bert_base] PASSED [ 48%]
tests/test_models.py::test_tvm_integration[ctx0-TN-1-4-google_electra_small] PASSED  [ 49%]
tests/test_models.py::test_tvm_integration[ctx0-TN-1-4-fairseq_bart_base] PASSED     [ 50%]
tests/test_models_albert.py::test_albert_backbone[auto-False-False] PASSED           [ 51%]
tests/test_models_albert.py::test_albert_backbone[auto-True-True] PASSED             [ 52%]
tests/test_models_albert.py::test_albert_backbone[NT-False-False] PASSED             [ 54%]
tests/test_models_albert.py::test_albert_backbone[NT-True-True] PASSED               [ 55%]
tests/test_models_albert.py::test_albert_backbone[TN-False-False] PASSED             [ 56%]
tests/test_models_albert.py::test_albert_backbone[TN-True-True] PASSED               [ 57%]
tests/test_models_albert.py::test_albert_for_mlm_model[auto] PASSED                  [ 58%]
tests/test_models_albert.py::test_albert_for_mlm_model[NT] PASSED                    [ 59%]
tests/test_models_albert.py::test_albert_for_mlm_model[TN] PASSED                    [ 60%]
tests/test_models_albert.py::test_albert_for_pretrain_model[auto] PASSED             [ 62%]
tests/test_models_albert.py::test_albert_for_pretrain_model[NT] PASSED               [ 63%]
tests/test_models_albert.py::test_albert_for_pretrain_model[TN] PASSED               [ 64%]
tests/test_models_albert.py::test_list_pretrained_albert PASSED                      [ 65%]
tests/test_models_albert.py::test_albert_get_pretrained[google_albert_base_v2] PASSED [ 66%]
tests/test_models_albert.py::test_albert_get_pretrained[google_albert_large_v2] PASSED [ 67%]
tests/test_models_albert.py::test_albert_get_pretrained[google_albert_xlarge_v2] PASSED [ 68%]
tests/test_models_albert.py::test_albert_get_pretrained[google_albert_xxlarge_v2] PASSED [ 70%]
tests/test_models_bart.py::test_list_pretrained_bart PASSED                          [ 71%]
tests/test_models_bart.py::test_bart[fairseq_bart_base] PASSED                       [ 72%]
tests/test_models_bart.py::test_bart[fairseq_bart_large] PASSED                      [ 73%]
tests/test_models_bart.py::test_bart_cfg_registry PASSED                             [ 74%]
tests/test_models_bart.py::test_bart_cfg[bart_base] PASSED                           [ 75%]
tests/test_models_bart.py::test_bart_cfg[bart_large] PASSED                          [ 77%]
tests/test_models_bert.py::test_list_pretrained_bert PASSED                          [ 78%]
tests/test_models_bert.py::test_bert_small_cfg[ctx0-auto] PASSED                     [ 79%]
tests/test_models_bert.py::test_bert_small_cfg[ctx0-NT] PASSED                       [ 80%]
tests/test_models_bert.py::test_bert_small_cfg[ctx0-TN] PASSED                       [ 81%]
tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_en_cased_bert_base] PASSED [ 82%]
tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_en_cased_bert_large] PASSED [ 83%]
tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_en_cased_bert_wwm_large] PASSED [ 85%]
tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_en_uncased_bert_base] PASSED [ 86%]
tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_en_uncased_bert_large] PASSED [ 87%]
tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_en_uncased_bert_wwm_large] PASSED [ 88%]
tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_multi_cased_bert_base] PASSED [ 89%]
tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_zh_bert_base] PASSED [ 90%]
tests/test_models_gpt2.py::test_list_pretrained_gpt2 PASSED                          [ 91%]
tests/test_models_gpt2.py::test_gpt2_small_config[ctx0-auto] PASSED                  [ 93%]
tests/test_models_gpt2.py::test_gpt2_small_config[ctx0-TN] PASSED                    [ 94%]
tests/test_models_gpt2.py::test_gpt2_small_config[ctx0-NT] PASSED                    [ 95%]
tests/test_models_gpt2.py::test_gpt2_incremental_states[ctx0] PASSED                 [ 96%]
tests/test_models_gpt2.py::test_gpt2[ctx0-gpt2_124M] PASSED                          [ 97%]
tests/test_models_gpt2.py::test_gpt2[ctx0-gpt2_355M] PASSED                          [ 98%]
tests/test_models_gpt2.py::test_gpt2[ctx0-gpt2_774M] PASSED                          [100%]

===================================== warnings summary =====================================
../incubator-mxnet/python/mxnet/contrib/onnx/mx2onnx/_op_translations.py:67
  /workspace/incubator-mxnet/python/mxnet/contrib/onnx/mx2onnx/_op_translations.py:67: DeprecationWarning: invalid escape sequence \(
    tuple_re = re.compile('\([0-9L|,| ]+\)')

src/gluonnlp/attention_cell.py:715
  /workspace/gluon-nlp/src/gluonnlp/attention_cell.py:715: DeprecationWarning: invalid escape sequence \s
    """

src/gluonnlp/op.py:226
  /workspace/gluon-nlp/src/gluonnlp/op.py:226: DeprecationWarning: invalid escape sequence \p
    """

tests/test_models_albert.py: 6 warnings
tests/test_models_bart.py: 2 warnings
tests/test_models_bert.py: 3 warnings
tests/test_models_gpt2.py: 3 warnings
  /workspace/incubator-mxnet/python/mxnet/gluon/block.py:572: UserWarning: Parameter 'weight' is already initialized, ignoring. Set force_reinit=True to re-initialize.
    v.initialize(None, ctx, init, force_reinit=force_reinit)

-- Docs: https://docs.pytest.org/en/stable/warnings.html
======================= 87 passed, 17 warnings in 1928.37s (0:32:08) =======================
root@6a1ad75b3392:/workspace/gluon-nlp# 

@andrei5055
Copy link
Contributor

@barry-jin : To investigate this problem I need to compile MxNet locally. Do you know what set of cmake options I need to use for that?

@barry-jin
Copy link
Contributor Author

barry-jin commented Jan 19, 2021

From my experience, I just used following commands to build MxNet locally and reproduce the issue:

$ git clone --recursive https://github.com/apache/incubator-mxnet
$ cd incubator-mxnet
$ git checkout 43750c8bfed6ca91fc47fd1fa6d620197e26c84c
$ cp config/linux_gpu.cmake config.cmake
$ mkdir build; cd build
$ cmake -GNinja -DCMAKE_BUILD_TYPE=Debug ..; ninja
$ cd ..
$ python3 -m pip install --user -e ./python
$ cd ~/workspace
$ git clone https://github.com/dmlc/gluon-nlp
$ cd ~/workspace/gluon-nlp
$ git checkout 8c8b0c9cda0853caa88fdbf4e0544986fbef243c
$ python3 -m pip install --quiet -e .[extras]
$ python3 -m pytest --device='gpu' --verbose --runslow tests/test_models.py tests/test_models_albert.py tests/test_models_bart.py tests/test_models_bert.py tests/test_models_gpt2.py

@andrei5055
Copy link
Contributor

Thanks a lot for the script! Unfortunately, I am having a linking problem:

root@28b3a2b8de7a:/opt/mxnet/build# ninja
[1/3] Linking CXX shared library libmxnet.so
FAILED: libmxnet.so 
. . .
Error copying file "/opt/mxnet/build/3rdparty/mkldnn/include/dnnl_config.h" to "/opt/mxnet/include/mkldnn/".
ninja: build stopped: subcommand failed.

The file dnnl_config.h is not presented in any part of incubator-mxnet

@barry-jin
Copy link
Contributor Author

barry-jin commented Jan 19, 2021

You may try to update 3rdparty modules

$ git clean -ffxd
$ git submodule update --init --recursive

@andrei5055
Copy link
Contributor

@barry-jin : Is it true, that the script you gave me should reproduce this problem? I tried, and I don't see it:
==== 71 passed, 16 skipped, 17 warnings in 1528.46s (0:25:28) ====
Just in case... The 16 tests were skipped, because "JVM is not supported". I'm not sure if a memory problem will show up in one of these tests.

@barry-jin
Copy link
Contributor Author

@andrei5055 Thanks for your investigation. I think the warning message should be "TVM is not supported". You can follow tvm documentation to install tvm. Alternatively, I will provide test suite without tvm support that will reproduce this issue.

@barry-jin
Copy link
Contributor Author

You can checkout gluon-nlp to dmlc/gluon-nlp@7910d6d and run following test suite.

git checkout 7910d6d247ec9cb1b51cd49d79e3d474b087b188
python3 -m pytest --device='gpu' --verbose --runslow tests/test_attention_cell.py tests/test_data_batchify.py tests/test_data_filtering.py tests/test_data_sampler.py tests/test_data_tokenizers.py tests/test_embedding.py tests/test_gluon_block.py tests/test_initializer.py tests/test_layers.py tests/test_loss.py tests/test_models.py tests/test_models_albert.py tests/test_models_bart.py tests/test_models_bert.py tests/test_models_electra.py tests/test_models_gpt2.py tests/test_models_roberta.py tests/test_models_transformer.py

@andrei5055
Copy link
Contributor

@barry-jin: Still cannot reproduce this problem:
========== 933 passed, 847 warnings in 2932.28s (0:48:52) =======

BTW, all warnings are of following two types:
Type 1:

  /opt/mxnet/python/mxnet/gluon/block.py:1098: UserWarning: Parameter 0b7a2e74_c816_4146_bbb2_7973d2ca9112_gamma, 0af6619c_7075_430a_9226_8458e6ca733a_bias, c75fe6d3_81e7_4748_9894_f49abf4b5f2a_bias, 53661f2f_d20f_4c90_a539_173394b859d3_weight, 2b4ce060_94a7_4cd1_ac29_4bdc41789888_weight, e19ccd3d_cc61_44b2_ab1a_20e88f571877_bias, 8f53b519_069f_415a_bd05_c8b4ec58dd24_const, 99d015d6_eeca_4ad6_9fc6_1fb55e43b0f7_weight, 711c0a20_91e2_43c3_ba41_48f5fd2a3398_gamma, d852d48d_ca52_408a_83f3_2c11bf3a01b8_beta, e0417d39_d73a_4101_a440_f992b45a176e_weight, 3f5329d5_0903_448a_8c7a_65536aa507a1_bias, d08c8d34_3bca_4006_9843_aa5d069767cf_beta is not used by any computation. Is this intended?
    self._build_cache(*args)

Type 2:

  /opt/mxnet/python/mxnet/registry.py:108: UserWarning: New initializer mxnet.gluon.parameter.Init registered with name constant_140658119590520 isoverriding existing initializer mxnet.gluon.parameter.Init
    register(klass, name)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants