[Master Test Failure] Python3: TensorRT GPU #14626

Chancebair · 2019-04-05T11:03:46Z

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/master/496/pipeline
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/master/497/pipeline
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/master/498/pipeline

The following stages are failing:

Python3: TensorRT GPU

===========================================
Model: cifar_resnet20_v1
===========================================
*** Running inference using pure MXNet ***


Model file is not found. Downloading.
Downloading /home/jenkins_slave/.mxnet/models/cifar_resnet20_v1-121e1579.zip from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/cifar_resnet20_v1-121e1579.zip...
Downloading /home/jenkins_slave/.mxnet/datasets/cifar10/cifar-10-binary.tar.gz from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/cifar10/cifar-10-binary.tar.gz...

/work/mxnet/python/mxnet/gluon/block.py:420: UserWarning: load_params is deprecated. Please use load_parameters.
  warnings.warn("load_params is deprecated. Please use load_parameters.")
[07:15:55] /work/mxnet/src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
[2019-04-04 07:16:00   ERROR] Cuda initialization failure with error 35. Please check cuda installation:  http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html.
terminate called after throwing an instance of 'std::runtime_error'
  what():  Failed to create object
/work/runtime_functions.sh: line 827:    52 Aborted                 (core dumped) nosetests-3.4 $NOSE_COVERAGE_ARGUMENTS $NOSE_TIMER_ARGUMENTS --with-xunit --xunit-file nosetests_trt_gpu.xml --verbose --nocapture tests/python/tensorrt/
build.py: 2019-04-04 07:16:01,058Z INFO Waiting for status of container e82e1b5998f4 for 600 s.
build.py: 2019-04-04 07:16:01,256Z INFO Container exit status: {'Error': None, 'StatusCode': 134}

The text was updated successfully, but these errors were encountered:

mxnet-label-bot · 2019-04-05T11:03:49Z

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Test

Chancebair · 2019-04-05T11:04:15Z

@mxnet-label-bot add Test

Chancebair · 2019-04-05T11:05:00Z

@KellenSunderland would you have any ideas on this issue?

lebeg · 2019-04-05T12:47:15Z

I think this is due to GluonCV model zoo trying to download cifar which is not available, taking a deeper look...

lebeg · 2019-04-05T12:56:50Z

Links seem to be alright

lebeg · 2019-04-05T14:00:21Z

Might be an issue with the CUDA driver, need to check for the version on the hosts in the AMI's

lanking520 · 2019-04-05T16:00:27Z

#14618

abhinavs95 · 2019-04-05T16:52:08Z

@mxnet-label-bot add [Test, CI]

eric-haibin-lin · 2019-04-07T03:48:25Z

Also happened in http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-14518/4/

eric-haibin-lin · 2019-04-07T06:24:15Z

Also happened in http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-14518/5/pipeline

KellenSunderland · 2019-04-07T19:38:14Z

Strange, did this just start happening with a cuda driver update? Did we change instance type or anything else at the same time?

The documentation shows that the error code happens when the driver is running an older version than the runtime, but we've been running this runtime for quite some time in CI, and presumably if we changed the driver it would have been to a newer version, so I'm not sure how we could have that regressions.

For reference docs are at:
https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1g3f51e3575c2178246db0a94a430e0038

This indicates that the installed NVIDIA CUDA driver is older than the CUDA runtime library. This is not a supported configuration. Users should install an updated NVIDIA display driver to allow the application to run.

I wonder if there was an update to the bases image that now requires a newer driver on the host? If so we can probably pin to an older base image. What driver version is running on the host?

reminisce · 2019-04-07T20:52:55Z

Also see this for the most of the time in
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-14622/15/pipeline/274

Error message shows it's a cuda driver initialization problem.

Chancebair · 2019-04-08T11:50:08Z

Created this PR #14642 to unblock PRs while we fix this issue

Chancebair · 2019-04-08T13:44:32Z

Driver on the host:

~$ nvidia-smi
Mon Apr  8 13:43:24 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.73       Driver Version: 410.73       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   46C    P0    43W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Chancebair · 2019-04-08T14:21:29Z

And on the docker container (nvidia/cuda:10.0-cudnn7-devel):

# nvidia-smi
Mon Apr  8 14:20:56 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.73       Driver Version: 410.73       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   47C    P0    43W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Chancebair · 2019-04-08T15:02:25Z

On the docker container the CUDNN version appears to be 7.5.0:

# cat /usr/include/cudnn.h | grep CUDNN_MAJOR -A 2
#define CUDNN_MAJOR 7
#define CUDNN_MINOR 5
#define CUDNN_PATCHLEVEL 0

KellenSunderland · 2019-04-08T17:28:53Z

Good info, as a heads up I'm using NVIDIA-SMI 418.40.04 Driver Version: 418.40.04 CUDA Version: 10.1 in a service and so far haven't seen any issues with this version.

Chancebair · 2019-04-09T10:58:38Z

Updating our host AMI with that driver version and giving it a test

Chancebair · 2019-04-09T13:30:08Z

That appears to have fixed it, will be deploying to prod shortly: http://jenkins.mxnet-ci-dev.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fchancebair-unix-gpu/detail/chanbair-disable-tensorrtgpu/4/pipeline/276

marcoabreu added CI Test labels Apr 5, 2019

Chancebair mentioned this issue Apr 8, 2019

Disable Python3: TensorRT GPU Temporarily #14642

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Master Test Failure] Python3: TensorRT GPU #14626

[Master Test Failure] Python3: TensorRT GPU #14626

Chancebair commented Apr 5, 2019

mxnet-label-bot commented Apr 5, 2019

Chancebair commented Apr 5, 2019

Chancebair commented Apr 5, 2019

lebeg commented Apr 5, 2019

lebeg commented Apr 5, 2019

lebeg commented Apr 5, 2019

lanking520 commented Apr 5, 2019

abhinavs95 commented Apr 5, 2019

eric-haibin-lin commented Apr 7, 2019

eric-haibin-lin commented Apr 7, 2019

KellenSunderland commented Apr 7, 2019

reminisce commented Apr 7, 2019

Chancebair commented Apr 8, 2019

Chancebair commented Apr 8, 2019

Chancebair commented Apr 8, 2019

Chancebair commented Apr 8, 2019

KellenSunderland commented Apr 8, 2019

Chancebair commented Apr 9, 2019

Chancebair commented Apr 9, 2019 •

edited

Loading

[Master Test Failure] Python3: TensorRT GPU #14626

[Master Test Failure] Python3: TensorRT GPU #14626

Comments

Chancebair commented Apr 5, 2019

Python3: TensorRT GPU

mxnet-label-bot commented Apr 5, 2019

Chancebair commented Apr 5, 2019

Chancebair commented Apr 5, 2019

lebeg commented Apr 5, 2019

lebeg commented Apr 5, 2019

lebeg commented Apr 5, 2019

lanking520 commented Apr 5, 2019

abhinavs95 commented Apr 5, 2019

eric-haibin-lin commented Apr 7, 2019

eric-haibin-lin commented Apr 7, 2019

KellenSunderland commented Apr 7, 2019

reminisce commented Apr 7, 2019

Chancebair commented Apr 8, 2019

Chancebair commented Apr 8, 2019

Chancebair commented Apr 8, 2019

Chancebair commented Apr 8, 2019

KellenSunderland commented Apr 8, 2019

Chancebair commented Apr 9, 2019

Chancebair commented Apr 9, 2019 • edited Loading

Chancebair commented Apr 9, 2019 •

edited

Loading