-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[Master Test Failure] Python3: TensorRT GPU #14626
Comments
Hey, this is the MXNet Label Bot. |
@mxnet-label-bot add Test |
@KellenSunderland would you have any ideas on this issue? |
I think this is due to GluonCV model zoo trying to download cifar which is not available, taking a deeper look... |
Links seem to be alright |
Might be an issue with the CUDA driver, need to check for the version on the hosts in the AMI's |
@mxnet-label-bot add [Test, CI] |
Strange, did this just start happening with a cuda driver update? Did we change instance type or anything else at the same time? The documentation shows that the error code happens when the driver is running an older version than the runtime, but we've been running this runtime for quite some time in CI, and presumably if we changed the driver it would have been to a newer version, so I'm not sure how we could have that regressions. For reference docs are at:
I wonder if there was an update to the bases image that now requires a newer driver on the host? If so we can probably pin to an older base image. What driver version is running on the host? |
Also see this for the most of the time in Error message shows it's a cuda driver initialization problem. |
Created this PR #14642 to unblock PRs while we fix this issue |
Driver on the host:
|
And on the docker container (nvidia/cuda:10.0-cudnn7-devel):
|
On the docker container the CUDNN version appears to be 7.5.0:
|
Good info, as a heads up I'm using NVIDIA-SMI 418.40.04 Driver Version: 418.40.04 CUDA Version: 10.1 in a service and so far haven't seen any issues with this version. |
Updating our host AMI with that driver version and giving it a test |
That appears to have fixed it, will be deploying to prod shortly: http://jenkins.mxnet-ci-dev.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fchancebair-unix-gpu/detail/chanbair-disable-tensorrtgpu/4/pipeline/276 |
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/master/496/pipeline
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/master/497/pipeline
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/master/498/pipeline
The following stages are failing:
Python3: TensorRT GPU
The text was updated successfully, but these errors were encountered: