-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Failed to import MXNet built with TensorRT #12142
Comments
Hi @Faldict thanks for your reported issue. @haojin2 could you please take a look at here, I remember somebody is already working on this, can you point this issue to that PR? @mxnet-label-bot could you please label this as [build, backend]? |
Hey @Faldict. The problem is that you don't have protobuf on your LD_LIBRARY_PATH. I'd recommend setting your path like the following:
Where MXNET_PATH is the root director of you MXNet folder, and PROTOBUF_PATH is the root directory of your protobuf files. In general when you run into these runtime issues one step that usually helps me is to run
which should show all the libraries that have been referenced during compilation, but which are not currently on my library path. I then search my filesystem for those libs and append their folders into my LD_LIBRARY_PATH. Note: Right now building from source for this feature is admittedly quite complicated. Thank you very much for being an early adopter. We're working together this week to provide some detailed information about how to install and run this feature. Those docs will hopefully make the process a bit easier. |
@KellenSunderland Thanks for your useful reply! I run
In fact, I have built
|
@Faldict It looks like everything is linked and resolved correctly to me. That is a little strange. I'd like to statically link the protobuf lib in the future which should solve this. The only advice I could give at this point would be to try to closely copy the installation process the CI is taking. I'll hope to have docker images and/or pip packages next week if you're ok with either of those solutions. Edit: one thing you could try and do is ensure you only have a single version of protobuf on your machine (i.e. uninstall any that may have been included from package managers), then clean and rebuild. |
@KellenSunderland I uninstalled protobuf 3.5.1 and rebuild the whole toolchain. At present, MXNet could be imported successfully. It seems that you should constrain the protobuf version strictly. Further more, I tried to run a tensorrt baseline. I used the test code
As I set some breakpoints, I found this error occurs when executing this line:
where the symbol and parameters are trained by running EDIT: when I dig deeper, the error occurs during the execution of |
@Faldict: I suspect it's still something to do with the build, but it could be some missing validation. Do the other tests run properly? I'd like to do two things to help troubleshoot the problem. First use a pre-built package to rule out build issues. Second lets gather some more information by getting a full stack dump. Would you be able to run the diagnose script so I can see what OS distro you're running? I'm working on an installer package for TRT at the moment. Since you're one of the earlier adopters maybe you can give it a shot and see if it fixes your issues? What to do:
Could you also try to run the test using gdb? You would need to run something like: gdb python3 incubator-mxnet/tests/python/tensorrt/test_tensorrt_lenet5.py
then from within gdb
c
# to continue, it should then crash and allow you to enter this command:
thread apply all bt
# dumps the stack of all threads If you could then paste the results here that would help me understand where the crash is coming from. |
@KellenSunderland I'm glad to do something that would benefit your work. Firstly, I ran the diagnosis and paste it below:
What's more, as I mentioned here, my PC has a GTX 1060 GPU. Then I used gdb to run the test, crashed with following message:
Next, I entered the dump command and selected the most important segment, which is Thread 1 in this case, and paste here:
At this time I could clearly affirm that this crash occurs during the execution of |
Hey @Faldict I've updated the version of onnx-trt in our repo. I don't think it'll address your issue yet, but you can give the new version a shot. |
Hey @Faldict. (1) Nice machine. (2) I was wondering if you'd be able to test a pre-release version of MXNet 1.3 from a pip package? Could you try a pip install mxnet-tensorrt-cu90 ? |
Hi @KellenSunderland . I have installed mxnet-tensorrt-cu90, but failed to utilize gpu. While running code with gpu context, I meet some errors:
The cuda version is indeed 9.0. So I wonder what cudnn version it builds on? By the way, this pip package depends on protobuf 3.5. I wish you could point out critical dependencies. (I reinstalled protobuf 3.5.1 again.) |
I'm trying to link as many packages as possible statically, but have been unable to do so with protobuf yet. The no kernel image available is a CUDA warning, but it's not specific to CUDA versions. It's actually saying the package doesn't include object code compatible with your GPU (which should be compute capability 6.1). |
@KellenSunderland +1, I currently holds a decent code that can build all binaries but not protobuf... |
@Faldict for CU90 build:
|
Alright, I'm a little limited in what I can ship at the moment due to maximum file sizes in PyPi. I just pushed a version with static protobuf and JIT compilable GPU operators for Pascal cards. This may introduce a small delay when you first load the library as CUDA kernels are JIT'd. This should get you passed the errors you're currently seeing though, so give it an update. A regular pip upgrade should work, but if not try: I'm working with the PyPi maintainers to up our limits there, and then I'll be able to make the package more portable. pypi/warehouse#4686 |
The diligent PyPi maintainers have enabled extra storage space for our two packages, and I've uploaded a version that has both Pascal and Volta support included. I've also statically compiled a number of libraries to make the lib more portable. Give the new version a shot and see if it addresses your issues. |
@KellenSunderland That problem was probably due to the mismatch of nvidia driver versions. After I fixed the problems and installed the latest pip package
Seems that it works fine! Thanks for your awesome efforts! |
Hi @KellenSunderland, sorry to bother you again. |
@Faldict @KellenSunderland
Following is the error message
|
FYI build is quite close to what's in CI under the ci/docker/runtime_function.sh file. Hope it helps. |
I pulled the latest source code from the master branch and built MXNet successfully with
USE_TENSORRT = 1
. But I failed to import mxnet:Here is the error log:
I use protobuf 3.5.1.
@mkolod Could you please take a look at this?
The text was updated successfully, but these errors were encountered: