Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add robust dependency check for Python package #6436

Merged
merged 15 commits into from
Feb 21, 2021

Conversation

ivanst0
Copy link
Member

@ivanst0 ivanst0 commented Jan 25, 2021

Description:

  • Use the correct version of CUDA when multiple versions are installed on the machine.
    • A .py file is generated during CMake build and then used at runtime to check for the appropriate version of dependencies.
    • cuDNN is expected to be installed at same location as CUDA Toolkit, as described in the official installation instructions from NVIDIA.
    • Python 3.6 and 3.7 rely on PATH environment variable for locating CUDA/cuDNN DLLs, while in Python 3.8 and 3.9 a new API (os.add_dll_directory) is used to specify DLL search location.
    • The new method of locating CUDA is also more robust: previously it could happen that everything works just fine if import torch is placed before import onnxruntime, but an error occurs if the order of these imports is switched (onnxruntime could end up using CUDA DLLs distributed with PyTorch).
  • Provide an informative error message to the user in case one of dependencies is missing.
    • onnxruntime-gpu package depends on CUDA, cuDNN and C++ Runtime (if built with VS 2019).
    • The missing dependency and the required version are printed as part of the error message.
    • The check for VS 2019 C++ Runtime is now more robust.
  • Build onnxruntime-gpu Python package for Python 3.8/3.9 in CI.

Motivation and Context:

  • Having more than one version of CUDA installed on the machine is a quite common scenario. onnxruntime-gpu package should work out-of-the-box on such a machine, without the need for manual configuration.
  • The users are often confused when import onnxruntime complains about not being able to find a DLL - it's not obvious which dependency is missing.
  • Python 3.8 edition of onnxruntime-gpu package is still missing on PyPI. This change should help producing robust packages for all supported versions of Python.

Affected components:

  • onnxruntime and onnxruntime-gpu Python packages for Windows for all versions of Python.

fixes #5697
fixes #5963
fixes #6433
fixes #6435

@ivanst0 ivanst0 requested a review from a team as a code owner January 25, 2021 19:54
@snnn
Copy link
Member

snnn commented Jan 25, 2021

/azp run orttraining-distributed

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@ivanst0
Copy link
Member Author

ivanst0 commented Jan 27, 2021

@snnn, @jywu-msft
Could you please place cuDNN DLLs in the same directory as CUDA DLLs on machines in Win-GPU-2019 agent pool? This follows the official cuDNN installation instructions from NVIDIA and better represents the setup that end users are likely to have on their machines.

This should solve the only remaining test failure (in Windows GPU CI Pipeline).

@snnn
Copy link
Member

snnn commented Jan 27, 2021

@snnn, @jywu-msft
Could you please place cuDNN DLLs in the same directory as CUDA DLLs on machines in Win-GPU-2019 agent pool? This follows the official cuDNN installation instructions from NVIDIA and better represents the setup that end users are likely to have on their machines.

This should solve the only remaining test failure (in Windows GPU CI Pipeline).

I can do it. But now I'm busily working on Security Development Lifecycle (SDL) items. I won't have extra time to work on this in this week or the next week.

@ivanst0
Copy link
Member Author

ivanst0 commented Feb 4, 2021

@jywu-msft, could you please take a look at this PR?

@jywu-msft
Copy link
Member

@jywu-msft, could you please take a look at this PR?

thanks. Will take a look. was OOF so still catching up on things.

@ivanst0
Copy link
Member Author

ivanst0 commented Feb 5, 2021

Merged the latest master to pick up the recent fix for Windows build (caused by update to numpy 1.20).
Please kick off the builds again.

@faxu
Copy link
Contributor

faxu commented Feb 6, 2021

/azp run Linux CPU CI Pipeline, Linux CPU x64 NoContribops CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux OpenVINO CI Pipeline, MacOS CI Pipeline, MacOS NoContribops CI Pipeline, Windows CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline

@faxu
Copy link
Contributor

faxu commented Feb 6, 2021

/azp run Windows GPU CI Pipeline, WIndows GPU TensorRT CI Pipeline, centos7_cpu, centos7_cpu (linux_centos_ci Debug), centos7_cpu (linux_centos_ci Release), orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 5 pipeline(s).

@azure-pipelines
Copy link

Azure Pipelines successfully started running 9 pipeline(s).

@snnn
Copy link
Member

snnn commented Feb 6, 2021

/azp run orttraining-distributed, Linux Nuphar CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

Copy link
Member

@jywu-msft jywu-msft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please test the code as I documented. it doesn't work.

@ivanst0
Copy link
Member Author

ivanst0 commented Feb 6, 2021

please test the code as I documented. it doesn't work.

"Windows GPU CI Pipeline" completed successfully. Do you believe something is missing in the pipeline?

@snnn
Copy link
Member

snnn commented Feb 6, 2021

please test the code as I documented. it doesn't work.

"Windows GPU CI Pipeline" completed successfully. Do you believe something is missing in the pipeline?

That pipeline is using python 3.7.

@ivanst0
Copy link
Member Author

ivanst0 commented Feb 7, 2021

/azp run Windows GPU CI Pipeline

@azure-pipelines
Copy link

Commenter does not have sufficient privileges for PR 6436 in repo microsoft/onnxruntime

@jywu-msft
Copy link
Member

please test the code as I documented. it doesn't work.

"Windows GPU CI Pipeline" completed successfully. Do you believe something is missing in the pipeline?

yes, the fix must be tested with python 3.8

@snnn
Copy link
Member

snnn commented Feb 17, 2021

So, all the builds passed.

@ivanst0
Copy link
Member Author

ivanst0 commented Feb 17, 2021

please test the code as I documented. it doesn't work.

Python packaging pipeline succeeded for all supported operating systems and Python versions.

@ivanst0
Copy link
Member Author

ivanst0 commented Feb 18, 2021

@jywu-msft, @snnn,
Could you please advise what are the next steps for merging this PR?
All the pipelines completed successfully.
Further waiting only generates additional work (such as resolving a conflict in tools/ci_build/build.py which popped up recently).

@ivanst0 ivanst0 requested review from snnn and jywu-msft February 20, 2021 12:03
@jywu-msft
Copy link
Member

@jywu-msft, @snnn,
Could you please advise what are the next steps for merging this PR?
All the pipelines completed successfully.
Further waiting only generates additional work (such as resolving a conflict in tools/ci_build/build.py which popped up recently).

sorry for the delay. things have been busy as of late.
I kicked off python packaging pipeline. after that passes, will kick off the rest of the CI pipelines.

@jywu-msft
Copy link
Member

/azp run orttraining-linux-ci-pipeline,orttraining-mac-ci-pipeline,orttraining-linux-gpu-ci-pipeline,centos7_cpu,Linux CPU Minimal Build E2E CI Pipeline,Linux Nuphar CI Pipeline,MacOS NoContribops CI Pipeline,Linux OpenVINO CI Pipeline,orttraining-distributed

@azure-pipelines
Copy link

Azure Pipelines successfully started running 9 pipeline(s).

@jywu-msft
Copy link
Member

/azp run Linux CPU CI Pipeline,Linux CPU x64 NoContribops CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,MacOS CI Pipeline,MacOS NoContribops CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 9 pipeline(s).

@jywu-msft
Copy link
Member

/azp run orttraining-amd-gpu-ci-pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment