Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyTorch 2.4.0 Package Not Installable w/ CUDA 12 on Python 3.12 Linux x86_64 #254

Closed
1 task done
iamthebot opened this issue Aug 28, 2024 · 12 comments
Closed
1 task done
Labels
bug Something isn't working

Comments

@iamthebot
Copy link

Solution to issue cannot be found in the documentation.

  • I checked the documentation.

Issue

On a Linux x86_64 machine:

CONDA_OVERRIDE_CUDA=12 conda install pytorch
...
Channels:
 - conda-forge
Platform: linux-64
Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/alfredo_luque/.airconda-environments/devel--demo--alfredo_luque--airconda_tutorial--v0.0.1

  added / updated specs:
    - pytorch


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    fsspec-2024.6.1            |     pyhff2d567_0         130 KB  conda-forge
    numpy-2.1.0                |  py312h1103770_0         8.0 MB  conda-forge
    ------------------------------------------------------------
                                           Total:         8.1 MB

The following NEW packages will be INSTALLED:

  _sysroot_linux-64~ conda-forge/noarch::_sysroot_linux-64_curr_repodata_hack-3-h69a702a_16
  cuda-version       conda-forge/noarch::cuda-version-11.8-h70ddcb2_3
  cudatoolkit        conda-forge/linux-64::cudatoolkit-11.8.0-h4ba93d1_13
  cudnn              conda-forge/linux-64::cudnn-8.9.7.29-hbc23b4c_3
  filelock           conda-forge/noarch::filelock-3.15.4-pyhd8ed1ab_0
  fsspec             conda-forge/noarch::fsspec-2024.6.1-pyhff2d567_0
  gmp                conda-forge/linux-64::gmp-6.3.0-hac33072_2
  gmpy2              conda-forge/linux-64::gmpy2-2.1.5-py312h1d5cde6_1
  icu                conda-forge/linux-64::icu-75.1-he02047a_0
  jinja2             conda-forge/noarch::jinja2-3.1.4-pyhd8ed1ab_0
  kernel-headers_li~ conda-forge/noarch::kernel-headers_linux-64-3.10.0-h4a8ded7_16
  libabseil          conda-forge/linux-64::libabseil-20240116.2-cxx17_he02047a_1
  libblas            conda-forge/linux-64::libblas-3.9.0-23_linux64_openblas
  libcblas           conda-forge/linux-64::libcblas-3.9.0-23_linux64_openblas
  libgfortran        conda-forge/linux-64::libgfortran-14.1.0-h69a702a_1
  libgfortran-ng     conda-forge/linux-64::libgfortran-ng-14.1.0-h69a702a_1
  libgfortran5       conda-forge/linux-64::libgfortran5-14.1.0-hc5f4f2c_1
  libhwloc           conda-forge/linux-64::libhwloc-2.11.1-default_hecaa2ac_1000
  libiconv           conda-forge/linux-64::libiconv-1.17-hd590300_2
  liblapack          conda-forge/linux-64::liblapack-3.9.0-23_linux64_openblas
  libmagma           conda-forge/linux-64::libmagma-2.8.0-hfdb99dd_0
  libmagma_sparse    conda-forge/linux-64::libmagma_sparse-2.8.0-h9ddd185_0
  libopenblas        conda-forge/linux-64::libopenblas-0.3.27-pthreads_hac2b453_1
  libprotobuf        conda-forge/linux-64::libprotobuf-4.25.3-h08a7969_0
  libstdcxx          conda-forge/linux-64::libstdcxx-14.1.0-hc0a3c3a_1
  libstdcxx-ng       conda-forge/linux-64::libstdcxx-ng-14.1.0-h4852527_1
  libtorch           conda-forge/linux-64::libtorch-2.4.0-cuda118_h8db9d67_301
  libuv              conda-forge/linux-64::libuv-1.48.0-hd590300_0
  libxml2            conda-forge/linux-64::libxml2-2.12.7-he7c6b58_4
  llvm-openmp        conda-forge/linux-64::llvm-openmp-18.1.8-hf5423f3_1
  markupsafe         conda-forge/linux-64::markupsafe-2.1.5-py312h98912ed_0
  mkl                conda-forge/linux-64::mkl-2023.2.0-h84fe81f_50496
  mpc                conda-forge/linux-64::mpc-1.3.1-h24ddda3_0
  mpfr               conda-forge/linux-64::mpfr-4.2.1-h38ae2d0_2
  mpmath             conda-forge/noarch::mpmath-1.3.0-pyhd8ed1ab_0
  nccl               conda-forge/linux-64::nccl-2.22.3.1-hee583db_1
  networkx           conda-forge/noarch::networkx-3.3-pyhd8ed1ab_1
  numpy              conda-forge/linux-64::numpy-2.1.0-py312h1103770_0
  python_abi         conda-forge/linux-64::python_abi-3.12-5_cp312
  pytorch            conda-forge/linux-64::pytorch-2.4.0-cuda118_py312h3690e1b_301
  sleef              conda-forge/linux-64::sleef-3.6.1-h1b44611_3
  sympy              conda-forge/noarch::sympy-1.13.2-pypyh2585a3b_103
  sysroot_linux-64   conda-forge/noarch::sysroot_linux-64-2.17-h4a8ded7_16
  tbb                conda-forge/linux-64::tbb-2021.12.0-h434a139_3
  typing_extensions  conda-forge/noarch::typing_extensions-4.12.2-pyha770c72_0
  zstd               conda-forge/linux-64::zstd-1.5.6-ha6fb4c9_0

The following packages will be DOWNGRADED:

  _openmp_mutex                                   4.5-2_gnu --> 4.5-2_kmp_llvm


Proceed ([y]/n)?

Interestingly, the CUDA 11.8 variant is picked when using this solve. I ran this using the libmamba solver but it's also an issue with the classic solver (which ends up ignoring CONDA_OVERRIDE_CUDA and picks the cpu_generic_py312 variant).

2.3.1 does not have this issue. That is, if I run CONDA_OVERRIDE_CUDA=12 conda install "pytorch<2.4.0" I get a CUDA 12 version of PyTorch in the solve.

Installed packages

# packages in environment at /home/alfredo_luque/.airconda-environments/devel--demo--alfredo_luque--airconda_tutorial--v0.0.1:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       2_gnu    conda-forge
bzip2                     1.0.8                h4bc722e_7    conda-forge
ca-certificates           2024.7.4             hbcca054_0    conda-forge
ld_impl_linux-64          2.40                 hf3520f5_7    conda-forge
libexpat                  2.6.2                h59595ed_0    conda-forge
libffi                    3.4.2                h7f98852_5    conda-forge
libgcc                    14.1.0               h77fa898_1    conda-forge
libgcc-ng                 14.1.0               h69a702a_1    conda-forge
libgomp                   14.1.0               h77fa898_1    conda-forge
libnsl                    2.0.1                hd590300_0    conda-forge
libsqlite                 3.46.0               hde9e2c9_0    conda-forge
libuuid                   2.38.1               h0b41bf4_0    conda-forge
libxcrypt                 4.4.36               hd590300_1    conda-forge
libzlib                   1.3.1                h4ab18f5_1    conda-forge
ncurses                   6.5                  he02047a_1    conda-forge
openssl                   3.3.1                hb9d3cd8_3    conda-forge
pip                       24.2               pyhd8ed1ab_0    conda-forge
python                    3.12.5          h2ad013b_0_cpython    conda-forge
readline                  8.2                  h8228510_1    conda-forge
setuptools                72.2.0             pyhd8ed1ab_0    conda-forge
tk                        8.6.13          noxft_h4845f30_101    conda-forge
tzdata                    2024a                h8827d51_1    conda-forge
wheel                     0.44.0             pyhd8ed1ab_0    conda-forge
xz                        5.2.6                h166bdaf_0    conda-forge

Environment info

active environment : devel--demo--alfredo_luque--airconda_tutorial--v0.0.1
    active env location : /home/alfredo_luque/.airconda-environments/devel--demo--alfredo_luque--airconda_tutorial--v0.0.1
            shell level : 2
       user config file : /home/alfredo_luque/.condarc
 populated config files : /opt/conda/.condarc
                          /home/alfredo_luque/.condarc
          conda version : 24.7.1
    conda-build version : 24.5.1
         python version : 3.10.14.final.0
                 solver : libmamba (default)
       virtual packages : __archspec=1=zen2
                          __conda=24.7.1=0
                          __cuda=12.4=0
                          __glibc=2.35=0
                          __linux=5.15.149=0
                          __unix=0=0
       base environment : /opt/conda  (writable)
      conda av data dir : /opt/conda/etc/conda
  conda av metadata url : None
           channel URLs : https://artifactory.d.musta.ch/artifactory/api/conda/conda-airbnb/linux-64
                          https://artifactory.d.musta.ch/artifactory/api/conda/conda-airbnb/noarch
                          https://conda.anaconda.org/conda-forge/linux-64
                          https://conda.anaconda.org/conda-forge/noarch
          package cache : /opt/conda/pkgs
                          /home/alfredo_luque/.conda/pkgs
       envs directories : /home/alfredo_luque/.airconda-environments
                          /opt/conda/envs
                          /home/alfredo_luque/.conda/envs
               platform : linux-64
             user-agent : conda/24.7.1 requests/2.32.3 CPython/3.10.14 Linux/5.15.149-99.162.amzn2.x86_64 ubuntu/22.04.4 glibc/2.35 solver/libmamba conda-libmamba-solver/24.1.0 libmambapy/1.5.8
                UID:GID : 7331:7331
             netrc file : None
           offline mode : False
@iamthebot iamthebot added the bug Something isn't working label Aug 28, 2024
@hmaarrfk
Copy link
Contributor

I think it might be because our builds stalled...

@hmaarrfk
Copy link
Contributor

image

@hmaarrfk
Copy link
Contributor

I expect it to take like 13 hours. Please check and report! thanks!

@iamthebot
Copy link
Author

I expect it to take like 13 hours. Please check and report! thanks!

No problem, thanks for the quick response! Will test tomorrow.

@jakirkham
Copy link
Member

Thanks Mark! 🙏

Looks like one failed. Unfortunately this appears to be after the build, but during the conda-build DSO checking phase

Are these kinds of CI issue common here? If so, what things would you recommend (say to a provider) to address the reliability issues?

@hmaarrfk
Copy link
Contributor

but during the conda-build DSO checking phase

not sure if that is true, the other seemed to have failed during hte building phase.

I had to restart the aarch64 jobs.

@jakirkham
Copy link
Member

jakirkham commented Aug 29, 2024

That was what the last part of the log that I could see in GitHub last night. Perhaps they had trouble loading? The log files are quite long

Looking today using the raw log to get them to load fully (attached in compressed form below to meet size limitations), am seeing the following in those jobs


From the CUDA 12 Linux ARM job ( attached compressed log ):

+ python -c 'import torch; torch.tensor(1).to('\''cpu'\'').numpy(); print('\''numpy support enabled!!!'\'')'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/conda/feedstock_root/build_artifacts/libtorch_1724888760332/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placeho/lib/python3.8/site-packages/torch/__init__.py", line 290, in <module>
    from torch._C import *  # noqa: F403
ImportError: /lib64/libm.so.6: version `GLIBC_2.27' not found (required by /home/conda/feedstock_root/build_artifacts/libtorch_1724888760332/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placeho/lib/python3.8/site-packages/torch/../../.././libcurand.so.10)

Unfortunately some CUDA libraries moving to EL8: conda-forge/cuda-feedstock#28

So to run this test we likely need to use the AlmaLinux 8 image. An example would be PR: conda-forge/faiss-split-feedstock#75

Alternatively we could just skip this test on CUDA ARM. Presumably if the CPU one passes, this is a pretty good indication of whether this one will pass


From the CPU-only Linux ARM job ( attached compressed log ):

+ python -c 'import torch; torch.tensor(1).to('\''cpu'\'').numpy(); print('\''numpy support enabled!!!'\'')'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
RuntimeError: PyTorch was compiled without NumPy support

Though it looks like the CPU ARM test doesn't pass atm. Think you understand this better than I. Guessing we need to broaden this workaround to cover ARM: #252 ?

@hmaarrfk
Copy link
Contributor

Think you understand this better than I. Guessing we need to broaden this workaround to cover ARM: #252 ?

I'm glad my test worked....

@jcwomack
Copy link

jcwomack commented Aug 29, 2024

Possibly relevant: We encountered a "PyTorch was compiled without NumPy support" error when running on Linux aarch64 + CUDA (on NVIDIA GH200) using the conda-forge build of PyTorch 2.4.0.

Relevant output from conda list for the environment in which the error was encountered:

pytorch                   2.4.0           cuda120_py312haadfe8f_200    conda-forge
pytorch-gpu               2.4.0           cuda120py312hecaec72_200    conda-forge

Rolling back to 2.3.0 remedied this issue. Looking at the build number, it seems that the build we installed preceded merging PR #252.

@jakirkham
Copy link
Member

Thanks James! Yep this is expected

In PR ( #252 ), Mark worked around a bug in CMake to fix ensure PyTorch builds with NumPy and tested it in the recipe. These packages would show up with a build/number of 201 (instead of 200 as your example shows). CMake has since also integrated a fix, but it is not yet released

As noted above ( #254 (comment) ), this test appears to be working correctly. However it shows that the Linux ARM builds are failing. So no packages are available with build/number of 201 yet. So we may need to extend Mark's workaround for Linux ARM

Am guessing fixing this would be taking this code

- cmake !=3.30.0,!=3.30.1,!=3.30.2 # [osx and blas_impl == "mkl"]
- cmake # [not (osx and blas_impl == "mkl")]

...and changing it like so...

-    - cmake !=3.30.0,!=3.30.1,!=3.30.2        # [osx and blas_impl == "mkl"]
-    - cmake                                   # [not (osx and blas_impl == "mkl")]
+    - cmake !=3.30.0,!=3.30.1,!=3.30.2        # [unix]
+    - cmake                                   # [not unix]

@jcwomack is this something you would be willing to try in a new PR? 🙂

@jcwomack
Copy link

Hi @jakirkham, thanks for the quick response!

Apologies, but I've got quite limited availability for the next week or so, so would not be able to work on a PR myself at this time.

@hmaarrfk
Copy link
Contributor

The original issue is resolved. I opened #266 to track the aarch + numpy issue.

rapids-bot bot pushed a commit to rapidsai/cugraph-gnn that referenced this issue Nov 22, 2024
As the issue around PyTorch being built without NumPy was fixed in conda-forge, we can now relax these upper bounds to allow PyTorch 2.4.

xref: conda-forge/pytorch-cpu-feedstock#254
xref: conda-forge/pytorch-cpu-feedstock#266
xref: rapidsai/cugraph#4615
xref: rapidsai/cugraph#4703
xref: #59

Authors:
  - https://github.com/jakirkham

Approvers:
  - Jake Awe (https://github.com/AyodeAwe)
  - Tingyu Wang (https://github.com/tingyu66)

URL: #75
rapids-bot bot pushed a commit to rapidsai/cugraph that referenced this issue Nov 22, 2024
As the issue around PyTorch being built without NumPy was fixed in conda-forge, we can now relax these upper bounds to allow PyTorch 2.4.

xref: conda-forge/pytorch-cpu-feedstock#254
xref: conda-forge/pytorch-cpu-feedstock#266
xref: #4615

Authors:
  - https://github.com/jakirkham
  - Alex Barghi (https://github.com/alexbarghi-nv)

Approvers:
  - Alex Barghi (https://github.com/alexbarghi-nv)
  - James Lamb (https://github.com/jameslamb)

URL: #4703
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants