[BUG] Pin Test Version of PyTorch to 2.1 to Resolve NCCL Error #4464

alexbarghi-nv · 2024-06-05T19:29:07Z

PyTorch 2.2+ is incompatible with the NCCL version on our containers. Normally, this would not be an issue, but there is a bug in CuPy that loads the system NCCL instead of the user NCCL. This PR binds the PyTorch test dependency version to get around this issue.

raydouglass

Does the conda recipe need to be updated as well https://github.com/rapidsai/cugraph/blob/branch-24.06/conda/recipes/cugraph-pyg/meta.yaml#L37

Though, the dependency is not present on the wheel: https://github.com/rapidsai/cugraph/blob/branch-24.06/python/cugraph-pyg/pyproject.toml#L29-L34

Is this a runtime or test-only dependency?

alexbarghi-nv · 2024-06-06T16:55:21Z

Does the conda recipe need to be updated as well https://github.com/rapidsai/cugraph/blob/branch-24.06/conda/recipes/cugraph-pyg/meta.yaml#L37

Though, the dependency is not present on the wheel: https://github.com/rapidsai/cugraph/blob/branch-24.06/python/cugraph-pyg/pyproject.toml#L29-L34

Is this a runtime or test-only dependency?

tensordict is a runtime dependency. I'll resolve this.

alexbarghi-nv · 2024-06-06T17:03:13Z

@raydouglass I think I've addressed your concerns.

tingyu66

LGTM, we should be able to remove this constraint once the lazy load behavior in cupy is fixed.

jameslamb

👋🏻 hi, I'm here because @raydouglass asked me to come take a look.

Based on this statement

tensordict v0.4 requires PyTorch 2.3+ which is incompatible with the NCCL version on our containers

I don't think this fix will be enough to protect cugraph users. They could still install tensordict==0.3.0 together with pytorch>2.3.0, couldn't they?

If the direct issue is that cugraph is incompatible with pytorch >=2.3.0, I think the more reliable fix would be to put an upper bound like pytorch >=2.0,<2.3.0 on its pytorch dependency.

I'm leaving a non-blocking "comment" review in case I've misunderstood, and this incompatibility only shows up here in CI but wouldn't affect users.

The one other comment I left is minor and also doesn't need to block the PR.

ci/test_python.sh

dependencies.yaml

alexbarghi-nv · 2024-06-06T18:15:11Z

👋🏻 hi, I'm here because @raydouglass asked me to come take a look.

Based on this statement

tensordict v0.4 requires PyTorch 2.3+ which is incompatible with the NCCL version on our containers

I don't think this fix will be enough to protect cugraph users. They could still install tensordict==0.3.0 together with pytorch>2.3.0, couldn't they?

If the direct issue is that cugraph is incompatible with pytorch >=2.3.0, I think the more reliable fix would be to put an upper bound like pytorch >=2.0,<2.3.0 on its pytorch dependency.

I'm leaving a non-blocking "comment" review in case I've misunderstood, and this incompatibility only shows up here in CI but wouldn't affect users.

The one other comment I left is minor and also doesn't need to block the PR.

This is really a CI issue; cuGraph supports PyTorch up to v2.3 except on environments with an older NCCL version (which is the result of a cupy bug). But even when that bug gets fixed, we still need to pin the tensordict version because we still want compatibility with PyTorch 2.1+. Sorry if that's hard to follow.

I guess we could make a test dependency of pytorch<2.2? Would that be an acceptable solution here?

Co-authored-by: Bradley Dice <[email protected]>

…raph into tensordict-pin

tingyu66 · 2024-06-06T18:47:13Z

python/cugraph-pyg/pyproject.toml

@@ -31,6 +31,7 @@ dependencies = [
    "numba>=0.57",
    "numpy>=1.23,<2.0a0",
    "pylibcugraphops==24.6.*",
+    "tensordict>=0.1.2,<0.3.1a0",


I didn't catch this in my first pass, but this does not really enforce an upper bound for pytorch:

tensordict 0.3.0: torch>=2.1.0 tensordict 0.3.1: torch>=2.2.1 tensordict 0.3.2: torch==2.2.2

In addition, this line now introduces pytorch as a hard dependency for cugraph-pyg, and it will pull pytorch from PyPI, as opposed to using its own index-url (as mentioned in https://pytorch.org/get-started/locally/). I am not clear about the consequence such as if it might pull a CPU only version from PyPI.

If you look at torch_geometric's pyproject.toml, they don't even include pytorch as a hard dependency. One alternative is to run version check at import time from cugraph-pyg

Can't you just manually install the GPU version (as we do now)?

Yes, of course you can always manually install the correct GPU version to override whatever is installed by pip install cugraph-pyg

I know it's not ideal but otherwise we would be requiring users to manually install tensordict which would be unexpected. Right now I assume most of them do a manual install of torch anyways.

I was just about to post something similar after reading the @alexbarghi-nv 's explanation in #4464 (comment) (thanks for that!). I saw these same version constraints looking through tensordict releases (e.g. https://github.com/pytorch/tensordict/blob/7dbb8649f13b17be973918ec3a2dae10c985b5c4/setup.py#L70)

If the core issue is "this project's CI setup is incompatible with testing against PyTorch >= 2.3.0", then yeah similar to what was suggested in that comment, I think a better immediate fix would be:

put a pin of the form pytorch <2.3.0 in the testing dependencies

don't modify anything about the project's tensordict dependency

That should lead the pip and conda solvers to avoid tensordict 0.4.0, because it's requirement on torch>=2.3.0 wouldn't be satisfiable.

And it'd still make it possible for cugraph user to install cugraph==24.6.0 alongside a newer nccl and newer pytorch.

tensordict's 0.4.0 release happened to be the thing that introduced pytorch >=2.3.0 here, but some other dependency could do that in the future. Putting a ceiling in CI on pytorch is the most reliable way to prevent that from happening again... and hopefully for the next release you can work out a way to test against pytorch >=2.3.0 here in CI (maybe by upgrading beyond the cupy / nccl versions that are leading to the issue you've observed).

We still want the tensordict pin change, though. Otherwise we are locking out users on PyTorch 2.2 and 2.1 which we do not want.

I don't think so?

The pin is currently tensordict>=0.1.2.

cugraph/conda/recipes/cugraph-pyg/meta.yaml

Line 37 in 286591e

- tensordict >=0.1.2

That's sufficiently old to be installed alongside pytorch==2.1.0.

I just tried this:

conda create --name pytorch-test--dry-run \ -c conda-forge \ -c pytorch \ 'tensordict>=0.1.2' \ 'pytorch==2.1.0'

And got a successful solution like this:

libtorch conda-forge/linux-64::libtorch-2.1.0-cuda118_h7595d50_303 ... pytorch conda-forge/linux-64::pytorch-2.1.0-cuda118_py312hffed1d1_303 ... tensordict conda-forge/noarch::tensordict-0.1.2-pyh6074d0b_0

output of 'conda info' (click me)

active environment : None user config file : /home/nfs/jlamb/.condarc populated config files : /raid/jlamb/miniforge/.condarc /home/nfs/jlamb/.condarc conda version : 24.3.0 conda-build version : not installed python version : 3.10.12.final.0 solver : libmamba (default) virtual packages : __archspec=1=broadwell __conda=24.3.0=0 __cuda=12.2=0 __glibc=2.31=0 __linux=5.4.0=0 __unix=0=0 base environment : /raid/jlamb/miniforge (writable) conda av data dir : /raid/jlamb/miniforge/etc/conda conda av metadata url : None channel URLs : https://conda.anaconda.org/conda-forge/linux-64 https://conda.anaconda.org/conda-forge/noarch package cache : /raid/jlamb/miniforge/pkgs /home/nfs/jlamb/.conda/pkgs envs directories : /raid/jlamb/miniforge/envs /home/nfs/jlamb/.conda/envs platform : linux-64 user-agent : conda/24.3.0 requests/2.31.0 CPython/3.10.12 Linux/5.4.0-182-generic ubuntu/20.04.6 glibc/2.31 solver/libmamba conda-libmamba-solver/24.1.0 libmambapy/1.5.8 UID:GID : 10349:10004 netrc file : None offline mode : False

Sorry, you're right. I'll revert it.

the latest changes look good to me! Thanks for hearing me out.

Sure, no problem

alexbarghi-nv · 2024-06-06T19:44:05Z

Also @tingyu66 I should point out that we were already making PyTorch a dependency for Conda, so this just matches it on pip.

…raph into tensordict-pin

… wheel'

…raph into tensordict-pin

trxcllnt · 2024-06-11T18:00:58Z

If I understand correctly, cugraph-pyg depends on tensordict, and tensordict depends on torch?

If that's accurate, this is going to be difficult to build in DLFW. We don't build/install torch in the RAPIDS stages, because the DLFW Pytorch team is downstream from us.

I don't know how we can both create a cugraph-pyg wheel with the fully-specified set of runtime dependencies in its metadata, and not include torch in the set of wheels we ship to them.

alexbarghi-nv · 2024-06-11T18:12:42Z

If I understand correctly, cugraph-pyg depends on tensordict, and tensordict depends on torch?

If that's accurate, this is going to be difficult to build in DLFW. We don't build/install torch in the RAPIDS stages, because the DLFW Pytorch team is downstream from us.

I don't know how we can both create a cugraph-pyg wheel with the fully-specified set of runtime dependencies in its metadata, and not include torch in the set of wheels we ship to them.

I forgot about this. In that case I guess we need to make tensordict an optional dependency and lazily load it at runtime. That isn't going to be a great user experience but I don't want to block you guys. I assume the pytorch team can add tensordict when they do their build.

alexbarghi-nv · 2024-06-11T19:16:25Z

So @trxcllnt it looks like I already used import_optional for tensordict so I changed it to a test dependency for pip. I think that should resolve your issue.

rlratzel

Latest changes LGTM

See rapidsaigh-4486 for adding torchdata/pydantic and rapidsaigh-4464 for introducing the torch pin due to NCCL difficulties. It is entirely possible that this still fails due to the NCCL issue!

pin tensordict to 0.3.2 because of nccl problem

2b071be

alexbarghi-nv self-assigned this Jun 5, 2024

alexbarghi-nv added the CRITICAL BUG! BUG that needs to be FIX NOW !!!! label Jun 5, 2024

github-actions bot added python ci labels Jun 5, 2024

alexbarghi-nv added this to the 24.06 milestone Jun 5, 2024

alexbarghi-nv added breaking Breaking change non-breaking Non-breaking change bug Something isn't working and removed breaking Breaking change python labels Jun 5, 2024

alexbarghi-nv marked this pull request as ready for review June 5, 2024 19:36

alexbarghi-nv requested a review from a team as a code owner June 5, 2024 19:36

alexbarghi-nv changed the title ~~[BUG] Pin tensordict to 0.3.2 to Resolve NCCL Error~~ [BUG] Pin tensordict to 0.3.0 to Resolve NCCL Error Jun 5, 2024

change to 0.3.0

a7aaf8c

github-actions bot added the python label Jun 5, 2024

raydouglass approved these changes Jun 6, 2024

View reviewed changes

raydouglass requested changes Jun 6, 2024

View reviewed changes

update conda recipe with tensordict version, add as runtime dependency

4451ad9

alexbarghi-nv requested a review from a team as a code owner June 6, 2024 17:01

github-actions bot added the conda label Jun 6, 2024

alexbarghi-nv requested a review from tingyu66 June 6, 2024 17:01

tingyu66 approved these changes Jun 6, 2024

View reviewed changes

jameslamb self-requested a review June 6, 2024 18:01

jameslamb reviewed Jun 6, 2024

View reviewed changes

ci/test_python.sh Outdated Show resolved Hide resolved

bdice approved these changes Jun 6, 2024

View reviewed changes

dependencies.yaml Outdated Show resolved Hide resolved

block pre-release

6201e00

Co-authored-by: Bradley Dice <[email protected]>

alexbarghi-nv added 4 commits June 6, 2024 11:17

remove pip install of tensordict

b41b616

Merge branch 'tensordict-pin' of https://github.com/alexbarghi-nv/cug…

4d3e90b

…raph into tensordict-pin

style

39e1e2d

pin alpha

52ac3cb

tingyu66 reviewed Jun 6, 2024

View reviewed changes

set pytorch dependency

749b59a

alexbarghi-nv changed the title ~~[BUG] Pin tensordict to 0.3.0 to Resolve NCCL Error~~ [BUG] Pin Test Version of PyTorch to 2.1 to Resolve NCCL Error Jun 6, 2024

jameslamb self-requested a review June 6, 2024 20:24

jameslamb approved these changes Jun 6, 2024

View reviewed changes

alexbarghi-nv and others added 7 commits June 7, 2024 11:17

change to pip

35e04c5

Merge branch 'branch-24.06' into tensordict-pin

bd8339d

fix pytorch import

fe5ebc8

Merge branch 'tensordict-pin' of https://github.com/alexbarghi-nv/cug…

92014c9

…raph into tensordict-pin

consider pypi.nvidia.com when looking for build dependencies for 'pip…

19a3f35

… wheel'

remove pytorch install from pyg python test

0ac5319

Merge branch 'tensordict-pin' of https://github.com/alexbarghi-nv/cug…

eb3ebdb

…raph into tensordict-pin

jameslamb mentioned this pull request Jun 11, 2024

always consider pypi.nvidia.com in pip commands rapidsai/ci-imgs#149

Merged

make tensordict test dependency for pip

402e7cb

fix torch import

87ef175

rlratzel approved these changes Jun 12, 2024

View reviewed changes

add LD_PRELOAD to get around memory issue

1b68ba4

raydouglass approved these changes Jun 13, 2024

View reviewed changes

raydouglass merged commit f4c519e into rapidsai:branch-24.06 Jun 13, 2024
131 of 132 checks passed

alexbarghi-nv deleted the tensordict-pin branch June 24, 2024 18:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Pin Test Version of PyTorch to 2.1 to Resolve NCCL Error #4464

[BUG] Pin Test Version of PyTorch to 2.1 to Resolve NCCL Error #4464

alexbarghi-nv commented Jun 5, 2024 •

edited

Loading

raydouglass left a comment •

edited

Loading

alexbarghi-nv commented Jun 6, 2024

alexbarghi-nv commented Jun 6, 2024

tingyu66 left a comment

jameslamb left a comment

alexbarghi-nv commented Jun 6, 2024

tingyu66 Jun 6, 2024

alexbarghi-nv Jun 6, 2024

tingyu66 Jun 6, 2024 •

edited

Loading

alexbarghi-nv Jun 6, 2024

jameslamb Jun 6, 2024 •

edited

Loading

alexbarghi-nv Jun 6, 2024

jameslamb Jun 6, 2024

alexbarghi-nv Jun 6, 2024

jameslamb Jun 6, 2024

alexbarghi-nv Jun 6, 2024

alexbarghi-nv commented Jun 6, 2024

trxcllnt commented Jun 11, 2024 •

edited

Loading

alexbarghi-nv commented Jun 11, 2024

alexbarghi-nv commented Jun 11, 2024

rlratzel left a comment

[BUG] Pin Test Version of PyTorch to 2.1 to Resolve NCCL Error #4464

[BUG] Pin Test Version of PyTorch to 2.1 to Resolve NCCL Error #4464

Conversation

alexbarghi-nv commented Jun 5, 2024 • edited Loading

raydouglass left a comment • edited Loading

Choose a reason for hiding this comment

alexbarghi-nv commented Jun 6, 2024

alexbarghi-nv commented Jun 6, 2024

tingyu66 left a comment

Choose a reason for hiding this comment

jameslamb left a comment

Choose a reason for hiding this comment

alexbarghi-nv commented Jun 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tingyu66 Jun 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jameslamb Jun 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexbarghi-nv commented Jun 6, 2024

trxcllnt commented Jun 11, 2024 • edited Loading

alexbarghi-nv commented Jun 11, 2024

alexbarghi-nv commented Jun 11, 2024

rlratzel left a comment

Choose a reason for hiding this comment

alexbarghi-nv commented Jun 5, 2024 •

edited

Loading

raydouglass left a comment •

edited

Loading

tingyu66 Jun 6, 2024 •

edited

Loading

jameslamb Jun 6, 2024 •

edited

Loading

trxcllnt commented Jun 11, 2024 •

edited

Loading