Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Pin Test Version of PyTorch to 2.1 to Resolve NCCL Error #4464

Merged
merged 19 commits into from
Jun 13, 2024

Conversation

alexbarghi-nv
Copy link
Member

@alexbarghi-nv alexbarghi-nv commented Jun 5, 2024

PyTorch 2.2+ is incompatible with the NCCL version on our containers. Normally, this would not be an issue, but there is a bug in CuPy that loads the system NCCL instead of the user NCCL. This PR binds the PyTorch test dependency version to get around this issue.

@alexbarghi-nv alexbarghi-nv self-assigned this Jun 5, 2024
@alexbarghi-nv alexbarghi-nv added the CRITICAL BUG! BUG that needs to be FIX NOW !!!! label Jun 5, 2024
@alexbarghi-nv alexbarghi-nv added this to the 24.06 milestone Jun 5, 2024
@alexbarghi-nv alexbarghi-nv added breaking Breaking change non-breaking Non-breaking change bug Something isn't working and removed breaking Breaking change python labels Jun 5, 2024
@alexbarghi-nv alexbarghi-nv marked this pull request as ready for review June 5, 2024 19:36
@alexbarghi-nv alexbarghi-nv requested a review from a team as a code owner June 5, 2024 19:36
@alexbarghi-nv alexbarghi-nv changed the title [BUG] Pin tensordict to 0.3.2 to Resolve NCCL Error [BUG] Pin tensordict to 0.3.0 to Resolve NCCL Error Jun 5, 2024
@github-actions github-actions bot added the python label Jun 5, 2024
Copy link
Member

@raydouglass raydouglass left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the conda recipe need to be updated as well https://github.com/rapidsai/cugraph/blob/branch-24.06/conda/recipes/cugraph-pyg/meta.yaml#L37

Though, the dependency is not present on the wheel: https://github.com/rapidsai/cugraph/blob/branch-24.06/python/cugraph-pyg/pyproject.toml#L29-L34

Is this a runtime or test-only dependency?

@alexbarghi-nv
Copy link
Member Author

Does the conda recipe need to be updated as well https://github.com/rapidsai/cugraph/blob/branch-24.06/conda/recipes/cugraph-pyg/meta.yaml#L37

Though, the dependency is not present on the wheel: https://github.com/rapidsai/cugraph/blob/branch-24.06/python/cugraph-pyg/pyproject.toml#L29-L34

Is this a runtime or test-only dependency?

tensordict is a runtime dependency. I'll resolve this.

@alexbarghi-nv alexbarghi-nv requested a review from a team as a code owner June 6, 2024 17:01
@github-actions github-actions bot added the conda label Jun 6, 2024
@alexbarghi-nv alexbarghi-nv requested a review from tingyu66 June 6, 2024 17:01
@alexbarghi-nv
Copy link
Member Author

@raydouglass I think I've addressed your concerns.

Copy link
Member

@tingyu66 tingyu66 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, we should be able to remove this constraint once the lazy load behavior in cupy is fixed.

@jameslamb jameslamb self-requested a review June 6, 2024 18:01
Copy link
Member

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👋🏻 hi, I'm here because @raydouglass asked me to come take a look.

Based on this statement

tensordict v0.4 requires PyTorch 2.3+ which is incompatible with the NCCL version on our containers

I don't think this fix will be enough to protect cugraph users. They could still install tensordict==0.3.0 together with pytorch>2.3.0, couldn't they?

If the direct issue is that cugraph is incompatible with pytorch >=2.3.0, I think the more reliable fix would be to put an upper bound like pytorch >=2.0,<2.3.0 on its pytorch dependency.

I'm leaving a non-blocking "comment" review in case I've misunderstood, and this incompatibility only shows up here in CI but wouldn't affect users.

The one other comment I left is minor and also doesn't need to block the PR.

ci/test_python.sh Outdated Show resolved Hide resolved
dependencies.yaml Outdated Show resolved Hide resolved
@alexbarghi-nv
Copy link
Member Author

👋🏻 hi, I'm here because @raydouglass asked me to come take a look.

Based on this statement

tensordict v0.4 requires PyTorch 2.3+ which is incompatible with the NCCL version on our containers

I don't think this fix will be enough to protect cugraph users. They could still install tensordict==0.3.0 together with pytorch>2.3.0, couldn't they?

If the direct issue is that cugraph is incompatible with pytorch >=2.3.0, I think the more reliable fix would be to put an upper bound like pytorch >=2.0,<2.3.0 on its pytorch dependency.

I'm leaving a non-blocking "comment" review in case I've misunderstood, and this incompatibility only shows up here in CI but wouldn't affect users.

The one other comment I left is minor and also doesn't need to block the PR.

This is really a CI issue; cuGraph supports PyTorch up to v2.3 except on environments with an older NCCL version (which is the result of a cupy bug). But even when that bug gets fixed, we still need to pin the tensordict version because we still want compatibility with PyTorch 2.1+. Sorry if that's hard to follow.

I guess we could make a test dependency of pytorch<2.2? Would that be an acceptable solution here?

Co-authored-by: Bradley Dice <[email protected]>
@@ -31,6 +31,7 @@ dependencies = [
"numba>=0.57",
"numpy>=1.23,<2.0a0",
"pylibcugraphops==24.6.*",
"tensordict>=0.1.2,<0.3.1a0",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't catch this in my first pass, but this does not really enforce an upper bound for pytorch:

tensordict 0.3.0: torch>=2.1.0
tensordict 0.3.1: torch>=2.2.1
tensordict 0.3.2: torch==2.2.2

In addition, this line now introduces pytorch as a hard dependency for cugraph-pyg, and it will pull pytorch from PyPI, as opposed to using its own index-url (as mentioned in https://pytorch.org/get-started/locally/). I am not clear about the consequence such as if it might pull a CPU only version from PyPI.

If you look at torch_geometric's pyproject.toml, they don't even include pytorch as a hard dependency. One alternative is to run version check at import time from cugraph-pyg

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you just manually install the GPU version (as we do now)?

Copy link
Member

@tingyu66 tingyu66 Jun 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, of course you can always manually install the correct GPU version to override whatever is installed by pip install cugraph-pyg

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's not ideal but otherwise we would be requiring users to manually install tensordict which would be unexpected. Right now I assume most of them do a manual install of torch anyways.

Copy link
Member

@jameslamb jameslamb Jun 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just about to post something similar after reading the @alexbarghi-nv 's explanation in #4464 (comment) (thanks for that!). I saw these same version constraints looking through tensordict releases (e.g. https://github.com/pytorch/tensordict/blob/7dbb8649f13b17be973918ec3a2dae10c985b5c4/setup.py#L70)

If the core issue is "this project's CI setup is incompatible with testing against PyTorch >= 2.3.0", then yeah similar to what was suggested in that comment, I think a better immediate fix would be:

  • put a pin of the form pytorch <2.3.0 in the testing dependencies
  • don't modify anything about the project's tensordict dependency

That should lead the pip and conda solvers to avoid tensordict 0.4.0, because it's requirement on torch>=2.3.0 wouldn't be satisfiable.

And it'd still make it possible for cugraph user to install cugraph==24.6.0 alongside a newer nccl and newer pytorch.

tensordict's 0.4.0 release happened to be the thing that introduced pytorch >=2.3.0 here, but some other dependency could do that in the future. Putting a ceiling in CI on pytorch is the most reliable way to prevent that from happening again... and hopefully for the next release you can work out a way to test against pytorch >=2.3.0 here in CI (maybe by upgrading beyond the cupy / nccl versions that are leading to the issue you've observed).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still want the tensordict pin change, though. Otherwise we are locking out users on PyTorch 2.2 and 2.1 which we do not want.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so?

The pin is currently tensordict>=0.1.2.

- tensordict >=0.1.2

That's sufficiently old to be installed alongside pytorch==2.1.0.

I just tried this:

conda create --name pytorch-test--dry-run \
  -c conda-forge \
  -c pytorch \
    'tensordict>=0.1.2' \
    'pytorch==2.1.0'

And got a successful solution like this:

libtorch           conda-forge/linux-64::libtorch-2.1.0-cuda118_h7595d50_303
...
pytorch            conda-forge/linux-64::pytorch-2.1.0-cuda118_py312hffed1d1_303
...
tensordict         conda-forge/noarch::tensordict-0.1.2-pyh6074d0b_0
output of 'conda info' (click me)
     active environment : None
       user config file : /home/nfs/jlamb/.condarc
 populated config files : /raid/jlamb/miniforge/.condarc
                          /home/nfs/jlamb/.condarc
          conda version : 24.3.0
    conda-build version : not installed
         python version : 3.10.12.final.0
                 solver : libmamba (default)
       virtual packages : __archspec=1=broadwell
                          __conda=24.3.0=0
                          __cuda=12.2=0
                          __glibc=2.31=0
                          __linux=5.4.0=0
                          __unix=0=0
       base environment : /raid/jlamb/miniforge  (writable)
      conda av data dir : /raid/jlamb/miniforge/etc/conda
  conda av metadata url : None
           channel URLs : https://conda.anaconda.org/conda-forge/linux-64
                          https://conda.anaconda.org/conda-forge/noarch
          package cache : /raid/jlamb/miniforge/pkgs
                          /home/nfs/jlamb/.conda/pkgs
       envs directories : /raid/jlamb/miniforge/envs
                          /home/nfs/jlamb/.conda/envs
               platform : linux-64
             user-agent : conda/24.3.0 requests/2.31.0 CPython/3.10.12 Linux/5.4.0-182-generic ubuntu/20.04.6 glibc/2.31 solver/libmamba conda-libmamba-solver/24.1.0 libmambapy/1.5.8
                UID:GID : 10349:10004
             netrc file : None
           offline mode : False

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, you're right. I'll revert it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the latest changes look good to me! Thanks for hearing me out.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, no problem

@alexbarghi-nv alexbarghi-nv changed the title [BUG] Pin tensordict to 0.3.0 to Resolve NCCL Error [BUG] Pin Test Version of PyTorch to 2.1 to Resolve NCCL Error Jun 6, 2024
@alexbarghi-nv
Copy link
Member Author

Also @tingyu66 I should point out that we were already making PyTorch a dependency for Conda, so this just matches it on pip.

@jameslamb jameslamb self-requested a review June 6, 2024 20:24
@trxcllnt
Copy link
Collaborator

trxcllnt commented Jun 11, 2024

If I understand correctly, cugraph-pyg depends on tensordict, and tensordict depends on torch?

If that's accurate, this is going to be difficult to build in DLFW. We don't build/install torch in the RAPIDS stages, because the DLFW Pytorch team is downstream from us.

I don't know how we can both create a cugraph-pyg wheel with the fully-specified set of runtime dependencies in its metadata, and not include torch in the set of wheels we ship to them.

@alexbarghi-nv
Copy link
Member Author

If I understand correctly, cugraph-pyg depends on tensordict, and tensordict depends on torch?

If that's accurate, this is going to be difficult to build in DLFW. We don't build/install torch in the RAPIDS stages, because the DLFW Pytorch team is downstream from us.

I don't know how we can both create a cugraph-pyg wheel with the fully-specified set of runtime dependencies in its metadata, and not include torch in the set of wheels we ship to them.

I forgot about this. In that case I guess we need to make tensordict an optional dependency and lazily load it at runtime. That isn't going to be a great user experience but I don't want to block you guys. I assume the pytorch team can add tensordict when they do their build.

@alexbarghi-nv
Copy link
Member Author

So @trxcllnt it looks like I already used import_optional for tensordict so I changed it to a test dependency for pip. I think that should resolve your issue.

Copy link
Contributor

@rlratzel rlratzel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest changes LGTM

@raydouglass raydouglass merged commit f4c519e into rapidsai:branch-24.06 Jun 13, 2024
131 of 132 checks passed
@alexbarghi-nv alexbarghi-nv deleted the tensordict-pin branch June 24, 2024 18:14
seberg added a commit to seberg/cugraph that referenced this pull request Aug 24, 2024
See rapidsaigh-4486 for adding torchdata/pydantic and rapidsaigh-4464 for introducing
the torch pin due to NCCL difficulties.

It is entirely possible that this still fails due to the NCCL issue!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ci conda CRITICAL BUG! BUG that needs to be FIX NOW !!!! non-breaking Non-breaking change python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants