-
Notifications
You must be signed in to change notification settings - Fork 310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Pin Test Version of PyTorch to 2.1 to Resolve NCCL Error #4464
[BUG] Pin Test Version of PyTorch to 2.1 to Resolve NCCL Error #4464
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the conda recipe need to be updated as well https://github.com/rapidsai/cugraph/blob/branch-24.06/conda/recipes/cugraph-pyg/meta.yaml#L37
Though, the dependency is not present on the wheel: https://github.com/rapidsai/cugraph/blob/branch-24.06/python/cugraph-pyg/pyproject.toml#L29-L34
Is this a runtime or test-only dependency?
|
@raydouglass I think I've addressed your concerns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, we should be able to remove this constraint once the lazy load behavior in cupy is fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👋🏻 hi, I'm here because @raydouglass asked me to come take a look.
Based on this statement
tensordict v0.4 requires PyTorch 2.3+ which is incompatible with the NCCL version on our containers
I don't think this fix will be enough to protect cugraph
users. They could still install tensordict==0.3.0
together with pytorch>2.3.0
, couldn't they?
If the direct issue is that cugraph
is incompatible with pytorch >=2.3.0
, I think the more reliable fix would be to put an upper bound like pytorch >=2.0,<2.3.0
on its pytorch
dependency.
I'm leaving a non-blocking "comment" review in case I've misunderstood, and this incompatibility only shows up here in CI but wouldn't affect users.
The one other comment I left is minor and also doesn't need to block the PR.
This is really a CI issue; cuGraph supports PyTorch up to v2.3 except on environments with an older NCCL version (which is the result of a cupy bug). But even when that bug gets fixed, we still need to pin the I guess we could make a test dependency of |
Co-authored-by: Bradley Dice <[email protected]>
python/cugraph-pyg/pyproject.toml
Outdated
@@ -31,6 +31,7 @@ dependencies = [ | |||
"numba>=0.57", | |||
"numpy>=1.23,<2.0a0", | |||
"pylibcugraphops==24.6.*", | |||
"tensordict>=0.1.2,<0.3.1a0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't catch this in my first pass, but this does not really enforce an upper bound for pytorch:
tensordict 0.3.0: torch>=2.1.0
tensordict 0.3.1: torch>=2.2.1
tensordict 0.3.2: torch==2.2.2
In addition, this line now introduces pytorch as a hard dependency for cugraph-pyg, and it will pull pytorch from PyPI, as opposed to using its own index-url (as mentioned in https://pytorch.org/get-started/locally/). I am not clear about the consequence such as if it might pull a CPU only version from PyPI.
If you look at torch_geometric's pyproject.toml, they don't even include pytorch as a hard dependency. One alternative is to run version check at import time from cugraph-pyg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't you just manually install the GPU version (as we do now)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, of course you can always manually install the correct GPU version to override whatever is installed by pip install cugraph-pyg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it's not ideal but otherwise we would be requiring users to manually install tensordict
which would be unexpected. Right now I assume most of them do a manual install of torch
anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just about to post something similar after reading the @alexbarghi-nv 's explanation in #4464 (comment) (thanks for that!). I saw these same version constraints looking through tensordict
releases (e.g. https://github.com/pytorch/tensordict/blob/7dbb8649f13b17be973918ec3a2dae10c985b5c4/setup.py#L70)
If the core issue is "this project's CI setup is incompatible with testing against PyTorch >= 2.3.0
", then yeah similar to what was suggested in that comment, I think a better immediate fix would be:
- put a pin of the form
pytorch <2.3.0
in the testing dependencies - don't modify anything about the project's
tensordict
dependency
That should lead the pip
and conda
solvers to avoid tensordict
0.4.0, because it's requirement on torch>=2.3.0
wouldn't be satisfiable.
And it'd still make it possible for cugraph
user to install cugraph==24.6.0
alongside a newer nccl
and newer pytorch
.
tensordict
's 0.4.0 release happened to be the thing that introduced pytorch >=2.3.0
here, but some other dependency could do that in the future. Putting a ceiling in CI on pytorch
is the most reliable way to prevent that from happening again... and hopefully for the next release you can work out a way to test against pytorch >=2.3.0
here in CI (maybe by upgrading beyond the cupy
/ nccl
versions that are leading to the issue you've observed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still want the tensordict pin change, though. Otherwise we are locking out users on PyTorch 2.2 and 2.1 which we do not want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so?
The pin is currently tensordict>=0.1.2
.
- tensordict >=0.1.2 |
That's sufficiently old to be installed alongside pytorch==2.1.0
.
I just tried this:
conda create --name pytorch-test--dry-run \
-c conda-forge \
-c pytorch \
'tensordict>=0.1.2' \
'pytorch==2.1.0'
And got a successful solution like this:
libtorch conda-forge/linux-64::libtorch-2.1.0-cuda118_h7595d50_303
...
pytorch conda-forge/linux-64::pytorch-2.1.0-cuda118_py312hffed1d1_303
...
tensordict conda-forge/noarch::tensordict-0.1.2-pyh6074d0b_0
output of 'conda info' (click me)
active environment : None
user config file : /home/nfs/jlamb/.condarc
populated config files : /raid/jlamb/miniforge/.condarc
/home/nfs/jlamb/.condarc
conda version : 24.3.0
conda-build version : not installed
python version : 3.10.12.final.0
solver : libmamba (default)
virtual packages : __archspec=1=broadwell
__conda=24.3.0=0
__cuda=12.2=0
__glibc=2.31=0
__linux=5.4.0=0
__unix=0=0
base environment : /raid/jlamb/miniforge (writable)
conda av data dir : /raid/jlamb/miniforge/etc/conda
conda av metadata url : None
channel URLs : https://conda.anaconda.org/conda-forge/linux-64
https://conda.anaconda.org/conda-forge/noarch
package cache : /raid/jlamb/miniforge/pkgs
/home/nfs/jlamb/.conda/pkgs
envs directories : /raid/jlamb/miniforge/envs
/home/nfs/jlamb/.conda/envs
platform : linux-64
user-agent : conda/24.3.0 requests/2.31.0 CPython/3.10.12 Linux/5.4.0-182-generic ubuntu/20.04.6 glibc/2.31 solver/libmamba conda-libmamba-solver/24.1.0 libmambapy/1.5.8
UID:GID : 10349:10004
netrc file : None
offline mode : False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, you're right. I'll revert it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the latest changes look good to me! Thanks for hearing me out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, no problem
Also @tingyu66 I should point out that we were already making PyTorch a dependency for Conda, so this just matches it on pip. |
If I understand correctly, If that's accurate, this is going to be difficult to build in DLFW. We don't build/install torch in the RAPIDS stages, because the DLFW Pytorch team is downstream from us. I don't know how we can both create a |
I forgot about this. In that case I guess we need to make |
So @trxcllnt it looks like I already used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Latest changes LGTM
See rapidsaigh-4486 for adding torchdata/pydantic and rapidsaigh-4464 for introducing the torch pin due to NCCL difficulties. It is entirely possible that this still fails due to the NCCL issue!
PyTorch 2.2+ is incompatible with the NCCL version on our containers. Normally, this would not be an issue, but there is a bug in CuPy that loads the system NCCL instead of the user NCCL. This PR binds the PyTorch test dependency version to get around this issue.