Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove NumPy <2 pin #4615

Merged
merged 27 commits into from
Sep 30, 2024
Merged

Remove NumPy <2 pin #4615

merged 27 commits into from
Sep 30, 2024

Conversation

seberg
Copy link
Contributor

@seberg seberg commented Aug 19, 2024

This PR removes the NumPy<2 pin which is expected to work for
RAPIDS projects once CuPy 13.3.0 is released (CuPy 13.2.0 had
some issues preventing the use with NumPy 2).

@seberg seberg added non-breaking Non-breaking change improvement Improvement / enhancement to an existing function labels Aug 19, 2024
@seberg
Copy link
Contributor Author

seberg commented Aug 23, 2024

Updating branch to trigger CI. I think RAPIDS dependencies should now be resolved sufficiently to actually get NumPy 2 (not sure about all runs, as torch may get in the way depending on the version).

EDIT: No, the cugraph-equivariant picks up NumPy 2, but we will need wholegraph first for most runs, I think.

@seberg
Copy link
Contributor Author

seberg commented Aug 23, 2024

Sorry, this comment was meant for the wholegraph PR. (Although it still means wholegraph doesn't matter for this one...

@jakirkham
Copy link
Member

Updating branch to pull in the latest upstream changes and restart CI now that cuDF is done: rapidsai/cudf#16300

@jakirkham jakirkham marked this pull request as ready for review August 24, 2024 05:51
@jakirkham jakirkham requested a review from a team as a code owner August 24, 2024 05:51
@jakirkham jakirkham requested a review from bdice August 24, 2024 05:51
@jakirkham
Copy link
Member

Documenting for posterity

Seeing some unrelated C++ failures on CI:

The following tests FAILED:
	 50 - EDGE_TRIANGLE_COUNT_TEST (Failed)
	 51 - LOOKUP_SRC_DST_TEST (Failed)
	 52 - K_HOP_NBRS_TEST (Failed)

Will restart the failed job after CI completes

@seberg
Copy link
Contributor Author

seberg commented Aug 24, 2024

Just to note, the test failures seem all due to an out of memory.

terminate called after throwing an instance of 'rmm::out_of_memory'
  what():  std::bad_alloc: out_of_memory: RMM failure at:/opt/conda/envs/test/include/rmm/mr/device/pool_memory_resource.hpp:257: Maximum pool size exceeded

@jakirkham
Copy link
Member

So potentially related to general CI issues we have seen: rapidsai/cuml#6031 (comment)

See rapidsaigh-4486 for adding torchdata/pydantic and rapidsaigh-4464 for introducing
the torch pin due to NCCL difficulties.

It is entirely possible that this still fails due to the NCCL issue!
@seberg seberg requested a review from a team as a code owner August 24, 2024 12:11
Comment on lines 713 to 743
depends_on_pytorch:
common:
- output_types: [conda]
packages:
- &pytorch_unsuffixed pytorch>=2.0,<2.2.0a0
- torchdata
- pydantic

specific:
- output_types: [requirements]
matrices:
- matrix: {cuda: "12.*"}
packages:
- --extra-index-url=https://download.pytorch.org/whl/cu121
- matrix: {cuda: "11.*"}
packages:
- --extra-index-url=https://download.pytorch.org/whl/cu118
- {matrix: null, packages: null}
- output_types: [requirements, pyproject]
matrices:
- matrix: {cuda: "12.*"}
packages:
- &pytorch_pip torch>=2.0,<2.2.0a0
- *tensordict
- matrix: {cuda: "11.*"}
packages:
- *pytorch_pip
- *tensordict
- {matrix: null, packages: [*pytorch_pip, *tensordict]}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think it is ok to keep this. We just want to relax the upper bound. At a minimum we need PyTorch 2.3.0 for NumPy 2 support ( pytorch/pytorch#107302 ). Though wonder if this upper bound can be dropped altogether

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as we have cupy>13.2 in our test environment we are ok to remove the upper bound

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also ok if we enforce pytorch >=2.3 for both test and runtime

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep the discussion about what to do with the pytorch dependency in one thread, for ease of linking.

@hcho3 would >=2.3.0, as @alexbarghi-nv suggested here, work? A floor would be preferable to a ceiling, I think...otherwise this will have to be changed again in the next release.

Note: we should match whatever's done here in rapidsai/wholegraph#218

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because 2.4.0 has issues with Conda-forge packaging. See conda-forge/pytorch-cpu-feedstock#254 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! Sorry, I did see you'd posted that link but misunderstood what it contained. Ok great, I think a ceiling on 2.4.0 is ok here. Let's see what @alexbarghi-nv says.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was recently resolved. So have submitted PR ( #4703 ) to relax this in 24.12

@@ -662,10 +658,12 @@ dependencies:
- output_types: [conda, pyproject]
packages:
- pandas
- pydantic
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious what pydantic is being used for

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DGL requires it. It tries to import it and raises an exception if it isn't available. It's used by GraphBolt.

@jakirkham
Copy link
Member

Saw a couple test failures, which look like this on CI:

FAILED tests/data_store/test_property_graph.py::bench_extract_subgraph_for_rmat[df_type=cudf.DataFrame] - NameError: name 'np' is not defined
FAILED tests/data_store/test_property_graph.py::bench_extract_subgraph_for_rmat[df_type=pandas.DataFrame] - NameError: name 'np' is not defined

However the test file referenced has this definition

So not sure why that NameError occurs

@jakirkham
Copy link
Member

Have gone ahead and restarted CI

@seberg
Copy link
Contributor Author

seberg commented Aug 27, 2024

Seems those failures did not clear up, although it isn't clear to me why (or if) they are related to these changes.

@jakirkham
Copy link
Member

jakirkham commented Aug 27, 2024

Including the full traceback of one such error with cuDF on CI

Note there is also a Pandas failure, which looks identical

___________ bench_extract_subgraph_for_rmat[df_type=cudf.DataFrame] ____________

gpubenchmark = <pytest_benchmark.fixture.BenchmarkFixture object at 0x7fe8f938f950>
rmat_PropertyGraph = (<cugraph.experimental.PropertyGraph object at 0x7fe90cf59410>,              src      dst    weight
0         632844  ... 543777  0.816715
16777214  218373   734889  0.940356
16777215  762125  1012603  0.426682

[16777216 rows x 3 columns])

    def bench_extract_subgraph_for_rmat(gpubenchmark, rmat_PropertyGraph):
        from cugraph.experimental import PropertyGraph
    
        (pG, generated_df) = rmat_PropertyGraph
        scn = PropertyGraph.src_col_name
        dcn = PropertyGraph.dst_col_name
    
        verts = []
        for i in range(0, 10000, 10):
            verts.append(generated_df["src"].iloc[i])
    
>       selected_edges = pG.select_edges(f"{scn}.isin({verts}) | {dcn}.isin({verts})")

tests/data_store/test_property_graph.py:2583: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/opt/conda/envs/test/lib/python3.11/site-packages/cugraph/structure/property_graph.py:1555: in select_edges
    selected_col = eval(expr, globals, locals)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   NameError: name 'np' is not defined

<string>:1: NameError

Seems to be pointing to this line, which uses eval. Unfortunately it is hard to debug past there

selected_col = eval(expr, globals, locals)

Best guess is the globals need to include some imported modules (like np)


Edit: To provide summary of the resolution

The issue was NumPy 2 changed the scalar representation as part of NEP 51. As a result, some of the values passed to this eval call looked like np.int64(2), which resulted in the np undefined error.

To fix this Rick, added logic to cast these to Python ints where this isn't an issue. This was done below in commit: 21b689b


Edit 2: To give an idea of how this occurred, this overly simplified example may be instructive

In [1]: import numpy as np

In [2]: L = list(np.arange(3))
   ...: print(L)
[np.int64(0), np.int64(1), np.int64(2)]

In [3]: eval_globals = {}
   ...: eval_locals = {}
   ...: 
   ...: eval(f"sum({L})", eval_globals, eval_locals)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[3], line 4
      1 eval_globals = {}
      2 eval_locals = {}
----> 4 eval(f"sum({L})", eval_globals, eval_locals)

File <string>:1

NameError: name 'np' is not defined

Some resolutions to this would be

  1. Add np to eval_globals
  2. Rewrite the eval expression as sum(L) and include L in eval_locals
  3. Cast the values in L to Python ints first
  4. Use some alternate repr mode from NumPy

Rick went with option 3 below

@jakirkham
Copy link
Member

Looks like cugraph.structure.property_graph (and select_edges specifically) originates in PR: #1999

The test that fails, bench_extract_subgraph_for_rmat, comes from PR: #2056

Both authored by @rlratzel. Perhaps Rick can advise 🙂

ci/test_python.sh Outdated Show resolved Hide resolved
@jakirkham jakirkham requested a review from jameslamb September 26, 2024 20:24
@jakirkham
Copy link
Member

Thanks Philip! 🙏

Think this is looking pretty good

Would be great to hear from James and Alex as well 🙂

Copy link
Member

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving based on what I see in the diff and my own search through the repo... I don't see any references that were missed, and I do believe the new floors on pytorch + NCCL will be enough to fix the issues we saw.

But let's please not merge until we see NumPy 2 in the logs of some test jobs (especially conda-python-tests).

@jakirkham
Copy link
Member

Two of the test jobs errored during source checkout. One of the other builds got hung. The rest of the jobs passed. Canceled to clear out the hung job. Have now restarted those 3 jobs

@jakirkham
Copy link
Member

That said, one of the Conda test jobs does show NumPy 2 is installed

numpy                     2.0.2           py310hd6e36ab_0    conda-forge

@jameslamb
Copy link
Member

jameslamb commented Sep 27, 2024

This wheel test job was failing, so I restarted it:

wheel-tests-cugraph / 12.5.1, 3.10, amd64, ubuntu22.04, v100, latest-driver, oldest-deps

(build link)

It's been running for 3+ hours and I see a ton of error there (no logs with specific tracebacks reported yet, but it's 98% done so close). e.g. like this:

python/cugraph/cugraph/tests/sampling/test_uniform_neighbor_sample_mg.py::test_uniform_neighbor_sample_batched[16-DataFrame0-dataset0] ERROR [ 83%]
python/cugraph/cugraph/tests/sampling/test_uniform_neighbor_sample_mg.py::test_uniform_neighbor_sample_batched[16-DataFrame0-dataset1] ERROR [ 83%]
python/cugraph/cugraph/tests/sampling/test_uniform_neighbor_sample_mg.py::test_uniform_neighbor_sample_batched[16-DataFrame0-dataset2] ERROR [ 83%]
python/cugraph/cugraph/tests/sampling/test_uniform_neighbor_sample_mg.py::test_uniform_neighbor_sample_batched[16-DataFrame1-dataset0] ERROR [ 83%]
python/cugraph/cugraph/tests/sampling/test_uniform_neighbor_sample_mg.py::test_uniform_neighbor_sample_batched[16-DataFrame1-dataset1] ERROR [ 83%]
python/cugraph/cugraph/tests/sampling/test_uniform_neighbor_sample_mg.py::test_uniform_neighbor_sample_batched[16-DataFrame1-dataset2] ERROR [ 83%]
python/cugraph/cugraph/tests/sampling/test_uniform_neighbor_sample_mg.py::test_uniform_neighbor_sample_batched[32-DataFrame0-dataset0] ERROR [ 83%]
python/cugraph/cugraph/tests/sampling/test_uniform_neighbor_sample_mg.py::test_uniform_neighbor_sample_batched[32-DataFrame0-dataset1] ERROR [ 83%]
python/cugraph/cugraph/tests/sampling/test_uniform_neighbor_sample_mg.py::test_uniform_neighbor_sample_batched[32-DataFrame0-dataset2] ERROR [ 83%]
python/cugraph/cugraph/tests/sampling/test_uniform_neighbor_sample_mg.py::test_uniform_neighbor_sample_batched[32-DataFrame1-dataset0] ERROR [ 83%]
python/cugraph/cugraph/tests/sampling/test_uniform_neighbor_sample_mg.py::test_uniform_neighbor_sample_batched[32-DataFrame1-dataset1] ERROR [ 83%]
python/cugraph/cugraph/tests/sampling/test_uniform_neighbor_sample_mg.py::test_uniform_neighbor_sample_batched[32-DataFrame1-dataset2] ERROR [ 83%]

The other wheel-tests-cugraph job succeeded in under 9 minutes.

wheel-tests-cugraph / 11.8.0, 3.12, arm64, ubuntu20.04, a100, latest-driver, latest-deps

(build link)

We'll know more once the job completes and we can see tracebacks, but I think that needs some investigation.

@alexbarghi-nv
Copy link
Member

This wheel test job was failing, so I restarted it:

wheel-tests-cugraph / 12.5.1, 3.10, amd64, ubuntu22.04, v100, latest-driver, oldest-deps

(build link)

It's been running for 3+ hours and I see a ton of error there (no logs with specific tracebacks reported yet, but it's 98% done so close). e.g. like this:

python/cugraph/cugraph/tests/sampling/test_uniform_neighbor_sample_mg.py::test_uniform_neighbor_sample_batched[16-DataFrame0-dataset0] ERROR [ 83%]
python/cugraph/cugraph/tests/sampling/test_uniform_neighbor_sample_mg.py::test_uniform_neighbor_sample_batched[16-DataFrame0-dataset1] ERROR [ 83%]
python/cugraph/cugraph/tests/sampling/test_uniform_neighbor_sample_mg.py::test_uniform_neighbor_sample_batched[16-DataFrame0-dataset2] ERROR [ 83%]
python/cugraph/cugraph/tests/sampling/test_uniform_neighbor_sample_mg.py::test_uniform_neighbor_sample_batched[16-DataFrame1-dataset0] ERROR [ 83%]
python/cugraph/cugraph/tests/sampling/test_uniform_neighbor_sample_mg.py::test_uniform_neighbor_sample_batched[16-DataFrame1-dataset1] ERROR [ 83%]
python/cugraph/cugraph/tests/sampling/test_uniform_neighbor_sample_mg.py::test_uniform_neighbor_sample_batched[16-DataFrame1-dataset2] ERROR [ 83%]
python/cugraph/cugraph/tests/sampling/test_uniform_neighbor_sample_mg.py::test_uniform_neighbor_sample_batched[32-DataFrame0-dataset0] ERROR [ 83%]
python/cugraph/cugraph/tests/sampling/test_uniform_neighbor_sample_mg.py::test_uniform_neighbor_sample_batched[32-DataFrame0-dataset1] ERROR [ 83%]
python/cugraph/cugraph/tests/sampling/test_uniform_neighbor_sample_mg.py::test_uniform_neighbor_sample_batched[32-DataFrame0-dataset2] ERROR [ 83%]
python/cugraph/cugraph/tests/sampling/test_uniform_neighbor_sample_mg.py::test_uniform_neighbor_sample_batched[32-DataFrame1-dataset0] ERROR [ 83%]
python/cugraph/cugraph/tests/sampling/test_uniform_neighbor_sample_mg.py::test_uniform_neighbor_sample_batched[32-DataFrame1-dataset1] ERROR [ 83%]
python/cugraph/cugraph/tests/sampling/test_uniform_neighbor_sample_mg.py::test_uniform_neighbor_sample_batched[32-DataFrame1-dataset2] ERROR [ 83%]

The other wheel-tests-cugraph job succeeded in under 9 minutes.

wheel-tests-cugraph / 11.8.0, 3.12, arm64, ubuntu20.04, a100, latest-driver, latest-deps

(build link)

We'll know more once the job completes and we can see tracebacks, but I think that needs some investigation.

I'm seeing similar failures in some of the other PRs. Usually these pass and it's the PyG tests that fail. But the failure doesn't seem PyG specific. Is there any chance we pulled the wrong version of dask somewhere?

@jameslamb
Copy link
Member

Maybe! The errors do look Dask-y

E           Exception: Communicator is already initialized
/pyenv/versions/3.10.15/lib/python3.10/site-packages/cugraph/dask/comms/comms.py:156: Exception
_ ERROR at setup of test_dask_mg_all_pairs_sorensen[graph_file:/__w/cugraph/cugraph/datasets/email-Eu-core.csv-directed:False-has_vertex_pair:False-has_vertices:True-has_topk:False-is_weighted:True] _
    @pytest.fixture(scope="module")
    def dask_client():
        # start_dask_client will check for the SCHEDULER_FILE and
        # DASK_WORKER_DEVICES env vars and use them when creating a client if
        # set. start_dask_client will also initialize the Comms singleton.
>       dask_client, dask_cluster = start_dask_client(
            worker_class=IncreasedCloseTimeoutNanny
        )

It seems that in that job, we got these:

dask 2024.9.0
distributed 2024.9.0
dask-expr 1.1.14

(build link)

Same versions I see in the successful wheel-testing job :/

@jakirkham
Copy link
Member

jakirkham commented Sep 27, 2024

Well before that we had a wheel build that was taking a really long time. Honestly am wondering if there is just a bad node in the GHA pool that needs a reboot or perhaps exclusion until it can be repaired

@jakirkham
Copy link
Member

Rerunning the failed job again

The fact that there are two test with slightly different configurations running this test suite with one passing and the other failing is odd. It would be easier to believe Dask was the issue if both jobs were failing (or even more jobs were affected)

Given the flakiness we have seen here with CI, suspect there is an issue in the CI pool itself. Flagging this offline for follow up with the appropriate team

@jakirkham
Copy link
Member

Am noticing this looks a little out-of-date relative to branch-24.10. Going to try merging in latest in case that helps

Copy link
Member

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-approving. Seeing all CI passing again on a re-run, and NumPy 2 being used in all the wheel and conda test jobs, has me feeling confident that this is working as intended.

I also git grep'd around one more time to check for missed references to NumPy, torch/PyTorch, and NCCL... didn't find any.

From my perspective, this can be merged.

@jameslamb
Copy link
Member

There are several other PRs in cugraph trying to get into 24.10 in the next few days. So if this really did introduce flakiness (instead of it, say, being the result of some issue with the CI runners), we'll find out about it soon.

Given that and @alexbarghi-nv 's written approval of the main changes here in the threads on this PR, I'm confident we can merge this, so I'm going to merge it.

@alexbarghi-nv please @ me any time if you see other issues like #4615 (comment) on the other PRs, I'd be happy to come help investigate.

@jameslamb
Copy link
Member

/merge

@rapids-bot rapids-bot bot merged commit 0f4fe8f into rapidsai:branch-24.10 Sep 30, 2024
132 checks passed
@seberg seberg deleted the my_new_branch branch September 30, 2024 13:43
@jakirkham
Copy link
Member

Thanks all! 🙏

rapids-bot bot pushed a commit to rapidsai/cugraph-gnn that referenced this pull request Nov 22, 2024
As the issue around PyTorch being built without NumPy was fixed in conda-forge, we can now relax these upper bounds to allow PyTorch 2.4.

xref: conda-forge/pytorch-cpu-feedstock#254
xref: conda-forge/pytorch-cpu-feedstock#266
xref: rapidsai/cugraph#4615
xref: rapidsai/cugraph#4703
xref: #59

Authors:
  - https://github.com/jakirkham

Approvers:
  - Jake Awe (https://github.com/AyodeAwe)
  - Tingyu Wang (https://github.com/tingyu66)

URL: #75
rapids-bot bot pushed a commit that referenced this pull request Nov 22, 2024
As the issue around PyTorch being built without NumPy was fixed in conda-forge, we can now relax these upper bounds to allow PyTorch 2.4.

xref: conda-forge/pytorch-cpu-feedstock#254
xref: conda-forge/pytorch-cpu-feedstock#266
xref: #4615

Authors:
  - https://github.com/jakirkham
  - Alex Barghi (https://github.com/alexbarghi-nv)

Approvers:
  - Alex Barghi (https://github.com/alexbarghi-nv)
  - James Lamb (https://github.com/jameslamb)

URL: #4703
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci conda improvement Improvement / enhancement to an existing function non-breaking Non-breaking change python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants