Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relax Arrow pin #16681

Merged
merged 7 commits into from
Aug 28, 2024
Merged

Conversation

vyasr
Copy link
Contributor

@vyasr vyasr commented Aug 28, 2024

Description

With this change, cudf users can install any version of pyarrow greater than 14. This is the minimum version supporting the C Data Interface, which is a requirement for us (it may be possible to relax in principle, but would require changes to the cudf/pylibcudf code). A few tests are skipped due to bugs or missing features in older versions of pyarrow.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@vyasr vyasr added feature request New feature or request 3 - Ready for Review Ready for review by team Python Affects Python cuDF API. non-breaking Non-breaking change pylibcudf Issues specific to the pylibcudf package cuDF (Python) labels Aug 28, 2024
@vyasr vyasr self-assigned this Aug 28, 2024
@vyasr vyasr requested review from a team as code owners August 28, 2024 16:27
@github-actions github-actions bot removed the pylibcudf Issues specific to the pylibcudf package label Aug 28, 2024
@vyasr vyasr mentioned this pull request Aug 28, 2024
3 tasks
Copy link
Contributor

@galipremsagar galipremsagar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my local testing I found that the minimum required version might be 15.0.0 because __arrow_c_stream__ seems to be added in 15.0.0:

apache/arrow#38717

But it shows up in changelog of 16.0.0 only(and not 15.0.0):

Screenshot 2024-08-28 at 11 46 11 AM

However, if I install a 14.0.0 version of pyarrow, I still have __arrow_c_stream__ - It might have been possibly back-ported. is that the case? I'm not able to find history of this being back-ported.

These changes look good. But should be rather keep an upper bound pin on the pyarrow version because of the same reasons we do for other libraries?

Locally verified cudf & dask-cudf work quite well for all of the following versions (barring the failures that you already covered with pytest skips in this PR):

  • 14.0.0
  • 15.0.0
  • 15.0.1
  • 16.0.0
  • 16.1.0
  • 17.0.0

conda/environments/all_cuda-118_arch-x86_64.yaml Outdated Show resolved Hide resolved
conda/environments/all_cuda-125_arch-x86_64.yaml Outdated Show resolved Hide resolved
conda/recipes/cudf/meta.yaml Outdated Show resolved Hide resolved
conda/recipes/pylibcudf/meta.yaml Outdated Show resolved Hide resolved
dependencies.yaml Outdated Show resolved Hide resolved
python/cudf/pyproject.toml Outdated Show resolved Hide resolved
python/pylibcudf/pyproject.toml Outdated Show resolved Hide resolved
@vyasr
Copy link
Contributor Author

vyasr commented Aug 28, 2024

The original capsule interface was added in apache/arrow#37797. apache/arrow#39455 looks to be specifically providing the stream capsule for chunked arrays. I believe that we use the __arrow_c_array__ protocol for those instead and rely on combining chunks, hence why Arrow 14 seems to be sufficient. Let me double check what's happening on 14 to be sure.

@vyasr
Copy link
Contributor Author

vyasr commented Aug 28, 2024

Well, the answer is not quite as good as I had hoped. I was right that our current conversion function for ChunkedArray tries to use the __arrow_c_array__ attribute, but that doesn't actually exist so it will just error out. I do appear to be correct that we are always combining chunks as needed before attempting such a conversion though because I was able to run the entire test suite with pylibcudf modified to raise an exception if we ever pass a ChunkedArray to from_arrow. Therefore, I propose that we leave the pinning on this PR as is (allow arrow 14), but remove the erroneous overload of from_arrow for ChunkedArray. If we choose to implement it correctly in the future, we can either bump the minimum pyarrow version or gate just that bit of functionality on the right attributes existing. Does that sound OK?

@galipremsagar
Copy link
Contributor

Well, the answer is not quite as good as I had hoped. I was right that our current conversion function for ChunkedArray tries to use the __arrow_c_array__ attribute, but that doesn't actually exist so it will just error out. I do appear to be correct that we are always combining chunks as needed before attempting such a conversion though because I was able to run the entire test suite with pylibcudf modified to raise an exception if we ever pass a ChunkedArray to from_arrow. Therefore, I propose that we leave the pinning on this PR as is (allow arrow 14), but remove the erroneous overload of from_arrow for ChunkedArray. If we choose to implement it correctly in the future, we can either bump the minimum pyarrow version or gate just that bit of functionality on the right attributes existing. Does that sound OK?

Yup, sounds good to me. 👍

@vyasr
Copy link
Contributor Author

vyasr commented Aug 28, 2024

Cool. Added upper bounds as well.

@vyasr vyasr requested a review from galipremsagar August 28, 2024 17:52
Copy link
Contributor

@galipremsagar galipremsagar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to go! Thanks @vyasr !

Copy link
Member

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one comment, but marking this "approve" so you don't need to wait on another review from me after addressing that. @galipremsagar 's testing in #16681 (review) and the discussion after that makes me very confident in this change.

dependencies.yaml Show resolved Hide resolved
@vyasr
Copy link
Contributor Author

vyasr commented Aug 28, 2024

For the record in case it helps inspire future confidence, independently of Prem I also ran the test suite on all the Python versions that he listed with similar results (which is what revealed the skips that I added to the test suite).

@vyasr
Copy link
Contributor Author

vyasr commented Aug 28, 2024

/merge

@rapids-bot rapids-bot bot merged commit 925530a into rapidsai:branch-24.10 Aug 28, 2024
87 checks passed
@vyasr vyasr deleted the feat/loosen_pyarrow_pinning branch August 28, 2024 21:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team feature request New feature or request non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants