Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zero-copy buffer protocol data import #204

Merged
merged 9 commits into from
Oct 1, 2024

Conversation

kylebarron
Copy link
Owner

Change list

  • Add AnyBufferProtocol enum, with variants for all the Element types. In the FromPyObject impl on AnyBufferProtocol, we simply check for each Element type.
  • Implement into_arrow_array for zero-copy conversions of buffer protocol objects. This relies on the Buffer::from_custom_allocation method in arrow. This is safe because the original PyBuffer is put into an Arc and tracked by the Arrow buffer. So when the Arrow array is dropped, it will decrement the reference count on the Arc, and when it hits zero the Arc<PyBuffer> will be dropped. This will at that point call the original buffer's release method.
  • Check for the buffer protocol by default in the FromPyObject impl on PyArray.
    • This means that all places in the Python API that accept a PyArray now also accept buffer protocol input
  • Remove custom vendoring of PyAnyBuffer from Support for python buffer protocol #156. This was a lot of low-level, relatively unsafe code, that is not ideal to maintain here.
  • Add ArrayInput type union, which is a union of ArrowArrayExportable and buffer protocol objects.
  • Update type hints to use ArrayInput instead of ArrowArrayExportable where appropriate
  • Update README to note zero-copy buffer protocol support

We're now able to do computations on numpy arrays by default!

import numpy as np
import arro3.compute as ac

np_arr = np.arange(10_000_000, dtype=np.uint64)

%timeit np.max(np_arr)
# 1.37 ms ± 46 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%timeit ac.max(np_arr)
# 1.69 ms ± 32.1 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

This really isn't bad! Especially as numpy might be doing SIMD here (the arrow crate is SIMD capable, but we aren't enabling SIMD optimizations in our wheel builds).

@kylebarron kylebarron mentioned this pull request Oct 1, 2024
@kylebarron kylebarron enabled auto-merge (squash) October 1, 2024 19:20
@kylebarron kylebarron merged commit 9a01c80 into main Oct 1, 2024
4 checks passed
@kylebarron kylebarron deleted the kyle/zero-copy-buffer-protocol branch October 1, 2024 19:21
kylebarron added a commit to developmentseed/lonboard that referenced this pull request Oct 3, 2024
### Change list

- With some updates to arro3, now we can use numpy numeric arrays as
backing buffers for Arrow arrays. See
kylebarron/arro3#204,
kylebarron/arro3#208
- This also simplifies some code, because we can pass numpy arrays
directly into arro3 functions that expect arrays.

These tests will probably fail until the latest arro3 beta is published
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant