-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-45457: [Python] Add pyarrow.ArrayStatistics
#45550
Conversation
@github-actions crossbow submit -g python |
|
This comment was marked as outdated.
This comment was marked as outdated.
@pitrou @jorisvandenbossche Could you take a look at this? |
I'll merge this in a few days if nobody objects it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @kou ! Some minor comments below, but LGTM in general.
if null_count.has_value(): | ||
return null_count.value() | ||
else: | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the record, I've opened a Cython feature request to make this more automatic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I've added a comment that refers the issue.
std::optional<arrow::ArrayStatistics::ValueType>> data. | ||
|
||
arrow::ArrayStatistics::ValueType is | ||
std::variant<bool, int64_t, uint64_t, double, std::string>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uint64_t
isn't handled below, should the docstring or the code be fixed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh... The code was wrong... I've added the uint64_t
case.
python/pyarrow/array.pxi
Outdated
raise TypeError("Do not call {}'s constructor directly" | ||
.format(self.__class__.__name__)) | ||
|
||
cdef void init(self, const shared_ptr[CArrayStatistics]& sp_statistics) except *: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
except *
means it could raise Python exceptions, but it doesn't here, so perhaps you can remove that annotation (though it's not really a problem either).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I didn't know much about except
in Cython...
It's the bindings of `arrow::ArrayStatistics`. You can get it by `pyarrow.Array.statistics()`.
Co-authored-by: Antoine Pitrou <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
@github-actions crossbow submit -g python |
This comment was marked as outdated.
This comment was marked as outdated.
@github-actions crossbow submit -g python |
This comment was marked as outdated.
This comment was marked as outdated.
assert statistics.min == -1 | ||
assert statistics.is_min_exact | ||
assert statistics.max == 3 | ||
assert statistics.is_max_exact |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a test for repr(statistics)
to make sure that the string representation works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good idea. I've added it.
@github-actions crossbow submit -g python |
Revision: e3a20b5 Submitted crossbow builds: ursacomputing/crossbow @ actions-747dbaddf2 |
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 631fa0a. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 12 possible false positives for unstable benchmarks that are known to sometimes produce them. |
### Rationale for this change Apache Arrow C++ can attach statistics read from Apache Parquet data to `arrow::Array`. If we have the bindings of the feature in Python, Python users can also use attached statistics. ### What changes are included in this PR? * Add `pyarrow.ArrayStatistics` * Add `pyarrow.Array.statistics()`. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: apache#45457 Lead-authored-by: Sutou Kouhei <[email protected]> Co-authored-by: Sutou Kouhei <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
Rationale for this change
Apache Arrow C++ can attach statistics read from Apache Parquet data to
arrow::Array
. If we have the bindings of the feature in Python, Python users can also use attached statistics.What changes are included in this PR?
pyarrow.ArrayStatistics
pyarrow.Array.statistics()
.Are these changes tested?
Yes.
Are there any user-facing changes?
Yes.
arrow::ArrayStatistics
bindings #45457