Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: don't raise NotImplementedError when reading empty array from Parquet #2194

Merged
merged 3 commits into from
Feb 3, 2023

Conversation

dsavoiu
Copy link
Contributor

@dsavoiu dsavoiu commented Feb 2, 2023

This PR proposes an implementation for reading an zero-length ak.RecordArray / Arrow Table from Parquet.

Since for zero-length there are no batches to use for conversion, an empty RecordBatch is created matching the input schema, and is propagated through the handle_arrow function.

There might be better ways to do this (maybe avoiding the list-of-empty lists?), so let me know.

Closes #2193.

@agoose77 agoose77 requested a review from jpivarski February 2, 2023 19:44
Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a different solution from the one I had been thinking of, and if it were not for the fact that the extensionarray=True case leads to

  File "<stdin>", line 1, in <module>
  File "pyarrow/table.pxi", line 2460, in pyarrow.lib.RecordBatch.from_arrays
  File "pyarrow/table.pxi", line 1406, in pyarrow.lib._sanitize_arrays
  File "pyarrow/array.pxi", line 343, in pyarrow.lib.asarray
  File "pyarrow/array.pxi", line 317, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 39, in pyarrow.lib._sequence_to_array
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: extension

I would have gone with it (with the minor correction of passing the arguments generate_bitmasks and pass_empty_field through the recursive call to handle_arrow). It's a good idea.

The other method, #2193 (comment), doesn't involve Arrow, so at least we won't run into any ArrowNotImplementedError that way.

Also, the PR needs tests. Other Arrow and Parquet conversion tests are parameterized for extensionarray in (False, True) so this would have been caught.

@jpivarski
Copy link
Member

It might be enough to just return this:

form_handle_arrow(obj.schema, pass_empty_field=pass_empty_field).length_zero_array()

@codecov
Copy link

codecov bot commented Feb 2, 2023

Codecov Report

Merging #2194 (76abfd7) into main (4cc61df) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Additional details and impacted files
Impacted Files Coverage Δ
src/awkward/_connect/pyarrow.py 91.00% <100.00%> (+0.20%) ⬆️

@jpivarski
Copy link
Member

Here is a test that triggers the error:

ak.from_arrow(ak.to_arrow_table(ak.Array([{"x": 1, "y": 2.2}])[0:0], extensionarray=extensionarray))

(not actually Parquet-related). Tests would either go in a file named

tests/test_2194_fix_read_empty_parquet.py

or into the existing

tests/test_1125_to_arrow_from_arrow.py

Your choice, although we usually just make new files for new issues, and you can use the above as a template for parameterizing by extensionarray. The 1125 file has more bells and whistles than you'll probably want.

@dsavoiu
Copy link
Contributor Author

dsavoiu commented Feb 2, 2023

It might be enough to just return this:

form_handle_arrow(obj.schema, pass_empty_field=pass_empty_field).length_zero_array()

That does seems to do the trick! Looks like it needs to be length_zero_array(highlevel=False) though, otherwise the wrapping fails at the end.

Also, the PR needs tests. [...] Tests would either go in a file named tests/test_2194_fix_read_empty_parquet.py or into the existing tests/test_1125_to_arrow_from_arrow.py. Your choice, although we usually just make new files for new issues, and you can use the above as a template for parameterizing by extensionarray. The 1125 file has more bells and whistles than you'll probably want.

Right, but putting the test in 1125 could make use of the arrow_round_trip and parquet_round_trip helper functions without needing to reimplement them or import them across tests. I've added a short test case there using the array you suggested.

@jpivarski
Copy link
Member

You're right about both points! Since I think you're at an inflection point, I'm going to do an "update branch" to make sure that it will work when up-to-date on main. Be sure to git pull before making any more changes.

@jpivarski
Copy link
Member

I've approved the changes, so if it passes tests, it's ready to be merged!

@jpivarski jpivarski merged commit d8e7fc8 into scikit-hep:main Feb 3, 2023
@jpivarski
Copy link
Member

@all-contributors please add @dsavoiu for code

@allcontributors
Copy link
Contributor

@jpivarski

I've put up a pull request to add @dsavoiu! 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Schema is not preserved when reading an empty RecordArray from parquet
2 participants