-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: support conversion to arrow and back with non-option Unknown type #3085
fix: support conversion to arrow and back with non-option Unknown type #3085
Conversation
The unit tests required checking for nullable sub-arrays, as only these would be valid for Parquet etc. To properly disambiguate the nullable Arrow arrays, we require adding more metadata. In this case, I'm adding "is_empty_array" to node_parameters. I'm not yet confident that all code paths are tested and are correct yet. Other enclosing array types likely need support.
* EmptyArray no longer abuses node_parameters but instead gets its own ArrowArrayType property, _is_nonnullable_nulltype. On conversion to arrow, instead of <type>.is_option we test <type>._arrow_needs_option_type()
This change seems to be about code consistency only. I did not find any way to test this change. A table column must be a list type capable of having a non-zero number of columns. An EmptyArray column type doesn't make any sense.
Added unit test for this. And ammending previous commit with a similar unit test for ak_to_arrow_array!
This intentionally fails with this commit. Also dressing up RecordArray tests a little just for fun.
(Even this makes zero sense!)
It might be better to test pyarrow version instead of hasattr(...) though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only really important issue is on emptyarray.py new line 376: you're setting is_nonnullable_nulltype=True
for every EmptyArray.
If it's originally a non-optional unknowntype
, you want to wrap it in UnmaskedArray and set is_nonnullable_nulltype=True
so that when it gets read back from Arrow, the fake UnmaskedArray gets taken off again.
If it's originally a ?unknowntype
(the ?
is option-type), you want to not wrap it in UnmaskedArray (it's already in an option-tyep) and set is_nonnullable_nulltype=True
so that when it gets read back from Arrow, nothing gets taken off again.
To ensure that this behavior gets tested, you should have some originally ?unknowntype
examples. You're currently testing
[]
ofunknowntype
[[], [], []]
ofvar * unknowntype
Add to that
[]
of?unknowntype
: you can make it withak.Array([None])[0:0]
[[], [], []]
ofvar * ?unknowntype
: you can make it withak.Array([[], [], [], [None]])[0:3]
.
The rest of my comments are minor coding style preferences.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is done! I checked the code carefully and I tested a few manually constructed arrays:
>>> import numpy as np
>>> import awkward as ak
>>> from awkward.contents import EmptyArray, UnmaskedArray, ListOffsetArray
>>> from awkward.index import Index64
>>>
>>> ak.from_arrow(ak.to_arrow(EmptyArray(), extensionarray=False), highlevel=False)
<IndexedOptionArray len='0'>
<index><Index dtype='int64' len='0'>[]</Index></index>
<content><EmptyArray len='0'/></content>
</IndexedOptionArray>
>>>
>>> ak.from_arrow(ak.to_arrow(EmptyArray(), extensionarray=True), highlevel=False)
<EmptyArray len='0'/>
That's good: with extensionarray=False
, it can't round-trip: the EmptyArray comes back as an option-type (IndexedOptionArray) EmptyArray, but that's because it has nowhere to store the metadata. Meanwhile, the extensionarray=True
case does round-trip: it's using that metadata.
Same thing inside of a list:
>>> ak.from_arrow(ak.to_arrow(ListOffsetArray(Index64(np.array([0])), EmptyArray()), extensionarray=False), highlevel=False)
<ListOffsetArray len='0'>
<offsets><Index dtype='int64' len='1'>[0]</Index></offsets>
<content><IndexedOptionArray len='0'>
<index><Index dtype='int64' len='0'>
[]
</Index></index>
<content><EmptyArray len='0'/></content>
</IndexedOptionArray></content>
</ListOffsetArray>
>>>
>>> ak.from_arrow(ak.to_arrow(ListOffsetArray(Index64(np.array([0])), EmptyArray()), extensionarray=True), highlevel=False)
<ListOffsetArray len='0'>
<offsets><Index dtype='int64' len='1'>[0]</Index></offsets>
<content><EmptyArray len='0'/></content>
</ListOffsetArray>
Without an extensionarray
, it can't store the metadata to know that the EmptyArray originally wasn't wrapped in any option-type node, but with an extensionarray
, it knows that and reproduces the original type.
Now if we wrap the EmptyArray in an option-type (UnmaskedArray), it always comes back with the option-type, regardless of the extensionarray
setting.
>>> ak.from_arrow(ak.to_arrow(UnmaskedArray(EmptyArray()), extensionarray=False), highlevel=False)
<IndexedOptionArray len='0'>
<index><Index dtype='int64' len='0'>[]</Index></index>
<content><EmptyArray len='0'/></content>
</IndexedOptionArray>
>>>
>>> ak.from_arrow(ak.to_arrow(UnmaskedArray(EmptyArray()), extensionarray=True), highlevel=False)
<IndexedOptionArray len='0'>
<index><Index dtype='int64' len='0'>[]</Index></index>
<content><EmptyArray len='0'/></content>
</IndexedOptionArray>
>>>
>>> ak.from_arrow(ak.to_arrow(ListOffsetArray(Index64(np.array([0])), UnmaskedArray(EmptyArray())), extensionarray=False), highlevel=False)
<ListOffsetArray len='0'>
<offsets><Index dtype='int64' len='1'>[0]</Index></offsets>
<content><IndexedOptionArray len='0'>
<index><Index dtype='int64' len='0'>
[]
</Index></index>
<content><EmptyArray len='0'/></content>
</IndexedOptionArray></content>
</ListOffsetArray>
>>>
>>> ak.from_arrow(ak.to_arrow(ListOffsetArray(Index64(np.array([0])), UnmaskedArray(EmptyArray())), extensionarray=True), highlevel=False)
<ListOffsetArray len='0'>
<offsets><Index dtype='int64' len='1'>[0]</Index></offsets>
<content><IndexedOptionArray len='0'>
<index><Index dtype='int64' len='0'>
[]
</Index></index>
<content><EmptyArray len='0'/></content>
</IndexedOptionArray></content>
</ListOffsetArray>
The unit tests test these cases as well, but not explicitly with UnmaskedArray. (Constructing it from Python lists makes the option-type with IndexedOptionArray, but the code isn't hard-coded to any particular node type, just the option-ness.)
So I think this is ready to merge!
…e-but-arrow-doesnt
When we convert an Awkward array of type Unknown, this becomes
an arrow array (even prior to this) with zero length, nullable-null type,
but with Awkward metadata indicating no mask (non-option-type).
I believe in this case we can be sure that the original Awkward array was an EmptyArray.