-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fixed size list type is not retained when writing to parquet #957
Comments
Research note: |
I think you want to set
In my opinion that should be the default, otherwise no Arrow-only types will be retained I think. |
hi @kylebarron , I amended
but the fixed size list type was still not retained:
|
In this case it's an issue with writing the Parquet file, not reading, which you can see if you try to read the file back with pyarrow: In [23]: import pyarrow.parquet as pq
In [24]: pq.read_schema(FILENAME)
Out[24]:
array: list<item: float>
child 0, item: float In this case it's actually because the writing side doesn't correctly propagate the Arrow metadata either. Here's how In [32]: pq.write_table(table, "test.parquet")
In [33]: meta2 = pq.read_metadata('test.parquet')
In [34]: meta2.metadata
Out[34]: {b'ARROW:schema': b'/////6gAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAEAAAAEAAAAzP///wAAARAUAAAAIAAAAAQAAAABAAAALAAAAAUAAABhcnJheQAGAAgABAAGAAAAAgAAABAAFAAIAAYABwAMAAAAEAAQAAAAAAABAxAAAAAcAAAABAAAAAAAAAAEAAAAaXRlbQAABgAIAAYABgAAAAAAAQA='} However there's no embedded Arrow schema in the Parquet file written by DataFusion:
|
IMO not writing the Arrow schema to Parquet is a big bug. Trying to track this down... datafusion-python/src/dataframe.rs Lines 510 to 520 in 79c22d6
This just calls I'm not sure where on the datafusion side this fails. |
Looks like it's this bug: apache/datafusion#11770 |
The core underlying bug was fixed apache/datafusion#11770 (comment) |
When I create a parquet file from an arrow table with a fixed size array as one of the columns, then read back the resulting parquet, the column is no longer a fixed size array, but instead a dynamically sized array.
Example:
Output:
As the output demonstrates, the datafusion dataframe that is written out has the proper schema. Nevertheless, the file that is read back does not.
If instead of datafusion, I use pyarrow to write the parquet file, I do get the expected schema when I read it back using datafusion.
output:
The text was updated successfully, but these errors were encountered: