Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixed size list type is not retained when writing to parquet #957

Open
matko opened this issue Nov 25, 2024 · 7 comments
Open

fixed size list type is not retained when writing to parquet #957

matko opened this issue Nov 25, 2024 · 7 comments
Labels
bug Something isn't working

Comments

@matko
Copy link

matko commented Nov 25, 2024

When I create a parquet file from an arrow table with a fixed size array as one of the columns, then read back the resulting parquet, the column is no longer a fixed size array, but instead a dynamically sized array.

Example:

import datafusion as df
import pyarrow as pa

FILENAME = "/tmp/fixed_array_example.parquet"
ctx = df.SessionContext()

array = pa.array([[1.0, 2.0], [3.0, 4.0]], type=pa.list_(pa.float32(), 2))
table = pa.Table.from_pydict({"array": array})
df_table = ctx.from_arrow(table)
print("original schema:")
print(df_table.schema())

df_table.write_parquet(FILENAME)
print("roundtrip schema:")
print(ctx.read_parquet(FILENAME).schema())

Output:

original schema:
array: fixed_size_list<item: float>[2]
  child 0, item: float
roundtrip schema:
array: list<item: float>
  child 0, item: float

As the output demonstrates, the datafusion dataframe that is written out has the proper schema. Nevertheless, the file that is read back does not.

If instead of datafusion, I use pyarrow to write the parquet file, I do get the expected schema when I read it back using datafusion.

import datafusion as df
import pyarrow as pa
import pyarrow.parquet as pq

FILENAME = "/tmp/fixed_array_example_pyarrow.parquet"
ctx = df.SessionContext()

array = pa.array([[1.0, 2.0], [3.0, 4.0]], type=pa.list_(pa.float32(), 2))
table = pa.Table.from_pydict({"array": array})

print("original schema:")
print(table.schema)

pq.write_table(table, FILENAME)
print("roundtrip schema:")
print(ctx.read_parquet(FILENAME).schema())

output:

original schema:
array: fixed_size_list<item: float>[2]
  child 0, item: float
roundtrip schema:
array: fixed_size_list<element: float>[2]
  child 0, element: float
@matko matko added the bug Something isn't working label Nov 25, 2024
@kosiew
Copy link
Contributor

kosiew commented Dec 3, 2024

@kylebarron
Copy link
Contributor

kylebarron commented Dec 17, 2024

I think you want to set skip_metadata=False

skip_metadata: bool = True,

In my opinion that should be the default, otherwise no Arrow-only types will be retained I think.

@kosiew
Copy link
Contributor

kosiew commented Dec 18, 2024

hi @kylebarron ,

I amended skip_metadata: bool = False and ran

import datafusion as df
import pyarrow as pa

FILENAME = "/tmp/fixed_array_example.parquet"
ctx = df.SessionContext()

array = pa.array([[1.0, 2.0], [3.0, 4.0]], type=pa.list_(pa.float32(), 2))
table = pa.Table.from_pydict({"array": array})
df_table = ctx.from_arrow(table)
print("original schema:")
print(df_table.schema())

df_table.write_parquet(FILENAME)

ctx.register_parquet("test_fixed_list", FILENAME)

print("register parquet schema:")
print(ctx.table("test_fixed_list").schema())

but the fixed size list type was still not retained:

original schema:
array: fixed_size_list<item: float>[2]
  child 0, item: float
==> register_parquet with skip_metadata:  False
register parquet schema:
array: list<item: float>
  child 0, item: float

@kylebarron
Copy link
Contributor

In this case it's an issue with writing the Parquet file, not reading, which you can see if you try to read the file back with pyarrow:

In [23]: import pyarrow.parquet as pq

In [24]: pq.read_schema(FILENAME)
Out[24]:
array: list<item: float>
  child 0, item: float

In this case it's actually because the writing side doesn't correctly propagate the Arrow metadata either.

Here's how pyarrow.parquet correctly propagates the Arrow schema within the Parquet metadata:

In [32]: pq.write_table(table, "test.parquet")

In [33]: meta2 = pq.read_metadata('test.parquet')

In [34]: meta2.metadata
Out[34]: {b'ARROW:schema': b'/////6gAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAEAAAAEAAAAzP///wAAARAUAAAAIAAAAAQAAAABAAAALAAAAAUAAABhcnJheQAGAAgABAAGAAAAAgAAABAAFAAIAAYABwAMAAAAEAAQAAAAAAABAxAAAAAcAAAABAAAAAAAAAAEAAAAaXRlbQAABgAIAAYABgAAAAAAAQA='}

However there's no embedded Arrow schema in the Parquet file written by DataFusion:

In [35]: df_table.write_parquet(FILENAME)

In [36]: meta = pq.read_metadata(FILENAME)

In [37]: meta.metadata # None

@kylebarron
Copy link
Contributor

IMO not writing the Arrow schema to Parquet is a big bug.

Trying to track this down...

let mut options = TableParquetOptions::default();
options.global.compression = Some(compression_string);
wait_for_future(
py,
self.df.as_ref().clone().write_parquet(
path,
DataFrameWriteOptions::new(),
Option::from(options),
),
)?;

This just calls datafusion::dataframe::DataFrame::write_parquet with the default options.

I'm not sure where on the datafusion side this fails.

@kylebarron
Copy link
Contributor

Looks like it's this bug: apache/datafusion#11770

@kylebarron
Copy link
Contributor

The core underlying bug was fixed apache/datafusion#11770 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants