Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_parquet cannot read fixed size array cell correctly with use_pyarrow=True #16614

Closed
2 tasks done
datainfer opened this issue May 31, 2024 · 5 comments · Fixed by #17058
Closed
2 tasks done

read_parquet cannot read fixed size array cell correctly with use_pyarrow=True #16614

datainfer opened this issue May 31, 2024 · 5 comments · Fixed by #17058
Assignees
Labels
A-interop-arrow Area: interoperability with other Arrow implementations (such as pyarrow) accepted Ready for implementation bug Something isn't working P-medium Priority: medium python Related to Python Polars

Comments

@datainfer
Copy link

datainfer commented May 31, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars
import numpy
import pyarrow

n = 500000
name = "coord"
filename = "tmp.parquet"
coord = numpy.zeros((n,2), dtype=numpy.float32)
original = polars.DataFrame({name: polars.Series(coord, dtype=polars.Array(polars.Float32, 2))})
for write in (True, False):
    for read in (True, False):
        original.write_parquet(filename, use_pyarrow=write)
        backtest = polars.read_parquet(filename, use_pyarrow=read)
        arrow = pyarrow.parquet.read_table(filename)
        comment = "Good"
        if original[name].shape[0] != n:
            comment = "Bug!"
        if backtest[name].shape[0] != n:
            comment = "Bug!"
        if arrow[name].length() != n:
            comment = "Bug!"
        print(write, read, original[name].shape, backtest[name].shape, arrow[name].length(), comment)

Log output

True True (500000,) (2000000,) 500000 Bug!
True False (500000,) (500000,) 500000 Good
False True (500000,) (2000000,) 500000 Bug!
False False (500000,) (500000,) 500000 Good

Issue description

for fixed size array cells, read_parquet does not read the column length correctly if use_pyarrow=True

Expected behavior

for fixed size array cells, read_parquet shall read the column length correctly if use_pyarrow=True

Installed versions

0.20.23
@datainfer datainfer added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels May 31, 2024
@datainfer
Copy link
Author

import polars
polars.version
'0.20.23'
import pyarrow
pyarrow.version
'16.1.0'

@datainfer datainfer changed the title read_parquet cannot write/read fixed size array cell correctly with use_pyarrow=True read_parquet cannot read fixed size array cell correctly with use_pyarrow=True May 31, 2024
@deanm0000
Copy link
Collaborator

In the future please, make sure to put your code in backticks so that it is more readable.

This is a weird one. I changed the test a bit to this

import polars as pl
import pyarrow as pa
n = 500000
filename = "tmp.parquet"
original =pl.select(a=pl.repeat(pl.int_ranges(0,2).cast(pl.Array(pl.Int64,2)),n))
for write in (True, False):
    original.write_parquet(filename, use_pyarrow=write)
    for read in (True, False):
        backtest = pl.read_parquet(filename, use_pyarrow=read)
        arrow = pa.parquet.read_table(filename)
        print(f"write={'pa' if write else 'pl'}, read={'pa' if read else 'pl'}, read_shape={backtest.shape[0]} arrow_read_shape={arrow.shape[0]} from_arrow_shape {pl.from_arrow(arrow).shape[0]}")

write=pa, read=pa, read_shape=2000000 arrow_read_shape=500000 from_arrow_shape 2000000
write=pa, read=pl, read_shape=500000 arrow_read_shape=500000 from_arrow_shape 2000000
write=pl, read=pa, read_shape=2000000 arrow_read_shape=500000 from_arrow_shape 2000000
write=pl, read=pl, read_shape=500000 arrow_read_shape=500000 from_arrow_shape 2000000

If you focus on the arrow_read_shape we can see that pyarrow always reads the table with 500,000 rows. When polars is natively reading then it also reads 500,000 rows. Regardless of the reader or writer, the from_arrow_shape is 2M so it seems there's nothing wrong with pyarrow's ability to read the parquet but in pl.from_arrow(...)'s ability to convert it. That said, we can do, pl.from_arrow(original.to_arrow()) and get back the proper shape.

However, if we make the pyarrow table from scratch like this

tab=pa.Table.from_arrays([pa.FixedSizeListArray.from_arrays(pa.array([i + x for i in range(500000) for x in [0, 500000]]), 2)], ['a'])

then it works.

If we then notice that

arrow.to_batches()
### returns 4 batches

and that the 2M that we were getting back is 4x the original we can guess that we're getting the same result back 4 times.

Also

tab.to_batches()
### returns 1 batch

It seems that if the arrow array is in chunks then it'll fail but...

tbl=pa.Table.from_arrays([pa.chunked_array([
    pa.FixedSizeListArray.from_arrays(pa.array(range(i, i+int(1000000/4)), pa.int64()),2)
    for i in range(0,1000000,int(1000000/4))
])],['a'])

tbl['a'].num_chunks
##4
tbl.shape[0]
##500000
pl.from_arrow(tbl).shape[0]
##500000

So it seems to have to do with parquet serialization somehow.

@deanm0000 deanm0000 added P-medium Priority: medium A-interop-arrow Area: interoperability with other Arrow implementations (such as pyarrow) and removed needs triage Awaiting prioritization by a maintainer labels Jun 5, 2024
@github-project-automation github-project-automation bot moved this to Ready in Backlog Jun 5, 2024
@deanm0000 deanm0000 self-assigned this Jun 5, 2024
@coastalwhite coastalwhite self-assigned this Jun 17, 2024
@coastalwhite
Copy link
Collaborator

This seems like an upstream issue with PyArrow. The ffi::ArrowArray::length does not match the num_rows for FixedSizeLists.

@coastalwhite
Copy link
Collaborator

I looked into this a bit more and it seems like we also have some problem on the Polars side. We don't seem to handle sliced arrays. Since I cannot reproduce the issues for other datatypes, I fixed this for sliced FixedSizeList arrays. However, it might be possible that other datatypes suffer from the same problem.

@deanm0000
Copy link
Collaborator

Here's an issue that was fixed on the python side only about structs. Some low frequency contributor ;) fixed it on the python side so there haven't been repeated complaints.

coastalwhite added a commit to coastalwhite/polars that referenced this issue Jun 25, 2024
This fixes the slicing behavior of FixedSizeLists when loaded with PyArrow. I am not sure if this behavior is also faulty at other places (I especially suspect structs), but as long as there are no reported problems there I think this fix is okay for now.

Fixes pola-rs#16614.
@github-project-automation github-project-automation bot moved this from Ready to Done in Backlog Jun 28, 2024
@c-peters c-peters added the accepted Ready for implementation label Jul 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-interop-arrow Area: interoperability with other Arrow implementations (such as pyarrow) accepted Ready for implementation bug Something isn't working P-medium Priority: medium python Related to Python Polars
Projects
Archived in project
4 participants