-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
read_parquet cannot read fixed size array cell correctly with use_pyarrow=True #16614
Comments
|
In the future please, make sure to put your code in backticks so that it is more readable. This is a weird one. I changed the test a bit to this
If you focus on the arrow_read_shape we can see that pyarrow always reads the table with 500,000 rows. When polars is natively reading then it also reads 500,000 rows. Regardless of the reader or writer, the from_arrow_shape is 2M so it seems there's nothing wrong with pyarrow's ability to read the parquet but in However, if we make the pyarrow table from scratch like this
then it works. If we then notice that
and that the 2M that we were getting back is 4x the original we can guess that we're getting the same result back 4 times. Also
It seems that if the arrow array is in chunks then it'll fail but...
So it seems to have to do with parquet serialization somehow. |
This seems like an upstream issue with PyArrow. The |
I looked into this a bit more and it seems like we also have some problem on the Polars side. We don't seem to handle sliced arrays. Since I cannot reproduce the issues for other datatypes, I fixed this for sliced |
Here's an issue that was fixed on the python side only about structs. Some low frequency contributor ;) fixed it on the python side so there haven't been repeated complaints. |
This fixes the slicing behavior of FixedSizeLists when loaded with PyArrow. I am not sure if this behavior is also faulty at other places (I especially suspect structs), but as long as there are no reported problems there I think this fix is okay for now. Fixes pola-rs#16614.
Checks
Reproducible example
Log output
Issue description
for fixed size array cells, read_parquet does not read the column length correctly if use_pyarrow=True
Expected behavior
for fixed size array cells, read_parquet shall read the column length correctly if use_pyarrow=True
Installed versions
The text was updated successfully, but these errors were encountered: