-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deactivating extension types or making ObjectId mapped type polars-compliant? #236
Comments
Here are the casts we do internally: def _cast_away_extension_type(field: pa.field) -> pa.field:
if isinstance(field.type, pa.ExtensionType):
field_without_extension = pa.field(field.name, field.type.storage_type)
elif isinstance(field.type, pa.StructType):
field_without_extension = pa.field(
field.name,
pa.struct([_cast_away_extension_type(nested_field) for nested_field in field.type]),
)
elif isinstance(field.type, pa.ListType):
field_without_extension = pa.field(
field.name, pa.list_(_cast_away_extension_type(field.type.value_field))
)
else:
field_without_extension = field
return field_without_extension
def _arrow_to_polars(arrow_table: pa.Table):
"""Helper function that converts an Arrow Table to a Polars DataFrame.
Note: Polars lacks ExtensionTypes. We cast them to their base arrow classes.
"""
if pl is None:
msg = "polars is not installed. Try pip install polars."
raise ValueError(msg)
schema_without_extensions = pa.schema(
[_cast_away_extension_type(field) for field in arrow_table.schema]
)
arrow_table_without_extensions = arrow_table.cast(schema_without_extensions)
return pl.from_arrow(arrow_table_without_extensions)
def find_polars_all(collection, query, *, schema=None, **kwargs):
return _arrow_to_polars(find_arrow_all(collection, query, schema=schema, **kwargs))
We could offer a feature to customize the BSON type conversion that way a cast wouldn't be needed. |
Hi @ShaneHarvey, amazing, thanks for the snippet!
Is that expected? Should I not be able to specify a schema that casts an ObjectId field directly to its storage type? This might be what you were getting at with your final comment. |
That is expected as we have not implemented direct support for pa.binary() yet (https://jira.mongodb.org/browse/ARROW-214). Looking into it more, I believe it should be more straightforward to implement support for fixed size binary so I opened a new ticket here: https://jira.mongodb.org/browse/ARROW-251 |
I have a use-case where I want to extract data as an arrow table, save as parquet, and then later load it with polars.
My problem is that I cannot figure out how to avoid extension types, in particular for ObjectId. Status quo now is that the parquet gets stored with the extension type, which polars cannot read. Pymongoarrow functions like
find_polars_all
somehow manage this casting, but can it be achieved withfind_arrow_all
?Is it possible to specify a datatype for ObjectId fields in the pymongoarrow schema, so that it gets recorded in the arrow table and parquet file as a binary data type that polars can read out of the box?
The text was updated successfully, but these errors were encountered: