Arrow: Infer the types when reading #1669

Fokko · 2025-02-16T22:45:20Z

Time to give this another go 😆

When reading a Parquet file using PyArrow, there is some metadata stored in the Parquet file to either make it a large type (eg large_string, or a normal type (string). The difference is that the large types use a 64 bit offset to encode their arrays. This is not always needed, and we can could first check all the in the types of which it is stored, and let PyArrow decide here:

iceberg-python/pyiceberg/io/pyarrow.py

Line 1579 in 300b840

result = pa.concat_tables(tables, promote_options="permissive")

In PyArrow today we just bump everything to a large type, which might lead to additional memory consumption because it allocates an int64 array to allocate the offsets, instead of an int32.

I thought we would be good to go for this now with the new lower bound of PyArrow to 17. But, it looks like we still have to wait for Arrow 18 to fix the issue with the date types:

apache/arrow#43183

Fixes: #1049

When reading a Parquet file using PyArrow, there is some metadata stored in the Parquet file to either make it a large type (eg `large_string`, or a normal type (`string`). The difference is that the large types use a 64 bit offset to encode their arrays. This is not always needed, and we can could first check all the in the types of which it is stored, and let PyArrow decide here: https://github.com/apache/iceberg-python/blob/300b8405a0fe7d0111321e5644d704026af9266b/pyiceberg/io/pyarrow.py#L1579 In PyArrow today we just bump everything to a large type, which might lead to additional memory consumption because it allocates a int64 array to allocate the offsets, instead of an int32. I thought we would be good to go for this now with the new lower bound of PyArrow to 17. But, it looks like we still have to wait for Arrow 18 to fix the issue with the `date` types: apache/arrow#43183 Fixes: apache#1049

Fokko · 2025-02-18T14:24:34Z

pyiceberg/table/__init__.py

@@ -1750,7 +1750,7 @@ def to_arrow_batch_reader(self) -> pa.RecordBatchReader:
        return pa.RecordBatchReader.from_batches(
            target_schema,
            batches,
-        )
+        ).cast(target_schema)


This will still return large types if you stream the batches because we don't want to fetch all the schemas upfront.

Fokko · 2025-02-18T15:06:33Z

pyiceberg/io/pyarrow.py

+        if property_as_bool(self._io.properties, PYARROW_USE_LARGE_TYPES_ON_READ, False):
+            result = result.cast(arrow_schema)


I left this in, but I would be leaning toward deprecating this, since I don't think we want to trouble the user. I think it should be an implementation detail based on how large the buffers are.

Fokko · 2025-02-18T15:06:57Z

@sungwy Thoughts? :D

sungwy

Hi @Fokko - thank you for pinging me for review! The change looks good to me, but I have a reservation about introducing this change without a deprecation warning.

Firstly, without the PyIceberg code base having a properly defined list of public classes, we assume all our classes to be public facing unless they start with an underscore. I'd argue that removing an input parameter to the ArrowProjectionVisitor __init__ method is an API change.

Secondly, changing the default value of PYARROW_USE_LARGE_TYPES_ON_READ to True for to_table method also seems like a breaking change for users reading Iceberg tables through PyIceberg. their large_string columns will change to a string column on upgrade without a warning.

Would it make sense to introduce this change in two stages:

First by introducing a new config variable like: PYICEBERG_INFER_LARGE_TYPES_ON_READ and set it to False on default, and raise a deprecation warning when the flag is set to False?
Then remove PYICEBERG_INFER_LARGE_TYPES_ON_READ and PYARROW_USE_LARGE_TYPES_ON_READ in the next major version?

sungwy · 2025-02-18T20:56:51Z

pyiceberg/io/pyarrow.py


    def __init__(
        self,
        file_schema: Schema,
        downcast_ns_timestamp_to_us: bool = False,
        include_field_ids: bool = False,
-        use_large_types: bool = True,


I've always dreaded the process of updating our code base in these internal classes and functions because we do not yet have a properly defined list of public classes 😞

Would this change require a deprecation notice first?

sungwy · 2025-02-18T20:58:54Z

pyiceberg/io/pyarrow.py


        result = pa.concat_tables(tables, promote_options="permissive")

+        if property_as_bool(self._io.properties, PYARROW_USE_LARGE_TYPES_ON_READ, False):


Should we update this to align with the current default value?

Suggested change

if property_as_bool(self._io.properties, PYARROW_USE_LARGE_TYPES_ON_READ, False):

if property_as_bool(self._io.properties, PYARROW_USE_LARGE_TYPES_ON_READ, True):

Fokko added 2 commits February 16, 2025 23:43

Less is more 😍

0384b4e

Fokko commented Feb 18, 2025

View reviewed changes

Fokko modified the milestones: PyIceberg 1.0.0, PyIceberg 0.10.0 Feb 18, 2025

Reinstate the table property

6dd9308

Fokko commented Feb 18, 2025

View reviewed changes

Cleanup

2817c61

sungwy reviewed Feb 18, 2025

View reviewed changes

Fokko mentioned this pull request Feb 26, 2025

fix(table/scanner): Fix nested field scan apache/iceberg-go#311

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arrow: Infer the types when reading #1669

Arrow: Infer the types when reading #1669

Fokko commented Feb 16, 2025

Fokko Feb 18, 2025

Fokko Feb 18, 2025 •

edited

Loading

Fokko commented Feb 18, 2025

sungwy left a comment •

edited

Loading

sungwy Feb 18, 2025 •

edited

Loading

sungwy Feb 18, 2025

		if property_as_bool(self._io.properties, PYARROW_USE_LARGE_TYPES_ON_READ, False):
		result = result.cast(arrow_schema)

Arrow: Infer the types when reading #1669

Are you sure you want to change the base?

Arrow: Infer the types when reading #1669

Conversation

Fokko commented Feb 16, 2025

Fokko Feb 18, 2025

Choose a reason for hiding this comment

Fokko Feb 18, 2025 • edited Loading

Choose a reason for hiding this comment

Fokko commented Feb 18, 2025

sungwy left a comment • edited Loading

Choose a reason for hiding this comment

sungwy Feb 18, 2025 • edited Loading

Choose a reason for hiding this comment

sungwy Feb 18, 2025

Choose a reason for hiding this comment

Fokko Feb 18, 2025 •

edited

Loading

sungwy left a comment •

edited

Loading

sungwy Feb 18, 2025 •

edited

Loading