Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(table/scanner): Fix nested field scan #311

Merged
merged 11 commits into from
Mar 1, 2025

Conversation

zeroshade
Copy link
Member

@zeroshade zeroshade commented Feb 20, 2025

Fixes #309

There was a combination of factors that caused the initial problem:

  1. The arrow-go/v18/parquet/pqarrow library wasn't properly propagating PARQUET:field_id metadata for children of List or Map typed fields
  2. We only iterated the fields and skipped list/maptypes when selecting column indexes, this caused us to miss the children. Instead we need to iterate all of the field IDs, this change updates that.
  3. When pruning parquet fields we were not propagating the correct ColIndex for map typed columns, we want the leaves so we need the ColIndex of the children
  4. creating the output arrays during ToRequestedSchema led to a memory leak for list/map columns that needed to be fixed.

A unit test has been added to ensure we are properly able to read the test_all_types table and get the rows without error.

@@ -589,9 +592,9 @@ func SchemaToArrowSchema(sc *iceberg.Schema, metadata map[string]string, include
// TypeToArrowType converts a given iceberg type, into the equivalent Arrow data type.
// For dealing with nested fields (List, Struct, Map) if includeFieldIDs is true, then
// the child fields will contain a metadata key PARQUET:field_id set to the field id.
func TypeToArrowType(t iceberg.Type, includeFieldIDs bool) (arrow.DataType, error) {
func TypeToArrowType(t iceberg.Type, includeFieldIDs bool, useLargeTypes bool) (arrow.DataType, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit add useLargeTypes to docstring above

@@ -497,7 +497,10 @@ func (c convertToArrow) Field(field iceberg.NestedField, result arrow.Field) arr

func (c convertToArrow) List(list iceberg.ListType, elemResult arrow.Field) arrow.Field {
elemField := c.Field(list.ElementField(), elemResult)
return arrow.Field{Type: arrow.LargeListOfField(elemField)}
if c.useLargeTypes {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was pretty vocal on this one on the Python side. I strongly believe that we should not expose the options of large types to the user, and that's the direction that we're heading with PyIceberg. In the end, it is up to Arrow to decide if you need large types, or if small types are sufficient.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's actually pretty difficult to do, and would make schema conversion inconsistent.

Determining whether to use the large types cannot be determined until you have the data at which point, if you're streaming the data, it's too late to switch as you can't safely change the schema during a stream. For example:

  1. Start reading parquet file with string column, total column data in this file is only 100MB so we use a regular string (small type) and start streaming record batches
  2. one of the last files has 3GB of raw data in the string column so we use LargeString for that column from that file
  3. We can't cast the LargeString to String, we can't change the schema of the stream to change the column to LargeString, so now we're stuck.

The same problem can occur for List/LargeList depending on the number of total elements in a given column among the lists.

We also can't determine ahead of time by the stats in the iceberg metadata alone whether or not we should use Large types or not. The only way to know ahead of time is to read the parquet file metadata for every data file before we start producing record batches and then reconcile whether or not we should use large types before we start processing the files.

It looks like in the pyiceberg PR you linked, if i'm reading it correctly, you just automatically push everything to large types for streaming to avoid the problems I mentioned above? which kinda defeats the benefits if the goal was to avoid using LargeTypes when they aren't needed (in most cases they aren't needed).

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I think we can revisit the discussion around large types later since the config is optional

@zeroshade zeroshade merged commit c43f0ed into apache:main Mar 1, 2025
10 checks passed
@zeroshade zeroshade deleted the fix-nested-field-scan branch March 1, 2025 17:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for list types?
3 participants