Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for list types? #309

Closed
GabrielM98 opened this issue Feb 19, 2025 · 4 comments · Fixed by #311
Closed

Support for list types? #309

GabrielM98 opened this issue Feb 19, 2025 · 4 comments · Fixed by #311

Comments

@GabrielM98
Copy link

GabrielM98 commented Feb 19, 2025

Apache Iceberg version

v0.1.0

Please describe the bug 🐞

Does the library support scanning tables with fields of type list?

I'm seeing some strange behaviour whilst attempting to scan a table (with all fields selected and no row filters applied) with the following schema:

{
    "type": "struct",
    "schema-id": 0,
    "fields": [
        {
            "id": 1,
            "name": "uuid",
            "required": false,
            "type": "string"
        },
        {
            "id": 2,
            "name": "source",
            "required": false,
            "type": {
                "type": "struct",
                "fields": [
                    {
                        "id": 5,
                        "name": "type",
                        "required": false,
                        "type": "string"
                    },
                    {
                        "id": 6,
                        "name": "serviceId",
                        "required": false,
                        "type": "string"
                    }
                ]
            }
        },
        {
            "id": 3,
            "name": "subjects",
            "required": false,
            "type": {
                "type": "list",
                "element-id": 7,
                "element": {
                    "type": "struct",
                    "fields": [
                        {
                            "id": 8,
                            "name": "type",
                            "required": false,
                            "type": "string"
                        },
                        {
                            "id": 9,
                            "name": "id",
                            "required": false,
                            "type": "string"
                        }
                    ]
                },
                "element-required": false
            }
        },
        {
            "id": 4,
            "name": "timing",
            "required": false,
            "type": {
                "type": "struct",
                "fields": [
                    {
                        "id": 10,
                        "name": "createdAt",
                        "required": false,
                        "type": "timestamptz"
                    },
                    {
                        "id": 11,
                        "name": "emittedAt",
                        "required": false,
                        "type": "timestamptz"
                    }
                ]
            }
        }
    ]
}

When I call (*table.Scan).ToArrowRecords and attempt to loop over the resulting iterator, the loop yields nothing.

Hooking up a debugger to my code, I can see there's an error being returned by (*table.Scan).recordsFromTask (here) which is resulting in the context being cancelled. Hence, the iterator returns without yielding anything. However, on some occasions it does yield an error, which seems to indicate that there's a race condition between the write to the done channel of the context.Context and the write to the out channel in (*table.Scan).recordsFromTask (here).

Race condition aside, the error being returned is the following...

error encountered during arrow schema visitor: invalid schema: cannot convert list: type=struct<type: utf8, id: utf8>, nullable to Iceberg field, missing field_id

I've been doing a bit of digging and noticed an intriguing bit of behaviour with regard to the projected field IDs. It appears that if the field is of type map or list that it doesn't get added to the set of projected field IDs (see switch statement here)? Is this a piece of functionality that is yet to be implemented or is this intended behaviour? Thanks.

@zeroshade
Copy link
Member

This is definitely a bug, nested types like map and list are definitely supposed to be included and scan properly. I think I've identified the issue and am working on a fix now. It's a bit more complex than just modifying the projectedFieldIds function. That said, I definitely need to address that race condition 😄

zeroshade added a commit to apache/arrow-go that referenced this issue Feb 21, 2025
### Rationale for this change
Discovered while fixing apache/iceberg-go#309
we didn't correctly propagate the field-id metadata to children of List
or Map fields, only structs.

### What changes are included in this PR?
A new MapType creator for constructing MapTypes from arrow fields for
the Key and Items for easier construction, and fixing the `pqarrow`
schema manifest creation to correctly propagate the child metadata field
IDs for the children.

### Are these changes tested?
Unit test is added.

### Are there any user-facing changes?
Usage of pqarrow reading List/Map typed fields will now correctly
contain the `PARQUET:field_id` metadata key in the schema produced.
@zeroshade
Copy link
Member

@GabrielM98 please take a look at the linked PR and confirm for me that it solves your problem?

@GabrielM98
Copy link
Author

LGTM @zeroshade 👍 I'm getting some repeated warn level logs from the AWS SDK (see below), but other than that it works as expected. Thanks for the quick fix!

SDK 2025/02/24 08:45:52 WARN Response has no supported checksum. Not validating response payload.

@zeroshade
Copy link
Member

Yea i'm seeing the same warnings, i think it's related to apache/iceberg#12264 and I might have to disable the strong integrity checksum

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants