Add upsert docs #1665

Fokko · 2025-02-15T21:03:21Z

And make the join-cols optional using the identifier fields.

soumilshah1995 · 2025-02-16T00:10:05Z

lovely

kevinjqliu

LGTM, minor comment on updating the docstring

kevinjqliu · 2025-02-16T01:45:18Z

pyiceberg/table/__init__.py

@@ -1148,6 +1148,15 @@ def upsert(
        """
        from pyiceberg.table import upsert_util

+        if join_cols is None:


👍 we should also update the docstring

ananthdurai · 2025-02-16T02:21:15Z

jon_cols seems focused on the primary key. How do we specify the partition column to enable partition pruning?

kevinjqliu

actually this doesnt respect identifier_field_ids columns' uniqueness

For example,

def test_upsert_with_identifier_fields(catalog: Catalog) -> None:
    identifier = "default.test_upsert_with_identifier_fields"
    _drop_table(catalog, identifier)

    schema = Schema(
        NestedField(1, "city", StringType(), required=True),
        NestedField(2, "inhabitants", IntegerType(), required=True),
        # Mark City as the identifier field, also known as the primary-key
        identifier_field_ids=[1],
    )

    tbl = catalog.create_table(identifier, schema=schema)

    arrow_schema = pa.schema(
        [
            pa.field("city", pa.string(), nullable=False),
            pa.field("inhabitants", pa.int32(), nullable=False),
        ]
    )

    # Write some data
    df = pa.Table.from_pylist(
        [
            {"city": "Amsterdam", "inhabitants": 921402},
            {"city": "San Francisco", "inhabitants": 808988},
            {"city": "Drachten", "inhabitants": 45019},
            {"city": "Paris", "inhabitants": 2103000},
        ],
        schema=arrow_schema,
    )
    tbl.append(df)

    df = pa.Table.from_pylist(
        [
            {"city": "Paris", "inhabitants": 921402},
        ],
        schema=arrow_schema,
    )
    upd = tbl.upsert(df, join_cols=["inhabitants"], when_not_matched_insert_all=True)

    print(tbl.scan().to_pandas())

kevinjqliu · 2025-02-16T07:07:16Z

jon_cols seems focused on the primary key. How do we specify the partition column to enable partition pruning

@ananthdurai the partition columns are part of the Iceberg table definition. The upsert function is called on an Iceberg table. upsert does an overwrite and/or an append. Both of these functions are aware of Iceberg's partition and handles partition pruning as part of Iceberg's partition write feature

Fokko · 2025-02-16T08:37:51Z

@kevinjqliu Yes, that is an issue, but we don't respect this for any of the operations (append, etc). Doing this would make the operations expensive so we could leave this up to the user. Two more opinionated approaches are:

Don't allow join_cols if the table has identifier fields.
Remove the join_cols column.

I think it would be nice to push Iceberg-specific features like the identifier fields, but I think the above might be too opinionated. Would love to hear what others think.

@ananthdurai Kevin already provided an excellent answer, If you want to learn more, I would recommend reading the docs on hidden partition pruning.

mattmartin14 · 2025-02-16T12:43:30Z

@kevinjqliu Yes, that is an issue, but we don't respect this for any of the operations (append, etc). Doing this would make the operations expensive so we could leave this up to the user. Two more opinionated approaches are:

Don't allow join_cols if the table has identifier fields.

Remove the join_cols column.

I think it would be nice to push Iceberg-specific features like the identifier fields, but I think the above might be too opinionated. Would love to hear what others think.

@ananthdurai Kevin already provided an excellent answer, If you want to learn more, I would recommend reading the docs on hidden partition pruning.
@Fokko ,

I honestly didn't even know about the iceberg specific identifier fields until you had recently mentioned it. I can't imagine many have. I see situations where teams have already built a ton of iceberg tables and it would be easier and more explicit for the user to understand if join_cols is an option they can call out. Otherwise, for users that do not know the internal schema of the table and see the code for the first time with no join_cols specified, they will probably be puzzled and wonder "how is this thing doing this correctly?"

I'd personally leave the join_cols as an optional way for users to use upsert.

mattmartin14 · 2025-02-16T12:56:08Z

actually this doesnt respect identifier_field_ids columns' uniqueness

For example,

def test_upsert_with_identifier_fields(catalog: Catalog) -> None:

    identifier = "default.test_upsert_with_identifier_fields"

    _drop_table(catalog, identifier)



    schema = Schema(

        NestedField(1, "city", StringType(), required=True),

        NestedField(2, "inhabitants", IntegerType(), required=True),

        # Mark City as the identifier field, also known as the primary-key

        identifier_field_ids=[1],

    )



    tbl = catalog.create_table(identifier, schema=schema)



    arrow_schema = pa.schema(

        [

            pa.field("city", pa.string(), nullable=False),

            pa.field("inhabitants", pa.int32(), nullable=False),

        ]

    )



    # Write some data

    df = pa.Table.from_pylist(

        [

            {"city": "Amsterdam", "inhabitants": 921402},

            {"city": "San Francisco", "inhabitants": 808988},

            {"city": "Drachten", "inhabitants": 45019},

            {"city": "Paris", "inhabitants": 2103000},

        ],

        schema=arrow_schema,

    )

    tbl.append(df)



    df = pa.Table.from_pylist(

        [

            {"city": "Paris", "inhabitants": 921402},

        ],

        schema=arrow_schema,

    )

    upd = tbl.upsert(df, join_cols=["inhabitants"], when_not_matched_insert_all=True)



    print(tbl.scan().to_pandas())

@kevinjqliu ,

Not to sound blunt but the example above seems odd TBH. If I understand correctly, inhabitants is analogous to population count for a city. Thus, the join column should be city, not inhabitants. City identifies the unique record. Inhabitants is just an attribute of that record.

kevinjqliu · 2025-02-16T16:56:24Z

Yes, that is an issue, but we don't respect this for any of the operations (append, etc). Doing this would make the operations expensive so we could leave this up to the user.

You're right, this is an issue for all the write operations, we dont take identifier_field_ids into account when writing... I'll raise a separate issue to track this. For now, Im ok with leaving this up to the user/external engine. When done correctly, the write operations will respect the uniqueness.
To quote the spec,

uniqueness of rows by this identifier is not guaranteed or required by Iceberg and it is the responsibility of processing engines or data providers to enforce.

As a followup, we can add a uniqueness check to upsert when identifier_field_ids is set, similar to checking for duplicates. I see this issue as a potential footgun so its better to verify the uniqueness and prevent data correctness problems.

kevinjqliu · 2025-02-16T16:59:43Z

Not to sound blunt but the example above seems odd TBH

@mattmartin14 it is an odd example! I had a feeling this can break the uniqueness constraint so I crafted an example to show it. It's not something users will normally write, but it does show a data correctness issue. This can become a problem when interacting with the table again, since it is assumed that the identifier_field_ids should provide uniqueness guarantees.

kevinjqliu

I'm ok with this approach and shift the responsibility to the user to provide the uniqueness guarantee when using identifier_field_ids.

I raised an issue to figure out the path forward for all write paths when identifier_field_ids is set (#1666)
And perhaps a uniqueness check for upsert when identifier_field_ids is set (#1667)

Fokko · 2025-02-16T18:38:23Z

I honestly didn't even know about the iceberg specific identifier fields until you had recently mentioned it. I can't imagine many have. I see situations where teams have already built a ton of iceberg tables and it would be easier and more explicit for the user to understand if join_cols is an option they can call out.

Yes, this is also my concern. Keep in mind that the identifier-field-ids are referencing the columns by ID, so if you rename them, nothing breaks :)

Thanks for raising the issues @kevinjqliu, I've provided a way to cover this in #1667 (comment) LMKWYT

kevinjqliu · 2025-02-16T19:18:35Z

LGTM! Thanks @Fokko and thanks @mattmartin14 for the review

Add upsert docs

ba25b61

And make the join-cols optional using the identifier fields.

This was referenced Feb 15, 2025

Feature: MERGE/Upsert Support #1534

Closed

Merge into / Upsert #402

Closed

kevinjqliu approved these changes Feb 16, 2025

View reviewed changes

kevinjqliu requested changes Feb 16, 2025

View reviewed changes

This was referenced Feb 16, 2025

[feature] On the write path, take into account identifier_field_ids #1666

Open

[feature] for upsert, add uniqueness check when identifier_field_ids is set #1667

Open

kevinjqliu approved these changes Feb 16, 2025

View reviewed changes

Update docs

31df9fa

kevinjqliu merged commit 300b840 into apache:main Feb 16, 2025
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add upsert docs #1665

Add upsert docs #1665

Fokko commented Feb 15, 2025

soumilshah1995 commented Feb 16, 2025

kevinjqliu left a comment

kevinjqliu Feb 16, 2025

ananthdurai commented Feb 16, 2025

kevinjqliu left a comment •

edited by Fokko

Loading

kevinjqliu commented Feb 16, 2025

Fokko commented Feb 16, 2025 •

edited

Loading

mattmartin14 commented Feb 16, 2025

mattmartin14 commented Feb 16, 2025

kevinjqliu commented Feb 16, 2025

kevinjqliu commented Feb 16, 2025

kevinjqliu left a comment

Fokko commented Feb 16, 2025

kevinjqliu commented Feb 16, 2025

Add upsert docs #1665

Add upsert docs #1665

Conversation

Fokko commented Feb 15, 2025

soumilshah1995 commented Feb 16, 2025

kevinjqliu left a comment

Choose a reason for hiding this comment

kevinjqliu Feb 16, 2025

Choose a reason for hiding this comment

ananthdurai commented Feb 16, 2025

kevinjqliu left a comment • edited by Fokko Loading

Choose a reason for hiding this comment

kevinjqliu commented Feb 16, 2025

Fokko commented Feb 16, 2025 • edited Loading

mattmartin14 commented Feb 16, 2025

mattmartin14 commented Feb 16, 2025

kevinjqliu commented Feb 16, 2025

kevinjqliu commented Feb 16, 2025

kevinjqliu left a comment

Choose a reason for hiding this comment

Fokko commented Feb 16, 2025

kevinjqliu commented Feb 16, 2025

kevinjqliu left a comment •

edited by Fokko

Loading

Fokko commented Feb 16, 2025 •

edited

Loading