Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat!: Remove BoundPartitionSpec #771

Merged
merged 8 commits into from
Dec 15, 2024

Conversation

c-thiel
Copy link
Collaborator

@c-thiel c-thiel commented Dec 9, 2024

Fixes #694 (kind of). This is based on what @Fokko and I discussed this morning.

Let me try to summarize our requirements:

  1. In the REST spec and in V1 Table spec, field_id is optional -> we need an UnboundPartitionSpec type. We shouldn't use it anywhere in TableMetadata as field_id is required for V2 tables.
  2. We are unsure if we need a spec type that holds the schema it was bound to. If a PartitionSpec is part of TableMetadata, then the answer is no, as we do have both side by side in the same struct. If PartitionSpec is not part of TableMetadata, then we might need it. Without the schema being part of the spec struct, we wouldn't be able to implement column name mapping as java does: https://github.com/apache/iceberg/blob/70d87f1750627b14b3b25a0216a97db86a786992/core/src/main/java/org/apache/iceberg/TableMetadata.java#L772-L791

My original commit introducing three structs (Bound, Unbound, Schemaless) assumed we need 2). However @Fokko and I couldn't find a good use case. Hence, he is taking the discussion now to Java and see how it goes. If the answer of that discussion is that we don't need 2, then we can merge this PR.

By not storing the schema as part of the partition spec we also loose an easy way to access partition_type of the metadata - it becomes fallible now. This is why I extended TableMetadata with default_partition_type. We need to compute the type anyway to ensure the compatibility of current schema and default spec on load - so we might as well store it.

I would suggest to already review this PR, so that we can click either the "Close" or the "Merge" button when we get Feedback from @Fokko.

CC @Fokko, @Xuanwo, @liurenjie1024

@c-thiel c-thiel changed the title feat!: Remove BoundPartitionSpec feat!: Remove BoundPartitionSpec (WIP) Dec 9, 2024
@c-thiel c-thiel requested a review from Fokko December 9, 2024 16:34
pub(crate) default_spec: BoundPartitionSpecRef,
pub(crate) default_spec: PartitionSpecRef,
/// Partition type of the default partition spec.
pub(crate) default_partition_type: StructType,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to think about this one, but I think I like it 👍

Fokko
Fokko previously approved these changes Dec 10, 2024
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is great @c-thiel. It feels like we're simplifying the code quite a bit. This brings it much closer to the PyIceberg approach, as @liurenjie1024 already mentioned. In PyIceberg we even go a step further, by not validating the current-partition-spec, which there are arguments in both directions: You want to fail quickly, probably if you try to drop a field that's part of the current partition spec, it would fail right away. If you fail later on, you could still load the partition metadata, and rollback to a valid schema.

@Fokko Fokko added this to the 0.4.0 Release milestone Dec 10, 2024
@c-thiel c-thiel changed the title feat!: Remove BoundPartitionSpec (WIP) feat!: Remove BoundPartitionSpec Dec 11, 2024
Xuanwo
Xuanwo previously approved these changes Dec 11, 2024
Copy link
Member

@Xuanwo Xuanwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @c-thiel for working on this and also thank you @Fokko for keep watching this!

@Fokko Fokko requested a review from liurenjie1024 December 11, 2024 11:33
@Fokko
Copy link
Contributor

Fokko commented Dec 11, 2024

Gentle ping @liurenjie1024 so we can wrap up the 0.4.0 milestone

liurenjie1024
liurenjie1024 previously approved these changes Dec 14, 2024
Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @c-thiel for bringing this, LGTM!

@liurenjie1024
Copy link
Contributor

cc @c-thiel Seems we need to fix conflicts.

@sdd
Copy link
Contributor

sdd commented Dec 14, 2024

Hey all. I'm a bit late to the party on this one, but having taken a first look after not having been closely involved in this, there's something that struck me as being a little confusing at first glance, and probably likely to confuse more people than me as the design discussions in here get stale over time and we're just looking at the code.

Prior to this change, in partition.rs, we have a BoundPartitionSpec and a SchemalessPartitionSpec. It's pretty clear, if not from the names of these structs, then from the struct definitions themselves, what the difference is between the two.

With this change though, we'll have UnboundPartitionSpec and PartitionSpec, which both have exactly the same shape - the distinction between these is not apparent at first glance for someone with fresh eyes or potentially even to future us after having not thought about this code for a while.

Can we add some detail to the docstring for the definition of each that distinguishes each from the other?

@c-thiel c-thiel dismissed stale reviews from liurenjie1024 and Xuanwo via d5053e7 December 15, 2024 10:39
@c-thiel
Copy link
Collaborator Author

c-thiel commented Dec 15, 2024

@liurenjie1024 conflicts resolved. Had to slightly change a function signature:
https://github.com/apache/iceberg-rust/pull/771/files#diff-8389535350ef7cddc266dfd18d589a978643da0334c23e16646e62e8d6a0892eR216-R219

@sdd thanks for the review! I fixed the typo and added a few comments.

/// Partition spec that defines how to produce a tuple of partition values from a record.
///
/// A [`PartitionSpec`] is originally obtained by binding an [`UnboundPartitionSpec`] to a schema and is
/// only guaranteed to be valid for that schema. The main difference between [`PartitionSpec`] and
/// [`UnboundPartitionSpec`] is that the former has field ids assigned,
/// while field ids are optional for [`UnboundPartitionSpec`].

Let me know what you think! Feel free to extend if if you think more is needed.

I like my previous names also more (SchemalessPartitionSpec). But as Python and Java both call it PartitionSpec, it might be better to follow.

@sdd
Copy link
Contributor

sdd commented Dec 15, 2024

Thanks @c-thiel, those comments are great, looks good to me :-)

@Xuanwo
Copy link
Member

Xuanwo commented Dec 15, 2024

Thank you @c-thiel for working on this, and thank @Fokko, @liurenjie1024 and @sdd for the review. We have waited for this for so long, let's move!

@Xuanwo Xuanwo merged commit 821f8dd into apache:main Dec 15, 2024
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Why shouldn't we return an UnboundPartitionSpec instead?
5 participants