-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat!: Remove BoundPartitionSpec
#771
Conversation
BoundPartitionSpec
BoundPartitionSpec
(WIP)
pub(crate) default_spec: BoundPartitionSpecRef, | ||
pub(crate) default_spec: PartitionSpecRef, | ||
/// Partition type of the default partition spec. | ||
pub(crate) default_partition_type: StructType, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to think about this one, but I think I like it 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is great @c-thiel. It feels like we're simplifying the code quite a bit. This brings it much closer to the PyIceberg approach, as @liurenjie1024 already mentioned. In PyIceberg we even go a step further, by not validating the current-partition-spec
, which there are arguments in both directions: You want to fail quickly, probably if you try to drop a field that's part of the current partition spec, it would fail right away. If you fail later on, you could still load the partition metadata, and rollback to a valid schema.
Co-authored-by: Fokko Driesprong <[email protected]>
Co-authored-by: Fokko Driesprong <[email protected]>
BoundPartitionSpec
(WIP)BoundPartitionSpec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gentle ping @liurenjie1024 so we can wrap up the 0.4.0 milestone |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @c-thiel for bringing this, LGTM!
cc @c-thiel Seems we need to fix conflicts. |
Hey all. I'm a bit late to the party on this one, but having taken a first look after not having been closely involved in this, there's something that struck me as being a little confusing at first glance, and probably likely to confuse more people than me as the design discussions in here get stale over time and we're just looking at the code. Prior to this change, in With this change though, we'll have Can we add some detail to the docstring for the definition of each that distinguishes each from the other? |
@liurenjie1024 conflicts resolved. Had to slightly change a function signature: @sdd thanks for the review! I fixed the typo and added a few comments. /// Partition spec that defines how to produce a tuple of partition values from a record.
///
/// A [`PartitionSpec`] is originally obtained by binding an [`UnboundPartitionSpec`] to a schema and is
/// only guaranteed to be valid for that schema. The main difference between [`PartitionSpec`] and
/// [`UnboundPartitionSpec`] is that the former has field ids assigned,
/// while field ids are optional for [`UnboundPartitionSpec`].
Let me know what you think! Feel free to extend if if you think more is needed. I like my previous names also more (SchemalessPartitionSpec). But as Python and Java both call it |
Thanks @c-thiel, those comments are great, looks good to me :-) |
Thank you @c-thiel for working on this, and thank @Fokko, @liurenjie1024 and @sdd for the review. We have waited for this for so long, let's move! |
Fixes #694 (kind of). This is based on what @Fokko and I discussed this morning.
Let me try to summarize our requirements:
field_id
is optional -> we need anUnboundPartitionSpec
type. We shouldn't use it anywhere inTableMetadata
asfield_id
is required for V2 tables.PartitionSpec
is part ofTableMetadata
, then the answer is no, as we do have both side by side in the same struct. IfPartitionSpec
is not part ofTableMetadata
, then we might need it. Without the schema being part of the spec struct, we wouldn't be able to implement column name mapping as java does: https://github.com/apache/iceberg/blob/70d87f1750627b14b3b25a0216a97db86a786992/core/src/main/java/org/apache/iceberg/TableMetadata.java#L772-L791My original commit introducing three structs (Bound, Unbound, Schemaless) assumed we need 2). However @Fokko and I couldn't find a good use case. Hence, he is taking the discussion now to Java and see how it goes. If the answer of that discussion is that we don't need 2, then we can merge this PR.
By not storing the schema as part of the partition spec we also loose an easy way to access
partition_type
of the metadata - it becomes fallible now. This is why I extendedTableMetadata
withdefault_partition_type
. We need to compute the type anyway to ensure the compatibility of current schema and default spec on load - so we might as well store it.I would suggest to already review this PR, so that we can click either the "Close" or the "Merge" button when we get Feedback from @Fokko.
CC @Fokko, @Xuanwo, @liurenjie1024