-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Reassign field ids for schema #615
feat: Reassign field ids for schema #615
Conversation
crates/iceberg/src/spec/schema.rs
Outdated
/// If not provided, it will start from 0. | ||
/// | ||
/// All specified aliases and identifier fields will be updated to the new field-ids. | ||
pub fn with_reassigned_field_ids(mut self, start_from: Option<i32>) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please elaborate on the circumstances under which the user wants to set this? I used to think users always need to start with 0
.
crates/iceberg/src/spec/schema.rs
Outdated
/// | ||
/// All specified aliases and identifier fields will be updated to the new field-ids. | ||
pub fn with_reassigned_field_ids(mut self, start_from: Option<i32>) -> Self { | ||
self.reassign_field_ids_from = Some(start_from.unwrap_or(0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we always set Some(v)
here, how about just using i32
for reassign_field_ids_from
with default value 0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typically you would pass in last-column-id
from the table metadata. In the case of a CREATE TABLE
this will be zero, in the case of a REPLACE TABLE
it will be last-column-id
where all columns get a new ID and you ensure they are not used before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Xuanwo only doubt with the i32 is that people might input 1 as something is required - and it's shorter than Default::default().
The rust tests are quite a good example that this can happen: We have many tests where field numbering starts from 1, so I had to re-write them to be compatible with the new builder.
I am OK with the i32 as well, just wanted to bring up that a None is probably more verbose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think in this case we shoud enforce user to pass an u32
rather an Option<i32>
, since if this method is called, we always reassign field ids.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I think this method is currently only used by TableMetadata
builder? Could we make this method pub(crate)
for now to avoid exposing to user?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both good points, fixed in the latest commit.
Its dead_code
for now
cc @liurenjie1024 would you like to take a look too? |
Thanks for pinging me, I'll take a review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @c-thiel for this pr!
crates/iceberg/src/spec/schema.rs
Outdated
/// | ||
/// All specified aliases and identifier fields will be updated to the new field-ids. | ||
pub fn with_reassigned_field_ids(mut self, start_from: Option<i32>) -> Self { | ||
self.reassign_field_ids_from = Some(start_from.unwrap_or(0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think in this case we shoud enforce user to pass an u32
rather an Option<i32>
, since if this method is called, we always reassign field ids.
crates/iceberg/src/spec/schema.rs
Outdated
/// | ||
/// All specified aliases and identifier fields will be updated to the new field-ids. | ||
pub fn with_reassigned_field_ids(mut self, start_from: Option<i32>) -> Self { | ||
self.reassign_field_ids_from = Some(start_from.unwrap_or(0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I think this method is currently only used by TableMetadata
builder? Could we make this method pub(crate)
for now to avoid exposing to user?
crates/iceberg/src/spec/schema.rs
Outdated
let outer_fields = fields | ||
.into_iter() | ||
.map(|field| { | ||
self.old_to_new_id.insert(field.id, self.next_field_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should check exisistence of field.id
here since we don't trust user provide field ids, and it may duplicate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the check. It should never trigger currently, because I switched the build method order as you suggested.
Nevertheless the check is good if Reassignment is ever used outside of the builder in the future.
crates/iceberg/src/spec/schema.rs
Outdated
@@ -105,7 +116,16 @@ impl SchemaBuilder { | |||
} | |||
|
|||
/// Builds the schema. | |||
pub fn build(self) -> Result<Schema> { | |||
pub fn build(mut self) -> Result<Schema> { | |||
if let Some(start_from) = self.reassign_field_ids_from { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking that maybe we should put this part in the end of this method? The build process has some check, for example identifier field type, uniqueness of field name. Though current approach is correct, the error reporting maybe confusing. For example when some field doesn't pass identifier type check, current approach reports reassigned id, rather original id.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree - I didn't think about the error messages.
It's unfortunately not super straightforward to switch the order, as for example index_by_id needs a full rebuild as both keys and values changes. I think we have two options to achieve the swap:
- Use
id_reassigner.apply_*
for all other fields that have been build (i.e.id_to_field
) or - Re-build those fields using the same code - i.e. the build option.
I opted for 2.
While it is slightly less performant because we have to visit the structure again, it is much less error prone as we simply re-use existing and well-tested code.
Let me know what you think!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also opted for 2, which is much cleaner and easier to maintain. About performance, I don't think this is a concern for now as it's not supposed to be called frequently. Current codes are working as opt 1, are you planning to refactoring it in this pr to opt 2? BTW currently opt 1 also looks good to me, so it's up to you to do it in this pr or future pr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liurenjie1024 I refactored it in my latest commit, however we cannot get rid of the two functions:
apply_to_aliases
and apply_to_identifier_fields
. The builder itself is standard now.
@@ -2237,4 +2377,168 @@ table { | |||
let schema = table_schema_simple().0; | |||
assert_eq!(3, schema.highest_field_id()); | |||
} | |||
|
|||
#[test] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the tests!
@liurenjie1024 I addressed all of your comments. Regarding your uniqueness of id comment: I noticed that this wasn't enforced previously, so I fixed it and added a test for it as well. This is currently making a few tests crash, where key-ids & value-ids of maps are re-used elsewhere in the schema. I am checking how java handles this and will fix the tests accordingly. |
Ok, java does not allow this. It's easy to test with the @liurenjie1024 ready for another round :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @c-thiel for this great pr!
crates/iceberg/src/spec/schema.rs
Outdated
@@ -86,6 +87,16 @@ impl SchemaBuilder { | |||
self | |||
} | |||
|
|||
/// Reassign all field-ids (nested) on build. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Reassign all field-ids (nested) on build. | |
/// Reassign all field-ids (including nested) on build. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in latest commit
crates/iceberg/src/spec/schema.rs
Outdated
@@ -105,7 +116,16 @@ impl SchemaBuilder { | |||
} | |||
|
|||
/// Builds the schema. | |||
pub fn build(self) -> Result<Schema> { | |||
pub fn build(mut self) -> Result<Schema> { | |||
if let Some(start_from) = self.reassign_field_ids_from { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also opted for 2, which is much cleaner and easier to maintain. About performance, I don't think this is a concern for now as it's not supposed to be called frequently. Current codes are working as opt 1, are you planning to refactoring it in this pr to opt 2? BTW currently opt 1 also looks good to me, so it's up to you to do it in this pr or future pr.
@liurenjie1024 last two minor changes added - I think we are good to merge @liurenjie1024 @Xuanwo |
Thanks @c-thiel for this pr! |
* Reassign field ids for schema * Address comments * Schema ensure unique field ids * Fix tests with duplicate nested field ids * Use Schema::builder() for reassigned ids * Better docs
Required for
TableMetadataBuilder
(#587).When new TableMetadata is created, field ids should be reassign to ensure that field numbering is normalized and started from 0.