-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Implement serialize/deserialize for datatypes #6
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @JanKaul
pub fn get_name(&self, name: &str) -> Option<&StructField> { | ||
self.fields.iter().find(|field| field.name == name) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Python and Java we also have convenient formatting of the struct. I think in rust this is the Display
trait. A struct gets formatted as struct<string, int, long, double>
based on the fields. Maybe also something to consider for rust
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've included it in this PR. How is the formatting for maps and lists?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Map: map<string, double>
List: list<string>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add this in a separate PR? This way we can get this one in.
/// Time of day without date or timezone. | ||
Time, | ||
/// Timestamp without timezone | ||
Timestamp, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark uses Timestamp
to represent timestamp with timezone, TimestampNtz
to represent timestamp without timezone
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to approach in this pr since it's same as iceberg's spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also prefer to stick to the iceberg spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find the negation rather confusing, on top of that, we should encourage people to use timestamps without timezones =)
/// Time of day without date or timezone. | ||
Time, | ||
/// Timestamp without timezone | ||
Timestamp, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to approach in this pr since it's same as iceberg's spec.
src/spec/datatypes.rs
Outdated
#[serde(rename = "struct", tag = "type")] | ||
pub struct StructType { | ||
/// Struct fields | ||
pub fields: Vec<StructField>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about making this field private here? The get/get_name
has a linear searching algorithm, it maybe inefficient for large structs with hundreds of fields. We may need a hashmap to index name/id to fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can create a lookup table to speed up the field access. Should I include it in this PR or should I create a new PR for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I include it in this PR or should I create a new PR for it?
I think it's not a block of merging this PR. We can add a new issue to track this instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's not a block of merging this PR. We can add a new issue to track this instead.
I'm ok with putting lookup table in later PR, just suggest to make fields
private for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lookup table for "id" is probably most important. Would you also create a lookup table for "name"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so, accessing by name is also quite frequently in query optimizer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Python we have both, and we cache them. A Python example:
def test_schema_index_by_name_visitor(table_schema_nested: Schema) -> None:
"""Test index_by_name visitor function"""
table_schema_nested = schema.Schema(
NestedField(field_id=1, name="foo", field_type=StringType(), required=False),
NestedField(field_id=2, name="bar", field_type=IntegerType(), required=True),
NestedField(field_id=3, name="baz", field_type=BooleanType(), required=False),
NestedField(
field_id=4,
name="qux",
field_type=ListType(element_id=5, element_type=StringType(), element_required=True),
required=True,
),
NestedField(
field_id=6,
name="quux",
field_type=MapType(
key_id=7,
key_type=StringType(),
value_id=8,
value_type=MapType(key_id=9, key_type=StringType(), value_id=10, value_type=IntegerType(), value_required=True),
value_required=True,
),
required=True,
),
NestedField(
field_id=11,
name="location",
field_type=ListType(
element_id=12,
element_type=StructType(
NestedField(field_id=13, name="latitude", field_type=FloatType(), required=False),
NestedField(field_id=14, name="longitude", field_type=FloatType(), required=False),
),
element_required=True,
),
required=True,
),
NestedField(
field_id=15,
name="person",
field_type=StructType(
NestedField(field_id=16, name="name", field_type=StringType(), required=False),
NestedField(field_id=17, name="age", field_type=IntegerType(), required=True),
),
required=False,
),
schema_id=1,
identifier_field_ids=[1],
)
index = schema.index_by_name(table_schema_nested)
assert index == {
"foo": 1,
"bar": 2,
"baz": 3,
"qux": 4,
"qux.element": 5,
"quux": 6,
"quux.key": 7,
"quux.value": 8,
"quux.value.key": 9,
"quux.value.value": 10,
"location": 11,
"location.element": 12,
"location.element.latitude": 13,
"location.element.longitude": 14,
"location.latitude": 13,
"location.longitude": 14,
"person": 15,
"person.name": 16,
"person.age": 17,
}
Let's do a separate PR!
Co-authored-by: Renjie Liu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, let's go!
minor: how about rename PR title to feat: Implement serialize/deserialize for datatypes
?
Thank you @JanKaul for the PR! And @liurenjie1024, @ConeyLiu and @Xuanwo for the review! |
Thank you all for the review and the great input. |
This PR adds the support to serialize/deserialize the iceberg datatypes.