-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Implement Iceberg values #20
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the effort. I think deriving serialization/deserialization for primitive types is ok. But I'm not sure we should implement ser/de for composite types in this way. When ser/de composite types, we should associate more information(such as schema, type) with it to avoid keeping them in memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @JanKaul, I left some comments
crates/iceberg/src/spec/values.rs
Outdated
Value::String(_) => Type::Primitive(PrimitiveType::String), | ||
Value::UUID(_) => Type::Primitive(PrimitiveType::Uuid), | ||
Value::Decimal(dec) => Type::Primitive(PrimitiveType::Decimal { | ||
precision: 38, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want to make the precision configurable as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we then might need to store the precision in another field. I don't know how to calculate the precision from the scale and I don't see a way to store it in the rust decimal.
I would suggest two changes to this pr:
|
Co-authored-by: Renjie Liu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, LGTM!
Thanks @JanKaul I think we are almost there, just need to update date/time with long/int. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, one question about the naming. A lot has happened since I last checked, and I'll leave it up to you to leave it as Literal
or revert it to Value
, it is up to you
PrimitiveLiteral::String(_) => Type::Primitive(PrimitiveType::String), | ||
PrimitiveLiteral::UUID(_) => Type::Primitive(PrimitiveType::Uuid), | ||
PrimitiveLiteral::Decimal(dec) => Type::Primitive(PrimitiveType::Decimal { | ||
precision: 38, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't always 38. It actually depends on the number of bytes that are read. But we can break that out in a separate chore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might need to store the precision in an extra field because I don't see a way to get the precision from the rust Decimal
. I'm not sure if this can be somehow calculated depending on the mantissa and the scale.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is how we do it in Python:
MAX_PRECISION = tuple(math.floor(math.log10(math.fabs(math.pow(2, 8 * pos - 1) - 1))) for pos in range(24))
REQUIRED_LENGTH = tuple(next(pos for pos in range(24) if p <= MAX_PRECISION[pos]) for p in range(40))
def decimal_required_bytes(precision: int) -> int:
"""Compute the number of bytes required to store a precision.
Args:
precision: The number of digits to store.
Returns:
The number of bytes required to store a decimal with a certain precision.
"""
if precision <= 0 or precision >= 40:
raise ValueError(f"Unsupported precision, outside of (0, 40]: {precision}")
return REQUIRED_LENGTH[precision]
And how we do it in Java: https://github.com/apache/iceberg/blob/f5f543a54ff7460648bb864f4f06a29eb28938b9/api/src/main/java/org/apache/iceberg/types/TypeUtil.java#L679-L715
// under the License. | ||
|
||
/*! | ||
* Value in iceberg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of happened since I reviewed this PR the last time. I'm also okay with calling this a value instead of a literal.
fn avro_bytes_long() { | ||
let bytes = vec![32u8, 0u8, 0u8, 0u8, 0u8, 0u8, 0u8, 0u8]; | ||
|
||
check_avro_bytes_serde( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are we testing here? Keep in mind that Avro uses zig-zag encoding for integers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typically the upper and lower bounds are stored as avro "bytes" inside the manifest_entry. These bytes conform to the iceberg binary single value serialization. This is a test to check if bytes serialized into avro "bytes" can be deserialized into a literal value.
crates/iceberg/src/spec/values.rs
Outdated
} | ||
} | ||
|
||
fn to_optional_literal(value: Result<Literal, Error>) -> Result<Option<Literal>, Error> { |
This comment was marked as duplicate.
This comment was marked as duplicate.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @JanKaul for working on this, and @liurenjie1024 @Xuanwo @ZENOTME for the reviews 🚀
let reader = apache_avro::Reader::new(&*encoded).unwrap(); | ||
|
||
for record in reader { | ||
let result = apache_avro::from_value::<ByteBuf>(&record.unwrap()).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I can't figure it out what is the difference between result and bytes? 🤔 cc @JanKaul
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! The literal
is supposed to be written to the Avro Writer instead of the bytes
. Like so:
writer.append_ser(Into::<ByteBuf>::into(literal)).unwrap();
I will create a PR to fix this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created a PR to fix it.
This pull request defines the representation of iceberg values. Additionally the serialization/deserialization is implemented.