-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Table metadata #29
Conversation
I will add some more tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Others LGTM, thanks for you hard work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very great work, thanks @JanKaul About the integration test, are you planning to do it in following pr or in this pr?
By integration tests you mean reading and writing an actual metadata.json file? |
I can include it here. Where should I place the files for testing? Should I create a folder at the workspace level? |
I see others projects usually put a 'testdata' folder alongside 'src' folder. That's similar to ''' cc @Xuanwo any other suggestions? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are almost there!
cc @Fokko PTAL |
crates/iceberg/src/spec/schema.rs
Outdated
#[derive(Debug, Serialize, Deserialize, PartialEq, Eq, Clone)] | ||
#[serde(rename_all = "kebab-case")] | ||
/// Names and types of fields in a table. | ||
pub(crate) struct SchemaV1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SchemaV1
(and SchemaV2
) are internal structs and are not visible to a user of the library. SchemaV1
is just used for serialization/deserialization. We can still do your recommended conversion.
The only publicly visible struct for a schema is the Schema
struct, which has the same representation for v1 and v2 tables.
crates/iceberg/src/spec/schema.rs
Outdated
type Error = Error; | ||
fn try_from(value: SchemaV1) -> Result<Self> { | ||
Schema::builder() | ||
.with_schema_id(value.schema_id.unwrap_or(DEFAULT_SCHEMA_ID)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like it isn't set to null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is called when deserializing a v1 schema into the general Schema
struct. If the v1 schema doesn't have a schema id, we assign a default schema_id on read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @JanKaul
crates/iceberg/src/spec/snapshot.rs
Outdated
#[derive(Debug, Serialize, Deserialize, PartialEq, Eq)] | ||
#[serde(rename_all = "kebab-case")] | ||
/// A snapshot represents the state of a table at some time and is used to access the complete set of data files in the table. | ||
pub(crate) struct SnapshotV2 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I would combine the V1 and V2. The sequence_number
is added later on, and there is some logic to set it afterward:
def _inherit_sequence_number(entry: ManifestEntry, manifest: ManifestFile) -> ManifestEntry:
"""Inherits the sequence numbers.
More information in the spec: https://iceberg.apache.org/spec/#sequence-number-inheritance
Args:
entry: The manifest entry that has null sequence numbers.
manifest: The manifest that has a sequence number.
Returns:
The manifest entry with the sequence numbers set.
"""
# The snapshot_id is required in V1, inherit with V2 when null
if entry.snapshot_id is None:
entry.snapshot_id = manifest.added_snapshot_id
# in v1 tables, the data sequence number is not persisted and can be safely defaulted to 0
# in v2 tables, the data sequence number should be inherited iff the entry status is ADDED
if entry.data_sequence_number is None and (manifest.sequence_number == 0 or entry.status == ManifestEntryStatus.ADDED):
entry.data_sequence_number = manifest.sequence_number
# in v1 tables, the file sequence number is not persisted and can be safely defaulted to 0
# in v2 tables, the file sequence number should be inherited iff the entry status is ADDED
if entry.file_sequence_number is None and (manifest.sequence_number == 0 or entry.status == ManifestEntryStatus.ADDED):
# Only available in V2, always 0 in V1
entry.file_sequence_number = manifest.sequence_number
return entry
This can happen when deserializing the JSON, or later on (like we do in PyIceberg).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parent_snapshot_id: v2.parent_snapshot_id, | ||
sequence_number: v2.sequence_number, | ||
timestamp_ms: v2.timestamp_ms, | ||
manifest_list: match v2.manifest_list { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, in PyIceberg we don't check for the manifests
field. cc @rdblue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | ||
|
||
#[test] | ||
fn test_table_data_v1() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend making a very minimal v1 spec, where schema
is present, but schemas
is missing. Same with partition-spec
and partition-specs
missing. And for sort-order
etc.
Left some comments, great work @JanKaul 🚀 |
Hi, @JanKaul I would suggest to add integration tests with json data in following pr. This pr is a little to large for me. |
6a3044a
to
b8fd0ad
Compare
It seems that the
|
Good idea, thank you for your great comments! |
To add some background here about the design philosophy here for reviews not familiar with rust:
We have a discussion about the overall structure design in #2 #3 cc @Fokko Hope this comment can help you understand it better. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this @JanKaul and @liurenjie1024 for teaching me on Rust, appreciate it!
This looks good, thanks!
.with_partition_field(PartitionField { | ||
name: "ts_day".to_string(), | ||
transform: Transform::Day, | ||
source_id: 4, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Idk what the best place for Rust is to do validation, but in this case, source id 4 does not exist in the current schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, it is probably best to do it during deserialization. We should add it in another PR.
This PR defines all structures necessary to represent Iceberg Table Metadata. The main focus lies on serialization and deserialization from JSON. Some functionality might need to be added later on.