-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support read Manifest List #56
Conversation
22af280
to
3cd5dfe
Compare
cc @liurenjie1024 @Xuanwo @JanKaul @nastra @Fokko PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the effort! Left some comments.
/// manifest. | ||
#[derive(Debug, Clone)] | ||
pub struct ManifestList { | ||
version: FormatVersion, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we serialize the ManifestList, we need to know serialize it to what format version.🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be controlled by table format version, storing another variable for manifest list is dangerous.
/// A list of field summaries for each partition field in the spec. Each | ||
/// field in the list corresponds to a field in the manifest file’s | ||
/// partition spec. | ||
partitions: Option<Vec<FieldSummary>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the Option
? I think empty vec is good enough for None
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will cause a case which will change manifest list implicitly
- read partitions: Some(vec![])
- write partitions back: None
Same things happend in key_metadata
Will it cause wierd for user?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. But since it's optional, we don't need to write None
in serialized format? We just need to skip serialization.
/// field: 519 | ||
/// | ||
/// Implementation-specific key metadata for encryption | ||
key_metadata: Option<Vec<u8>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above
/// Parse manifest list from bytes. | ||
/// | ||
/// QUESTION: Will we have more than one manifest list in a single file? | ||
pub fn parse( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to follow our current approach:
- ManifestListV1 and ManifestListV2
- ManifestListEnum
- ManifestList <-> ManifsetEnum
You can take schema as an example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concern does use ManifestEnum will have extra performance cost. The generate code will try to serialize ManifestListV2 first and then ManifestListV1. If we pass a version number, this cost can be avoid.🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to iceberg spec, manifest list will be avro file. The read/write process would be similar to read/write manifest file, e.g. we need to define avro schema first. So it's not blindly try v1/v2.
use super::ManifestListEntry; | ||
|
||
#[derive(Debug, Serialize, Deserialize, PartialEq, Eq)] | ||
pub(super) struct ManifestListEntryV1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should have similar approach for MainfestList
And I find some place is inconsistent with spec. https://iceberg.apache.org/spec/#manifests:~:text=504-,added_files_count,-int
Actually this field_summary field is not a optional value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this. I have one comment but otherwise looks good to me.
#[derive(Debug, PartialEq, Clone)] | ||
pub enum ManifestContentType { | ||
/// The manifest content is data. | ||
Data = 0, | ||
/// The manifest content is deletes. | ||
Deletes = 1, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you have to distinguish between position deletes = 1 and equality deletes = 2. If we use the serde_repr
crate we could directly serialize/deserialize it as follows:
/// The type of files tracked by the manifest, either data or delete files; Data(0) for all v1 manifests
#[derive(Debug, Serialize_repr, Deserialize_repr, PartialEq, Eq, Clone)]
#[repr(u8)]
pub enum ManifestContentType {
/// Data.
Data = 0,
/// Deletes at position.
PositionDeletes = 1,
/// Delete by equality.
EqualityDeletes = 2,
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you have to distinguish between position deletes = 1 and equality deletes = 2. If we use the serde_repr crate we could directly serialize/deserialize it as follows:
Thanks for the reminder! I have a question:
In spec of data file, it distinguish position deletes and equality delete.
In spec of manifest_list, it only distinguish data and delete.
So is that just a inconsistent in spec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Java's implementation also only has Data
or Delete
in manifest content type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you have to distinguish between position deletes = 1 and equality deletes = 2. If we use the serde_repr crate we could directly serialize/deserialize it as follows:
Thanks for the reminder! I have a question:
In spec of data file, it distinguish position deletes and equality delete. In spec of manifest_list, it only distinguish data and delete.
So is that just a inconsistent in spec?
Sorry, my bad. I thought it was the same struct as in datafile
. But it looks like they are different. Maybe the information is not required in ManifestList. Forget my comment then.
How about submitting fix to iceberg-docs? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are almost there, thanks!
crates/iceberg/src/avro/mod.rs
Outdated
@@ -18,3 +18,4 @@ | |||
//! Avro related codes. | |||
#[allow(dead_code)] | |||
mod schema; | |||
pub use schema::*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pub use schema::*; | |
pub(crate) use schema::*; |
Avro schema is not intended for external users.
crates/iceberg/src/avro/schema.rs
Outdated
@@ -215,7 +215,7 @@ impl SchemaVisitor for SchemaToAvroSchema { | |||
} | |||
|
|||
/// Converting iceberg schema to avro schema. | |||
pub(crate) fn schema_to_avro_schema(name: impl ToString, schema: &Schema) -> Result<AvroSchema> { | |||
pub fn schema_to_avro_schema(name: impl ToString, schema: &Schema) -> Result<AvroSchema> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above.
crates/iceberg/src/avro/schema.rs
Outdated
@@ -454,7 +454,7 @@ impl AvroSchemaVisitor for AvroSchemaToSchema { | |||
} | |||
|
|||
/// Converts avro schema to iceberg schema. | |||
pub(crate) fn avro_schema_to_schema(avro_schema: &AvroSchema) -> Result<Schema> { | |||
pub fn avro_schema_to_schema(avro_schema: &AvroSchema) -> Result<Schema> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above.
#[derive(Debug, PartialEq, Clone)] | ||
pub enum ManifestContentType { | ||
/// The manifest content is data. | ||
Data = 0, | ||
/// The manifest content is deletes. | ||
Deletes = 1, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Java's implementation also only has Data
or Delete
in manifest content type.
|
||
#[test] | ||
fn test_parse_manifest_list_v1() { | ||
let path = format!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ut only checks deserialization. We should also check deserialization.
)) | ||
}) | ||
}; | ||
const MANIFEST_LENGTH: Lazy<NestedFieldRef> = { | ||
pub static MANIFEST_LENGTH: Lazy<NestedFieldRef> = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we remove const
and use pub static
instead? I think const
is better here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't evaluate(Lazy) the const in runtime. Seems we only can use static in this case.🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest to change modifier to pub(crate)
76582cf
to
c5036eb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for your effort!
f78e19c
to
a801d0a
Compare
Any other comments? cc @Fokko |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default, ASF releases do not allow binary files. Should we generate these files or exclude them from the release?
cc @Fokko for comments as you are likely to be our first release manager.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can add binary files to the repository, but we should not release them, so if you can exclude them from the artifact, then we're good. In general, we try to avoid adding binary files. In PyIceberg we use FastAvro to generate them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is worth the effort to generate the files. This way we can also generate the whole structure later on (manifest-list, manifest, datafile). Since Iceberg requires absolute paths, it is not easy to generate these files on forehand (looking at the Windows CI 😁)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is worth the effort to generate the files. This way we can also generate the whole structure later on (manifest-list, manifest, datafile). Since Iceberg requires absolute paths, it is not easy to generate these files on forehand (looking at the Windows CI 😁)
+1, We can investigate how to do it in following PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tracked by #70
LGTM, thank you all for your efforts. |
related issue: #36
For the unifed veiw, it follow https://iceberg.apache.org/spec/#writer-requirements basically.
For write, it may need a writer interface like icelake so we need to implement it after #49 finish.