Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support read Manifest List #56

Merged
merged 5 commits into from
Oct 2, 2023
Merged

Conversation

ZENOTME
Copy link
Contributor

@ZENOTME ZENOTME commented Sep 2, 2023

related issue: #36

For the unifed veiw, it follow https://iceberg.apache.org/spec/#writer-requirements basically.

  • For the field in V1 is optional but require in V2 and spec didn't state the default value , we make it optional in unifed view.
  • But for this kind of field with indication of default value, e.g. content field, we make it require in unifed view.

For write, it may need a writer interface like icelake so we need to implement it after #49 finish.

@ZENOTME ZENOTME marked this pull request as ready for review September 3, 2023 03:45
@ZENOTME
Copy link
Contributor Author

ZENOTME commented Sep 3, 2023

cc @liurenjie1024 @Xuanwo @JanKaul @nastra @Fokko PTAL

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the effort! Left some comments.

crates/iceberg/src/spec/manifest_list.rs Outdated Show resolved Hide resolved
/// manifest.
#[derive(Debug, Clone)]
pub struct ManifestList {
version: FormatVersion,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we serialize the ManifestList, we need to know serialize it to what format version.🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be controlled by table format version, storing another variable for manifest list is dangerous.

/// A list of field summaries for each partition field in the spec. Each
/// field in the list corresponds to a field in the manifest file’s
/// partition spec.
partitions: Option<Vec<FieldSummary>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the Option? I think empty vec is good enough for None?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will cause a case which will change manifest list implicitly

  1. read partitions: Some(vec![])
  2. write partitions back: None
    Same things happend in key_metadata

Will it cause wierd for user?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. But since it's optional, we don't need to write None in serialized format? We just need to skip serialization.

/// field: 519
///
/// Implementation-specific key metadata for encryption
key_metadata: Option<Vec<u8>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above

/// Parse manifest list from bytes.
///
/// QUESTION: Will we have more than one manifest list in a single file?
pub fn parse(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to follow our current approach:

  1. ManifestListV1 and ManifestListV2
  2. ManifestListEnum
  3. ManifestList <-> ManifsetEnum

You can take schema as an example

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concern does use ManifestEnum will have extra performance cost. The generate code will try to serialize ManifestListV2 first and then ManifestListV1. If we pass a version number, this cost can be avoid.🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to iceberg spec, manifest list will be avro file. The read/write process would be similar to read/write manifest file, e.g. we need to define avro schema first. So it's not blindly try v1/v2.

use super::ManifestListEntry;

#[derive(Debug, Serialize, Deserialize, PartialEq, Eq)]
pub(super) struct ManifestListEntryV1 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have similar approach for MainfestList

@ZENOTME
Copy link
Contributor Author

ZENOTME commented Sep 7, 2023

And I find some place is inconsistent with spec.

https://iceberg.apache.org/spec/#manifests:~:text=504-,added_files_count,-int
In partice, this field in avro is added_data_files_count same thing exist in: existing_files_count, deleted_files_count

Optional fields, array elements, and map values must be wrapped in an Avro union with null. This is the only union type allowed in Iceberg data files.

manifest_list:
   partitions: `list<508: field_summary>`

Actually this field_summary field is not a optional value.

Copy link
Collaborator

@JanKaul JanKaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this. I have one comment but otherwise looks good to me.

Comment on lines +414 to +426
#[derive(Debug, PartialEq, Clone)]
pub enum ManifestContentType {
/// The manifest content is data.
Data = 0,
/// The manifest content is deletes.
Deletes = 1,
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you have to distinguish between position deletes = 1 and equality deletes = 2. If we use the serde_repr crate we could directly serialize/deserialize it as follows:

/// The type of files tracked by the manifest, either data or delete files; Data(0) for all v1 manifests
#[derive(Debug, Serialize_repr, Deserialize_repr, PartialEq, Eq, Clone)]
#[repr(u8)]
pub enum ManifestContentType {
    /// Data.
    Data = 0,
    /// Deletes at position.
    PositionDeletes = 1,
    /// Delete by equality.
    EqualityDeletes = 2,
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you have to distinguish between position deletes = 1 and equality deletes = 2. If we use the serde_repr crate we could directly serialize/deserialize it as follows:

Thanks for the reminder! I have a question:

In spec of data file, it distinguish position deletes and equality delete.
In spec of manifest_list, it only distinguish data and delete.

So is that just a inconsistent in spec?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java's implementation also only has Data or Delete in manifest content type.

Copy link
Collaborator

@JanKaul JanKaul Sep 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you have to distinguish between position deletes = 1 and equality deletes = 2. If we use the serde_repr crate we could directly serialize/deserialize it as follows:

Thanks for the reminder! I have a question:

In spec of data file, it distinguish position deletes and equality delete. In spec of manifest_list, it only distinguish data and delete.

So is that just a inconsistent in spec?

Sorry, my bad. I thought it was the same struct as in datafile. But it looks like they are different. Maybe the information is not required in ManifestList. Forget my comment then.

@liurenjie1024
Copy link
Contributor

And I find some place is inconsistent with spec.

https://iceberg.apache.org/spec/#manifests:~:text=504-,added_files_count,-int In partice, this field in avro is added_data_files_count same thing exist in: existing_files_count, deleted_files_count

Optional fields, array elements, and map values must be wrapped in an Avro union with null. This is the only union type allowed in Iceberg data files.

manifest_list:
   partitions: `list<508: field_summary>`

Actually this field_summary field is not a optional value.

How about submitting fix to iceberg-docs?

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are almost there, thanks!

@@ -18,3 +18,4 @@
//! Avro related codes.
#[allow(dead_code)]
mod schema;
pub use schema::*;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub use schema::*;
pub(crate) use schema::*;

Avro schema is not intended for external users.

@@ -215,7 +215,7 @@ impl SchemaVisitor for SchemaToAvroSchema {
}

/// Converting iceberg schema to avro schema.
pub(crate) fn schema_to_avro_schema(name: impl ToString, schema: &Schema) -> Result<AvroSchema> {
pub fn schema_to_avro_schema(name: impl ToString, schema: &Schema) -> Result<AvroSchema> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above.

@@ -454,7 +454,7 @@ impl AvroSchemaVisitor for AvroSchemaToSchema {
}

/// Converts avro schema to iceberg schema.
pub(crate) fn avro_schema_to_schema(avro_schema: &AvroSchema) -> Result<Schema> {
pub fn avro_schema_to_schema(avro_schema: &AvroSchema) -> Result<Schema> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above.

Comment on lines +414 to +426
#[derive(Debug, PartialEq, Clone)]
pub enum ManifestContentType {
/// The manifest content is data.
Data = 0,
/// The manifest content is deletes.
Deletes = 1,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java's implementation also only has Data or Delete in manifest content type.


#[test]
fn test_parse_manifest_list_v1() {
let path = format!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ut only checks deserialization. We should also check deserialization.

))
})
};
const MANIFEST_LENGTH: Lazy<NestedFieldRef> = {
pub static MANIFEST_LENGTH: Lazy<NestedFieldRef> = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we remove const and use pub static instead? I think const is better here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't evaluate(Lazy) the const in runtime. Seems we only can use static in this case.🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to change modifier to pub(crate)

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for your effort!

@ZENOTME
Copy link
Contributor Author

ZENOTME commented Sep 22, 2023

Any other comments? cc @Fokko

crates/iceberg/src/avro/mod.rs Show resolved Hide resolved
crates/iceberg/src/spec/manifest_list.rs Outdated Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default, ASF releases do not allow binary files. Should we generate these files or exclude them from the release?

cc @Fokko for comments as you are likely to be our first release manager.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can add binary files to the repository, but we should not release them, so if you can exclude them from the artifact, then we're good. In general, we try to avoid adding binary files. In PyIceberg we use FastAvro to generate them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is worth the effort to generate the files. This way we can also generate the whole structure later on (manifest-list, manifest, datafile). Since Iceberg requires absolute paths, it is not easy to generate these files on forehand (looking at the Windows CI 😁)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is worth the effort to generate the files. This way we can also generate the whole structure later on (manifest-list, manifest, datafile). Since Iceberg requires absolute paths, it is not easy to generate these files on forehand (looking at the Windows CI 😁)

+1, We can investigate how to do it in following PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracked by #70

@liurenjie1024
Copy link
Contributor

CC @Xuanwo @JanKaul Any comments?

@JanKaul
Copy link
Collaborator

JanKaul commented Oct 2, 2023

LGTM, thank you all for your efforts.

@Fokko Fokko merged commit 2585a2f into apache:main Oct 2, 2023
@Fokko
Copy link
Contributor

Fokko commented Oct 2, 2023

Thanks @ZENOTME for working on this! And @liurenjie102, @Xuanwo and @JanKaul for the review 🙌

@ZENOTME ZENOTME deleted the manifest_list branch October 7, 2023 16:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants