Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API, Core: Add content offset and size to DeleteFile #11446

Merged
merged 3 commits into from
Nov 4, 2024

Conversation

aokolnychyi
Copy link
Contributor

@aokolnychyi aokolnychyi commented Nov 2, 2024

This PR adds content offset and size to DeleteFile for DVs.

Scan, commit, and snapshot metrics will be updated in a subsequent PR. This work is part of #11122.

@@ -55,7 +56,7 @@ default ChangelogOperation operation() {

@Override
default long sizeBytes() {
return length() + deletes().stream().mapToLong(ContentFile::fileSizeInBytes).sum();
return length() + ContentFileUtil.contentSizeInBytes(deletes());
Copy link
Contributor Author

@aokolnychyi aokolnychyi Nov 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is in api. I initially moved ContentFileUtil (which holds relevant logic) to api, which required moving more classes. Specifically, I am talking about Partitioning and MetadataColumns. I am thinking about alternatives. I could add a brand new ScanTaskUtil in api, but it is a bit awkward to have both utils.

Any thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a fairly simple method that doesn't need to require extensive refactoring. Why not add a DeleteFileUtil or something with this logic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can. We will have ContentFileUtil and DeleteFileUtil. A bit awkward but not a big deal.

}
}

public static long contentSizeInBytes(Iterable<? extends ContentFile<?>> files) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not using streams for performance reasons.

@manuzhang
Copy link
Collaborator

Thanks, Anton. I think it will be valuable to link this and previous PRs to the proposal issue, #11122.

@aokolnychyi
Copy link
Contributor Author

@manuzhang, good call. Linked.

@@ -79,7 +80,7 @@ void updateFlushResult(WriteResult result) {
Arrays.stream(result.deleteFiles())
.forEach(
deleteFile -> {
deleteFilesSizeHistogram.update(deleteFile.fileSizeInBytes());
deleteFilesSizeHistogram.update(ContentFileUtil.contentSizeInBytes(deleteFile));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nastra, I cherry-picked all Flink changes across all versions, unlike the prototype PR you reviewed.

Preconditions.checkArgument(contentOffset != null, "Content offset is required for DV");
Preconditions.checkArgument(contentSizeInBytes != null, "Content size is required for DV");
} else {
Preconditions.checkArgument(contentOffset == null, "Content offset is only for DV");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these descriptive enough? Trying to stay on one line but have meaningful error messages.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's quite difficult to stay on a single line while also producing a good/descriptive error msg. I would favor a better error msg, since that is what people later will run into and will read.

What about Content offset must only be set for DVs here? In other places we typically use Invalid xyz: ..., so this error msg could also be Invalid content offset: must be null for DVs and in the msgs further above it could be Invalid content offset: must be non-null for DVs

Copy link
Contributor

@nastra nastra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with one comment around error messaging

@aokolnychyi aokolnychyi merged commit ec269ee into apache:main Nov 4, 2024
50 checks passed
@aokolnychyi
Copy link
Contributor Author

Thanks for reviewing, @nastra @rdblue!

zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants