-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core, Spark: Exclude non live content file in RewriteTablePathUtil #12006
Conversation
Yes, thanks for fixing the issue (found by our internal usage). I wonder, because the deleted entry may be important for CDC (to mark that this file at some point existed), is another possibility to skip the deleted file for copy? cc @flyrain for thoughts as well |
Sounds good, I updated the PR to keep the deleted entry but exclude deleted content files from copy plan |
...k/v3.5/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteTablePathsAction.java
Outdated
Show resolved
Hide resolved
...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteTablePathSparkAction.java
Outdated
Show resolved
Hide resolved
CDC is fine as it won't work anyway if the data file has been removed. +1 to keep the entry in the manifest file as this tool isn't supposed to change it. |
3c2d4c5
to
ebf5409
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dramaticlly this looks better, a few nits. Sorry for delay
core/src/main/java/org/apache/iceberg/RewriteTablePathUtil.java
Outdated
Show resolved
Hide resolved
...k/v3.5/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteTablePathsAction.java
Outdated
Show resolved
Hide resolved
...k/v3.5/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteTablePathsAction.java
Outdated
Show resolved
Hide resolved
...k/v3.5/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteTablePathsAction.java
Outdated
Show resolved
Hide resolved
Merged, thanks @dramaticlly ! |
Thank you @szehon-ho for reviewing my change! |
Instead of scanning all entries in data/manifest for identifying list of content files to copy, scan only the live one. This is essential to prevent rewrite table path to carry the files already expired as part of snapshot expiration in the source table.
Existing logic fetch both added/existing/deleted entry from manifest to collect list of content files to be copied and rely on reducer for deduplicate based on file name.
However we want to avoid the scenario where the given content file with only deleted status in older manifest, as snapshot expiration might already removed the snapshot which reference the given content file, and deleted as part of snapshot expiration.
With some concrete examlpe to help with explanation,
the expiration of first snapshot
8729031490038117099
, will remove d2.parquet on disk,second snapshot
6024975807438659167
might still have data manifest entry of deleted (status=2) for d2.parquet.However it's not desired to include d2.parquet as part of files for path rewrite.
CC @szehon-ho