Spark: Read DVs when reading from .position_deletes table #11657

nastra · 2024-11-26T14:03:58Z

this is part of #11122 and has been extracted from #11545

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/PositionDeletesRowReader.java

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/DVIterable.java

singhpk234 · 2024-11-28T17:15:14Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/DVIterable.java

+
+  @Override
+  public CloseableIterator<InternalRow> iterator() {
+    PuffinReader reader = builder.build();


[optional] might be too much, but can we have one reader per DV file ? considering specifically for this use case we will have to read all the blobs in the DV file eventually.

do you have a particular use case in mind as this isn't something that is being needed currently when reading the PositionDeletesTable?

I'd argue we rarely need to read the entire DV file as not all DVs may be still valid.

I agree, and makes sense to not go this route, was mostly coming from the case that we need to read more than 1 blob in a puffin DV file in that case it might be better to reuse the reader.

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/DVIterable.java

aokolnychyi · 2024-12-05T23:49:22Z

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/source/TestPositionDeletesReader.java

+import org.junit.jupiter.api.io.TempDir;
+
+@ExtendWith(ParameterizedTestExtension.class)
+public class TestPositionDeletesReader extends TestBase {


Can we also test reading DVs end-to-end by quering the position_deletes metadata table in Spark? I think we can commit DVs from Java and then read it from Spark, as we don't have DVs in Spark right now?

they are being tested end-to-end in #11545. I really just wanted to extract these pieces here to make reviewing easier. Alternatively, I could close this PR and we'll just have it as part of #11545 where all of this is fully tested end-to-end through Spark

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/PositionDeletesRowReader.java

aokolnychyi · 2024-12-05T23:59:24Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/DVIterable.java

+
+  @Override
+  public CloseableIterator<InternalRow> iterator() {
+    PuffinReader reader = builder.build();


I'd argue we rarely need to read the entire DV file as not all DVs may be still valid.

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/PositionDeletesRowReader.java

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/DVIterable.java

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/DVIterator.java

aokolnychyi · 2024-12-13T19:36:00Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/DVIterator.java

+      }
+
+      this.row = new GenericInternalRow(rowValues.toArray());
+    } else if (null != deletedPositionIndex) {


Question: Do we actually need null != deletedPositionIndex? I think it is the first invocation and we need to initialize the row or we need to update the position. Shouldn't this fail if the index is still null to indicate something went wrong?

Having a case where deletedPositionIndex is null is still a valid case IMO. This would be true if a user doesn't project the pos column

github-actions bot added the spark label Nov 26, 2024

nastra requested review from aokolnychyi and amogh-jahagirdar November 26, 2024 14:09

nastra commented Nov 26, 2024

View reviewed changes

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/PositionDeletesRowReader.java Outdated Show resolved Hide resolved

aokolnychyi reviewed Nov 27, 2024

View reviewed changes

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/DVIterable.java Outdated Show resolved Hide resolved

aokolnychyi reviewed Nov 27, 2024

View reviewed changes

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/DVIterable.java Outdated Show resolved Hide resolved

nastra force-pushed the dv-iterable-for-position-deletes branch 2 times, most recently from b347427 to cd35ea5 Compare November 27, 2024 13:29

nastra requested a review from aokolnychyi November 27, 2024 13:32

nastra force-pushed the dv-iterable-for-position-deletes branch 3 times, most recently from b79a7da to 2512b5f Compare November 27, 2024 15:18

singhpk234 reviewed Nov 28, 2024

View reviewed changes

nastra force-pushed the dv-iterable-for-position-deletes branch from 9a47998 to 3e1bafe Compare November 29, 2024 15:20

nastra commented Dec 2, 2024

View reviewed changes

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/DVIterable.java Outdated Show resolved Hide resolved

aokolnychyi reviewed Dec 6, 2024

View reviewed changes

nastra force-pushed the dv-iterable-for-position-deletes branch 3 times, most recently from 710febe to de6e0da Compare December 6, 2024 17:29

nastra closed this Dec 6, 2024

nastra reopened this Dec 6, 2024

aokolnychyi reviewed Dec 7, 2024

View reviewed changes

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/DVIterator.java Outdated Show resolved Hide resolved

aokolnychyi reviewed Dec 7, 2024

View reviewed changes

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/DVIterator.java Outdated Show resolved Hide resolved

nastra force-pushed the dv-iterable-for-position-deletes branch 3 times, most recently from 9774d18 to d633591 Compare December 9, 2024 13:32

aokolnychyi approved these changes Dec 13, 2024

View reviewed changes

nastra added 3 commits December 16, 2024 07:40

Spark: Read DVs when reading from .position_deletes table

504f2ba

improvements

02d5cdf

review feedback

7a69d55

review feedback

25eb30a

nastra force-pushed the dv-iterable-for-position-deletes branch from d633591 to 25eb30a Compare December 16, 2024 06:43

nastra merged commit 2a5b089 into apache:main Dec 16, 2024
31 checks passed

nastra deleted the dv-iterable-for-position-deletes branch December 16, 2024 07:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark: Read DVs when reading from .position_deletes table #11657

Spark: Read DVs when reading from .position_deletes table #11657

nastra commented Nov 26, 2024 •

edited

Loading

singhpk234 Nov 28, 2024

nastra Nov 29, 2024

aokolnychyi Dec 5, 2024

singhpk234 Dec 6, 2024

aokolnychyi Dec 5, 2024

nastra Dec 6, 2024 •

edited

Loading

aokolnychyi Dec 5, 2024

aokolnychyi Dec 13, 2024

nastra Dec 16, 2024

Spark: Read DVs when reading from .position_deletes table #11657

Spark: Read DVs when reading from .position_deletes table #11657

Conversation

nastra commented Nov 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nastra Dec 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nastra commented Nov 26, 2024 •

edited

Loading

nastra Dec 6, 2024 •

edited

Loading