Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark: Read DVs when reading from .position_deletes table #11657

Merged
merged 4 commits into from
Dec 16, 2024

Conversation

nastra
Copy link
Contributor

@nastra nastra commented Nov 26, 2024

this is part of #11122 and has been extracted from #11545

@nastra nastra force-pushed the dv-iterable-for-position-deletes branch 2 times, most recently from b347427 to cd35ea5 Compare November 27, 2024 13:29
@nastra nastra requested a review from aokolnychyi November 27, 2024 13:32
@nastra nastra force-pushed the dv-iterable-for-position-deletes branch 3 times, most recently from b79a7da to 2512b5f Compare November 27, 2024 15:18

@Override
public CloseableIterator<InternalRow> iterator() {
PuffinReader reader = builder.build();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[optional] might be too much, but can we have one reader per DV file ? considering specifically for this use case we will have to read all the blobs in the DV file eventually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you have a particular use case in mind as this isn't something that is being needed currently when reading the PositionDeletesTable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd argue we rarely need to read the entire DV file as not all DVs may be still valid.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, and makes sense to not go this route, was mostly coming from the case that we need to read more than 1 blob in a puffin DV file in that case it might be better to reuse the reader.

@nastra nastra force-pushed the dv-iterable-for-position-deletes branch from 9a47998 to 3e1bafe Compare November 29, 2024 15:20
import org.junit.jupiter.api.io.TempDir;

@ExtendWith(ParameterizedTestExtension.class)
public class TestPositionDeletesReader extends TestBase {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also test reading DVs end-to-end by quering the position_deletes metadata table in Spark? I think we can commit DVs from Java and then read it from Spark, as we don't have DVs in Spark right now?

Copy link
Contributor Author

@nastra nastra Dec 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they are being tested end-to-end in #11545. I really just wanted to extract these pieces here to make reviewing easier. Alternatively, I could close this PR and we'll just have it as part of #11545 where all of this is fully tested end-to-end through Spark


@Override
public CloseableIterator<InternalRow> iterator() {
PuffinReader reader = builder.build();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd argue we rarely need to read the entire DV file as not all DVs may be still valid.

@nastra nastra force-pushed the dv-iterable-for-position-deletes branch 3 times, most recently from 710febe to de6e0da Compare December 6, 2024 17:29
@nastra nastra closed this Dec 6, 2024
@nastra nastra reopened this Dec 6, 2024
@nastra nastra force-pushed the dv-iterable-for-position-deletes branch 3 times, most recently from 9774d18 to d633591 Compare December 9, 2024 13:32
}

this.row = new GenericInternalRow(rowValues.toArray());
} else if (null != deletedPositionIndex) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Do we actually need null != deletedPositionIndex? I think it is the first invocation and we need to initialize the row or we need to update the position. Shouldn't this fail if the index is still null to indicate something went wrong?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a case where deletedPositionIndex is null is still a valid case IMO. This would be true if a user doesn't project the pos column

@nastra nastra force-pushed the dv-iterable-for-position-deletes branch from d633591 to 25eb30a Compare December 16, 2024 06:43
@nastra nastra merged commit 2a5b089 into apache:main Dec 16, 2024
31 checks passed
@nastra nastra deleted the dv-iterable-for-position-deletes branch December 16, 2024 07:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants