Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only increment metrics for data pages #4285

Merged
merged 1 commit into from
May 26, 2023

Conversation

tustvold
Copy link
Contributor

Which issue does this PR close?

Closes #.

Rationale for this change

Split out from #4280

Previously SerializedPageWriter contained logic to only include data page in the num_values for the reported PageWriteSpec. This was not replicated in TestPageWriter (used solely for testing) and led to a behaviour mismatch. Lifting this into GenericColumnWriter eliminates this inconsistency but leads to some test changes

What changes are included in this PR?

Are there any user-facing changes?

@tustvold tustvold added the development-process Related to development process of arrow-rs label May 25, 2023
@github-actions github-actions bot added the parquet Changes to the parquet crate label May 25, 2023
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me (and I spent time tracing through the code) but I am not an expert in this area

I wonder if @sunchao has anytime to take a look at this change, to see if it looks ok from his perspective.

@@ -765,10 +764,7 @@ impl<'a, W: Write> PageWriter for SerializedPageWriter<'a, W> {
spec.compressed_size = compressed_size + header_size;
spec.offset = start_pos;
spec.bytes_written = self.sink.bytes_written() as u64 - start_pos;
// Number of values is incremented for data pages only
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what I am seeing is that this was basically a workaround for the fact that num_values was not updated for data pages in update_metrics_for_page which has now been fixed. If so that makes sense to me

@alamb
Copy link
Contributor

alamb commented May 26, 2023

Thank you for pulling this out @tustvold -- I think it will make understanding #4280 much easier

@tustvold tustvold merged commit 6959b4b into apache:master May 26, 2023
alamb pushed a commit to alamb/arrow-rs that referenced this pull request May 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
development-process Related to development process of arrow-rs parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants