Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not concatenate identical dictionaries #1219

Merged
merged 2 commits into from
Jan 24, 2022

Conversation

tustvold
Copy link
Contributor

Which issue does this PR close?

Closes #504

Rationale for this change

See ticket

What changes are included in this PR?

Adds ArrayData::ptr_eq and then uses this to elide dictionary concatenation in MutableArrayData

Are there any user-facing changes?

This changes the behaviour of concat, to no longer concatenate the dictionaries if they are the same. Not sure if this counts

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jan 21, 2022
@jhorstmann
Copy link
Contributor

Looks good to me. I initially thought ptr_eq for dictionary values could get away with only comparing the data buffer, but in theory the same buffer could be combined with different offsets or validity. So really need to compare all buffers and child data as int this PR.

@@ -1155,6 +1155,41 @@ impl ArrayData {
Ok(())
})
}

/// Returns true if this `ArrayData` is equal to `other`, using pointer comparisons
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really nice -- thank you @tustvold


/// Returns true if this `ArrayData` is equal to `other`, using pointer comparisons
/// to determine buffer equality. This is cheaper than `PartialEq::eq` but may
/// return false negatives
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in what case would this return a false negative (to the "are these two pointers the same" question)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may say two things are not equal when logically they are?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the wording

assert!(
array_copy.data().child_data()[0].ptr_eq(&combined.data().child_data()[0])
);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also add a test of concatenating three dictionaries -- where 2 use the same dictionary and one is a different dictionary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

@codecov-commenter
Copy link

Codecov Report

Merging #1219 (c9417ed) into master (fcd37ee) will increase coverage by 0.00%.
The diff coverage is 87.23%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #1219   +/-   ##
=======================================
  Coverage   82.70%   82.71%           
=======================================
  Files         175      175           
  Lines       51711    51787   +76     
=======================================
+ Hits        42769    42837   +68     
- Misses       8942     8950    +8     
Impacted Files Coverage Δ
arrow/src/array/transform/mod.rs 85.41% <76.00%> (-0.15%) ⬇️
arrow/src/array/data.rs 82.48% <87.23%> (+0.20%) ⬆️
arrow/src/compute/kernels/concat.rs 100.00% <100.00%> (ø)
parquet/src/encodings/encoding.rs 93.52% <0.00%> (-0.20%) ⬇️
arrow/src/csv/reader.rs 88.12% <0.00%> (+0.01%) ⬆️
arrow/src/datatypes/field.rs 54.10% <0.00%> (+0.30%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fcd37ee...c9417ed. Read the comment docs.


/// Returns true if this `ArrayData` is equal to `other`, using pointer comparisons
/// to determine buffer equality. This is cheaper than `PartialEq::eq` but may
/// return false when the arrays are logically equal
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb alamb merged commit 90de247 into apache:master Jan 24, 2022
@alamb alamb added enhancement Any new improvement worthy of a entry in the changelog performance and removed enhancement Any new improvement worthy of a entry in the changelog labels Feb 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Do not copy dictionary values when they are the same in concat
4 participants