-
Notifications
You must be signed in to change notification settings - Fork 839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementation string concat #1720
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1720 +/- ##
==========================================
+ Coverage 83.32% 83.34% +0.02%
==========================================
Files 196 197 +1
Lines 55961 56031 +70
==========================================
+ Hits 46627 46698 +71
+ Misses 9334 9333 -1
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good start, thank you, I've left a few comments. Let me know if anything is unclear
arrow/src/compute/kernels/concat.rs
Outdated
@@ -102,6 +102,25 @@ pub fn concat(arrays: &[&dyn Array]) -> Result<ArrayRef> { | |||
Ok(make_array(mutable.freeze())) | |||
} | |||
|
|||
// Elementwise concatenation of StringArrays | |||
pub fn string_concat<Offset: OffsetSizeTrait>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this should be in a new string module, as opposed to concat, as I could see it causing confusion that this performs element-wise concatenation, when the concat kernel above concatenates arrays?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps @alamb has some thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @tustvold to create a new compute kernel. Also, list, binary, fixed size list and fix size binary can all be supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also agree that a new string (or utf8
module) would be good here.
There are a bunch of other potential candidates in datafusion to move into arrow-rs (kudos to @ovr for implementing many of them initially): https://github.com/apache/arrow-datafusion/blob/master/datafusion/physical-expr/src/string_expressions.rs
arrow/src/compute/kernels/concat.rs
Outdated
None | ||
} | ||
}) | ||
.collect::<GenericStringArray<Offset>>()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be significantly faster to do something like (not at all tested)
// TODO: Handle non-zero offset in source ArrayData
if left.len() != right.len() {
return Err(...)
}
let nulls = match (left.data().null_bitmap(), right.data.null_bitmap()) {
(Some(left), Some(right)) = Some((left & right)?)
(Some(left), None) => Some(left),
(None, Some(right)) => Some(right),
(None, None) => None,
};
let left_offsets = left.value_offsets();
let right_offsets = right.value_offsets();
let left_values = left.value_data().as_slice();
let right_values = right.value_data().as_slice();
let left_iter = left_offsets.windows(2);
let right_iter = right_offsets.windows(2);
let mut output_offsets = BufferBuilder::<Offset>::new(left_offsets.len());
let mut output_values = BufferBuilder::<u8>::new(left_data.len() + right_data.len());
output_offsets.append(0);
for (left, right) in left_iter.zip(right_iter) {
output_values.append_slice(&left_values[left[0]..left[1]]);
output_values.append_slice(&right_values[right[0]..right[1]]);)
output_offsets.append(output_values.len()); // With checked cast
}
let mut builder = ArrayDataBuilder::new(Offset::get_data_type)
.len(left.len())
.add_buffer(output_offsets.finish())
.add_buffer(output_values.finish());
if let Some(nulls) = nulls {
builder = builder.null_bit_buffer(nulls);
}
// SAFETY - offsets valid by construction
unsafe {builder.build_unchecked()}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could file a follow-on issue to add benchmarks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple thoughts, because I spent some time digging into this:
- @alamb's example in Implementation string concat #1720 suggests that "valid_str" +
None
=>None
, where both implementations in this PR go for "valid_str" +None
=> "valid_str". As a user, I'd prefer the later, but I thought I'd make a note of it. If the former is chosen, thencompute::util
hascombine_option_bitmap()
which works nicely for the null handling - How do we handle cases when concatenating two arrays results in the offsets overflowing? There should probably be a test case for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"valid_str" + None => None
This is the standard SQL semantic and it is how most other arrow kernels I know of behave, thus I think it would be best for consistency
@tustvold Thank you for the review and guidance! I'll add more test cases and support for non-zero offset later. |
arrow/src/compute/kernels/concat.rs
Outdated
@@ -102,6 +102,25 @@ pub fn concat(arrays: &[&dyn Array]) -> Result<ArrayRef> { | |||
Ok(make_array(mutable.freeze())) | |||
} | |||
|
|||
// Elementwise concatenation of StringArrays | |||
pub fn string_concat<Offset: OffsetSizeTrait>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @tustvold to create a new compute kernel. Also, list, binary, fixed size list and fix size binary can all be supported.
arrow/src/compute/kernels/concat.rs
Outdated
@@ -102,6 +102,25 @@ pub fn concat(arrays: &[&dyn Array]) -> Result<ArrayRef> { | |||
Ok(make_array(mutable.freeze())) | |||
} | |||
|
|||
// Elementwise concatenation of StringArrays | |||
pub fn string_concat<Offset: OffsetSizeTrait>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should support concatenate arbitrary number of arrays. We could file a follow-on issue to track it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HaoYang670 would you like to file the follow on issues, or shall I?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will.
arrow/src/compute/kernels/concat.rs
Outdated
None | ||
} | ||
}) | ||
.collect::<GenericStringArray<Offset>>()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could file a follow-on issue to add benchmarks.
@Ismail-Maj Really a good start! Thank you for contributing. |
Looking good 😄 I think the major thing this now needs to handle is sliced inputs, and tests. |
Only tests left to do (and move the code from |
I personally suggest completing this PR and getting it merged first, and then add other DataTypes as follow on tickets (you might find others interested in working on them as well, if you don't have time) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, assuming CI passes, nice work 👍
Will leave open for a bit longer in case anyone else wants to review, otherwise I'll likely merge this afternoon (BST).
@Ismail-Maj could you possibly merge the latest master to bring in #1743 |
@tustvold it should be good now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really nice to me -- thank you @Ismail-Maj ❤️
) | ||
.unwrap(); | ||
|
||
let expected = [None, Some("foofar"), Some("barfaz")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Which issue does this PR close?
Closes #1540 .
Rationale for this change
What changes are included in this PR?
Add an implementation for string concat and a single test.
Are there any user-facing changes?