-
Notifications
You must be signed in to change notification settings - Fork 847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add DictionaryArray
support in eq_dyn
kernel
#1263
Conversation
/// Perform `left == right` operation on two `DictionaryArray`s. | ||
/// Only when two arrays are of the same type the comparison will happen otherwise it will err | ||
/// with a casting error. | ||
pub fn eq_dict<K, T>(left: &DictionaryArray<K>, right: &DictionaryArray<K>) -> Result<BooleanArray> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't find any eq
implementation for DictionaryArray
. Not sure if this is correct approach. cc @alamb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this in the right direction? As there is no existing eq
implementation for DictionaryArray
, I open this as wip work to get some feedback if any. Thanks.
Codecov Report
@@ Coverage Diff @@
## master #1263 +/- ##
========================================
Coverage 83.04% 83.05%
========================================
Files 180 180
Lines 52424 52844 +420
========================================
+ Hits 43537 43891 +354
- Misses 8887 8953 +66
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, @alamb might have some clever ideas, but I suspect without substantial changes to the rest of the file (well beyond the scope of this PR), I'm not sure how to avoid this change from being very verbose...
/// Perform `left == right` operation on two `DictionaryArray`s. | ||
/// Only when two arrays are of the same type the comparison will happen otherwise it will err | ||
/// with a casting error. | ||
pub fn eq_dict<K, T>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely a bonus feature, but you might want to make this generic in the comparison operation, perhaps look at simd_compare_op
for inspiration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Make it generic and rename to cmp_dict
.
K: ArrowNumericType, | ||
T: ArrowNumericType, | ||
{ | ||
assert_eq!(left.keys().len(), right.keys().len()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the other implementations this typically returns an error
if $left.len() != $right.len() {
return Err(ArrowError::ComputeError(
"Cannot perform comparison operation on arrays of different length"
.to_string(),
));
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to an error.
@@ -2030,12 +2030,112 @@ macro_rules! typed_compares { | |||
}}; | |||
} | |||
|
|||
macro_rules! typed_dict_cmp { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's entirely personal preference, and this file does tend to use macros for this stuff, but personally I find generics easier to interpret, debug and maintain. Just putting the idea out there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
macros are used pretty heavily in this file, so it may be better to follow along with that pattern
} | ||
} | ||
|
||
Ok(BooleanArray::from(result)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is likely to be significantly faster to use MutableBuffer::from_trusted_len_iter_bool(iter)
here, where iter
is the body of the for loop. This not only avoids incrementally growing a Vec
, but also allocating space for Vec<bool>
and then copying it into a Buffer
- instead it will construct the packed bitmask directly
(DataType::Int8, DataType::Int8) => { | ||
let left = as_dictionary_array::<Int8Type>($LEFT); | ||
let right = as_dictionary_array::<Int8Type>($RIGHT); | ||
typed_dict_cmp!(left, right, $OP_PRIM, Int8Type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do wonder if there might be some way to reuse typed_compares
here somehow, with the notion of the source type somehow encoded in OP_PRIM, etc... but it may be more effort than it is worth
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @viirya -- this is looking quite cool.
for idx in 0..left.keys().len() { | ||
unsafe { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could avoid some amount of this unsafe-ness and Vec manipulation with something like this (untested)
let result: BooleanArray = left.keys().iter()
.zip(right.keys.iter())
.map(|(left_key, right_key)| {
let left_key = left_key.to_usize().expect("..");
let right_key = right_key.to_usize().expect("..");
...
left_value == right_value
})
.collect()
Though given this pattern, perhaps we could (in a separate PR) implement an iterator for DictionaryArray, so this code could devolve into
let result: BooleanArray = left.iter()
.zip(right.iter())
.map(|(left_value, right_value)| {
left_value == right_value
})
.collect()
🤔 then I wonder how many of the existing macros would "just work" 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to suggested approach. I can work on an iterator for DictionaryArray in a separate PR later.
@@ -2030,12 +2030,112 @@ macro_rules! typed_compares { | |||
}}; | |||
} | |||
|
|||
macro_rules! typed_dict_cmp { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
macros are used pretty heavily in this file, so it may be better to follow along with that pattern
Thanks for the review comments! I will address them and finish soon. |
let left = as_dictionary_array::<UInt64Type>($LEFT); | ||
let right = as_dictionary_array::<UInt64Type>($RIGHT); | ||
typed_dict_cmp!(left, right, $OP_PRIM, UInt64Type) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems we don't need implement for other types? ArrowDictionaryKeyType
is only implemented for Int8Type...UInt64Type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I believe that is correct -- the only supported key types are Int8Type...UInt64Type
(at least in arrow-rs, I am not sure about the arrow spec in general).
(DataType::UInt64, DataType::UInt64) => { | ||
cmp_dict::<$KT, UInt64Type, _>($LEFT, $RIGHT, $OP) | ||
} | ||
(DataType::Utf8, DataType::Utf8) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it is also useful to include Float32
and Float64
-- but we an also do that as a follow on PR if someone is looking for it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, I forgot Float32
and Float64
. Going to add them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @viirya this is looking very nice.
let left = as_dictionary_array::<UInt64Type>($LEFT); | ||
let right = as_dictionary_array::<UInt64Type>($RIGHT); | ||
typed_dict_cmp!(left, right, $OP_PRIM, UInt64Type) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I believe that is correct -- the only supported key types are Int8Type...UInt64Type
(at least in arrow-rs, I am not sure about the arrow spec in general).
} | ||
|
||
/// Helper function to perform boolean lambda function on values from two dictionary arrays, this | ||
/// version does not attempt to use SIMD. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// version does not attempt to use SIMD. | |
/// version does not attempt to use SIMD explicitly (though the compiler may auto vectorize) |
@@ -2030,6 +2030,271 @@ macro_rules! typed_compares { | |||
}}; | |||
} | |||
|
|||
macro_rules! typed_dict_cmp { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
macro_rules! typed_dict_cmp { | |
/// Applies $OP to $LEFT and $RIGHT which are two dictionaries which have (the same) key type $KT | |
macro_rules! typed_dict_cmp { |
let left_key = left_k.to_usize().expect("Dictionary index not usize"); | ||
let right_key = | ||
right_k.to_usize().expect("Dictionary index not usize"); | ||
unsafe { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
} | ||
|
||
/// Perform given operation on two `DictionaryArray`s. | ||
/// Only when two arrays are of the same type the comparison will happen otherwise it will err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Only when two arrays are of the same type the comparison will happen otherwise it will err | |
/// Returns an error if the two arrays have different value type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed as suggested
@@ -2045,7 +2310,12 @@ macro_rules! typed_compares { | |||
/// assert_eq!(BooleanArray::from(vec![Some(true), None, Some(false)]), result); | |||
/// ``` | |||
pub fn eq_dyn(left: &dyn Array, right: &dyn Array) -> Result<BooleanArray> { | |||
typed_compares!(left, right, eq_bool, eq, eq_utf8, eq_binary) | |||
match left.data_type() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is wonderful
|
||
#[test] | ||
fn test_eq_dyn_dictionary_i8_array() { | ||
let key_type = DataType::Int8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for these tests. It clearly is quite messy to create DictionaryArrays :(
I have some ideas of how to make it less of a mess and will file a follow on ticket / PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#1300 as a proposed PR to make it better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, I found it is a bit painful to create dictionary array. thanks for the proposal!
@alamb Thanks! Added |
Looks good. I also find generics to be nicer, mostly because of IDE support, but those would be tricky here since there is no generic trait containing the I was wondering, what does the arrow spec say about dictionary values being null? Probably no one uses that combination, but in theory, a valid key could map to a null value. |
https://arrow.apache.org/docs/format/Columnar.html#dictionary-encoded-layout says explicitly
|
From the link @alamb provided
i.e. any nulls on the values array are simply ignored, and only the null mask of the indices array (keys in arrow-rs) are considered. |
I agree that this is my interpretation as well. Maybe it is worth a PR to clarify the format docs to be super explicit on this point? |
DictionaryArray
support in eq_dyn
kernel
Which issue does this PR close?
This is part of #1201.
Rationale for this change
What changes are included in this PR?
Are there any user-facing changes?