Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DictionaryArray support in eq_dyn kernel #1263

Merged
merged 11 commits into from
Feb 13, 2022
Merged

Conversation

viirya
Copy link
Member

@viirya viirya commented Feb 2, 2022

Which issue does this PR close?

This is part of #1201.

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

@viirya viirya marked this pull request as draft February 2, 2022 18:14
@github-actions github-actions bot added the arrow Changes to the arrow crate label Feb 2, 2022
/// Perform `left == right` operation on two `DictionaryArray`s.
/// Only when two arrays are of the same type the comparison will happen otherwise it will err
/// with a casting error.
pub fn eq_dict<K, T>(left: &DictionaryArray<K>, right: &DictionaryArray<K>) -> Result<BooleanArray>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't find any eq implementation for DictionaryArray. Not sure if this is correct approach. cc @alamb

Copy link
Member Author

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this in the right direction? As there is no existing eq implementation for DictionaryArray, I open this as wip work to get some feedback if any. Thanks.

@codecov-commenter
Copy link

codecov-commenter commented Feb 2, 2022

Codecov Report

Merging #1263 (d5ffe59) into master (35e16be) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff            @@
##           master    #1263    +/-   ##
========================================
  Coverage   83.04%   83.05%            
========================================
  Files         180      180            
  Lines       52424    52844   +420     
========================================
+ Hits        43537    43891   +354     
- Misses       8887     8953    +66     
Impacted Files Coverage Δ
arrow/src/compute/kernels/comparison.rs 92.29% <100.00%> (+0.58%) ⬆️
arrow/src/compute/kernels/filter.rs 84.77% <0.00%> (-7.73%) ⬇️
arrow/src/util/bench_util.rs 95.23% <0.00%> (ø)
parquet_derive/src/parquet_field.rs 66.21% <0.00%> (ø)
arrow/src/array/transform/mod.rs 84.64% <0.00%> (+0.12%) ⬆️
arrow/src/array/data.rs 83.30% <0.00%> (+0.17%) ⬆️
arrow/src/util/bit_chunk_iterator.rs 93.83% <0.00%> (+0.35%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 35e16be...d5ffe59. Read the comment docs.

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, @alamb might have some clever ideas, but I suspect without substantial changes to the rest of the file (well beyond the scope of this PR), I'm not sure how to avoid this change from being very verbose...

/// Perform `left == right` operation on two `DictionaryArray`s.
/// Only when two arrays are of the same type the comparison will happen otherwise it will err
/// with a casting error.
pub fn eq_dict<K, T>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely a bonus feature, but you might want to make this generic in the comparison operation, perhaps look at simd_compare_op for inspiration

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Make it generic and rename to cmp_dict.

K: ArrowNumericType,
T: ArrowNumericType,
{
assert_eq!(left.keys().len(), right.keys().len());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the other implementations this typically returns an error

if $left.len() != $right.len() {
            return Err(ArrowError::ComputeError(
                "Cannot perform comparison operation on arrays of different length"
                    .to_string(),
            ));
        }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to an error.

@@ -2030,12 +2030,112 @@ macro_rules! typed_compares {
}};
}

macro_rules! typed_dict_cmp {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's entirely personal preference, and this file does tend to use macros for this stuff, but personally I find generics easier to interpret, debug and maintain. Just putting the idea out there

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

macros are used pretty heavily in this file, so it may be better to follow along with that pattern

}
}

Ok(BooleanArray::from(result))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is likely to be significantly faster to use MutableBuffer::from_trusted_len_iter_bool(iter) here, where iter is the body of the for loop. This not only avoids incrementally growing a Vec, but also allocating space for Vec<bool> and then copying it into a Buffer - instead it will construct the packed bitmask directly

(DataType::Int8, DataType::Int8) => {
let left = as_dictionary_array::<Int8Type>($LEFT);
let right = as_dictionary_array::<Int8Type>($RIGHT);
typed_dict_cmp!(left, right, $OP_PRIM, Int8Type)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do wonder if there might be some way to reuse typed_compares here somehow, with the notion of the source type somehow encoded in OP_PRIM, etc... but it may be more effort than it is worth

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @viirya -- this is looking quite cool.

Comment on lines 2106 to 2107
for idx in 0..left.keys().len() {
unsafe {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could avoid some amount of this unsafe-ness and Vec manipulation with something like this (untested)

let result: BooleanArray = left.keys().iter()
  .zip(right.keys.iter())
  .map(|(left_key, right_key)| {
    let left_key = left_key.to_usize().expect("..");
    let right_key = right_key.to_usize().expect("..");
    ...
    left_value == right_value
})
.collect()

Though given this pattern, perhaps we could (in a separate PR) implement an iterator for DictionaryArray, so this code could devolve into

let result: BooleanArray = left.iter()
  .zip(right.iter())
  .map(|(left_value, right_value)| {
    left_value == right_value
})
.collect()

🤔 then I wonder how many of the existing macros would "just work" 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to suggested approach. I can work on an iterator for DictionaryArray in a separate PR later.

@@ -2030,12 +2030,112 @@ macro_rules! typed_compares {
}};
}

macro_rules! typed_dict_cmp {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

macros are used pretty heavily in this file, so it may be better to follow along with that pattern

@viirya
Copy link
Member Author

viirya commented Feb 4, 2022

Thanks for the review comments! I will address them and finish soon.

let left = as_dictionary_array::<UInt64Type>($LEFT);
let right = as_dictionary_array::<UInt64Type>($RIGHT);
typed_dict_cmp!(left, right, $OP_PRIM, UInt64Type)
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems we don't need implement for other types? ArrowDictionaryKeyType is only implemented for Int8Type...UInt64Type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I believe that is correct -- the only supported key types are Int8Type...UInt64Type (at least in arrow-rs, I am not sure about the arrow spec in general).

@viirya viirya marked this pull request as ready for review February 9, 2022 01:40
@viirya viirya changed the title wip. Implement DictionaryArray support in eq_dyn Implement DictionaryArray support in eq_dyn Feb 9, 2022
@viirya
Copy link
Member Author

viirya commented Feb 9, 2022

@alamb @tustvold This is ready for review. Please take a look again. Thanks.

(DataType::UInt64, DataType::UInt64) => {
cmp_dict::<$KT, UInt64Type, _>($LEFT, $RIGHT, $OP)
}
(DataType::Utf8, DataType::Utf8) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it is also useful to include Float32 and Float64 -- but we an also do that as a follow on PR if someone is looking for it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, I forgot Float32 and Float64. Going to add them.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @viirya this is looking very nice.

cc @tustvold @matthewmturner

let left = as_dictionary_array::<UInt64Type>($LEFT);
let right = as_dictionary_array::<UInt64Type>($RIGHT);
typed_dict_cmp!(left, right, $OP_PRIM, UInt64Type)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I believe that is correct -- the only supported key types are Int8Type...UInt64Type (at least in arrow-rs, I am not sure about the arrow spec in general).

}

/// Helper function to perform boolean lambda function on values from two dictionary arrays, this
/// version does not attempt to use SIMD.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// version does not attempt to use SIMD.
/// version does not attempt to use SIMD explicitly (though the compiler may auto vectorize)

@@ -2030,6 +2030,271 @@ macro_rules! typed_compares {
}};
}

macro_rules! typed_dict_cmp {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
macro_rules! typed_dict_cmp {
/// Applies $OP to $LEFT and $RIGHT which are two dictionaries which have (the same) key type $KT
macro_rules! typed_dict_cmp {

let left_key = left_k.to_usize().expect("Dictionary index not usize");
let right_key =
right_k.to_usize().expect("Dictionary index not usize");
unsafe {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

}

/// Perform given operation on two `DictionaryArray`s.
/// Only when two arrays are of the same type the comparison will happen otherwise it will err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Only when two arrays are of the same type the comparison will happen otherwise it will err
/// Returns an error if the two arrays have different value type

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed as suggested

@@ -2045,7 +2310,12 @@ macro_rules! typed_compares {
/// assert_eq!(BooleanArray::from(vec![Some(true), None, Some(false)]), result);
/// ```
pub fn eq_dyn(left: &dyn Array, right: &dyn Array) -> Result<BooleanArray> {
typed_compares!(left, right, eq_bool, eq, eq_utf8, eq_binary)
match left.data_type() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is wonderful


#[test]
fn test_eq_dyn_dictionary_i8_array() {
let key_type = DataType::Int8;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for these tests. It clearly is quite messy to create DictionaryArrays :(

I have some ideas of how to make it less of a mess and will file a follow on ticket / PRs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#1300 as a proposed PR to make it better

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, I found it is a bit painful to create dictionary array. thanks for the proposal!

@viirya
Copy link
Member Author

viirya commented Feb 10, 2022

@alamb Thanks! Added Float32 and Float64 and also changed the comments as suggested.

@jhorstmann
Copy link
Contributor

Looks good. I also find generics to be nicer, mostly because of IDE support, but those would be tricky here since there is no generic trait containing the value/value_unchecked methods.

I was wondering, what does the arrow spec say about dictionary values being null? Probably no one uses that combination, but in theory, a valid key could map to a null value.

@alamb
Copy link
Contributor

alamb commented Feb 13, 2022

I was wondering, what does the arrow spec say about dictionary values being null? Probably no one uses that combination, but in theory, a valid key could map to a null value.

https://arrow.apache.org/docs/format/Columnar.html#dictionary-encoded-layout

says explicitly

Note that a dictionary is permitted to contain duplicate values or nulls:

@alamb alamb merged commit 6ee1fda into apache:master Feb 13, 2022
@tustvold
Copy link
Contributor

tustvold commented Feb 13, 2022

From the link @alamb provided

The null count of such arrays is dictated only by the validity bitmap of its indices, irrespective of any null values in the dictionary.

i.e. any nulls on the values array are simply ignored, and only the null mask of the indices array (keys in arrow-rs) are considered.

@alamb
Copy link
Contributor

alamb commented Feb 13, 2022

i.e. any nulls on the values array are simply ignored, and only the null mask of the indices array (keys in arrow-rs) are considered.

I agree that this is my interpretation as well. Maybe it is worth a PR to clarify the format docs to be super explicit on this point?

@alamb alamb changed the title Implement DictionaryArray support in eq_dyn Add DictionaryArray support in eq_dyn kernel Feb 16, 2022
@alamb alamb added the enhancement Any new improvement worthy of a entry in the changelog label Feb 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants