-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support List for Array aggregate order and distinct #9234
Conversation
020f62b
to
4ee8a08
Compare
Some(9), | ||
])]), | ||
)); | ||
// Duplicate l1 and l3 in the input array and check that it is deduped in the output. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems more correct to me. The previous flattened result is incorrect.
datafusion/common/src/scalar.rs
Outdated
/// | ||
/// assert_eq!(scalar_vec, expected); | ||
/// ``` | ||
pub fn convert_first_level_array_to_scalar_vec( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we may not need the current version of convert_array_to_scalar_vec
# Apply array_sort to have determinisitic result, higher dimension nested array also works but not for array sort, | ||
# so they are covered in `datafusion/physical-expr/src/aggregate/array_agg_distinct.rs` | ||
query ?? | ||
select array_sort(c1), array_sort(c2) from ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than sorting, if there is some setting or random seed that comes out the deterministic result, it would be great
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we would need to use a structure for deduplicating that preserved input order
datafusion/common/src/scalar.rs
Outdated
vec![scalar] | ||
} | ||
}; | ||
let nested_array = array.as_list::<i32>().value(index); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main fix is to avoid recursively flattening, just convert each row directly.
085df61
to
af16049
Compare
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
af16049
to
bf815b7
Compare
This is ready for review. Only rebase if needed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @jayzhan211 -- this looks like a good improvement to me.
I had some suggestions on how to improve it even more, but I don't think they are required
/// ], | ||
/// ]; | ||
/// | ||
/// assert_eq!(scalar_vec, expected); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you fro these comments -- I suggest we improve the docs even more by:
- Split the examples up (in this case add a second
# Example of using array_into_list_array
heading and a second doc example - Add some comments explaining what each line is doing. For you it is likely obvious, but for a causal reader,. understanding what
array_into_list_array
is doing or why this is different from the previous example is likely not as easy
let array = &states[0]; | ||
|
||
// Unwrap outer ListArray then do update batch | ||
let inner_array = array.as_list::<i32>().value(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we know there is only a single row in the states array? Can we please add an assert here to make sure we don't accidentally lose data?
Something like assert_eq!(array.len(), 1)
?
panic!("Expected ScalarValue::List, got {:?}", arr) | ||
} | ||
}; | ||
// arrow::compute::sort can't sort nested ListArray directly, so we compare the scalar values pair-wise. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about filing a ticket upstream in arrow-rs to properly support sorting nested lists?
It seems like it would be valuable eventually and could likely be done much more efficiently than using ScalarValue
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, let me file the ticket. But, I think it is not so helpful for this test (values are little). Maybe support it if there is somewhere that needs sorting many nested lists.
Actually, we can just support array_sort
for nested array, this also need upstream fix
# Apply array_sort to have determinisitic result, higher dimension nested array also works but not for array sort, | ||
# so they are covered in `datafusion/physical-expr/src/aggregate/array_agg_distinct.rs` | ||
query ?? | ||
select array_sort(c1), array_sort(c2) from ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we would need to use a structure for deduplicating that preserved input order
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @jayzhan211 🙏
🎉 |
)" This reverts commit fc84a63.
)" This reverts commit fc84a63.
Which issue does this PR close?
Closes #8512.
There are value expressions and ordering expressions in array aggregate order, this PR supports List type for value expression.
List type for Ordering expressions is not included.
Rationale for this change
What changes are included in this PR?
Are these changes tested?
I came out with another test example with the same error, I didn't directly test the example given from the issue.
Are there any user-facing changes?