Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support List for Array aggregate order and distinct #9234

Merged
merged 10 commits into from
Feb 20, 2024

Conversation

jayzhan211
Copy link
Contributor

@jayzhan211 jayzhan211 commented Feb 15, 2024

Which issue does this PR close?

Closes #8512.

There are value expressions and ordering expressions in array aggregate order, this PR supports List type for value expression.
List type for Ordering expressions is not included.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

I came out with another test example with the same error, I didn't directly test the example given from the issue.

Are there any user-facing changes?

@github-actions github-actions bot added physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt) labels Feb 15, 2024
@jayzhan211 jayzhan211 marked this pull request as ready for review February 15, 2024 04:01
@jayzhan211 jayzhan211 marked this pull request as draft February 15, 2024 05:41
@jayzhan211 jayzhan211 changed the title Support List for Array aggregate order Support List for Array aggregate order and distinct Feb 15, 2024
@jayzhan211 jayzhan211 changed the title Support List for Array aggregate order and distinct Support List for Array aggregate order Feb 15, 2024
Some(9),
])]),
));
// Duplicate l1 and l3 in the input array and check that it is deduped in the output.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems more correct to me. The previous flattened result is incorrect.

///
/// assert_eq!(scalar_vec, expected);
/// ```
pub fn convert_first_level_array_to_scalar_vec(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may not need the current version of convert_array_to_scalar_vec

# Apply array_sort to have determinisitic result, higher dimension nested array also works but not for array sort,
# so they are covered in `datafusion/physical-expr/src/aggregate/array_agg_distinct.rs`
query ??
select array_sort(c1), array_sort(c2) from (
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than sorting, if there is some setting or random seed that comes out the deterministic result, it would be great

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we would need to use a structure for deduplicating that preserved input order

@jayzhan211 jayzhan211 changed the title Support List for Array aggregate order Support List for Array aggregate order and distinct Feb 15, 2024
@github-actions github-actions bot added the core Core DataFusion crate label Feb 15, 2024
vec![scalar]
}
};
let nested_array = array.as_list::<i32>().value(index);
Copy link
Contributor Author

@jayzhan211 jayzhan211 Feb 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main fix is to avoid recursively flattening, just convert each row directly.

@jayzhan211 jayzhan211 marked this pull request as ready for review February 15, 2024 09:49
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
@jayzhan211
Copy link
Contributor Author

This is ready for review. Only rebase if needed

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jayzhan211 -- this looks like a good improvement to me.

I had some suggestions on how to improve it even more, but I don't think they are required

/// ],
/// ];
///
/// assert_eq!(scalar_vec, expected);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you fro these comments -- I suggest we improve the docs even more by:

  1. Split the examples up (in this case add a second # Example of using array_into_list_array heading and a second doc example
  2. Add some comments explaining what each line is doing. For you it is likely obvious, but for a causal reader,. understanding what array_into_list_array is doing or why this is different from the previous example is likely not as easy

let array = &states[0];

// Unwrap outer ListArray then do update batch
let inner_array = array.as_list::<i32>().value(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we know there is only a single row in the states array? Can we please add an assert here to make sure we don't accidentally lose data?

Something like assert_eq!(array.len(), 1)?

panic!("Expected ScalarValue::List, got {:?}", arr)
}
};
// arrow::compute::sort can't sort nested ListArray directly, so we compare the scalar values pair-wise.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about filing a ticket upstream in arrow-rs to properly support sorting nested lists?

It seems like it would be valuable eventually and could likely be done much more efficiently than using ScalarValue

This comment was marked as outdated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, let me file the ticket. But, I think it is not so helpful for this test (values are little). Maybe support it if there is somewhere that needs sorting many nested lists.

Actually, we can just support array_sort for nested array, this also need upstream fix

# Apply array_sort to have determinisitic result, higher dimension nested array also works but not for array sort,
# so they are covered in `datafusion/physical-expr/src/aggregate/array_agg_distinct.rs`
query ??
select array_sort(c1), array_sort(c2) from (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we would need to use a structure for deduplicating that preserved input order

Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
@jayzhan211 jayzhan211 requested a review from alamb February 17, 2024 00:11
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jayzhan211 🙏

@jayzhan211 jayzhan211 merged commit fc84a63 into apache:main Feb 20, 2024
23 checks passed
@alamb
Copy link
Contributor

alamb commented Feb 21, 2024

🎉

erratic-pattern added a commit to erratic-pattern/arrow-datafusion that referenced this pull request Mar 14, 2024
erratic-pattern added a commit to erratic-pattern/arrow-datafusion that referenced this pull request Mar 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ARRAY_AGG of column of type list ORDER BY column of type non-list
2 participants