Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: inconsistent scalar types in DistinctArrayAggAccumulator state #7385

Merged
merged 3 commits into from
Aug 24, 2023

Conversation

korowa
Copy link
Contributor

@korowa korowa commented Aug 23, 2023

Which issue does this PR close?

Closes #6743.

Rationale for this change

Currently DistinctArrayAggAccumulator stores its state as a set of scalar values and returns it as a single scalar list. The problem is that while merging state, accumulator, instead of adding list elements, simply adds the whole list to the state as a single value, which causes errors while executing evaluate on accumulator with merged state.

What changes are included in this PR?

merge_state now unpacks scalar list input and updates state with its elements.

Are these changes tested?

  • tests for merging DistinctArrayAggAccumulators have been added,
  • check for successful execution of array_agg(distinct col) has been added to sqllogictests

Are there any user-facing changes?

array_agg(distinct) should not fail in case query execution requires merging accumulator states (e. g. aggregation over multiple partitions)

@github-actions github-actions bot added physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt) labels Aug 23, 2023
@@ -1280,6 +1280,10 @@ NULL NULL 781 7.81 125 -117 100
# ----
# [4, 2, 3, 5, 1]

# additional count(1) forces array_agg_distinct instead of array_agg over aggregated by c2 data
statement ok
SELECT array_agg(distinct c2), count(1) FROM aggregate_test_100
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you verify the output as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, we don't have array_sort / list_sort function yet, and output of array_agg(distinct) is non-deterministic, so absence of failures is the best option of SQL-level check (as I can see it)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about unnesting and then using order by?

Copy link
Contributor Author

@korowa korowa Aug 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if there exists SQL-level unnest 🤔 -- it seems to be provided only by dataframe API (also PR and issue for this still are open).

But yeah, I can manually unnest it anyway, thank you!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done using cross-join and CTE for array indices

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change makes sense to me -- thank you @korowa

(0..array.len()).try_for_each(|i| {
let scalar = ScalarValue::try_from_array(array, i)?;
if let ScalarValue::List(Some(values), _) = scalar {
self.values.extend(values);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 extend should do the deduplication. 👍

# ----
# [4, 2, 3, 5, 1]
query III
WITH indices AS (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can make an input that has at least one non distinct value too -- not just 1,2,3,4,5 but maybe repeat the value 1 a few times:

  SELECT 1 AS idx UNION ALL
  SELECT 1 AS idx UNION ALL
  SELECT 1 AS idx UNION ALL
  SELECT 2 AS idx UNION ALL
...

Copy link
Contributor Author

@korowa korowa Aug 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actual aggregation input is aggregate_test_100.c2 -- in the subquery. These indices are required only to check query output -- I was't able to find a better way to compare randomly-ordered array with expected one -- more details are in the thread above

@alamb alamb merged commit ea9144e into apache:main Aug 24, 2023
@alamb
Copy link
Contributor

alamb commented Aug 24, 2023

Thanks @korowa and @metesynnada

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Trouble getting fancy with ARRAY_AGG (DISTINCT ARRAY_AGG)
4 participants