Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review need for explicit distinct aggregate function implementations #10159

Closed
Jefffrey opened this issue Apr 21, 2024 · 2 comments
Closed

Review need for explicit distinct aggregate function implementations #10159

Jefffrey opened this issue Apr 21, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@Jefffrey
Copy link
Contributor

Is your feature request related to a problem or challenge?

When raising #10158 to close some old tickets, noticed in code base places where it states distinct aggregations are not supported when they are:

(AggregateFunction::Median, true) => {
return not_impl_err!("MEDIAN(DISTINCT) aggregations are not available");
}

I guess this is just saying there isn't an explicit function implementation for this, since the plan will apply distinct first then the aggregation, like so:

> explain select median(distinct "1") from '/home/jeffrey/Downloads/data.csv';
+---------------+------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                   |
+---------------+------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Projection: MEDIAN(alias1) AS MEDIAN(DISTINCT /home/jeffrey/Downloads/data.csv.1)                                      |
|               |   Aggregate: groupBy=[[]], aggr=[[MEDIAN(alias1)]]                                                                     |
|               |     Aggregate: groupBy=[[/home/jeffrey/Downloads/data.csv.1 AS alias1]], aggr=[[]]                                     |
|               |       TableScan: /home/jeffrey/Downloads/data.csv projection=[1]                                                       |
| physical_plan | ProjectionExec: expr=[MEDIAN(alias1)@0 as MEDIAN(DISTINCT /home/jeffrey/Downloads/data.csv.1)]                         |
|               |   AggregateExec: mode=Final, gby=[], aggr=[MEDIAN(alias1)]                                                             |
|               |     CoalescePartitionsExec                                                                                             |
|               |       AggregateExec: mode=Partial, gby=[], aggr=[MEDIAN(alias1)]                                                       |
|               |         AggregateExec: mode=FinalPartitioned, gby=[alias1@0 as alias1], aggr=[]                                        |
|               |           CoalesceBatchesExec: target_batch_size=8192                                                                  |
|               |             RepartitionExec: partitioning=Hash([alias1@0], 12), input_partitions=12                                    |
|               |               AggregateExec: mode=Partial, gby=[1@0 as alias1], aggr=[]                                                |
|               |                 RepartitionExec: partitioning=RoundRobinBatch(12), input_partitions=1                                  |
|               |                   CsvExec: file_groups={1 group: [[home/jeffrey/Downloads/data.csv]]}, projection=[1], has_header=true |
|               |                                                                                                                        |
+---------------+------------------------------------------------------------------------------------------------------------------------+

Describe the solution you'd like

Review if there's a need to do an explicit implementation of a distinct aggregation function (e.g. distinct_median) instead of relying on separate distinct -> median steps in the plan. Is it possible to implement a more efficient distinct median by doing it this way?

Describe alternatives you've considered

If decide not to implement an explicit function for distinct aggregates, update the above code to indicate this isn't a NotImplemented error but should instead be a plan or internal error, for clarity, and indicate in the error message that planning should have split it up.

Additional context

No response

@Jefffrey Jefffrey added the enhancement New feature or request label Apr 21, 2024
@Jefffrey
Copy link
Contributor Author

Jefffrey commented Apr 21, 2024

> select median(distinct "1"), median("1") from '/home/jeffrey/Downloads/data.csv';
This feature is not implemented: MEDIAN(DISTINCT) aggregations are not available

Looks like this is where the error can pop up, which might be a little confusing given that median(distinct "1")) by itself is fine. I guess will need to implement it then.

@Jefffrey
Copy link
Contributor Author

Looks like this was covered by #2406

Closing

@Jefffrey Jefffrey closed this as not planned Won't fix, can't repro, duplicate, stale Apr 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant