Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update explain plan to show when topk operator is used #7750

Closed
alamb opened this issue Oct 5, 2023 · 5 comments · Fixed by #7826
Closed

Update explain plan to show when topk operator is used #7750

alamb opened this issue Oct 5, 2023 · 5 comments · Fixed by #7826
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@alamb
Copy link
Contributor

alamb commented Oct 5, 2023

Is your feature request related to a problem or challenge?

After #7721 a SortExec with a limit will use a special TopK operator

However, I don't think it is clear from the EXPLAIN plan that this will be used (you have to know that sort with a limit is specially optimized to not actually sort).

Using DataFusion CLI and the dataset described here: #7721 (review), the explain plan looks like

❯ explain select trace_id from 'traces.parquet' order by time desc limit 10;
+---------------+-------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                              |
+---------------+-------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Projection: traces.parquet.trace_id                                                                               |
|               |   Limit: skip=0, fetch=10                                                                                         |
|               |     Sort: traces.parquet.time DESC NULLS FIRST, fetch=10                                                          |
|               |       Projection: traces.parquet.trace_id, traces.parquet.time                                                    |
|               |         TableScan: traces.parquet projection=[time, trace_id]                                                     |
| physical_plan | ProjectionExec: expr=[trace_id@0 as trace_id]                                                                     |
|               |   GlobalLimitExec: skip=0, fetch=10                                                                               |
|               |     SortExec: fetch=10, expr=[time@1 DESC]         <----- ****  this uses the TopK operator |                                                               
|               |       ProjectionExec: expr=[trace_id@1 as trace_id, time@0 as time]                                               |
|               |         ParquetExec: file_groups={1 group: [[traces.parquet]]}, projection=[time, trace_id] |
|               |                                                                                                                   |
+---------------+-------------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.004 seconds.

Describe the solution you'd like

I would like to make it clearer in the explain plan that this new operator is used. Perhaps something like

SortExec: TopK(fetch=10), expr=[time@1 DESC]         <----- **** This is updated |                                                               

See comment in https://github.com/apache/arrow-datafusion/pull/7721/files#r1343969678 for exactly where this code that controls the output is

Describe alternatives you've considered

We can leave the explain plans alone, but I think that is confusing

Additional context

No response

@alamb alamb added the enhancement New feature or request label Oct 5, 2023
@alamb
Copy link
Contributor Author

alamb commented Oct 5, 2023

I think this is an excellent task for new users as the code is quite straightforward and most of this work will be to learn how to run the various tests and update them for the new explain plans.

@alamb alamb added the good first issue Good for newcomers label Oct 5, 2023
Night-Amber3301 added a commit to Night-Amber3301/arrow-datafusion that referenced this issue Oct 6, 2023
solves: apache#7750

Replaced `SortExec: fetch={fetch}, expr=[{}]` with 'SortExec: TopK(fetch={fetch}), expr=[{}]' in [sort.rs](https://github.com/apache/arrow-datafusion/blob/main/datafusion/physical-plan/src/sorts/sort.rs) file
@fansehep
Copy link

Let me try it. 😄

@alamb
Copy link
Contributor Author

alamb commented Oct 10, 2023

Note there is already a partial version in #7751

With a good comment here: #7751 (comment)

Maybe you can base your efforts on that PR

@MayurShirodkarOfficial
Copy link

hello !! can i try working on this issue?

@alamb
Copy link
Contributor Author

alamb commented Oct 13, 2023

hello !! can i try working on this issue?

Absolutely -- I think we are just waiting on someone to make a PR that passes CI :)

alamb pushed a commit that referenced this issue Oct 15, 2023
* Updated sort.rs

solves: #7750

Replaced `SortExec: fetch={fetch}, expr=[{}]` with 'SortExec: TopK(fetch={fetch}), expr=[{}]' in [sort.rs](https://github.com/apache/arrow-datafusion/blob/main/datafusion/physical-plan/src/sorts/sort.rs) file

* fix: ci

---------

Co-authored-by: Pratibhanu Jarngal <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
3 participants