Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(query): order by adhoc metrics should trigger group by #13434

Merged
merged 19 commits into from
Mar 17, 2021

Conversation

ktmud
Copy link
Member

@ktmud ktmud commented Mar 3, 2021

SUMMARY

Update SQL query generator to properly handle adhoc metrics in order by. The idea is collect adhoc metrics in orderby before deciding whether to apply groupby. Closes #13465 .

Also assume it is a GROUP BY query unless metrics to set to None. Depends on: apache-superset/superset-ui#995 (released in @superset-ui/core v0.17.18)

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

Before

orderby-only-before

orderby-only-before-query

After

orderby-only-after

orderby-only-after-query

Correctly generates query for group by and sort by without metrics.

TEST PLAN

Unit tests and a new Cypress test

ADDITIONAL INFORMATION

  • Has associated issue:
  • Changes UI
  • Requires DB Migration.
  • Confirm DB Migration upgrade and downgrade tested.
  • Introduces new feature or API
  • Removes existing feature or API

@villebro
Copy link
Member

villebro commented Mar 5, 2021

Lots of great improvements here (I'll add review comments later). It seems the root problem here is that the selected column isn't being added to the groupby clause, not the aggregate expression being missing from the select. Also, one thing worthwhile considering is not all db engines require having the order by aggregate expression in the selects:

sqlite> select name from birth_names group by name order by sum(num_girls) desc limit 10;
Jennifer
Jessica
Ashley
Sarah
Amanda
Elizabeth
Melissa
Michelle
Kimberly
Stephanie
sqlite>

While I don't see a lot of use cases where omitting the aggregate in the select, it's not strictly always necessary, and could save unnecessary network bandwidth when not needed. So instead of always adding it to the selects, we might want to introduce a parameter in BaseEngineSpec stating whether or not orderby aggregates have to be present in the select.

@ktmud
Copy link
Member Author

ktmud commented Mar 5, 2021

Thanks for the headsup, @villebro !

Another thing I noticed is that it's not always correct to assume "columns" with no "metrics" requires no aggregation. E.g. when table chart selects groupby with no metrics in aggregation mode, the output should be unique combination of the groups instead of raw records. This is mentioned as the second problem in #13228 .

Therefore we can not really deprecate groupby in QueryObject. Or we will need to add at least an is_aggregate flag if groupby is to be cleaned up.

@villebro
Copy link
Member

villebro commented Mar 5, 2021

Thanks for the headsup, @villebro !

Another thing I noticed is that it's not always correct to assume "columns" with no "metrics" requires no aggregation. E.g. when table chart selects groupby with no metrics in aggregation mode, the output should be unique combination of the groups instead of raw records. This is mentioned as the second problem in #13228 .

Therefore we can not really deprecate groupby in QueryObject. Or we will need to add at least an is_aggregate flag is groupby is to be cleaned up.

Yes, I've been planning on that - I was thinking along the lines of an aggregation_type, which could be none, distinct and groupby, where any aggregate expression would force groupby, but in the absence of aggregates, we would default to none, but would provide the option to do distinct.

@ktmud
Copy link
Member Author

ktmud commented Mar 5, 2021

I'm trying to imagine what would be a useful case in exposing SELECT DISTINCT abc FROM tbl in the API and why would the user (or client query builder) choose one way or another---since it generally produces the same results as group by (SELECT abc FROM tbl GROUP BY abc).

Note that in some data engines (e.g. Presto), there are performance implications in using DISTINCT vs GROUP BY (and GROUP BY is normally faster).

I'm OK with either way, but would hope we can keep the interface simple and avoid the possibility of getting the same results with different query configs.

@ktmud
Copy link
Member Author

ktmud commented Mar 5, 2021

Lots of great improvements here (I'll add review comments later). It seems the root problem here is that the selected column isn't being added to the groupby clause, not the aggregate expression being missing from the select. Also, one thing worthwhile considering is not all db engines require having the order by aggregate expression in the selects:

sqlite> select name from birth_names group by name order by sum(num_girls) desc limit 10;
Jennifer
Jessica
Ashley
Sarah
Amanda
Elizabeth
Melissa
Michelle
Kimberly
Stephanie
sqlite>

While I don't see a lot of use cases where omitting the aggregate in the select, it's not strictly always necessary, and could save unnecessary network bandwidth when not needed. So instead of always adding it to the selects, we might want to introduce a parameter in BaseEngineSpec stating whether or not orderby aggregates have to be present in the select.

I was actually contemplating sending the orderby only columns back to client (at least in the Data panel), just so users can have confidence the order by was correctly applied. Currently I'm removing the orderby columns with labels_expected, but they could also be removed in query_actions.py, or be handled by each chart itself. Either way, it seems useful to have the engine always return the sort by field while it's not clear how much the bandwidth saving matters, so I'm inclined to not add this EngineSpec parameter just to keep things simple...

@villebro
Copy link
Member

villebro commented Mar 5, 2021

I was actually contemplating sending the orderby only columns back to client (at least in the Data panel), just so users can have confidence the order by was correctly applied. Currently I'm removing the orderby columns with labels_expected, but they could also be removed in query_actions.py, or be handled by each chart itself. Either way, it seems useful to have the engine always return the sort by field while it's not clear how much the bandwidth saving matters, so I'm inclined to not add this EngineSpec parameter just to keep things simple...

I'm not super passionate about adding distinct support, so if there's any risk of it causing added complexity or getting in the way I'm ok with just supporting group bys. Same goes for adding vs removing order bys from the select - I'm fine adding them in the select, and agree that it's good to have them around for validation purposes (the real optimizations are probably elsewhere anyways, like using binary protocols, streaming etc). But in general I'm in the camp of "if a feature can easily be added, why not", and tend to prefer to avoid adding limitations or forcing features if there are no clear motivations for doing so.

@ktmud ktmud changed the title fix(query): properly select adhoc metrics in orderby fix(query): order by adhoc metrics should trigger group by Mar 8, 2021
@ktmud ktmud marked this pull request as ready for review March 8, 2021 07:13
@ktmud
Copy link
Member Author

ktmud commented Mar 8, 2021

I just removed the logic of adding order by only metrics to select expressions and it still works at least for Postgres. We can decide if we want to expose these hidden columns later when there are clearer actual feature requests.

@codecov
Copy link

codecov bot commented Mar 8, 2021

Codecov Report

Merging #13434 (7eea891) into master (98a26e7) will decrease coverage by 4.78%.
The diff coverage is 84.76%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #13434      +/-   ##
==========================================
- Coverage   77.41%   72.63%   -4.79%     
==========================================
  Files         918      918              
  Lines       46673    46674       +1     
  Branches     5720     5720              
==========================================
- Hits        36132    33901    -2231     
- Misses      10405    12557    +2152     
- Partials      136      216      +80     
Flag Coverage Δ
cypress ?
hive ?
javascript 63.16% <0.00%> (-0.01%) ⬇️
mysql 80.55% <89.00%> (+<0.01%) ⬆️
postgres ?
presto 80.29% <87.00%> (+<0.01%) ⬆️
python 80.80% <89.00%> (-0.31%) ⬇️
sqlite 80.22% <89.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...tend/src/explore/components/DisplayQueryButton.jsx 43.75% <0.00%> (-32.06%) ⬇️
superset/viz.py 56.07% <0.00%> (-0.04%) ⬇️
superset/common/query_actions.py 92.85% <75.00%> (-2.53%) ⬇️
superset/connectors/sqla/models.py 90.50% <89.06%> (-0.13%) ⬇️
superset/common/query_context.py 82.06% <100.00%> (ø)
superset/common/query_object.py 90.47% <100.00%> (-0.07%) ⬇️
superset/connectors/base/models.py 90.81% <100.00%> (+0.03%) ⬆️
superset/db_engine_specs/base.py 86.19% <100.00%> (+0.03%) ⬆️
superset/db_engine_specs/pinot.py 95.12% <100.00%> (+0.12%) ⬆️
superset/models/core.py 89.37% <100.00%> (ø)
... and 228 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 98a26e7...7eea891. Read the comment docs.

Copy link
Member Author

@ktmud ktmud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think there is still a way to fully deprecate groupby. We just assume group by queries whenever metrics is not None. An empty metrics array should also trigger group by.

This could be a somewhat dangerous change and would require more testing.

if item_key not in seen:
seen.add(item_key)
result.append(item)
return result
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is not used by this PR anymore, but I'm keeping it here for convenience.

else metric["label"] # type: ignore
for metric in metrics
self.metrics = metrics and [
x
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use x in case some may wonder whether metric is also from the parent context (e.g. an __init__ argument).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd personally prefer metric_ here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always prefer x in list comprehensions and lambda functions because it's simpler and reads more "local".

@ktmud ktmud force-pushed the sql-generator branch 6 times, most recently from 8608c6e to 332ba4d Compare March 11, 2021 02:40
@ktmud
Copy link
Member Author

ktmud commented Mar 11, 2021

@ktmud ktmud force-pushed the sql-generator branch 2 times, most recently from 2753c8d to 4bb6545 Compare March 12, 2021 17:13
@ktmud ktmud merged commit bd1d6ac into apache:master Mar 17, 2021
@ktmud ktmud deleted the sql-generator branch March 29, 2021 16:09
@ktmud ktmud restored the sql-generator branch March 29, 2021 16:09
@ktmud ktmud deleted the sql-generator branch March 29, 2021 16:09
@ktmud ktmud restored the sql-generator branch March 29, 2021 16:09
@ktmud ktmud deleted the sql-generator branch March 29, 2021 16:09
allanco91 pushed a commit to allanco91/superset that referenced this pull request May 21, 2021
)

* fix(query): properly select adhoc metrics in orderby

* Throw error when sql is empty

* Allow `metrics` to be None

* Always use alias in orderby for metrics

* Bump table chart version and migrate histogram to typescript

* Fix Histogram without groupby

* Fix Presto birth names test

* Raw records mode should not aggregate
@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 1.2.0 labels Mar 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels explore:control Related to the controls panel of Explore size/L 🚢 1.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Explore][Table] Adding sort by without metrics causing unexpected error
5 participants