fix(query): order by adhoc metrics should trigger group by #13434

ktmud · 2021-03-03T08:38:49Z

SUMMARY

Update SQL query generator to properly handle adhoc metrics in order by. The idea is collect adhoc metrics in orderby before deciding whether to apply groupby. Closes #13465 .

Also assume it is a GROUP BY query unless metrics to set to None. Depends on: apache-superset/superset-ui#995 (released in @superset-ui/core v0.17.18)

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

Before

After

Correctly generates query for group by and sort by without metrics.

TEST PLAN

Unit tests and a new Cypress test

ADDITIONAL INFORMATION

villebro · 2021-03-05T07:16:04Z

Lots of great improvements here (I'll add review comments later). It seems the root problem here is that the selected column isn't being added to the groupby clause, not the aggregate expression being missing from the select. Also, one thing worthwhile considering is not all db engines require having the order by aggregate expression in the selects:

sqlite> select name from birth_names group by name order by sum(num_girls) desc limit 10;
Jennifer
Jessica
Ashley
Sarah
Amanda
Elizabeth
Melissa
Michelle
Kimberly
Stephanie
sqlite>

While I don't see a lot of use cases where omitting the aggregate in the select, it's not strictly always necessary, and could save unnecessary network bandwidth when not needed. So instead of always adding it to the selects, we might want to introduce a parameter in BaseEngineSpec stating whether or not orderby aggregates have to be present in the select.

ktmud · 2021-03-05T18:34:48Z

Thanks for the headsup, @villebro !

Another thing I noticed is that it's not always correct to assume "columns" with no "metrics" requires no aggregation. E.g. when table chart selects groupby with no metrics in aggregation mode, the output should be unique combination of the groups instead of raw records. This is mentioned as the second problem in #13228 .

Therefore we can not really deprecate groupby in QueryObject. Or we will need to add at least an is_aggregate flag if groupby is to be cleaned up.

villebro · 2021-03-05T18:59:46Z

Thanks for the headsup, @villebro !

Another thing I noticed is that it's not always correct to assume "columns" with no "metrics" requires no aggregation. E.g. when table chart selects groupby with no metrics in aggregation mode, the output should be unique combination of the groups instead of raw records. This is mentioned as the second problem in #13228 .

Therefore we can not really deprecate groupby in QueryObject. Or we will need to add at least an is_aggregate flag is groupby is to be cleaned up.

Yes, I've been planning on that - I was thinking along the lines of an aggregation_type, which could be none, distinct and groupby, where any aggregate expression would force groupby, but in the absence of aggregates, we would default to none, but would provide the option to do distinct.

ktmud · 2021-03-05T19:46:26Z

I'm trying to imagine what would be a useful case in exposing SELECT DISTINCT abc FROM tbl in the API and why would the user (or client query builder) choose one way or another---since it generally produces the same results as group by (SELECT abc FROM tbl GROUP BY abc).

Note that in some data engines (e.g. Presto), there are performance implications in using DISTINCT vs GROUP BY (and GROUP BY is normally faster).

I'm OK with either way, but would hope we can keep the interface simple and avoid the possibility of getting the same results with different query configs.

ktmud · 2021-03-05T20:44:27Z

Lots of great improvements here (I'll add review comments later). It seems the root problem here is that the selected column isn't being added to the groupby clause, not the aggregate expression being missing from the select. Also, one thing worthwhile considering is not all db engines require having the order by aggregate expression in the selects:
sqlite> select name from birth_names group by name order by sum(num_girls) desc limit 10;
Jennifer
Jessica
Ashley
Sarah
Amanda
Elizabeth
Melissa
Michelle
Kimberly
Stephanie
sqlite>
While I don't see a lot of use cases where omitting the aggregate in the select, it's not strictly always necessary, and could save unnecessary network bandwidth when not needed. So instead of always adding it to the selects, we might want to introduce a parameter in BaseEngineSpec stating whether or not orderby aggregates have to be present in the select.

I was actually contemplating sending the orderby only columns back to client (at least in the Data panel), just so users can have confidence the order by was correctly applied. Currently I'm removing the orderby columns with labels_expected, but they could also be removed in query_actions.py, or be handled by each chart itself. Either way, it seems useful to have the engine always return the sort by field while it's not clear how much the bandwidth saving matters, so I'm inclined to not add this EngineSpec parameter just to keep things simple...

villebro · 2021-03-05T21:46:59Z

I was actually contemplating sending the orderby only columns back to client (at least in the Data panel), just so users can have confidence the order by was correctly applied. Currently I'm removing the orderby columns with labels_expected, but they could also be removed in query_actions.py, or be handled by each chart itself. Either way, it seems useful to have the engine always return the sort by field while it's not clear how much the bandwidth saving matters, so I'm inclined to not add this EngineSpec parameter just to keep things simple...

I'm not super passionate about adding distinct support, so if there's any risk of it causing added complexity or getting in the way I'm ok with just supporting group bys. Same goes for adding vs removing order bys from the select - I'm fine adding them in the select, and agree that it's good to have them around for validation purposes (the real optimizations are probably elsewhere anyways, like using binary protocols, streaming etc). But in general I'm in the camp of "if a feature can easily be added, why not", and tend to prefer to avoid adding limitations or forcing features if there are no clear motivations for doing so.

ktmud · 2021-03-08T07:14:56Z

I just removed the logic of adding order by only metrics to select expressions and it still works at least for Postgres. We can decide if we want to expose these hidden columns later when there are clearer actual feature requests.

codecov · 2021-03-08T07:45:39Z

Codecov Report

Merging #13434 (7eea891) into master (98a26e7) will decrease coverage by 4.78%.
The diff coverage is 84.76%.

@@            Coverage Diff             @@
##           master   #13434      +/-   ##
==========================================
- Coverage   77.41%   72.63%   -4.79%     
==========================================
  Files         918      918              
  Lines       46673    46674       +1     
  Branches     5720     5720              
==========================================
- Hits        36132    33901    -2231     
- Misses      10405    12557    +2152     
- Partials      136      216      +80

Flag	Coverage Δ
cypress	`?`
hive	`?`
javascript	`63.16% <0.00%> (-0.01%)`	⬇️
mysql	`80.55% <89.00%> (+<0.01%)`	⬆️
postgres	`?`
presto	`80.29% <87.00%> (+<0.01%)`	⬆️
python	`80.80% <89.00%> (-0.31%)`	⬇️
sqlite	`80.22% <89.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...tend/src/explore/components/DisplayQueryButton.jsx	`43.75% <0.00%> (-32.06%)`	⬇️
superset/viz.py	`56.07% <0.00%> (-0.04%)`	⬇️
superset/common/query_actions.py	`92.85% <75.00%> (-2.53%)`	⬇️
superset/connectors/sqla/models.py	`90.50% <89.06%> (-0.13%)`	⬇️
superset/common/query_context.py	`82.06% <100.00%> (ø)`
superset/common/query_object.py	`90.47% <100.00%> (-0.07%)`	⬇️
superset/connectors/base/models.py	`90.81% <100.00%> (+0.03%)`	⬆️
superset/db_engine_specs/base.py	`86.19% <100.00%> (+0.03%)`	⬆️
superset/db_engine_specs/pinot.py	`95.12% <100.00%> (+0.12%)`	⬆️
superset/models/core.py	`89.37% <100.00%> (ø)`
... and 228 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 98a26e7...7eea891. Read the comment docs.

ktmud

So I think there is still a way to fully deprecate groupby. We just assume group by queries whenever metrics is not None. An empty metrics array should also trigger group by.

This could be a somewhat dangerous change and would require more testing.

ktmud · 2021-03-08T08:02:59Z

superset/utils/core.py

+        if item_key not in seen:
+            seen.add(item_key)
+            result.append(item)
+    return result


This function is not used by this PR anymore, but I'm keeping it here for convenience.

ktmud · 2021-03-08T08:13:14Z

superset/common/query_object.py

-            else metric["label"]  # type: ignore
-            for metric in metrics
+        self.metrics = metrics and [
+            x


Use x in case some may wonder whether metric is also from the parent context (e.g. an __init__ argument).

nit: I'd personally prefer metric_ here

I always prefer x in list comprehensions and lambda functions because it's simpler and reads more "local".

ktmud · 2021-03-11T20:10:45Z

Presto tests seem to be failing because of this error:

[Explore]Ensure ad-hoc metrics are not named the same as an existing column #13426
Invalid plan for ORDER BY expression that refers to output column prestodb/presto#4698

) * fix(query): properly select adhoc metrics in orderby * Throw error when sql is empty * Allow `metrics` to be None * Always use alias in orderby for metrics * Bump table chart version and migrate histogram to typescript * Fix Histogram without groupby * Fix Presto birth names test * Raw records mode should not aggregate

pull-request-size bot added the size/M label Mar 3, 2021

ktmud force-pushed the sql-generator branch from db292dc to 2d778b5 Compare March 4, 2021 23:31

pull-request-size bot added size/L and removed size/M labels Mar 4, 2021

ktmud force-pushed the sql-generator branch from 2d778b5 to 7dbf67c Compare March 5, 2021 04:26

This was referenced Mar 5, 2021

[Explore]SORT BY metric incorrectly re-added to SELECT clause #13423

Closed

fix(explore): make sure sort by metric is not duplicated #13473

Merged

ktmud changed the title ~~fix(query): properly select adhoc metrics in orderby~~ fix(query): order by adhoc metrics should trigger group by Mar 8, 2021

ktmud force-pushed the sql-generator branch from b7b581b to 75b56a4 Compare March 8, 2021 07:10

ktmud marked this pull request as ready for review March 8, 2021 07:13

ktmud mentioned this pull request Mar 8, 2021

fix(core): don't add metrics to query object when in raw records mode apache-superset/superset-ui#995

Merged

ktmud commented Mar 8, 2021

View reviewed changes

ktmud force-pushed the sql-generator branch 6 times, most recently from 8608c6e to 332ba4d Compare March 11, 2021 02:40

ktmud force-pushed the sql-generator branch 2 times, most recently from 2753c8d to 4bb6545 Compare March 12, 2021 17:13

ktmud mentioned this pull request Mar 12, 2021

chore(explore): bump superset-ui 0.17.19 #13593

Merged

6 tasks

ktmud added 19 commits March 16, 2021 16:58

fix(query): properly select adhoc metrics in orderby

edc9e7f

Fix many cases

960b8fc

Throw error when sql is empty

001d250

Remove orderby only columns from output

01917ef

Small refactor and more test cases

ffc4166

No need to add sort by only metrics

97a0f05

Allow metrics to be None

1abccb1

Fix tests

f2512c8

Fix some typing error

ca42813

Let metrics be None

2eb4aab

Always use alias in orderby for metrics

2e0614a

Bump table chart version and migrate histogram to typescript

b141a7f

Fix Histogram without groupby

f17a006

Fix Presto birth names test

263f3fa

Remove unused props

d189a80

Clean up

937cbe7

Raw records mode should not aggregate

2128a6b

Address PR comments

2cb2a80

Orderby col is not optional

7eea891

ktmud force-pushed the sql-generator branch from 4f2f603 to 7eea891 Compare March 16, 2021 23:58

ktmud merged commit bd1d6ac into apache:master Mar 17, 2021

ktmud deleted the sql-generator branch March 29, 2021 16:09

ktmud restored the sql-generator branch March 29, 2021 16:09

ktmud deleted the sql-generator branch March 29, 2021 16:09

ktmud restored the sql-generator branch March 29, 2021 16:09

ktmud deleted the sql-generator branch March 29, 2021 16:09

filippociceri mentioned this pull request Mar 31, 2021

Charts with spaces in metric names generate Unexpected Errors #13812

Closed

3 tasks

mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 1.2.0 labels Mar 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(query): order by adhoc metrics should trigger group by #13434

fix(query): order by adhoc metrics should trigger group by #13434

ktmud commented Mar 3, 2021 •

edited

Loading

villebro commented Mar 5, 2021

ktmud commented Mar 5, 2021 •

edited

Loading

villebro commented Mar 5, 2021

ktmud commented Mar 5, 2021 •

edited

Loading

ktmud commented Mar 5, 2021

villebro commented Mar 5, 2021

ktmud commented Mar 8, 2021 •

edited

Loading

codecov bot commented Mar 8, 2021 •

edited

Loading

ktmud left a comment •

edited

Loading

ktmud Mar 8, 2021

ktmud Mar 8, 2021

villebro Mar 16, 2021

ktmud Mar 16, 2021

ktmud commented Mar 11, 2021

fix(query): order by adhoc metrics should trigger group by #13434

fix(query): order by adhoc metrics should trigger group by #13434

Conversation

ktmud commented Mar 3, 2021 • edited Loading

SUMMARY

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

Before

After

TEST PLAN

ADDITIONAL INFORMATION

villebro commented Mar 5, 2021

ktmud commented Mar 5, 2021 • edited Loading

villebro commented Mar 5, 2021

ktmud commented Mar 5, 2021 • edited Loading

ktmud commented Mar 5, 2021

villebro commented Mar 5, 2021

ktmud commented Mar 8, 2021 • edited Loading

codecov bot commented Mar 8, 2021 • edited Loading

Codecov Report

ktmud left a comment • edited Loading

Choose a reason for hiding this comment

ktmud Mar 8, 2021

Choose a reason for hiding this comment

ktmud Mar 8, 2021

Choose a reason for hiding this comment

villebro Mar 16, 2021

Choose a reason for hiding this comment

ktmud Mar 16, 2021

Choose a reason for hiding this comment

ktmud commented Mar 11, 2021

ktmud commented Mar 3, 2021 •

edited

Loading

ktmud commented Mar 5, 2021 •

edited

Loading

ktmud commented Mar 5, 2021 •

edited

Loading

ktmud commented Mar 8, 2021 •

edited

Loading

codecov bot commented Mar 8, 2021 •

edited

Loading

ktmud left a comment •

edited

Loading