-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Invalid plan for ORDER BY expression that refers to output column #4698
Comments
Until now all the subqueries were converted (to some form of join) before applying filtering, aggregation, window or order by clauses, independently on where the subquery was placed. To enable subqueries for queries with aggregation above behaviour had to be changed. Depending on where the subquery is placed, it had to be applied at different processing moment in logical planner. Subquery in filter and group by clause is applied before aggregation. Subquery in having, project and order by clause is applied after aggregation. Hovewer if the same subquery expression is used in project, group by and/or having clause, then subquery have to be applied before aggregation and places where the subquery was used are replaced by the symbol of the subquery application.
Btw. this is not directly related but we ran into this several time when porting SQL from other DB engines to Presto:
It looks like you expect ORDER BY to handle aliases defined in SELECT but WHERE does not even see them. |
Yes, that's per the ANSI spec. The scope of the ORDER BY clause is the columns defined in the SELECT clause, but the scope of the WHERE clause is the columns derived from the FROM clause. If the WHERE clause could refer to columns in the SELECT clause, the behavior of the following query would be undefined: SELECT row_number() OVER () rn
FROM t
WHERE rn > 5 |
That's fair. Obviously you can use a subquery to rename columns and use them later in WHERE. We just observed so many SQL engines (including commercial ones) leaning towards user convenience. I bet they just throw a parsing error when it is unclear/undefined as in your example. |
@martint I would like to fix this as it is related to correlated subqueries work (see https://docs.google.com/document/d/18HN7peS2eR8lZsErqcmnoWyMEPb6p4OQeidH1JP_EkA/edit#heading=h.sjurgitot2kp). |
@kokosing go for it! |
Strange thing. I checked the PSQL and it looks they have the same bug. I have run the following query:
and it returned Did I (we?) miss something, or PSQL has a bug which seems unrealistic to me. |
I am getting lost. I have checked following queries in PSQL:
returns
returns I checked the explain and it looks like when expression in ORDER BY contains only output column alias ( I am not a native English speaker so I could misunderstood the section copied from SQL specification. But from what I understood ORDER BY expression can access columns returned by SELECT list (projection), that would mean that output column should be used all the time for above queries. |
@martint The same thing with MYSQL. So it looks like your query was processed properly (at least according to MYSQL and PSQL), that would mean that is not a bug, that is a feature ;) |
In case I name output column uniquely then:
In regards what can do MYSQL and PGSQL they are consistent to each other. I wonder which path we should take. Should we follow MYSQL or SQL spec? |
@martint ping. Can you please comment what you think is expected behavior for presto in regards of this bug, considering MYSQL and PSQL behavior? |
I'm re-reading that part of the spec. I could've have misinterpreted or missed something. Also, I'm looking at the relevant parts in the "SQL:1999 - Understanding Relational Language Components" book. In the section about ORDER BY (in the context of cursors) it says:
which would seem to imply that names should be resolved against the SELECT list, not the source table. |
Ok. Sounds good to me. However we need to be aware of that Presto will behave differently than MYSQL and PSQL. I am going to fix this on top of #5302 . |
@martint I think I know why PSQL and MYSQL do not support this. I spent on this a week and this exploded in my hands. Here is my WIP code: https://github.com/Teradata/presto/tree/feature_gko_scopes_fix_order_by_bug To fix the bug I wanted ORDER BY to use query output relation. This did not work as ORDER BY can access source relation columns as well. So I added source relation columns (which were not overridden by output relation) as hidden columns to output relation.This change required a lot of changes of the parser and the planner. Then I reached the problem is with aggregation. Currently all aggregation functions are collected from different parts of the query: select list, having, group by, order by. These aggregation can only use source relation columns. With my changes I would need to handle aggregations from ORDER BY separately and when output relation columns is for aggregation raise an exception. This would make below query to fail:
Handling this requires more changes. I am not sure if it is worth doing this as it is only about ORDER BY clause. |
SQL standard specifies that ORDER BY expressions can reference projection outputs. For instance following query works: SELECT x as y FROM (VALUES(1) t(x)) ORDER BY y*-1; This commit adds support for referencing projection outputs from ORDER BY expressions. This fixes prestodb#4698
The following query produces an incorrect plan. It resolves
a
in the ORDER BY clause with respect to the FROM clause, but it should be using the output column from the SELECT clause, instead.This is the resulting plan:
From the SQL spec, section "6.6 ":
The text was updated successfully, but these errors were encountered: