Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid plan for ORDER BY expression that refers to output column #4698

Closed
martint opened this issue Mar 3, 2016 · 13 comments
Closed

Invalid plan for ORDER BY expression that refers to output column #4698

martint opened this issue Mar 3, 2016 · 13 comments
Assignees
Labels
Milestone

Comments

@martint
Copy link
Contributor

martint commented Mar 3, 2016

The following query produces an incorrect plan. It resolves a in the ORDER BY clause with respect to the FROM clause, but it should be using the output column from the SELECT clause, instead.

SELECT 10/a AS a 
FROM (VALUES 1,2,3) t(a) 
ORDER BY 10/a

This is the resulting plan:

presto> explain select 10/a AS a FROM (VALUES 1,2,3) t(a) order by 10/a;
                    Query Plan                    
--------------------------------------------------
 - Output[a] => [expr:bigint]                     
         a := expr                                
     - Sort[expr ASC_NULLS_LAST] => [expr:bigint] 
         - Project => [expr:bigint]               
                 expr := (10 / "field")           
             - Values => [field:bigint]           
                     (1)                          
                     (2)                          
                     (3)

From the SQL spec, section "6.6 ":

i) If IC [identifier chain] is contained in an <order by clause> simply contained in a <query expression> QE, and the <select list> simply contained in QE directly contains a <derived column> DC whose explicit or implicit <column name> is equivalent to IC, then PIC1 [partial identifier chain] is a candidate basis, the scope of PIC1 is QE, and the referent of PIC1 is the column referenced by DC.

@martint martint added the bug label Mar 3, 2016
martint referenced this issue in Teradata/presto Mar 3, 2016
Until now all the subqueries were converted (to some form of join)
before applying filtering, aggregation, window or order by
clauses, independently on where the subquery was placed.

To enable subqueries for queries with aggregation above behaviour had to
be changed. Depending on where the subquery is placed, it had to be
applied at different processing moment in logical planner.

Subquery in filter and group by clause is applied before aggregation.
Subquery in having, project and order by clause is applied after aggregation.

Hovewer if the same subquery expression is used in project, group by
and/or having clause, then subquery have to be applied before
aggregation and places where the subquery was used are replaced by the
symbol of the subquery application.
@kbajda
Copy link

kbajda commented Mar 3, 2016

Btw. this is not directly related but we ran into this several time when porting SQL from other DB engines to Presto:

SELECT a*10 AS b 
FROM (VALUES 1,2,3) t(a) 
WHERE b>15;

Query 20160303_152228_00012_su828 failed: line 1:49: Column '"b"' cannot be resolved
SELECT a*10 AS b FROM (VALUES 1,2,3) t(a) WHERE b>15

It looks like you expect ORDER BY to handle aliases defined in SELECT but WHERE does not even see them.

@martint
Copy link
Contributor Author

martint commented Mar 3, 2016

It looks like you expect ORDER BY to handle aliases defined in SELECT but WHERE does not even see them.

Yes, that's per the ANSI spec. The scope of the ORDER BY clause is the columns defined in the SELECT clause, but the scope of the WHERE clause is the columns derived from the FROM clause.

If the WHERE clause could refer to columns in the SELECT clause, the behavior of the following query would be undefined:

SELECT row_number() OVER () rn
FROM t
WHERE rn > 5

@kbajda
Copy link

kbajda commented Mar 3, 2016

That's fair. Obviously you can use a subquery to rename columns and use them later in WHERE. We just observed so many SQL engines (including commercial ones) leaning towards user convenience. I bet they just throw a parsing error when it is unclear/undefined as in your example.

@kokosing
Copy link
Contributor

kokosing commented Mar 4, 2016

@martint I would like to fix this as it is related to correlated subqueries work (see https://docs.google.com/document/d/18HN7peS2eR8lZsErqcmnoWyMEPb6p4OQeidH1JP_EkA/edit#heading=h.sjurgitot2kp).

@martint
Copy link
Contributor Author

martint commented Mar 15, 2016

@kokosing go for it!

@kokosing
Copy link
Contributor

Strange thing. I checked the PSQL and it looks they have the same bug. I have run the following query:

SELECT a * -1 as a, b 
FROM (VALUES (-1,4), (0, 3), (1, 2)) t(a,b) 
ORDER BY a * -1 
LIMIT 1;

and it returned -1, 2. Which means that multiplication by -1 was applied only once on original column t.a.

Did I (we?) miss something, or PSQL has a bug which seems unrealistic to me.

@kokosing
Copy link
Contributor

I am getting lost.

I have checked following queries in PSQL:

SELECT a * -1 as a, b 
FROM (VALUES (-1,4), (0, 3), (1, 2)) t(a,b) 
ORDER BY a 
LIMIT 1;

returns -1, -2, while query

SELECT a * -1 as a, b 
FROM (VALUES (-1,4), (0, 3), (1, 2)) t(a,b) 
ORDER BY a * 1
LIMIT 1;

returns 1, 4.

I checked the explain and it looks like when expression in ORDER BY contains only output column alias (a) then output column is used, but when a is used in an expression then table column is used. I will check that in MYSQL.

I am not a native English speaker so I could misunderstood the section copied from SQL specification. But from what I understood ORDER BY expression can access columns returned by SELECT list (projection), that would mean that output column should be used all the time for above queries.

@kokosing
Copy link
Contributor

@martint The same thing with MYSQL. So it looks like your query was processed properly (at least according to MYSQL and PSQL), that would mean that is not a bug, that is a feature ;)

@kokosing
Copy link
Contributor

In case I name output column uniquely then:

  • MYSQL works as I would expect. It is using output column reference in all the ORDER BY expressions
  • PGSQL is able to run ORDER BY when there is only a reference to output column. When output column is used in expression (like c * -1) then it raises that column c is unrecognized.

In regards what can do MYSQL and PGSQL they are consistent to each other.

I wonder which path we should take. Should we follow MYSQL or SQL spec?

@kokosing
Copy link
Contributor

@martint ping. Can you please comment what you think is expected behavior for presto in regards of this bug, considering MYSQL and PSQL behavior?

@kokosing kokosing assigned martint and unassigned kokosing May 25, 2016
@martint
Copy link
Contributor Author

martint commented May 25, 2016

I'm re-reading that part of the spec. I could've have misinterpreted or missed something.

Also, I'm looking at the relevant parts in the "SQL:1999 - Understanding Relational Language Components" book. In the section about ORDER BY (in the context of cursors) it says:

[...] if your ORDER BY clauses use expressions -- even if the expressions reference only those columns that otherwise appear in the select list [...]

which would seem to imply that names should be resolved against the SELECT list, not the source table.

@kokosing
Copy link
Contributor

Ok. Sounds good to me. However we need to be aware of that Presto will behave differently than MYSQL and PSQL. I am going to fix this on top of #5302 .

kokosing added a commit to Teradata/presto that referenced this issue Jun 6, 2016
@kokosing
Copy link
Contributor

kokosing commented Jun 6, 2016

@martint I think I know why PSQL and MYSQL do not support this. I spent on this a week and this exploded in my hands. Here is my WIP code: https://github.com/Teradata/presto/tree/feature_gko_scopes_fix_order_by_bug

To fix the bug I wanted ORDER BY to use query output relation. This did not work as ORDER BY can access source relation columns as well. So I added source relation columns (which were not overridden by output relation) as hidden columns to output relation.This change required a lot of changes of the parser and the planner. Then I reached the problem is with aggregation. Currently all aggregation functions are collected from different parts of the query: select list, having, group by, order by. These aggregation can only use source relation columns. With my changes I would need to handle aggregations from ORDER BY separately and when output relation columns is for aggregation raise an exception. This would make below query to fail:

SELECT max(a) AS a FROM t ORDER BY min(a);

Handling this requires more changes. I am not sure if it is worth doing this as it is only about ORDER BY clause.

sopel39 added a commit to Teradata/presto that referenced this issue Dec 12, 2016
SQL standard specifies that ORDER BY expressions can reference
projection outputs. For instance following query works:
SELECT x as y FROM (VALUES(1) t(x)) ORDER BY y*-1;

This commit adds support for referencing projection outputs
from ORDER BY expressions.

This fixes prestodb#4698
@martint martint added this to the 0.161 milestone Dec 12, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants