You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
Background
Select with wildcard over a USING/NATURAL JOIN should deduplicate join columns.
For example:
with t as(select 1 as a, 2 as b) select * from t t1 join t t2 using(a)
This query above should output the column 'a' only once.
LogicalPlan::using_columns() is used to find these join columns and to help exclude duplicated columns when expanding wildcards.
Problem using_columns() works by traversing the plan tree. This manner might be unsafe as it could incorrectly find columns that are not relevant to the current SQL context. This may lead to some output columns being incorrectly excluded.
For example, the result of the query below is different from other databases.
create table t(a int);
insert into t values(1),(2),(3);
select * from (select t.a+2 as a from t join t t2 using(a)) as t2;
To Reproduce
Run query in CLI (compiled from the latest main: apache@722307f)
create table t(a int);
insert into t values(1),(2),(3);
select * from (select t.a+2 as a from t join t t2 using(a)) as t2;
0 row(s) fetched.
Elapsed 0.008 seconds.
++
++
++
3 row(s) fetched.
Elapsed 0.013 seconds.
It outputs no columns.
Expected behavior
In PostgreSQL it does output one column.
psql (16.6 (Ubuntu 16.6-0ubuntu0.24.04.1))
Type "help" for help.
psql=> create table t(a int);
insert into t values(1),(2),(3);
select * from (select t.a+2 as a from t join t t2 using(a)) as t2;
CREATE TABLE
INSERT 0 3
a
3
4
5
(3 rows)
Additional context
No response
To Reproduce
No response
Expected behavior
No response
Additional context
No response
The text was updated successfully, but these errors were encountered:
To address the issue with LogicalPlan::using_columns() incorrectly identifying irrelevant columns, you might consider the following approaches:
Refine Column Identification Logic: Modify the logic within using_columns() to ensure it only considers columns relevant to the current SQL context. This could involve more precise checks on the context in which columns are used, possibly by maintaining additional metadata about the query structure.
Context-Aware Traversal: Implement a traversal method that is aware of the context of each node in the logical plan. This could help in distinguishing between columns that are genuinely part of the join condition and those that are not.
Comparison with Other Databases: Analyze how databases like PostgreSQL handle similar queries. This might provide insights into alternative methods for deduplicating columns in USING or NATURAL JOIN scenarios.
Testing and Validation: Develop comprehensive test cases that cover various join scenarios, including edge cases. This will help ensure that any changes made to the using_columns() function produce the expected results across different query types.
Alternative Data Structures: Consider using alternative data structures that might better capture the relationships between columns in joins, potentially reducing the risk of incorrectly excluding relevant columns.
These approaches aim to make the identification of join columns more robust and aligned with expected SQL behavior, particularly in comparison to other databases like PostgreSQL.
Describe the bug
Describe the bug
Background
Select with wildcard over a USING/NATURAL JOIN should deduplicate join columns.
For example:
with t as(select 1 as a, 2 as b) select * from t t1 join t t2 using(a)
This query above should output the column 'a' only once.
LogicalPlan::using_columns() is used to find these join columns and to help exclude duplicated columns when expanding wildcards.
Problem
using_columns() works by traversing the plan tree. This manner might be unsafe as it could incorrectly find columns that are not relevant to the current SQL context. This may lead to some output columns being incorrectly excluded.
For example, the result of the query below is different from other databases.
create table t(a int);
insert into t values(1),(2),(3);
select * from (select t.a+2 as a from t join t t2 using(a)) as t2;
To Reproduce
Run query in CLI (compiled from the latest main: apache@722307f)
+-------+
| count |
+-------+
| 3 |
+-------+
1 row(s) fetched.
Elapsed 0.015 seconds.
++
++
++
3 row(s) fetched.
Elapsed 0.013 seconds.
It outputs no columns.
Expected behavior
In PostgreSQL it does output one column.
psql (16.6 (Ubuntu 16.6-0ubuntu0.24.04.1))
Type "help" for help.
psql=> create table t(a int);
insert into t values(1),(2),(3);
select * from (select t.a+2 as a from t join t t2 using(a)) as t2;
CREATE TABLE
INSERT 0 3
a
3
4
5
(3 rows)
Additional context
No response
To Reproduce
No response
Expected behavior
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: