-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexpected results with group by and random() #7876
Comments
After some initial debugging, I found that this issue is likely caused by the optimization rule Before optimization: Filter: r <= Float64(0.2)
Projection: aggregate_test_100.c1, random() AS r
Aggregate: groupBy=[[aggregate_test_100.c1]], aggr=[[]]
Projection: aggregate_test_100.c1
TableScan: aggregate_test_100 After: Projection: aggregate_test_100.c1, random() AS r
Aggregate: groupBy=[[aggregate_test_100.c1]], aggr=[[]]
Projection: aggregate_test_100.c1
Filter: random() <= Float64(0.2)
TableScan: aggregate_test_100, partial_filters=[random() <= Float64(0.2)] |
I found that even if we don't push down non-deterministic predicates, the results will also be wrong, select name, random() as r from test group by name having r > 0.2 after select name, random() as r from test group by name having random() > 0.2 |
🤔 I was thinking about this Would it be ok to push It seems like these two queries would be equivalent Non-pushdown
Pushdown
However, We definitely shouldn't push non deterministic ( |
When I use a subquery in place of the HAVING clause to meet the requirements in the ticket, like below select t.c1, t.r from (select c1, random() as r from aggregate_test_100 group by c1) as t where t.r > 0.8 But I get the wrong result, because the So should we consider prohibiting select name, random() as r from test group by name having r > 0.2 and never push down non deterministic ( select t.c1, t.r from (select c1, random() as r from aggregate_test_100 group by c1) as t where t.r > 0.8 @alamb how do you think, If you don't mind, I will try to fix it. |
I think we can start with the projection plan first because:
We can carefully consider other logical plans afterward. |
This may be a dupe / related to #7976 |
I think we have fixed this issue in subsequent releases, so closing this ticket. Let's reopen / file a new ticket if we find something still is not working |
Describe the bug
I have table
t1
with a column calledfile_path
I want to obtain a list of file_paths where each element is unique and then take a random subset of those columns.
I thought that this could be achieved with the following code.
However in the output of my query I see the following entries which contains a record that should be filtered out.
This is the calculated logical plan
In this case I would expect the filter to occur after the aggregate operation not before.
To Reproduce
No response
Expected behavior
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: