Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect results for join condition against current master branch #4844

Closed
DDtKey opened this issue Jan 7, 2023 · 18 comments
Closed

Incorrect results for join condition against current master branch #4844

DDtKey opened this issue Jan 7, 2023 · 18 comments
Labels
bug Something isn't working

Comments

@DDtKey
Copy link
Contributor

DDtKey commented Jan 7, 2023

Describe the bug
Join condition(on with between) works incorrectly.
Looks like ignored and returned cartesian product.

It used to work for latest stable release (15.0.0 from crates.io)
But I tested it against current master branch, hash 3cc607de4ce6e9e1fd537091e471858c62f58653.

To Reproduce
Steps to reproduce the behavior:
students.csv:

name,mark
Stuart,28
Amina,89
Christen,50
Salma,77
Samantha,21

grades.csv:

grade,min,max
1,0,14
2,15,35
3,36,55
4,56,79
5,80,100
MRE:
use datafusion::prelude::{CsvReadOptions, SessionContext};

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let students_path = "../students.csv";
    let grades_path = "../grades.csv";

    // Datafusion execution
    let ctx = SessionContext::new();

    ctx.register_csv("students", students_path, CsvReadOptions::default())
        .await?;

    ctx.register_csv("grades", grades_path, CsvReadOptions::default())
        .await?;

    let data_frame = ctx.sql("SELECT s.*, g.grade FROM students s join grades g on s.mark between g.min and g.max WHERE grade > 2 ORDER BY s.mark DESC").await?;

    data_frame.show().await?;

    Ok(())
}

It will return:

+----------+------+-------+
| name     | mark | grade |
+----------+------+-------+
| Amina    | 89   | 3     |
| Amina    | 89   | 4     |
| Amina    | 89   | 5     |
| Salma    | 77   | 3     |
| Salma    | 77   | 4     |
| Salma    | 77   | 5     |
| Christen | 50   | 3     |
| Christen | 50   | 4     |
| Christen | 50   | 5     |
| Stuart   | 28   | 3     |
| Stuart   | 28   | 4     |
| Stuart   | 28   | 5     |
| Samantha | 21   | 3     |
| Samantha | 21   | 4     |
| Samantha | 21   | 5     |
+----------+------+-------+

Expected behavior
It should be the same as for datafusion = "15.0.0":

+----------+------+-------+
| name     | mark | grade |
+----------+------+-------+
| Amina    | 89   | 5     |
| Salma    | 77   | 4     |
| Christen | 50   | 3     |
+----------+------+-------+
@DDtKey DDtKey added the bug Something isn't working label Jan 7, 2023
@DDtKey DDtKey changed the title Incorrect results for join broken in current master branch Incorrect results for join condition against current master branch Jan 7, 2023
@ozankabak
Copy link
Contributor

Wow, let's make sure to add some tests so regressions like this do not stealthily go through in the future 🤔

@Jefffrey
Copy link
Contributor

Jefffrey commented Jan 8, 2023

Looks to be regression introduced by fddb3d3 (#4562)

On the commit prior to it (2792113), I get this explain plan:

+------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type                                                  | plan                                                                                                                                                                    |
+------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| initial_logical_plan                                       | Sort: s.mark DESC NULLS FIRST                                                                                                                                           |
|                                                            |   Projection: s.name, s.mark, g.grade                                                                                                                                   |
|                                                            |     Filter: g.grade > Int64(2)                                                                                                                                          |
|                                                            |       Filter: s.mark BETWEEN g.min AND g.max                                                                                                                            |
|                                                            |         CrossJoin:                                                                                                                                                      |
|                                                            |           SubqueryAlias: s                                                                                                                                              |
|                                                            |             TableScan: students                                                                                                                                         |
|                                                            |           SubqueryAlias: g                                                                                                                                              |
|                                                            |             TableScan: grades                                                                                                                                           |
| logical_plan after inline_table_scan                       | SAME TEXT AS ABOVE                                                                                                                                                      |
| logical_plan after type_coercion                           | SAME TEXT AS ABOVE                                                                                                                                                      |
| logical_plan after simplify_expressions                    | Sort: s.mark DESC NULLS FIRST                                                                                                                                           |
|                                                            |   Projection: s.name, s.mark, g.grade                                                                                                                                   |
|                                                            |     Filter: g.grade > Int64(2)                                                                                                                                          |
|                                                            |       Filter: s.mark >= g.min AND s.mark <= g.max                                                                                                                       |
|                                                            |         CrossJoin:                                                                                                                                                      |
|                                                            |           SubqueryAlias: s                                                                                                                                              |
|                                                            |             TableScan: students                                                                                                                                         |
|                                                            |           SubqueryAlias: g                                                                                                                                              |
|                                                            |             TableScan: grades                                                                                                                                           |
...
| logical_plan after eliminate_cross_join                    | SAME TEXT AS ABOVE                                                                                                                                                      |
...
| logical_plan after push_down_filter                        | Sort: s.mark DESC NULLS FIRST                                                                                                                                           |
|                                                            |   Projection: s.name, s.mark, g.grade                                                                                                                                   |
|                                                            |     Filter: s.mark >= g.min AND s.mark <= g.max                                                                                                                         |
|                                                            |       CrossJoin:                                                                                                                                                        |
|                                                            |         SubqueryAlias: s                                                                                                                                                |
|                                                            |           TableScan: students                                                                                                                                           |
|                                                            |         SubqueryAlias: g                                                                                                                                                |
|                                                            |           Filter: grades.grade > Int64(2)                                                                                                                               |
|                                                            |             TableScan: grades, partial_filters=[grades.grade > Int64(2)]
...

And on commit fddb3d3 I get this plan instead:

+------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type                                                  | plan                                                                                                                                                       |
+------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+
| initial_logical_plan                                       | Sort: s.mark DESC NULLS FIRST                                                                                                                              |
|                                                            |   Projection: s.name, s.mark, g.grade                                                                                                                      |
|                                                            |     Filter: g.grade > Int64(2)                                                                                                                             |
|                                                            |       Inner Join:  Filter: s.mark BETWEEN g.min AND g.max                                                                                                  |
|                                                            |         SubqueryAlias: s                                                                                                                                   |
|                                                            |           TableScan: students                                                                                                                              |
|                                                            |         SubqueryAlias: g                                                                                                                                   |
|                                                            |           TableScan: grades                                                                                                                                |
| logical_plan after inline_table_scan                       | SAME TEXT AS ABOVE                                                                                                                                         |
| logical_plan after type_coercion                           | SAME TEXT AS ABOVE                                                                                                                                         |
| logical_plan after simplify_expressions                    | Sort: s.mark DESC NULLS FIRST                                                                                                                              |
|                                                            |   Projection: s.name, s.mark, g.grade                                                                                                                      |
|                                                            |     Filter: g.grade > Int64(2)                                                                                                                             |
|                                                            |       Inner Join:  Filter: s.mark >= g.min AND s.mark <= g.max AS s.mark BETWEEN g.min AND g.max                                                           |
|                                                            |         SubqueryAlias: s                                                                                                                                   |
|                                                            |           TableScan: students                                                                                                                              |
|                                                            |         SubqueryAlias: g                                                                                                                                   |
|                                                            |           TableScan: grades                                                                                                                                |
...
| logical_plan after eliminate_cross_join                    | Sort: s.mark DESC NULLS FIRST                                                                                                                              |
|                                                            |   Projection: s.name, s.mark, g.grade                                                                                                                      |
|                                                            |     Filter: g.grade > Int64(2)                                                                                                                             |
|                                                            |       CrossJoin:                                                                                                                                           |
|                                                            |         SubqueryAlias: s                                                                                                                                   |
|                                                            |           TableScan: students                                                                                                                              |
|                                                            |         SubqueryAlias: g                                                                                                                                   |
|                                                            |           TableScan: grades                                                                                                                                |
...
| logical_plan after push_down_filter                        | Sort: s.mark DESC NULLS FIRST                                                                                                                              |
|                                                            |   Projection: s.name, s.mark, g.grade                                                                                                                      |
|                                                            |     CrossJoin:                                                                                                                                             |
|                                                            |       SubqueryAlias: s                                                                                                                                     |
|                                                            |         TableScan: students                                                                                                                                |
|                                                            |       SubqueryAlias: g                                                                                                                                     |
|                                                            |         Filter: grades.grade > Int64(2)                                                                                                                    |
|                                                            |           TableScan: grades, partial_filters=[grades.grade > Int64(2)]

The actual regression seems to be caused by the SQL planner generating the initial logical plan with an Inner Join instead of a Cross Join, and this propagates down to cause the bug.

However it seems to highlight the actual flaw which is in eliminate_cross_join optimizer rule which converts the plan from:

Sort: s.mark DESC NULLS FIRST
  Projection: s.name, s.mark, g.grade
    Filter: g.grade > Int64(2)
      Inner Join:  Filter: s.mark >= g.min AND s.mark <= g.max AS s.mark BETWEEN g.min AND g.max
        SubqueryAlias: s
          TableScan: students
        SubqueryAlias: g
          TableScan: grades

to

Sort: s.mark DESC NULLS FIRST
  Projection: s.name, s.mark, g.grade
    Filter: g.grade > Int64(2)
      CrossJoin:
        SubqueryAlias: s
          TableScan: students
        SubqueryAlias: g
          TableScan: grades

Where it completely discards the Filter on the Inner Join when converting it to a Cross Join, causing the buggy behaviour

@ygf11
Copy link
Contributor

ygf11 commented Jan 8, 2023

It seems the EliminateCrossJoin rule only collect and consider the equijoin predicates when finding an appropriate right input of join. We can extend equijoin to all join predicate(equijoin and non-equi join), this issue will be fixed I think.

I will submit a pr later if others do not fix it.

@alamb
Copy link
Contributor

alamb commented Jan 8, 2023

cc @liukun4515

@alamb
Copy link
Contributor

alamb commented Jan 8, 2023

Definitely it would be great to get some test coverage for this case.

@ozankabak
Copy link
Contributor

ozankabak commented Jan 9, 2023

IMO it would be a good idea to fix this before releasing, this seems like a regression on fundamental functionality. Happy to help with reviewing the fix PR.

@alamb
Copy link
Contributor

alamb commented Jan 10, 2023

IMO it would be a good idea to fix this before releasing,

I agree -- I have mentioned it on the mailing list vote thread. @ygf11 can you work on a PR for this issue soon?

@ygf11
Copy link
Contributor

ygf11 commented Jan 10, 2023

I agree -- I have mentioned it on the mailing list vote thread. @ygf11 can you work on a PR for this issue soon?

Of course @alamb , I will submit a pr today or tomorrow.

@liukun4515
Copy link
Contributor

It seems the EliminateCrossJoin rule only collect and consider the equijoin predicates when finding an appropriate right input of join. We can extend equijoin to all join predicate(equijoin and non-equi join), this issue will be fixed I think.

I will submit a pr later if others do not fix it.

I think the plan should be the inner join not the cross join

@liukun4515
Copy link
Contributor

eliminate_cross_join

Yes, we can just fix this bug in the eliminate_cross_join rule.
cc @alamb

@jackwener
Copy link
Member

I debug it, it is because find join_key just consider equal instead of consider non-equal

@liukun4515
Copy link
Contributor

I agree -- I have mentioned it on the mailing list vote thread. @ygf11 can you work on a PR for this issue soon?

Of course @alamb , I will submit a pr today or tomorrow.

@ygf11 waiting for your PR

@ygf11
Copy link
Contributor

ygf11 commented Jan 10, 2023

I meet a strange problem when I do the pr #4866.

After I fix the test, the tcph-q11 will panic when run sqllogictests(the plan of q11 is modified), the direct panic message:

thread 'tokio-runtime-worker' panicked at 'partition not used yet', datafusion/core/src/physical_plan/repartition.rs:405:14 ... at tests/sqllogictests/test_files/tpch.slt:456.

I have no idea about it now, need to investigate more.
More info: https://github.com/apache/arrow-datafusion/actions/runs/3884323174/jobs/6626733651

To not block the release, I think maybe we can skip EliminateCrossJoin rule when it meet any inner-join input who has non-empty filter(this is consistent with datafusion 15).

With the patch, the sql of this issue will get the correct result.
For the feature of #4866, some query in datafusion 15 will get wrong result, and if we skip EliminateCrossJoin, datafusion 16(release soon) will get correct result, although the plan is not better optimized.

cc @alamb @liukun4515 @jackwener

@ygf11
Copy link
Contributor

ygf11 commented Jan 10, 2023

I create #4869 to show the patch.

@jackwener
Copy link
Member

jackwener commented Jan 10, 2023

This problem remind me the importance of #4615.

@andygrove
Copy link
Member

Should this issue be closed now that #4869 is merged?

@alamb
Copy link
Contributor

alamb commented Jan 11, 2023

Should this issue be closed now that #4869 is merged?

yes I think we should use this ticket to track the incorrect result. I will file another ticket to track the improvement proposed in #4866

@alamb
Copy link
Contributor

alamb commented Jan 11, 2023

Filed #4877 to track the work to properly support this case. Thank you all for your work to fix this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants