Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQL statement (UNION + EXCEPT) causes panic #4837

Closed
DDtKey opened this issue Jan 7, 2023 · 14 comments · Fixed by #4862
Closed

SQL statement (UNION + EXCEPT) causes panic #4837

DDtKey opened this issue Jan 7, 2023 · 14 comments · Fixed by #4862
Labels
bug Something isn't working good first issue Good for newcomers

Comments

@DDtKey
Copy link
Contributor

DDtKey commented Jan 7, 2023

Describe the bug
datafusion 15.0.0 panics for specific SQL-query:

(   
    SELECT * FROM table_1
    EXCEPT
    SELECT * FROM table_2
)  
UNION ALL
(   
    SELECT * FROM table_2
    EXCEPT
    SELECT * FROM table_1
) 

To Reproduce
Steps to reproduce the behavior:

    let table_1_path = "...";
    let table_2_path = "...";

    let ctx = SessionContext::new();

    ctx.register_csv("table_1", table_1_path, CsvReadOptions::default())
        .await?;
    ctx.register_csv("table_2", table_2_path, CsvReadOptions::default())
        .await?;

    let data_frame = ctx.sql("(SELECT * FROM table_1 EXCEPT SELECT * FROM table_2) UNION ALL (SELECT * FROM table_2 EXCEPT SELECT * FROM table_1)").await?;

    data_frame.show().await?;

It will panic:
thread 'main' panicked at 'index out of bounds: the len is 2 but the index is 2' (len & index depends on number of columns in file)

Expected behavior
It should works correctly or return an error, but not to panic.

@DDtKey DDtKey added the bug Something isn't working label Jan 7, 2023
@alamb
Copy link
Contributor

alamb commented Jan 7, 2023

Thank you for the report @DDtKey

@alamb alamb added the good first issue Good for newcomers label Jan 7, 2023
@Jefffrey
Copy link
Contributor

Jefffrey commented Jan 7, 2023

@DDtKey are you able to provide the files you used to cause the bug, or at least give an example of their structure/data? I don't seem to be able to reproduce the panic myself

@DDtKey
Copy link
Contributor Author

DDtKey commented Jan 7, 2023

@DDtKey are you able to provide the files you used to cause the bug, or at least give an example of their structure/data? I don't seem to be able to reproduce the panic myself

It seems to be reproducible for any file structure 🤔

But for example, very simple ones:

table_1:

name
Alex
Bob
Alice

and

table_2:

name
Alex
Bob
John

And at least for datafusion 15 I wasn't able to run such query without panic.

@Jefffrey
Copy link
Contributor

Jefffrey commented Jan 7, 2023

Hmm, still can't reproduce, on latest master 83c1026:

jeffrey:~/Code/arrow-datafusion/datafusion-cli$ cat data1.csv
name
Alex
Bob
Alice
jeffrey:~/Code/arrow-datafusion/datafusion-cli$ cat data2.csv
name
Alex
Bob
John
jeffrey:~/Code/arrow-datafusion/datafusion-cli$ cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.10s
     Running `/media/jeffrey/1tb_860evo_ssd/.cargo_target_cache/debug/datafusion-cli`
DataFusion CLI v15.0.0
❯ CREATE EXTERNAL TABLE foo1 STORED AS CSV WITH HEADER ROW LOCATION 'data1.csv';
0 rows in set. Query took 0.013 seconds.
❯ CREATE EXTERNAL TABLE foo2 STORED AS CSV WITH HEADER ROW LOCATION 'data2.csv';
0 rows in set. Query took 0.002 seconds.
❯ (select * from foo1 except select * from foo2) union all (select * from foo2 except select * from foo1);
+-------+
| name  |
+-------+
| Alice |
| John  |
+-------+
2 rows in set. Query took 0.012 seconds.
❯ \q
jeffrey:~/Code/arrow-datafusion/datafusion-cli$ git rev-parse --short HEAD
83c10269

Or should I try for different file formats too?

@DDtKey
Copy link
Contributor Author

DDtKey commented Jan 7, 2023

@Jefffrey oh, looks like it isn't reproducible against main branchq(but for last stable release) for me as well. Thank you.

But results isn't correct for me 🤔

data_frame.show() output:

+-------+
| name  |
+-------+
| John  |
| Alice |
+-------+
+-------+
| name  |
+-------+
| Alice |
| John  |
+-------+
+-------+
| name  |
+-------+
| Alice |
| John  |
+-------+

But if I would use

data_frame.write_csv("...").await?;

It will lead to an error:
{ code: 17, kind: AlreadyExists, message: "File exists" }

Any thoughts?

@Jefffrey
Copy link
Contributor

Jefffrey commented Jan 7, 2023

The output seems correct to me from my understanding, unless you're referring to the inconsistent output row ordering?

The error seems straightforward enough, where it doesn't seem to override the existing output file?

@DDtKey
Copy link
Contributor Author

DDtKey commented Jan 7, 2023

The output seems correct to me from my understanding, unless you're referring to the inconsistent output row ordering?

The error seems straightforward enough, where it doesn't seem to override the existing output file?

My bad, sorry. Accidentally I've ran 3 parallel processes, that's the reason of 3 outputs & file already exists error.

@DDtKey
Copy link
Contributor Author

DDtKey commented Jan 7, 2023

So, with bisect I was able to figure out that it was fixed by #4666 (ac2e5d1).

Thanks for help @Jefffrey!
Shouldn't such case be a part of tests?

@DDtKey
Copy link
Contributor Author

DDtKey commented Jan 7, 2023

After some tests against current master branch I was able to discover a new bug(broken behavior), see new issue: #4844

@Jefffrey
Copy link
Contributor

Jefffrey commented Jan 7, 2023

Shouldn't such case be a part of tests?

That would be a good idea to have more test coverage, though I confess I'm not entirely certain where such a test would be located, maybe in the sqllogictests?

Relevant epic #4460

@alamb
Copy link
Contributor

alamb commented Jan 8, 2023

Thanks @DDtKey -- indeed if you have a test that fails before #4666 it would be great to add it to the test suite

@alamb
Copy link
Contributor

alamb commented Jan 8, 2023

cc @ygf11

@ygf11
Copy link
Contributor

ygf11 commented Jan 9, 2023

I find the bug is in union type coercion(#3513), and the bug still exists.
We can reproduce it in the master branch:

❯ create table table_2(name text, id INT) as  values('Alex',1);
0 rows in set. Query took 0.002 seconds.
❯ create table table_1(name text, id TINYINT) as  values('Alex',1);
0 rows in set. Query took 0.002 seconds.
❯ (
    SELECT * FROM table_1
    EXCEPT
    SELECT * FROM table_2
)
UNION ALL
(
    SELECT * FROM table_2
    EXCEPT
    SELECT * FROM table_1
);
SchemaError(FieldNotFound { field: Column { relation: Some("table_2"), name: "id" }, valid_fields: Some([Column { relation: Some("table_1"), name: "name" }, Column { relation: Some("table_1"), name: "id" }]) })

For union operation, we need ensure each data type of left and right should be same.
It is done in:
https://github.com/apache/arrow-datafusion/blob/71b9baecd0a3c881f96e9994d922f3c1b3d61854/datafusion/expr/src/expr_rewriter.rs#L523-L527

But it uses plan.expressions() to get the type coercion units of the input, which I think is not correct, because it maybe return other expressions, like join will return its predicates not the fields of the schema.

To fix this issue, I think we can abandon plan.expressions(), and use the input schema to enumerate the fields, finally create the new plan with Projection::try_new_with_schema.

@liukun4515
Copy link
Contributor

I find the bug is in union type coercion(#3513), and the bug still exists. We can reproduce it in the master branch:

❯ create table table_2(name text, id INT) as  values('Alex',1);
0 rows in set. Query took 0.002 seconds.
❯ create table table_1(name text, id TINYINT) as  values('Alex',1);
0 rows in set. Query took 0.002 seconds.
❯ (
    SELECT * FROM table_1
    EXCEPT
    SELECT * FROM table_2
)
UNION ALL
(
    SELECT * FROM table_2
    EXCEPT
    SELECT * FROM table_1
);
SchemaError(FieldNotFound { field: Column { relation: Some("table_2"), name: "id" }, valid_fields: Some([Column { relation: Some("table_1"), name: "name" }, Column { relation: Some("table_1"), name: "id" }]) })

For union operation, we need ensure each data type of left and right should be same. It is done in:

https://github.com/apache/arrow-datafusion/blob/71b9baecd0a3c881f96e9994d922f3c1b3d61854/datafusion/expr/src/expr_rewriter.rs#L523-L527

But it uses plan.expressions() to get the type coercion units of the input, which I think is not correct, because it maybe return other expressions, like join will return its predicates not the fields of the schema.

To fix this issue, I think we can abandon plan.expressions(), and use the input schema to enumerate the fields, finally create the new plan with Projection::try_new_with_schema.

😭, I think this is the only solution to resolve union plan for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants