Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

query result empty when a struct field name and a regular field name is same #8456

Closed
anlihust opened this issue Dec 7, 2023 · 3 comments · Fixed by #8848
Closed

query result empty when a struct field name and a regular field name is same #8456

anlihust opened this issue Dec 7, 2023 · 3 comments · Fixed by #8848
Labels
bug Something isn't working

Comments

@anlihust
Copy link

anlihust commented Dec 7, 2023

Describe the bug

When a struct field name is the same as a regular field name, and the struct is declared before the regular field, it can cause issues when the field name in where clause

To Reproduce

construct a parquet file ,write 2 records as follow

    +---------------------+----+--------+
     | struct              | id | name   |
     +---------------------+----+--------+
     | {id: 1, name: aaa1} | 1  | test01 |
     | {id: 2, name: aaa2} | 2  | test02 |
     +---------------------+----+--------+

when execute sql select * from base_table where name='test01' ,will got empty result
here is the bug demo
https://github.com/anlihust/datafusion_demo/blob/main/src/main.rs

Expected behavior

when execute sql select * from base_table where name='test01' ,get the test01 record

    +---------------------+----+--------+
     | struct              | id | name   |
     +---------------------+----+--------+
     | {id: 1, name: aaa1} | 1  | test01 |
     +---------------------+----+--------+

Additional context

No response

@anlihust anlihust added the bug Something isn't working label Dec 7, 2023
@alamb
Copy link
Contributor

alamb commented Dec 7, 2023

Thank you for the report @anlihust (and the reproducer)
This sounds similar to #8335 which we fixed recently -- maybe additional code somewhere incorrectly maps column names to parquet columns.

In particular using parquet_column is needed to find the correct file index


/// Lookups up the parquet column by name
///
/// Returns the parquet column index and the corresponding arrow field
pub(crate) fn parquet_column<'a>(
    parquet_schema: &SchemaDescriptor,
    arrow_schema: &'a Schema,
    name: &str,
) -> Option<(usize, &'a FieldRef)> {
    let (root_idx, field) = arrow_schema.fields.find(name)?;
    if field.data_type().is_nested() {
        // Nested fields are not supported and require non-trivial logic
        // to correctly walk the parquet schema accounting for the
        // logical type rules - <https://github.com/apache/parquet-format/blob/master/LogicalTypes.md>
        //
        // For example a ListArray could correspond to anything from 1 to 3 levels
        // in the parquet schema
        return None;
    }

    // This could be made more efficient (#TBD)
    let parquet_idx = (0..parquet_schema.columns().len())
        .find(|x| parquet_schema.get_column_root_idx(*x) == root_idx)?;
    Some((parquet_idx, field))
}

@manoj-inukolunu
Copy link
Contributor

manoj-inukolunu commented Jan 13, 2024

Hello @alamb , I attempted a fix for this . row_group_metadata.columns() is returning the fields in structs as top level columns but with an additional path . So the predicate can be applied on either the struct field or the actual column field .I have added a filter to only consider top level columns rather than the ones in struct.

@alamb
Copy link
Contributor

alamb commented Jan 14, 2024

Thanks @manoj-inukolunu -- I will check out #8848 shortly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
3 participants