-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Find the correct fields when using page filter on struct
fields in parquet
#8848
Conversation
@@ -280,7 +280,7 @@ fn find_column_index( | |||
.columns() | |||
.iter() | |||
.enumerate() | |||
.find(|(_idx, c)| c.column_descr().name() == column.name()) | |||
.find(|(_idx, c)| c.column_descr().name() == column.name() && c.column_descr().path().parts().len() == 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps you can use the parquet_column
function
Which is how the row group metadata fetches the relevant metadata
https://github.com/apache/arrow-datafusion/blob/eb81ea299aa7e121bbe244e7e1ab56513d4ef800/datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs#L252-L255
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @alamb , I will check it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update this PR @alamb . Please check. Thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @manoj-inukolunu thank you for this contribution
There appears to be a CI failure as well
https://github.com/apache/arrow-datafusion/actions/runs/7510755702/job/20489197785?pr=8848
struct
fields in parquet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much @manoj-inukolunu -- this looks great
I have some style suggestions in this proposal which targets your branch: manoj-inukolunu#1
However they aren't required -- if you would prefer I think we could just update the docstrings for find_column_name
instead
Thanks again 🙏
let mut parquet_col = None; | ||
if name.is_some() { | ||
parquet_col = | ||
parquet_column(parquet_schema, arrow_schema, name.unwrap().as_str()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a more rust ideomatic way of doing this pattern that avoids the unwrap and mut is something like
let mut parquet_col = None; | |
if name.is_some() { | |
parquet_col = | |
parquet_column(parquet_schema, arrow_schema, name.unwrap().as_str()); | |
} | |
let parquet_col = name | |
.and_then(|name| { | |
parquet_column(parquet_schema, arrow_schema, name.as_str()); | |
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aah thank you , i was searching for something like this. I will add this to my toolbox.
@@ -2136,4 +2144,65 @@ mod tests { | |||
let execution_props = ExecutionProps::new(); | |||
create_physical_expr(expr, &df_schema, &execution_props).unwrap() | |||
} | |||
|
|||
#[tokio::test] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I verified that this test fails without the code in this PR 👍
thread 'datasource::physical_plan::parquet::tests::test_struct_filter_parquet' panicked at datafusion/core/src/datasource/physical_plan/parquet/mod.rs:2156:9:
assertion `left == right` failed
left: 0
right: 1
stack backtrace:
datafusion/core/src/datasource/physical_plan/parquet/page_filter.rs
Outdated
Show resolved
Hide resolved
Alamb/improve filter
I took the liberty of merging this branch with main and running |
Thanks again @manoj-inukolunu |
Which issue does this PR close?
Closes #8456
Rationale for this change
What changes are included in this PR?
The issue is in this line
c.column_descr().name() == column.name()
infind_column_index
datafusion/core/src/datasource/physical_plan/parquet/page_filter.rs
. When struct fields have the same name as the column names that condition is returning the struct field instead of the column field.Are these changes tested?
Yes I have added a test which is taken from the reproduction code.
Are there any user-facing changes?
No