-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-11964: [Rust][DataFusion] Extend constant folding and parquet filtering support #9704
ARROW-11964: [Rust][DataFusion] Extend constant folding and parquet filtering support #9704
Conversation
_ => { | ||
debug!("Filter expression not supported in ParquetExec {:?}", expr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you for adding some debug!
logs, I have been thinking how predicate push-down into parquet could be displayed into explain
output for easier diagnostics, but lately I have been preoccupied with rethinking of reading of arrow arrays from parquet files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this is very useful! I found that the filter pushdown doesn't filter out anything on the TPC-H benchmark (even when adding some extra support), as the data is distributed so evenly, so every row group contains almost the complete range of values.
I think @andygrove was thinking about some more general support for keeping statistics like this, but some debug logging is something easy to add.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very interested in your progress with Parquet to Arrow! I was hoping this PR would do something for the tpch benchmark, but seems the filter pushdown is not very useful in this case :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, filter push-down into parquet works best when the filter is over sorted data, if the data is randomly distributed across files it can't do much for performance as you have found
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Dandandan regarding progress on reading arrow arrays from parquet files, I have been able to make some progress and hope to be able to update the jira issue with my latest findings in the next day or two
@Dandandan I have mixed feelings about this change - I get the performance improvement side, but in my opinion it makes the explain output less readable by replacing the string to date cast with a date: |
The |
@@ -188,6 +188,97 @@ impl<'a> ExprRewriter for ConstantRewriter<'a> { | |||
right, | |||
}, | |||
}, | |||
Operator::Plus => match (left.as_ref(), right.as_ref()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
(Some(l), Some(r)) => { | ||
Expr::Literal(ScalarValue::Int32(Some(l + r))) | ||
} | ||
_ => Expr::Literal(ScalarValue::Int64(None)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be worth adding some kind of macro to simplify the maintenance of this section? I'm not sure if this is intentionally an Int64 or whether this is meant to be an Int32, and I wonder if we could have a macro that reduces these maths ops to just the ScalarValue variant and an operator (slight awkwardness: can't pass operators to macros).
The Apache Arrow Rust community is moving the Rust implementation into its own dedicated github repositories arrow-rs and arrow-datafusion. It is likely we will not merge this PR into this repository Please see the mailing-list thread for more details We expect the process to take a few days and will follow up with a migration plan for the in-flight PRs. |
@Dandandan , is this already moved to arrow-rs? |
Not yet, I will add some issues based on the PR and split this thing in two, I think this should be the way to go:
|
#10096 has removed the arrow implementation from this repository (it now resides in https://github.com/apache/arrow-rs and https://github.com/apache/arrow-datafusion) in the hopes of streamlining the development process Please re-target this PR (let us know if you need help doing so) to one/both of the new repositories. Thank you for understanding and helping to make arrow-rs and datafusion better |
This PR enables to use some more constant folding and adds support for
Between
in parquet filtering.Additionally, it adds some more debug logging.
Before:
After:
Todo: address/fix some tests