Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Statistics, introduce precision estimates (Exact, Inexact, Absent) #7793

Merged
merged 69 commits into from
Oct 17, 2023
Merged

Refactor Statistics, introduce precision estimates (Exact, Inexact, Absent) #7793

merged 69 commits into from
Oct 17, 2023

Conversation

berkaysynnada
Copy link
Contributor

@berkaysynnada berkaysynnada commented Oct 11, 2023

Which issue does this PR close?

Closes #7553.

Rationale for this change

This PR is a comprehensive refactor of the statistics() method of execs. The main goal is to enhance DataFusion by introducing an enumeration called "Precision" and revealing its benefit by inserting it into Statistics parameters.

This enumeration categorizes the level of information accuracy throughout the code. We're using it to categorize Statistics parameters more precisely and consistently across the codebase. Regarding this, a few algorithms have been enhanced as we have more detailed information now.

The starting point of this update is based on the problem in the issue that will be closed. As mentioned in the issue, statistics function is causing an inefficiency in FilterExec, and the error propagation is broken. That is fixed, and the cases in which errors can arise are restricted and clearly defined.

What changes are included in this PR?

  1. statistics() method of exec's now returns a Result.
  2. The option values of statistics parameters have been simplified, and the exactness of each field is now tracked separately.
  3. Some optimizations on aggregations have been modified. Each aggregate now takes into consideration the corresponding statistics' exactness information, which advances capabilities of optimizations.
  4. Statistical calculations such as join output statistics are handled more wisely.

Are these changes tested?

Yes. Existing tests are adapted to Precision. New tests are added for the new functions.

Are there any user-facing changes?

@crepererum
Copy link
Contributor

I like the idea of "precision", however I have a few concerns regarding the current implementation:

  • inexact: What does "inexact" provide if you don't have ANY bounds whatsoever? I think this only makes sense if you can specify if the "inexact" guess is a at least also an upper/lower bound of the possible values. E.g. "row counts" can be an inexact upper bound if you have a filter but you just pass through the counts of the infiltered data. So I propose
  • constraints: If you have min+max value, I think you rarely want a totally random guess but they probably should be ends of a range and constraint each other. Also see point above.

With that, I propose to change the Precision enum to:

  • Exact: speaks for itself, might be rare for floats though
  • UpperBound: The actual value is <= this given bound.
  • LowerBound: The actual value is >= this given bound.
  • Absent: speaks for itself

For min/max value this probably leads to min=LowerBound(...) and max=UpperBound(...) in many cases which also nicely fits into the interval analysis that also exists in the DF code base already.

A more drastic (but maybe better?) alternative is to keep Inexact but instead of filling it with T, fill it with Interval. This provides you bounds and we already have a bunch of code around that type.

@ozankabak
Copy link
Contributor

ozankabak commented Oct 16, 2023

Hi @crepererum, thanks for taking a look. We actually had similar thoughts and did consider whether we can use Intervals in a similar fashion. We ended up using intervals whenever we can, but there are cases where something like Precision in its current form remains necessary. Here is what we do as of this PR:

  • Whenever we can elicit lower/upper bounds for a quantity, we use Interval. No need for another type -- a singleton interval already denotes exact information and we can represent the absence of any information with a universal interval (an interval containing all values) too.
  • There are cases where we only get an inexact/approximate point estimate, and we have no information on bounds. In these cases, a type like Precision is necessary. There are various ways this can arise, but one good example is as follows: Let's say I have two zero-mean quantities A and B. I know the value of A exactly (say 10), and I don't have any information on B. When using intervals, the sum A + B ends up in the no information state, which loses information. In such cases, having a value of Inexact(10) works better for many purposes.

So basically Precision::Inexact is an arbitrarily wide Interval whose midpoint/mean value information is carried over throughout calculations, and we only use this construct in contexts where lower/upper bounds are unavailable.

@crepererum
Copy link
Contributor

So in which case is an "arbitrarily wide Interval" better than "absent"?

@ozankabak
Copy link
Contributor

Maybe an example involving selectivity will help: We often end up with an approximate point-wise estimate of selectivity that we get from input/output rows, but we have no rigorous bounds on it. The "inexact" mechanism comes into play when we use such a selectivity in any downstream calculations.

1) get_statistics_with_limit() is simplified
2) AnalysisContext { boundaries } does not need to be an Option, it is removed.
3) Aggregation statistics() has a special handling for 0 and 1 row cases.
@crepererum
Copy link
Contributor

Which selectivity estimator provides an inexact point instead of a lower/upper bound?

@berkaysynnada
Copy link
Contributor Author

Which selectivity estimator provides an inexact point instead of a lower/upper bound?

I can give an example of how FilterExec::statistics() currently works on a table having one column.

Let's assume that the number of rows from input statistics() is exactly 1000, and the range of the column values is [1,100]. The predicate is column_a>90.

After the analysis, the new range of the column would become [91,100], which estimates the selectivity as 0.1 (intervals are assumed to be uniform, they cannot be treated any differently yet). https://github.com/apache/arrow-datafusion/blob/fa2bb6c4e80d80e3d1d26ce85f7c21232f036dd5/datafusion/physical-expr/src/analysis.rs#L218-L223

Now, when we update the number of rows, we use that selectivity value and find the new row count as 1000*0.1=100.
https://github.com/apache/arrow-datafusion/blob/fa2bb6c4e80d80e3d1d26ce85f7c21232f036dd5/datafusion/physical-plan/src/filter.rs#L233-L235
These lines become:

https://github.com/synnada-ai/arrow-datafusion/blob/23c564b093f8a60b95bbf9f774a1afe9b2908bc7/datafusion/physical-plan/src/filter.rs#L221-L224

Though the input row count is exact, we cannot take that exactness information further. At that point, how can we define a range for 100 rows? That is why we needed such Inexact implementation.

@crepererum
Copy link
Contributor

@berkaysynnada thank you for the nice example. W/o introducing even more complex statistics into DF, I don't see how to really express that estimate and I think the "inexact" precision proposed in this PR is a reasonable solution until we find something better ✔️

@alamb alamb changed the title Refactor Statistics Refactor Statistics, introduce precision estimates (Exact, Inexact, Absent) Oct 16, 2023
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you again @berkaysynnada -- this is really nice work, both codewise as well as implementation-wise.

Thanks also to everyone who helped review the code

datafusion/common/src/stats.rs Show resolved Hide resolved
datafusion/core/src/datasource/statistics.rs Outdated Show resolved Hide resolved
})
}
_ => {
// When the input row count is 0 or 1, we can adopt that statistic keeping its reliability.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change seems related / may conflict with #7832 from @mustafasrepo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have discussed that with @mustafasrepo, and his solution is temporary. This change should serve as the final version after merging the PRs.

datafusion/physical-plan/src/filter.rs Show resolved Hide resolved
datafusion/core/src/datasource/statistics.rs Outdated Show resolved Hide resolved
datafusion/physical-plan/src/lib.rs Outdated Show resolved Hide resolved
@ozankabak
Copy link
Contributor

Great, thanks everyone for the reviews! We will wait for a little bit more and merge this tomorrow if all still looks good

@mustafasrepo mustafasrepo merged commit c9330bc into apache:main Oct 17, 2023
@alamb
Copy link
Contributor

alamb commented Oct 17, 2023

Epic!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate core Core DataFusion crate physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt) substrait
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimize FilterExec::statistics / don't ignore errors
6 participants