-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Queries that lost statistics or their statistics become inexact #8099
Comments
Update: From recent PR that adds column statistics into the explain, statistics of columns are absent. |
I used the change in this PR #8112 to see stats of columns.
❯ set datafusion.explain.show_statistics = true;
0 rows in set. Query took 0.009 seconds.
❯ set datafusion.execution.collect_statistics = true;
0 rows in set. Query took 0.001 seconds.
❯
create table t1(state string, city string, min_temp float, area int, time timestamp) as values
('MA', 'Boston', 70.4, 1, 50),
('MA', 'Bedford', 71.59, 2, 150),
('CA', 'SF', 79.0, 1, 300),
('MA', 'Boston', 75.4, 3, 250),
('MA', 'Andover', 69.5, 4, 250),
('MA', 'Bedford', 78.2, 2, 150),
('MA', 'Boston', 65.0, 2, 250),
('CA', 'SJ', 78.4, 1, 300),
('MA', 'Reading', 53.0, 4, 250),
('CA', 'SJ', 75.4, 5, 350);
0 rows in set. Query took 0.013 seconds.
-- `Exact` table-level stats. Columns stats only include `Null` stats and it is `Exact`
❯ explain select * from t1;
+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan |
+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan | TableScan: t1 projection=[state, city, min_temp, area, time] |
| physical_plan | MemoryExec: partitions=1, partition_sizes=[1], statistics=[Rows=Exact(10), Bytes=Exact(2960), [(Col[0]: Null=Exact(0)),(Col[1]: Null=Exact(0)),(Col[2]: Null=Exact(0)),(Col[3]: Null=Exact(0)),(Col[4]: Null=Exact(0))]] |
| | |
+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.004 seconds.
-- SortExec keeps stats the same
❯ explain select * from t1 order by time DESC;
+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan |
+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan | Sort: t1.time DESC NULLS FIRST |
| | TableScan: t1 projection=[state, city, min_temp, area, time] |
| physical_plan | SortExec: expr=[time@4 DESC], statistics=[Rows=Exact(10), Bytes=Exact(2960), [(Col[0]: Null=Exact(0)),(Col[1]: Null=Exact(0)),(Col[2]: Null=Exact(0)),(Col[3]: Null=Exact(0)),(Col[4]: Null=Exact(0))]] |
| | MemoryExec: partitions=1, partition_sizes=[1], statistics=[Rows=Exact(10), Bytes=Exact(2960), [(Col[0]: Null=Exact(0)),(Col[1]: Null=Exact(0)),(Col[2]: Null=Exact(0)),(Col[3]: Null=Exact(0)),(Col[4]: Null=Exact(0))]] |
| | |
+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.008 seconds.
-- Filtering on timestamps loses stats
❯ explain select * from t1 where time > to_timestamp(350) order by time DESC;
+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan |
+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan | Sort: t1.time DESC NULLS FIRST |
| | Filter: t1.time > TimestampNanosecond(350000000000, None) |
| | TableScan: t1 projection=[state, city, min_temp, area, time] |
| physical_plan | SortExec: expr=[time@4 DESC], statistics=[Rows=Absent, Bytes=Absent, [(Col[0]:),(Col[1]:),(Col[2]:),(Col[3]:),(Col[4]:)]] |
| | CoalesceBatchesExec: target_batch_size=8192, statistics=[Rows=Absent, Bytes=Absent, [(Col[0]:),(Col[1]:),(Col[2]:),(Col[3]:),(Col[4]:)]] |
| | FilterExec: time@4 > 350000000000, statistics=[Rows=Absent, Bytes=Absent, [(Col[0]:),(Col[1]:),(Col[2]:),(Col[3]:),(Col[4]:)]] |
| | MemoryExec: partitions=1, partition_sizes=[1], statistics=[Rows=Exact(10), Bytes=Exact(2960), [(Col[0]: Null=Exact(0)),(Col[1]: Null=Exact(0)),(Col[2]: Null=Exact(0)),(Col[3]: Null=Exact(0)),(Col[4]: Null=Exact(0))]] |
| | |
+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.006 seconds.
-- Filtering on string loses stats
❯ explain select * from t1 where city = 'Boston' order by time DESC;
+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan |
+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan | Sort: t1.time DESC NULLS FIRST |
| | Filter: t1.city = Utf8("Boston") |
| | TableScan: t1 projection=[state, city, min_temp, area, time] |
| physical_plan | SortExec: expr=[time@4 DESC], statistics=[Rows=Absent, Bytes=Absent, [(Col[0]:),(Col[1]:),(Col[2]:),(Col[3]:),(Col[4]:)]] |
| | CoalesceBatchesExec: target_batch_size=8192, statistics=[Rows=Absent, Bytes=Absent, [(Col[0]:),(Col[1]:),(Col[2]:),(Col[3]:),(Col[4]:)]] |
| | FilterExec: city@1 = Boston, statistics=[Rows=Absent, Bytes=Absent, [(Col[0]:),(Col[1]:),(Col[2]:),(Col[3]:),(Col[4]:)]] |
| | MemoryExec: partitions=1, partition_sizes=[1], statistics=[Rows=Exact(10), Bytes=Exact(2960), [(Col[0]: Null=Exact(0)),(Col[1]: Null=Exact(0)),(Col[2]: Null=Exact(0)),(Col[3]: Null=Exact(0)),(Col[4]: Null=Exact(0))]] |
| | |
+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.006 seconds.
-- filtering on float makes stats **incorrect** for Column stats `Exact` (on non-filter columns) and `Inexact` (on filter columns)
❯ explain select * from t1 where min_temp > 10.0 order by time DESC;
+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan |
+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan | Sort: t1.time DESC NULLS FIRST |
| | Filter: CAST(t1.min_temp AS Float64) > Float64(10) |
| | TableScan: t1 projection=[state, city, min_temp, area, time] |
| physical_plan | SortExec: expr=[time@4 DESC], statistics=[Rows=Inexact(10), Bytes=Inexact(2960), [(Col[0]: Min=Exact(Utf8(NULL)) Max=Exact(Utf8(NULL)) Null=Inexact(0)),(Col[1]: Min=Exact(Utf8(NULL)) Max=Exact(Utf8(NULL)) Null=Inexact(0)),(Col[2]: Min=Inexact(Float32(10.000001)) Max=Inexact(Float32(NULL)) Null=Inexact(0)),(Col[3]: Min=Exact(Int32(NULL)) Max=Exact(Int32(NULL)) Null=Inexact(0)),(Col[4]: Min=Exact(TimestampNanosecond(NULL, None)) Max=Exact(TimestampNanosecond(NULL, None)) Null=Inexact(0))]] |
| | CoalesceBatchesExec: target_batch_size=8192, statistics=[Rows=Inexact(10), Bytes=Inexact(2960), [(Col[0]: Min=Exact(Utf8(NULL)) Max=Exact(Utf8(NULL)) Null=Inexact(0)),(Col[1]: Min=Exact(Utf8(NULL)) Max=Exact(Utf8(NULL)) Null=Inexact(0)),(Col[2]: Min=Inexact(Float32(10.000001)) Max=Inexact(Float32(NULL)) Null=Inexact(0)),(Col[3]: Min=Exact(Int32(NULL)) Max=Exact(Int32(NULL)) Null=Inexact(0)),(Col[4]: Min=Exact(TimestampNanosecond(NULL, None)) Max=Exact(TimestampNanosecond(NULL, None)) Null=Inexact(0))]] |
| | FilterExec: CAST(min_temp@2 AS Float64) > 10, statistics=[Rows=Inexact(10), Bytes=Inexact(2960), [(Col[0]: Min=Exact(Utf8(NULL)) Max=Exact(Utf8(NULL)) Null=Inexact(0)),(Col[1]: Min=Exact(Utf8(NULL)) Max=Exact(Utf8(NULL)) Null=Inexact(0)),(Col[2]: Min=Inexact(Float32(10.000001)) Max=Inexact(Float32(NULL)) Null=Inexact(0)),(Col[3]: Min=Exact(Int32(NULL)) Max=Exact(Int32(NULL)) Null=Inexact(0)),(Col[4]: Min=Exact(TimestampNanosecond(NULL, None)) Max=Exact(TimestampNanosecond(NULL, None)) Null=Inexact(0))]] |
| | MemoryExec: partitions=1, partition_sizes=[1], statistics=[Rows=Exact(10), Bytes=Exact(2960), [(Col[0]: Null=Exact(0)),(Col[1]: Null=Exact(0)),(Col[2]: Null=Exact(0)),(Col[3]: Null=Exact(0)),(Col[4]: Null=Exact(0))]] |
| | |
+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.009 seconds.
-- filtering on integer makes stats **incorrect** for Column stats `Exact` (on non-filter columns) and `Inexact` (on filter columns)
❯ explain select * from t1 where area > 10 order by time DESC;
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan |
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan | Sort: t1.time DESC NULLS FIRST |
| | Filter: t1.area > Int32(10) |
| | TableScan: t1 projection=[state, city, min_temp, area, time] |
| physical_plan | SortExec: expr=[time@4 DESC], statistics=[Rows=Inexact(10), Bytes=Inexact(2960), [(Col[0]: Min=Exact(Utf8(NULL)) Max=Exact(Utf8(NULL)) Null=Inexact(0)),(Col[1]: Min=Exact(Utf8(NULL)) Max=Exact(Utf8(NULL)) Null=Inexact(0)),(Col[2]: Min=Exact(Float32(NULL)) Max=Exact(Float32(NULL)) Null=Inexact(0)),(Col[3]: Min=Inexact(Int32(11)) Max=Inexact(Int32(NULL)) Null=Inexact(0)),(Col[4]: Min=Exact(TimestampNanosecond(NULL, None)) Max=Exact(TimestampNanosecond(NULL, None)) Null=Inexact(0))]] |
| | CoalesceBatchesExec: target_batch_size=8192, statistics=[Rows=Inexact(10), Bytes=Inexact(2960), [(Col[0]: Min=Exact(Utf8(NULL)) Max=Exact(Utf8(NULL)) Null=Inexact(0)),(Col[1]: Min=Exact(Utf8(NULL)) Max=Exact(Utf8(NULL)) Null=Inexact(0)),(Col[2]: Min=Exact(Float32(NULL)) Max=Exact(Float32(NULL)) Null=Inexact(0)),(Col[3]: Min=Inexact(Int32(11)) Max=Inexact(Int32(NULL)) Null=Inexact(0)),(Col[4]: Min=Exact(TimestampNanosecond(NULL, None)) Max=Exact(TimestampNanosecond(NULL, None)) Null=Inexact(0))]] |
| | FilterExec: area@3 > 10, statistics=[Rows=Inexact(10), Bytes=Inexact(2960), [(Col[0]: Min=Exact(Utf8(NULL)) Max=Exact(Utf8(NULL)) Null=Inexact(0)),(Col[1]: Min=Exact(Utf8(NULL)) Max=Exact(Utf8(NULL)) Null=Inexact(0)),(Col[2]: Min=Exact(Float32(NULL)) Max=Exact(Float32(NULL)) Null=Inexact(0)),(Col[3]: Min=Inexact(Int32(11)) Max=Inexact(Int32(NULL)) Null=Inexact(0)),(Col[4]: Min=Exact(TimestampNanosecond(NULL, None)) Max=Exact(TimestampNanosecond(NULL, None)) Null=Inexact(0))]] |
| | MemoryExec: partitions=1, partition_sizes=[1], statistics=[Rows=Exact(10), Bytes=Exact(2960), [(Col[0]: Null=Exact(0)),(Col[1]: Null=Exact(0)),(Col[2]: Null=Exact(0)),(Col[3]: Null=Exact(0)),(Col[4]: Null=Exact(0))]] |
| | |
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.009 seconds.
--- Aggregate makes `row_count` stats `Inexact` and loses all other stats
❯ explain select city, min(min_temp) from t1 group by city;
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan |
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan | Aggregate: groupBy=[[t1.city]], aggr=[[MIN(t1.min_temp)]] |
| | TableScan: t1 projection=[city, min_temp] |
| physical_plan | AggregateExec: mode=FinalPartitioned, gby=[city@0 as city], aggr=[MIN(t1.min_temp)], statistics=[Rows=Inexact(10), Bytes=Absent, [(Col[0]:),(Col[1]:)]] |
| | CoalesceBatchesExec: target_batch_size=8192, statistics=[Rows=Inexact(10), Bytes=Absent, [(Col[0]:),(Col[1]:)]] |
| | RepartitionExec: partitioning=Hash([city@0], 10), input_partitions=10, statistics=[Rows=Inexact(10), Bytes=Absent, [(Col[0]:),(Col[1]:)]] |
| | RepartitionExec: partitioning=RoundRobinBatch(10), input_partitions=1, statistics=[Rows=Inexact(10), Bytes=Absent, [(Col[0]:),(Col[1]:)]] |
| | AggregateExec: mode=Partial, gby=[city@0 as city], aggr=[MIN(t1.min_temp)], statistics=[Rows=Inexact(10), Bytes=Absent, [(Col[0]:),(Col[1]:)]] |
| | MemoryExec: partitions=1, partition_sizes=[1], statistics=[Rows=Exact(10), Bytes=Exact(2960), [(Col[0]: Null=Exact(0)),(Col[1]: Null=Exact(0))]] |
| | |
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.007 seconds. |
I think figuring out how to adequately capture the results of Filter is a key feature that we can / should improve over the next few weeks. |
@alamb do you have a thought on this, do you think I an can try to do it or it requires too much background knowledge? |
Hi @edmondop -- I would recommend not trying this one as I believe we need to fix a few more things in the underlying representation of Statistics before this will become feasible |
Is your feature request related to a problem or challenge?
These are examples and use cases for the design of #8078
InfluxDB IOx uses
statistics Min and Max
to optimize some queries and we found, for many queries, their statistics are either lost (becomeabsent
) or becomeinexact
while being propagated upwards. Our request is to keep the statisticsconservative
that cover the exact bound rather then going it or making itinexact
Note that the reproducers below do not show Stats Min and Max so I use the Stats Row instead. Open #8110 to add Stats Minuend Max in the explain
Table for the reproducer
stats of timestamp filter is lost
stats of string filter is lost
stats of integer filter becomes inexact
stats of float filter becomes inexact
stats of filter IN is lost for any data type & subquery
stats of join is lost
stats of aggregation becomes inexact
Describe the solution you'd like
Have another enum value
conservative
keep the stats.A few examples: Row statistics of
. Filter : same value as non-filter
. Join: Cartesian product of 2 inputs
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: