Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Queries that lost statistics or their statistics become inexact #8099

Open
NGA-TRAN opened this issue Nov 8, 2023 · 6 comments
Open

Queries that lost statistics or their statistics become inexact #8099

NGA-TRAN opened this issue Nov 8, 2023 · 6 comments
Labels
enhancement New feature or request

Comments

@NGA-TRAN
Copy link
Contributor

NGA-TRAN commented Nov 8, 2023

Is your feature request related to a problem or challenge?

These are examples and use cases for the design of #8078

InfluxDB IOx uses statistics Min and Max to optimize some queries and we found, for many queries, their statistics are either lost (become absent) or become inexact while being propagated upwards. Our request is to keep the statistics conservative that cover the exact bound rather then going it or making it inexact

Note that the reproducers below do not show Stats Min and Max so I use the Stats Row instead. Open #8110 to add Stats Minuend Max in the explain

Table for the reproducer

set datafusion.explain.show_statistics = true;


create table t1(state string, city string, min_temp float, area int, time timestamp) as values 
    ('MA', 'Boston', 70.4, 1, 50),
    ('MA', 'Bedford', 71.59, 2, 150),
    ('CA', 'SF', 79.0, 1, 300),
    ('MA', 'Boston', 75.4, 3, 250),
    ('MA', 'Andover', 69.5, 4, 250),
    ('MA', 'Bedford', 78.2, 2, 150),
    ('MA', 'Boston', 65.0, 2, 250),
    ('CA', 'SJ', 78.4, 1, 300),
    ('MA', 'Reading', 53.0, 4, 250),
    ('CA', 'SJ', 75.4, 5, 350);

stats of timestamp filter is lost

explain select * from t1 where time <= to_timestamp(350);
+---------------+---------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                              |
+---------------+---------------------------------------------------------------------------------------------------+
| logical_plan  | Filter: t1.time <= TimestampNanosecond(350000000000, None)                                        |
|               |   TableScan: t1 projection=[state, city, min_temp, area, time]                                    |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192, statistics=[Rows=Absent, Bytes=Absent]               |
|               |   FilterExec: time@4 <= 350000000000, statistics=[Rows=Absent, Bytes=Absent]                      |
|               |     MemoryExec: partitions=1, partition_sizes=[1], statistics=[Rows=Exact(10), Bytes=Exact(2960)] |
|               |                                                                                                   |
+---------------+---------------------------------------------------------------------------------------------------+

stats of string filter is lost

explain select * from t1 where state = 'MA';
+---------------+---------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                              |
+---------------+---------------------------------------------------------------------------------------------------+
| logical_plan  | Filter: t1.state = Utf8("MA")                                                                     |
|               |   TableScan: t1 projection=[state, city, min_temp, area, time]                                    |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192, statistics=[Rows=Absent, Bytes=Absent]               |
|               |   FilterExec: state@0 = MA, statistics=[Rows=Absent, Bytes=Absent]                                |
|               |     MemoryExec: partitions=1, partition_sizes=[1], statistics=[Rows=Exact(10), Bytes=Exact(2960)] |
|               |                                                                                                   |
+---------------+---------------------------------------------------------------------------------------------------+

stats of integer filter becomes inexact

explain select * from t1 where area <= 100;
+---------------+---------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                              |
+---------------+---------------------------------------------------------------------------------------------------+
| logical_plan  | Filter: t1.area <= Int32(100)                                                                     |
|               |   TableScan: t1 projection=[state, city, min_temp, area, time]                                    |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192, statistics=[Rows=Inexact(10), Bytes=Inexact(2960)]   |
|               |   FilterExec: area@3 <= 100, statistics=[Rows=Inexact(10), Bytes=Inexact(2960)]                   |
|               |     MemoryExec: partitions=1, partition_sizes=[1], statistics=[Rows=Exact(10), Bytes=Exact(2960)] |
|               |                                                                                                   |
+---------------+---------------------------------------------------------------------------------------------------+

stats of float filter becomes inexact

explain select * from t1 where min_temp = 10.0;
+---------------+---------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                              |
+---------------+---------------------------------------------------------------------------------------------------+
| logical_plan  | Filter: t1.area <= Int32(100)                                                                     |
|               |   TableScan: t1 projection=[state, city, min_temp, area, time]                                    |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192, statistics=[Rows=Inexact(10), Bytes=Inexact(2960)]   |
|               |   FilterExec: area@3 <= 100, statistics=[Rows=Inexact(10), Bytes=Inexact(2960)]                   |
|               |     MemoryExec: partitions=1, partition_sizes=[1], statistics=[Rows=Exact(10), Bytes=Exact(2960)] |
|               |                                                                                                   |
+---------------+---------------------------------------------------------------------------------------------------+

stats of filter IN is lost for any data type & subquery

explain select * from t1 where area in (1, 2);
+---------------+---------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                              |
+---------------+---------------------------------------------------------------------------------------------------+
| logical_plan  | Filter: t1.area = Int32(1) OR t1.area = Int32(2)                                                  |
|               |   TableScan: t1 projection=[state, city, min_temp, area, time]                                    |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192, statistics=[Rows=Absent, Bytes=Absent]               |
|               |   FilterExec: area@3 = 1 OR area@3 = 2, statistics=[Rows=Absent, Bytes=Absent]                    |
|               |     MemoryExec: partitions=1, partition_sizes=[1], statistics=[Rows=Exact(10), Bytes=Exact(2960)] |
|               |                                                                                                   |
+---------------+---------------------------------------------------------------------------------------------------+

explain select * from t1 where city in ('Boston', 'Reading');
+---------------+---------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                              |
+---------------+---------------------------------------------------------------------------------------------------+
| logical_plan  | Filter: t1.city = Utf8("Boston") OR t1.city = Utf8("Reading")                                     |
|               |   TableScan: t1 projection=[state, city, min_temp, area, time]                                    |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192, statistics=[Rows=Absent, Bytes=Absent]               |
|               |   FilterExec: city@1 = Boston OR city@1 = Reading, statistics=[Rows=Absent, Bytes=Absent]         |
|               |     MemoryExec: partitions=1, partition_sizes=[1], statistics=[Rows=Exact(10), Bytes=Exact(2960)] |
|               |                                                                                                   |
+---------------+---------------------------------------------------------------------------------------------------+

explain select * from t1 where city in (select city from t1);
+---------------+----------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                       |
+---------------+----------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | LeftSemi Join: t1.city = __correlated_sq_1.city                                                                            |
|               |   TableScan: t1 projection=[state, city, min_temp, area, time]                                                             |
|               |   SubqueryAlias: __correlated_sq_1                                                                                         |
|               |     TableScan: t1 projection=[city]                                                                                        |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192, statistics=[Rows=Absent, Bytes=Absent]                                        |
|               |   HashJoinExec: mode=Partitioned, join_type=LeftSemi, on=[(city@1, city@0)], statistics=[Rows=Absent, Bytes=Absent]        |
|               |     CoalesceBatchesExec: target_batch_size=8192, statistics=[Rows=Exact(10), Bytes=Exact(2960)]                            |
|               |       RepartitionExec: partitioning=Hash([city@1], 10), input_partitions=1, statistics=[Rows=Exact(10), Bytes=Exact(2960)] |
|               |         MemoryExec: partitions=1, partition_sizes=[1], statistics=[Rows=Exact(10), Bytes=Exact(2960)]                      |
|               |     CoalesceBatchesExec: target_batch_size=8192, statistics=[Rows=Exact(10), Bytes=Exact(2960)]                            |
|               |       RepartitionExec: partitioning=Hash([city@0], 10), input_partitions=1, statistics=[Rows=Exact(10), Bytes=Exact(2960)] |
|               |         MemoryExec: partitions=1, partition_sizes=[1], statistics=[Rows=Exact(10), Bytes=Exact(2960)]                      |
|               |                                                                                                                            |
+---------------+----------------------------------------------------------------------------------------------------------------------------+

stats of join is lost

explain select * from t1, t1 as t2 where t1.city = t2.city;
+---------------+----------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                       |
+---------------+----------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Inner Join: t1.city = t2.city                                                                                              |
|               |   TableScan: t1 projection=[state, city, min_temp, area, time]                                                             |
|               |   SubqueryAlias: t2                                                                                                        |
|               |     TableScan: t1 projection=[state, city, min_temp, area, time]                                                           |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192, statistics=[Rows=Absent, Bytes=Absent]                                        |
|               |   HashJoinExec: mode=Partitioned, join_type=Inner, on=[(city@1, city@1)], statistics=[Rows=Absent, Bytes=Absent]           |
|               |     CoalesceBatchesExec: target_batch_size=8192, statistics=[Rows=Exact(10), Bytes=Exact(2960)]                            |
|               |       RepartitionExec: partitioning=Hash([city@1], 10), input_partitions=1, statistics=[Rows=Exact(10), Bytes=Exact(2960)] |
|               |         MemoryExec: partitions=1, partition_sizes=[1], statistics=[Rows=Exact(10), Bytes=Exact(2960)]                      |
|               |     CoalesceBatchesExec: target_batch_size=8192, statistics=[Rows=Exact(10), Bytes=Exact(2960)]                            |
|               |       RepartitionExec: partitioning=Hash([city@1], 10), input_partitions=1, statistics=[Rows=Exact(10), Bytes=Exact(2960)] |
|               |         MemoryExec: partitions=1, partition_sizes=[1], statistics=[Rows=Exact(10), Bytes=Exact(2960)]                      |
|               |                                                                                                                            |
+---------------+----------------------------------------------------------------------------------------------------------------------------+

stats of aggregation becomes inexact

explain select city, max(min_temp) as max_min_temp from t1 group by city order by max_min_temp DESC limit 5;
+---------------+------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                     |
+---------------+------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Limit: skip=0, fetch=5                                                                                                                   |
|               |   Sort: max_min_temp DESC NULLS FIRST, fetch=5                                                                                           |
|               |     Projection: t1.city, MAX(t1.min_temp) AS max_min_temp                                                                                |
|               |       Aggregate: groupBy=[[t1.city]], aggr=[[MAX(t1.min_temp)]]                                                                          |
|               |         TableScan: t1 projection=[city, min_temp]                                                                                        |
| physical_plan | GlobalLimitExec: skip=0, fetch=5, statistics=[Rows=Inexact(5), Bytes=Absent]                                                             |
|               |   SortPreservingMergeExec: [max_min_temp@1 DESC], fetch=5, statistics=[Rows=Inexact(10), Bytes=Absent]                                   |
|               |     SortExec: TopK(fetch=5), expr=[max_min_temp@1 DESC], statistics=[Rows=Inexact(10), Bytes=Absent]                                     |
|               |       ProjectionExec: expr=[city@0 as city, MAX(t1.min_temp)@1 as max_min_temp], statistics=[Rows=Inexact(10), Bytes=Absent]             |
|               |         AggregateExec: mode=FinalPartitioned, gby=[city@0 as city], aggr=[MAX(t1.min_temp)], statistics=[Rows=Inexact(10), Bytes=Absent] |
|               |           CoalesceBatchesExec: target_batch_size=8192, statistics=[Rows=Inexact(10), Bytes=Absent]                                       |
|               |             RepartitionExec: partitioning=Hash([city@0], 10), input_partitions=10, statistics=[Rows=Inexact(10), Bytes=Absent]           |
|               |               RepartitionExec: partitioning=RoundRobinBatch(10), input_partitions=1, statistics=[Rows=Inexact(10), Bytes=Absent]         |
|               |                 AggregateExec: mode=Partial, gby=[city@0 as city], aggr=[MAX(t1.min_temp)], statistics=[Rows=Inexact(10), Bytes=Absent]  |
|               |                   MemoryExec: partitions=1, partition_sizes=[1], statistics=[Rows=Exact(10), Bytes=Exact(2960)]                          |
|               |                                                                                                                                          |
+---------------+------------------------------------------------------------------------------------------------------------------------------------------+

Describe the solution you'd like

Have another enum value conservative keep the stats.

A few examples: Row statistics of
. Filter : same value as non-filter
. Join: Cartesian product of 2 inputs

Describe alternatives you've considered

No response

Additional context

No response

@NGA-TRAN NGA-TRAN added the enhancement New feature or request label Nov 8, 2023
@NGA-TRAN
Copy link
Contributor Author

NGA-TRAN commented Nov 8, 2023

@alamb

@NGA-TRAN
Copy link
Contributor Author

NGA-TRAN commented Nov 9, 2023

Update: From recent PR that adds column statistics into the explain, statistics of columns are absent.

@NGA-TRAN
Copy link
Contributor Author

NGA-TRAN commented Nov 10, 2023

I used the change in this PR #8112 to see stats of columns.
Summary:

  1. Lowest scan step: column only has stats Null. No stats on min and max.
  2. SortExec keeps stats as its input
  3. FilterExec on string and timestamp lose stats on everything including table-level stats
  4. FilterExec on integer and float make: (1) table-level stats become inexact, (2) column stats of columns in the filter start having Inexact min and max stats, (3) column stats of other columns start having Exact but wrong stats
  5. AggregateExec make: (1) row count become Inexact, (2) lost stats for bytes and columns
set datafusion.explain.show_statistics = true;
0 rows in set. Query took 0.009 seconds.

❯ set datafusion.execution.collect_statistics = true;
0 rows in set. Query took 0.001 seconds.

❯ 
create table t1(state string, city string, min_temp float, area int, time timestamp) as values 
    ('MA', 'Boston', 70.4, 1, 50),
    ('MA', 'Bedford', 71.59, 2, 150),
    ('CA', 'SF', 79.0, 1, 300),
    ('MA', 'Boston', 75.4, 3, 250),
    ('MA', 'Andover', 69.5, 4, 250),
    ('MA', 'Bedford', 78.2, 2, 150),
    ('MA', 'Boston', 65.0, 2, 250),
    ('CA', 'SJ', 78.4, 1, 300),
    ('MA', 'Reading', 53.0, 4, 250),
    ('CA', 'SJ', 75.4, 5, 350);
0 rows in set. Query took 0.013 seconds.

-- `Exact` table-level stats. Columns stats only include `Null` stats and it is `Exact`
❯ explain select * from t1;
+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                                                                                                     |
+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | TableScan: t1 projection=[state, city, min_temp, area, time]                                                                                                                                                             |
| physical_plan | MemoryExec: partitions=1, partition_sizes=[1], statistics=[Rows=Exact(10), Bytes=Exact(2960), [(Col[0]: Null=Exact(0)),(Col[1]: Null=Exact(0)),(Col[2]: Null=Exact(0)),(Col[3]: Null=Exact(0)),(Col[4]: Null=Exact(0))]] |
|               |                                                                                                                                                                                                                          |
+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.004 seconds.

-- SortExec keeps stats the same
❯ explain select * from t1 order by time DESC;
+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                                                                                                       |
+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Sort: t1.time DESC NULLS FIRST                                                                                                                                                                                             |
|               |   TableScan: t1 projection=[state, city, min_temp, area, time]                                                                                                                                                             |
| physical_plan | SortExec: expr=[time@4 DESC], statistics=[Rows=Exact(10), Bytes=Exact(2960), [(Col[0]: Null=Exact(0)),(Col[1]: Null=Exact(0)),(Col[2]: Null=Exact(0)),(Col[3]: Null=Exact(0)),(Col[4]: Null=Exact(0))]]                    |
|               |   MemoryExec: partitions=1, partition_sizes=[1], statistics=[Rows=Exact(10), Bytes=Exact(2960), [(Col[0]: Null=Exact(0)),(Col[1]: Null=Exact(0)),(Col[2]: Null=Exact(0)),(Col[3]: Null=Exact(0)),(Col[4]: Null=Exact(0))]] |
|               |                                                                                                                                                                                                                            |
+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.008 seconds.

-- Filtering on timestamps loses stats
❯ explain select * from t1 where time > to_timestamp(350) order by time DESC;
+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                                                                                                           |
+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Sort: t1.time DESC NULLS FIRST                                                                                                                                                                                                 |
|               |   Filter: t1.time > TimestampNanosecond(350000000000, None)                                                                                                                                                                    |
|               |     TableScan: t1 projection=[state, city, min_temp, area, time]                                                                                                                                                               |
| physical_plan | SortExec: expr=[time@4 DESC], statistics=[Rows=Absent, Bytes=Absent, [(Col[0]:),(Col[1]:),(Col[2]:),(Col[3]:),(Col[4]:)]]                                                                                                      |
|               |   CoalesceBatchesExec: target_batch_size=8192, statistics=[Rows=Absent, Bytes=Absent, [(Col[0]:),(Col[1]:),(Col[2]:),(Col[3]:),(Col[4]:)]]                                                                                     |
|               |     FilterExec: time@4 > 350000000000, statistics=[Rows=Absent, Bytes=Absent, [(Col[0]:),(Col[1]:),(Col[2]:),(Col[3]:),(Col[4]:)]]                                                                                             |
|               |       MemoryExec: partitions=1, partition_sizes=[1], statistics=[Rows=Exact(10), Bytes=Exact(2960), [(Col[0]: Null=Exact(0)),(Col[1]: Null=Exact(0)),(Col[2]: Null=Exact(0)),(Col[3]: Null=Exact(0)),(Col[4]: Null=Exact(0))]] |
|               |                                                                                                                                                                                                                                |
+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.006 seconds.

-- Filtering on string loses stats
❯ explain select * from t1 where city = 'Boston' order by time DESC;
+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                                                                                                           |
+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Sort: t1.time DESC NULLS FIRST                                                                                                                                                                                                 |
|               |   Filter: t1.city = Utf8("Boston")                                                                                                                                                                                             |
|               |     TableScan: t1 projection=[state, city, min_temp, area, time]                                                                                                                                                               |
| physical_plan | SortExec: expr=[time@4 DESC], statistics=[Rows=Absent, Bytes=Absent, [(Col[0]:),(Col[1]:),(Col[2]:),(Col[3]:),(Col[4]:)]]                                                                                                      |
|               |   CoalesceBatchesExec: target_batch_size=8192, statistics=[Rows=Absent, Bytes=Absent, [(Col[0]:),(Col[1]:),(Col[2]:),(Col[3]:),(Col[4]:)]]                                                                                     |
|               |     FilterExec: city@1 = Boston, statistics=[Rows=Absent, Bytes=Absent, [(Col[0]:),(Col[1]:),(Col[2]:),(Col[3]:),(Col[4]:)]]                                                                                                   |
|               |       MemoryExec: partitions=1, partition_sizes=[1], statistics=[Rows=Exact(10), Bytes=Exact(2960), [(Col[0]: Null=Exact(0)),(Col[1]: Null=Exact(0)),(Col[2]: Null=Exact(0)),(Col[3]: Null=Exact(0)),(Col[4]: Null=Exact(0))]] |
|               |                                                                                                                                                                                                                                |
+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.006 seconds.

-- filtering on float makes stats **incorrect**  for Column stats `Exact` (on non-filter columns) and `Inexact` (on filter columns)
❯ explain select * from t1 where min_temp > 10.0 order by time DESC;
+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Sort: t1.time DESC NULLS FIRST                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
|               |   Filter: CAST(t1.min_temp AS Float64) > Float64(10)                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
|               |     TableScan: t1 projection=[state, city, min_temp, area, time]                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
| physical_plan | SortExec: expr=[time@4 DESC], statistics=[Rows=Inexact(10), Bytes=Inexact(2960), [(Col[0]: Min=Exact(Utf8(NULL)) Max=Exact(Utf8(NULL)) Null=Inexact(0)),(Col[1]: Min=Exact(Utf8(NULL)) Max=Exact(Utf8(NULL)) Null=Inexact(0)),(Col[2]: Min=Inexact(Float32(10.000001)) Max=Inexact(Float32(NULL)) Null=Inexact(0)),(Col[3]: Min=Exact(Int32(NULL)) Max=Exact(Int32(NULL)) Null=Inexact(0)),(Col[4]: Min=Exact(TimestampNanosecond(NULL, None)) Max=Exact(TimestampNanosecond(NULL, None)) Null=Inexact(0))]]                     |
|               |   CoalesceBatchesExec: target_batch_size=8192, statistics=[Rows=Inexact(10), Bytes=Inexact(2960), [(Col[0]: Min=Exact(Utf8(NULL)) Max=Exact(Utf8(NULL)) Null=Inexact(0)),(Col[1]: Min=Exact(Utf8(NULL)) Max=Exact(Utf8(NULL)) Null=Inexact(0)),(Col[2]: Min=Inexact(Float32(10.000001)) Max=Inexact(Float32(NULL)) Null=Inexact(0)),(Col[3]: Min=Exact(Int32(NULL)) Max=Exact(Int32(NULL)) Null=Inexact(0)),(Col[4]: Min=Exact(TimestampNanosecond(NULL, None)) Max=Exact(TimestampNanosecond(NULL, None)) Null=Inexact(0))]]    |
|               |     FilterExec: CAST(min_temp@2 AS Float64) > 10, statistics=[Rows=Inexact(10), Bytes=Inexact(2960), [(Col[0]: Min=Exact(Utf8(NULL)) Max=Exact(Utf8(NULL)) Null=Inexact(0)),(Col[1]: Min=Exact(Utf8(NULL)) Max=Exact(Utf8(NULL)) Null=Inexact(0)),(Col[2]: Min=Inexact(Float32(10.000001)) Max=Inexact(Float32(NULL)) Null=Inexact(0)),(Col[3]: Min=Exact(Int32(NULL)) Max=Exact(Int32(NULL)) Null=Inexact(0)),(Col[4]: Min=Exact(TimestampNanosecond(NULL, None)) Max=Exact(TimestampNanosecond(NULL, None)) Null=Inexact(0))]] |
|               |       MemoryExec: partitions=1, partition_sizes=[1], statistics=[Rows=Exact(10), Bytes=Exact(2960), [(Col[0]: Null=Exact(0)),(Col[1]: Null=Exact(0)),(Col[2]: Null=Exact(0)),(Col[3]: Null=Exact(0)),(Col[4]: Null=Exact(0))]]                                                                                                                                                                                                                                                                                                   |
|               |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.009 seconds.

-- filtering on integer makes stats **incorrect**  for Column stats `Exact` (on non-filter columns) and `Inexact` (on filter columns)
❯ explain select * from t1 where area > 10 order by time DESC;
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Sort: t1.time DESC NULLS FIRST                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
|               |   Filter: t1.area > Int32(10)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
|               |     TableScan: t1 projection=[state, city, min_temp, area, time]                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
| physical_plan | SortExec: expr=[time@4 DESC], statistics=[Rows=Inexact(10), Bytes=Inexact(2960), [(Col[0]: Min=Exact(Utf8(NULL)) Max=Exact(Utf8(NULL)) Null=Inexact(0)),(Col[1]: Min=Exact(Utf8(NULL)) Max=Exact(Utf8(NULL)) Null=Inexact(0)),(Col[2]: Min=Exact(Float32(NULL)) Max=Exact(Float32(NULL)) Null=Inexact(0)),(Col[3]: Min=Inexact(Int32(11)) Max=Inexact(Int32(NULL)) Null=Inexact(0)),(Col[4]: Min=Exact(TimestampNanosecond(NULL, None)) Max=Exact(TimestampNanosecond(NULL, None)) Null=Inexact(0))]]                  |
|               |   CoalesceBatchesExec: target_batch_size=8192, statistics=[Rows=Inexact(10), Bytes=Inexact(2960), [(Col[0]: Min=Exact(Utf8(NULL)) Max=Exact(Utf8(NULL)) Null=Inexact(0)),(Col[1]: Min=Exact(Utf8(NULL)) Max=Exact(Utf8(NULL)) Null=Inexact(0)),(Col[2]: Min=Exact(Float32(NULL)) Max=Exact(Float32(NULL)) Null=Inexact(0)),(Col[3]: Min=Inexact(Int32(11)) Max=Inexact(Int32(NULL)) Null=Inexact(0)),(Col[4]: Min=Exact(TimestampNanosecond(NULL, None)) Max=Exact(TimestampNanosecond(NULL, None)) Null=Inexact(0))]] |
|               |     FilterExec: area@3 > 10, statistics=[Rows=Inexact(10), Bytes=Inexact(2960), [(Col[0]: Min=Exact(Utf8(NULL)) Max=Exact(Utf8(NULL)) Null=Inexact(0)),(Col[1]: Min=Exact(Utf8(NULL)) Max=Exact(Utf8(NULL)) Null=Inexact(0)),(Col[2]: Min=Exact(Float32(NULL)) Max=Exact(Float32(NULL)) Null=Inexact(0)),(Col[3]: Min=Inexact(Int32(11)) Max=Inexact(Int32(NULL)) Null=Inexact(0)),(Col[4]: Min=Exact(TimestampNanosecond(NULL, None)) Max=Exact(TimestampNanosecond(NULL, None)) Null=Inexact(0))]]                   |
|               |       MemoryExec: partitions=1, partition_sizes=[1], statistics=[Rows=Exact(10), Bytes=Exact(2960), [(Col[0]: Null=Exact(0)),(Col[1]: Null=Exact(0)),(Col[2]: Null=Exact(0)),(Col[3]: Null=Exact(0)),(Col[4]: Null=Exact(0))]]                                                                                                                                                                                                                                                                                         |
|               |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.009 seconds.

--- Aggregate makes `row_count` stats `Inexact` and loses all other stats
❯  explain select city, min(min_temp) from t1 group by city;
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                                       |
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Aggregate: groupBy=[[t1.city]], aggr=[[MIN(t1.min_temp)]]                                                                                                  |
|               |   TableScan: t1 projection=[city, min_temp]                                                                                                                |
| physical_plan | AggregateExec: mode=FinalPartitioned, gby=[city@0 as city], aggr=[MIN(t1.min_temp)], statistics=[Rows=Inexact(10), Bytes=Absent, [(Col[0]:),(Col[1]:)]]    |
|               |   CoalesceBatchesExec: target_batch_size=8192, statistics=[Rows=Inexact(10), Bytes=Absent, [(Col[0]:),(Col[1]:)]]                                          |
|               |     RepartitionExec: partitioning=Hash([city@0], 10), input_partitions=10, statistics=[Rows=Inexact(10), Bytes=Absent, [(Col[0]:),(Col[1]:)]]              |
|               |       RepartitionExec: partitioning=RoundRobinBatch(10), input_partitions=1, statistics=[Rows=Inexact(10), Bytes=Absent, [(Col[0]:),(Col[1]:)]]            |
|               |         AggregateExec: mode=Partial, gby=[city@0 as city], aggr=[MIN(t1.min_temp)], statistics=[Rows=Inexact(10), Bytes=Absent, [(Col[0]:),(Col[1]:)]]     |
|               |           MemoryExec: partitions=1, partition_sizes=[1], statistics=[Rows=Exact(10), Bytes=Exact(2960), [(Col[0]: Null=Exact(0)),(Col[1]: Null=Exact(0))]] |
|               |                                                                                                                                                            |
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.007 seconds.

@alamb
Copy link
Contributor

alamb commented Nov 10, 2023

I think figuring out how to adequately capture the results of Filter is a key feature that we can / should improve over the next few weeks.

@edmondop
Copy link
Contributor

@alamb do you have a thought on this, do you think I an can try to do it or it requires too much background knowledge?

@alamb
Copy link
Contributor

alamb commented Nov 28, 2023

Hi @edmondop -- I would recommend not trying this one as I believe we need to fix a few more things in the underlying representation of Statistics before this will become feasible

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants