Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the inappropriate heuristic method to estimate the equal condition selectivity #18461

Closed
qw4990 opened this issue Jul 9, 2020 · 3 comments · Fixed by #18543
Closed

Fix the inappropriate heuristic method to estimate the equal condition selectivity #18461

qw4990 opened this issue Jul 9, 2020 · 3 comments · Fixed by #18543
Assignees
Labels
epic/access-path sig/planner SIG: Planner type/enhancement The issue or PR belongs to an enhancement.

Comments

@qw4990
Copy link
Contributor

qw4990 commented Jul 9, 2020

Development Task

Now in the planner, if the estimated value is out of range, which means we cannot find it in the CMSketch, a heuristic method is used, which is selectivity = 1 / NDV * (ModifyRows / TotalRows).
Compared with Oracle's method selectivity = 1 / NDV, our method may result in an inappropriate low selectivity, especially when there are a few rows modified, for example, if one row is modified, it would be 1 / NDV * (1 / TotalRows).

It's better to use 1 / NDV or 1 / NDV * (1 - ModifyRows / TotalRows).

@qw4990 qw4990 added type/enhancement The issue or PR belongs to an enhancement. sig/planner SIG: Planner labels Jul 9, 2020
@qw4990 qw4990 self-assigned this Jul 9, 2020
@qw4990
Copy link
Contributor Author

qw4990 commented Jul 13, 2020

Here is PG's method in this case: sel = (1 - nullFrac - mcvSumFrac) / othersNDV.
Compared with simply setting the sel to 1 / NDV, minus mcvSumFrac from it can avoid wrong selectivities caused by small NDV.
Since TiDB doesn't maintain an MCV list, and for simplicity, we can use the magic number outOfRangeBetweenRate which already exists in TiDB:

if ndv < outOfRangeBetweenRage { ndv = outOfRangeBetweenRage }
sel := 1 / ndv

@lmduean
Copy link

lmduean commented Oct 26, 2023

Here is PG's method in this case: sel = (1 - nullFrac - mcvSumFrac) / othersNDV. Compared with simply setting the sel to 1 / NDV, minus mcvSumFrac from it can avoid wrong selectivities caused by small NDV. Since TiDB doesn't maintain an MCV list, and for simplicity, we can use the magic number outOfRangeBetweenRate which already exists in TiDB:

if ndv < outOfRangeBetweenRage { ndv = outOfRangeBetweenRage }
sel := 1 / ndv

Why don't we directly use 1/NDV, just like Oracle does?
use this doesn't make sense, for example:

//disable auto analyze
create table t1(type int, key type_idx(type));
insert into t1 values (1); // repeat 1024
analyze table t1;
insert into t1 values (2); // repeat 1024

explain select count(*) from t1 where type=2;
+-----------------------------+---------+-----------+--------------------------------+---------------------------------+
| id | estRows | task | access object | operator info |
+-----------------------------+---------+-----------+--------------------------------+---------------------------------+
| StreamAgg_17 | 1.00 | root | | funcs:count(Column#5)->Column#3 |
| └─IndexReader_18 | 1.00 | root | | index:StreamAgg_9 |
| └─StreamAgg_9 | 1.00 | cop[tikv] | | funcs:count(1)->Column#5 |
| └─IndexRangeScan_16 | 20.48 | cop[tikv] | table:t1, index:type_idx(type) | range:[2,2], keep order:false |
+-----------------------------+---------+-----------+--------------------------------+---------------------------------+
4 rows in set (0.00 sec)

without this adjust:
+-----------------------------+---------+-----------+--------------------------------+---------------------------------+
| id | estRows | task | access object | operator info |
+-----------------------------+---------+-----------+--------------------------------+---------------------------------+
| StreamAgg_17 | 1.00 | root | | funcs:count(Column#5)->Column#3 |
| └─IndexReader_18 | 1.00 | root | | index:StreamAgg_9 |
| └─StreamAgg_9 | 1.00 | cop[tikv] | | funcs:count(1)->Column#5 |
| └─IndexRangeScan_16 | 2048.00 | cop[tikv] | table:t1, index:type_idx(type) | range:[2,2], keep order:false |
+-----------------------------+---------+-----------+--------------------------------+---------------------------------+
4 rows in set (0.00 sec)

I think 1/NDV is better
@qw4990
/cc @qw4990

@lmduean
Copy link

lmduean commented Oct 31, 2023

/assign @qw4990

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
epic/access-path sig/planner SIG: Planner type/enhancement The issue or PR belongs to an enhancement.
Projects
None yet
3 participants