-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add functional dependency between columns from extended statistics #49
Comments
Hi, I'm not sure what you'd actually expect from this feature, as creating extended statistics shouldn't be a huge problem, even on a busy production server. Do you want to reproduce what CREATE STATISTICS does (so with the requirement to then take some samples to compute the statistics) but without the ShareUpdateExclusiveLock and/or to be able to do it on a read-only standby, or simulate specific fake FD statistics for some values by telling what the extended statistic rows are? Maybe you'd instead want to detect what FD are missing, and for this pg_qualstats can help you, since it'll give you the selectivity estimation error since v2. |
Thx for pointing me to pg_qualstats. I'll have a look at it. As for the expectations for this feature, can you @ankane pls. help explaining, since you suggested here ankane/dexter#32 (comment) that Dexter could use this feature? |
Hey @rjuju, it'd be the first case (reproduce what |
One thing still bothers me. For extended statistics investigation, I'm not sure that having the explain without analyze is really that useful. Unlike hypothetical indexes (where you want to know if the index is used and how), you probably want to see here how far are the estimates with and without various extended statistics compared to the real execution plan, right? So that's clearly doable in hypopg, but that's a radical change in its heuristics. |
Hey @rjuju, thanks for the response! I was thinking about it from the perspective of only recommending them if they significantly improved the estimated query plan (rather than improving row estimates). Here's an example from this post. CREATE TABLE pgqs AS SELECT i%2 val1 , (i+1)%2 val2 FROM generate_series(1, 50000) i;
ANALYZE pgqs;
EXPLAIN SELECT * FROM pgqs t1 JOIN pgqs t2 ON t1.val1 = t2.val1 WHERE t1.val1 = 0 AND t1.val2 = 0 order by t1.val2;
CREATE STATISTICS pgqs_stats ON val1, val2 FROM pgqs;
ANALYZE pgqs;
EXPLAIN SELECT * FROM pgqs t1 JOIN pgqs t2 ON t1.val1 = t2.val1 WHERE t1.val1 = 0 AND t1.val2 = 0 order by t1.val2; In this case, the plan goes from: Nested Loop (cost=0.00..3884194.00 rows=310587500 width=20)
-> Seq Scan on pgqs t2 (cost=0.00..847.00 rows=24847 width=8)
Filter: (val1 = 0)
-> Materialize (cost=0.00..1034.50 rows=12500 width=8)
-> Seq Scan on pgqs t1 (cost=0.00..972.00 rows=12500 width=8)
Filter: ((val1 = 0) AND (val2 = 0)) To Nested Loop (cost=0.00..2069.52 rows=25052 width=20)
-> Seq Scan on pgqs t1 (cost=0.00..972.00 rows=1 width=8)
Filter: ((val1 = 0) AND (val2 = 0))
-> Seq Scan on pgqs t2 (cost=0.00..847.00 rows=25052 width=8)
Filter: (val1 = 0) So extended statistics would be recommended. However, I haven't used extended statistics in a production environment, so I may be missing something. Also, if injecting hypothetical statistics into the query planner is significantly different than hypothetical indexes, it's probably not worth the trouble. |
My point was that you maybe can't naively assume that "25052 rows has to be correct while 310587500 has to be wrong, and the only reason it changed was extended statistics.". You don't know the real selectivity, and also you may be comparing outdated statistics with one you freshly gathered when creating hypothetical statistics. |
Makes sense. I wouldn't expect it to be correct all of the time (same with hypothetical indexes). |
So I looked a little bit at the extended statistic code, and while the definition of the extended statistic seems pluggable (only need to fill RelOptInfo->statlist), the data part unfortunately doesn't seem to be pluggable. See |
I admittedly don't know the HypoPG and PG planner internals (RelOptInfo?) good enough. So excuse my newbies questions: Why (1.) is "the data part unfortunately doesn't seem to be pluggable" a show stopper? Because HyopPG can't "fake" it - and because this would mean to refactor the PG mcv.c code? Then (2.): Any planner should consider as much of statistics as it get's - and now with CREATE STATISTICS it gets some more, like multivariate statistics (dependencies, ndistinct and mcv) [1] [2]. Why wouldn't "hypothetical statistics info" help simulating this info, similar to "hypothetical indexes"? [1] https://www.citusdata.com/blog/2018/03/06/postgres-planner-and-its-usage-of-statistics/ |
Oh sorry, I should have given more details. For the record RelOptInfo is a C data structure that holds a lot of metadata about relations that are being used during planning. For instance it contains the list of indexes (as a list of IndexOptInfo, which similarly describes an index), that 3rd party code can manipulate to add hypothetical indexes, or even prevent planner from considering some index.
|
Thx.
Do you think, it's much work to implement this feature (e.g. for a C++ programmer student who is good, but uninitiated in PG internals)? |
The change in postgres should be trivial (more or less add the hook declaration and call it rather than the function if not NULL, and there are multiple other hooks implementation to get inspiration from), but asking on pgsql-hackers first could be a good idea to make sure that this is wanted and there are no caveats. |
Yes, will do. And because I think Automated Database Query Optimization is a very hot topic, I dared to involve @s-hironobu sensei and @laurenz into the discussion. |
I somewhat agree with @rjuju. Extended statistics are great, but if you don't know the actual values, you cannot tell if they improve an estimate. Extended statistics can work both ways. Here is a little example:
The first estimate is off, the second is very good. Now let's calculate extended statistics and try again:
Now the first estimate is spot on, but the second is off. Since bad row count estimates will lead to bad cost estimates further down the road, a reduced |
Sorry to ask: But why is in the second example the second query plan off? Isn't actual "time=53.904..53.904" is much closer to "actual time=22.586..48.996" regarding "Execution Time: 53.933 ms"? And more basic: IMHO if extended statistics is not contributing to better row count estimates in practical situations, then either statistics is outdated (which means a DB admin can do something against this) - or the current PG planner is failing to take e.g. knowledge of func. dependency into account, isn'it? |
The plan is off because it estimates 100 rows, when really there are 0. As I said, misestimates in row count may lead to misestimates in execution costs later on (that "later on" is not part of my example). I am just trying to show that extended statistics can lead to worse estimates. So without seeing the real values it is hard to says if extended statistics improves a query plan or not. The dependency adjustment made with extended statistics relies on the queries using matching values. In this example, extended statistics rely on people always searching for existing |
So, are you saying: Because there are chances plans go worse without perfect statistics, let's forget about (Shameless plug: IMHO this is going off topic. Let's take the chance to discuss this on my next online seminar :-) https://www.swisspug.org/wiki/index.php/Agenda#Seminar:_On_Pure_Synthetic_Data_Generation_for_Databases ) |
I don't think this is going off-topic. @rjuju said:
Extended statistics are incredibly cool, but from a plain The whole raison d'être of HypoPG is to create an object (index) hypothetically (because creating a real index takes a long time, uses disk space and requires an Extended statistics fall into a different category: They are easy to create, don't use a lot of disk space and need no |
Now I understand! Thx for your patience. |
Agreed. Thanks all for the feedback! |
This is a feature suggestion along the lines of functions like
hypopg_create_statistics
.Optimization assumes that columns are independent, so a query plan can be enhanced not only by an additional index, but to let optimizer know about functional dependencies (FD) between columns. Real world datasets often have FD between columns because of denormalization or just because the data is like this.
PostgreSQL offers CREATE STATISTICS on FD.
Expected behaviour: Adding
hypopg_create_statistics
in order to simulate existence of FD.Example: Looking e.g. at this test database from ATP tennis tour there's at least one FD, namely between
tourney_id
andtourney_name
. So this is what's ususally done:BTW: CREATE STATISTICS currently also supports
ndistinct
, which enables n-distinct statistics, andmcv
which enables most-common values lists.The text was updated successfully, but these errors were encountered: