Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggest to create an extended statistics entry about a functional dependency between columns #7

Closed
sfkeller opened this issue May 29, 2020 · 5 comments

Comments

@sfkeller
Copy link

After some discussion over here ankane/dexter#32 I'v following suggestion:

Optimization assumes that columns are independent, so a query plan can go wrong if there are functional dependencies (FD) between columns which he doesn't know. Real world datasets often have FDs between columns because of denormalization or just because the data is like this.

Actually, finding FDs for all columns could become a rather hard problem to solve. But hopefully this challenge becomes feasible if applied only to columns as part of given slow queries.

Expected behaviour:

PostgreSQL offers CREATE STATISTICS in order to inform the planner about FDs. So, looking e.g. at the freely available database "ATP tennis tour", and given a slow query involving tourney_id, tourney_name, pg_plan_advsr could make a specific suggestion like this:

create statistics atp_matches_2019_tourney_id_tourney_name (dependencies) 
	on tourney_id, tourney_name 
	from atp_matches_2019;
@yamatattsu
Copy link
Member

Hi Stefan,

Thanks for your suggestion. :-D
I knew Extended statistics are useful for creating a good query plan, but I couldn't get enough time to implement it in my extension. However, as you might know, there is a more appropriate extension to get extended statistics than mine. It is pg_qualstats by Julien (@rjuju).
I gave a joint talk about pg_qualstats and pg_plan_advsr with him on PGCon 2020 [1], and also I discuss this topic as you suggested after the event. Probably, He will implement an Extended statistics advising feature in his extension. :-)

[1] https://www.pgcon.org/events/pgcon_2020/schedule/session/134-building-automatic-adviser-and-performance-tuning-tools-in-postgresql/

Stay safe!
Tatsuro Yamada

@sfkeller
Copy link
Author

sfkeller commented Jun 2, 2020

Thank you for this good news!

I always wondered about users complaining about planners having no clue - while they leave Extended Statistics unused. I've learned something from Julien and you (and also from @laurenz ).

BTW: So far only option dependencies from Extended Statistics has been mentioned.
=> What about ndistinct and mcv options?

After all, automatic optimization using Extended Statistics finally got a place ;-).
=> Should I open an issue over at https://github.com/powa-team/pg_qualstats/issues ?

@rjuju
Copy link
Contributor

rjuju commented Jun 2, 2020

Thanks @sfkeller !

Adding such something to detect correlated column is already on my todo list, especially since I added the selectivity estimation error in v2, but you can definitely add an issue to make sure it doesn't get forgotten.

@yamatattsu
Copy link
Member

Hi @sfkeller and @rjuju,

I always wondered about users complaining about planners having no clue

I guess almost users don't care about a query plan whether it is efficient or not. When they faced performance trouble, they investigated and realized it. From the perspective of query performance, it would be better to check a query plan on performance tests before production.
However, it depends on the culture or custom of a developing system.

What about ndistinct and mcv options?

Hmm... I didn't think them deeply. However, as you might know, you can create ndistinct and mcv of extended statistics at the same time.
The default option of create extended statistics syntax includes dependencies, ndistinct, and mcv (see below). For now, I'm not sure this approach is appropriate or not.

#Before
create statistics atp_matches_2019_tourney_id_tourney_name (dependencies)
on tourney_id, tourney_name
from atp_matches_2019;

#After
create statistics atp_matches_2019_tourney_id_tourney_name
on tourney_id, tourney_name
from atp_matches_2019;

Adding such something to detect correlated column is already on my todo list,

Thanks to @rjuju!

Yours,
Tatsuro Yamada

@yamatattsu
Copy link
Member

Closed. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants