Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ppscore interpretation #39

Open
FernandoDoreto opened this issue Sep 17, 2020 · 4 comments
Open

ppscore interpretation #39

FernandoDoreto opened this issue Sep 17, 2020 · 4 comments

Comments

@FernandoDoreto
Copy link

Hi @FlorianWetschoreck, @tkrabel , @SuryaThiru

Quick question: how to properly interpret ppscore?

Say if you have a dataset, 3000 rows x 30 columns; you then apply pps.matrix(), then sort values by ppscore. Is there a "rule of thumb" or rational-guideline to categorize ppscore levels?
Like the following:

  • If ppscore is in range 0.6 - 1.0, means strong (so X feature has a strong predictive power on Y)
  • If ppscore is in range 0.4 - 0.6, means moderate (so X feature has a moderate predictive power on Y)
  • If ppsscore is lower than 0.4, means weak (so X feature has a weak predictive power on Y)
    Note: the ranges and categories I gave are totally arbitrary

I read this article - https://towardsdatascience.com/rip-correlation-introducing-the-predictive-power-score-3d90808b9598 , but couldnt find an answer there

Thanks a million, Fernando

@FlorianWetschoreck
Copy link
Collaborator

Hi Fernando, thank you for posting the question.
Many people ask us this question but it is not so easy to answer because "it depends".
I will think more deeply about it and then get back to you with an answer

@FlorianWetschoreck
Copy link
Collaborator

Hi Fernando,

I gave the question quite some thought and for now I would like to reply the following:

First, the technical interpretation of the PPS is:

  • the percentage of model improvement potential that the feature adds when the current naive base model is compared to a perfect deterministic model.

The interpretation depends on the context:
In general, it is hard to denote some specific levels and give some interpretation for them without knowing the context. For example if many columns have a PPS of 0.3 then a PPS of 0.2 might actually be not that good. However, when no column has a PPS >0.01 then a PPS of 0.1 might be very good - especially when trying to predict something that is hard like stock prices.

Nevertheless, there are some levels that are often helpful during everyday life:

  • PPS == 0 means that there is no predictive power
  • PPS < 0.2 often means that there is some relevant predictive power but it is weak
  • PPS > 0.2 often means that there is strong predictive power
  • PPS > 0.8 often means that there is a deterministic relationship in the data, for example y = 3*x or there is some underlying if...else... logic

Based on those levels, it is often important to check the PPS for multiple columns and then determine your interpretation based on that.

What do you think about this explanation? Do you have some specific scenarios, use cases or questions that you want the PPS to answer?

@FernandoDoreto
Copy link
Author

Hi @FlorianWetschoreck, thanks for your attention and commitment. I liked your consideration on "interpreting on the context", it makes sense. Also the ranges you described generates insights for me, so I can code a way to automate a ppsThreshold (keep reading to understand what I mean)

I'm using specifically ppscore as part of an approach to detect relationships (linear, non linear, quadratic, trigonometric, log, exponential, etc) among variables. I see that would need a faceted approach since relationships are asymmetric and may have different shapes (linear, non linear etc). I consider combining ppscore, spearman corr and MIC.

  • (1) I apply pps.matrix() to my dataset. So yeah, that can have a high computing cost. Then I consider a certain "ppsThreshold" and query the matrix: model_score != 1 and ppscore > ppsThreshold. That would leave me with variable pairs with relevant predictive powers.

  • (2) Then I calculate spearman correlation on these pairs, revealing if there is a monotonic relationship. This, typically, has a low computing cost.

  • (3) Finally I calculate MIC for the unique combination from these variables pairs. This has a big computing cost, that is why it's important to reduce the feature space with ppscore. For me, MIC tells the relationship strength.

So ultimately I would be able to conclude sth like:

Variable A has strong predictive power on Variable B, but Variable B doesn't have predictive power on Variable A.
Variable A has a strong and positive relationship with Variable B

  • Conclusion rational:

strong predictive power - "strong" provided by ppscore
strong and positive relationship: "strong" is provided by MIC level and "positive" is provided by spearman correlation

Based on your personal and professional experience, this "relationship detection" approach that I described, makes sense to you? Are you aware of any (preferably) open-source python package that does that?

Regards, Fernando

@FlorianWetschoreck
Copy link
Collaborator

Hi Fernando,
what is the surrounding use case that you are working on and why do you need to estimate both the relationship strength and shape?
More information in this regard might inform the solution approach.
Also, I am suprised by this general notion of "positive" because this only applies for a handful of relationships but might be valid in your context.

About your solution approach:

  • please review if you want to apply spearman correlation in parallel to ppscore matrix because the ppscore is not that good for detecting linear relationships and the scores might be below your threshold

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants