ppscore interpretation #39

FernandoDoreto · 2020-09-17T14:25:44Z

Hi @FlorianWetschoreck, @tkrabel , @SuryaThiru

Quick question: how to properly interpret ppscore?

Say if you have a dataset, 3000 rows x 30 columns; you then apply pps.matrix(), then sort values by ppscore. Is there a "rule of thumb" or rational-guideline to categorize ppscore levels?
Like the following:

If ppscore is in range 0.6 - 1.0, means strong (so X feature has a strong predictive power on Y)
If ppscore is in range 0.4 - 0.6, means moderate (so X feature has a moderate predictive power on Y)
If ppsscore is lower than 0.4, means weak (so X feature has a weak predictive power on Y)
Note: the ranges and categories I gave are totally arbitrary

I read this article - https://towardsdatascience.com/rip-correlation-introducing-the-predictive-power-score-3d90808b9598 , but couldnt find an answer there

Thanks a million, Fernando

FlorianWetschoreck · 2020-09-21T12:57:27Z

Hi Fernando, thank you for posting the question.
Many people ask us this question but it is not so easy to answer because "it depends".
I will think more deeply about it and then get back to you with an answer

FlorianWetschoreck · 2020-09-30T08:08:18Z

Hi Fernando,

I gave the question quite some thought and for now I would like to reply the following:

First, the technical interpretation of the PPS is:

the percentage of model improvement potential that the feature adds when the current naive base model is compared to a perfect deterministic model.

The interpretation depends on the context:
In general, it is hard to denote some specific levels and give some interpretation for them without knowing the context. For example if many columns have a PPS of 0.3 then a PPS of 0.2 might actually be not that good. However, when no column has a PPS >0.01 then a PPS of 0.1 might be very good - especially when trying to predict something that is hard like stock prices.

Nevertheless, there are some levels that are often helpful during everyday life:

PPS == 0 means that there is no predictive power
PPS < 0.2 often means that there is some relevant predictive power but it is weak
PPS > 0.2 often means that there is strong predictive power
PPS > 0.8 often means that there is a deterministic relationship in the data, for example y = 3*x or there is some underlying if...else... logic

Based on those levels, it is often important to check the PPS for multiple columns and then determine your interpretation based on that.

What do you think about this explanation? Do you have some specific scenarios, use cases or questions that you want the PPS to answer?

FernandoDoreto · 2020-09-30T13:33:16Z

Hi @FlorianWetschoreck, thanks for your attention and commitment. I liked your consideration on "interpreting on the context", it makes sense. Also the ranges you described generates insights for me, so I can code a way to automate a ppsThreshold (keep reading to understand what I mean)

I'm using specifically ppscore as part of an approach to detect relationships (linear, non linear, quadratic, trigonometric, log, exponential, etc) among variables. I see that would need a faceted approach since relationships are asymmetric and may have different shapes (linear, non linear etc). I consider combining ppscore, spearman corr and MIC.

(1) I apply pps.matrix() to my dataset. So yeah, that can have a high computing cost. Then I consider a certain "ppsThreshold" and query the matrix: model_score != 1 and ppscore > ppsThreshold. That would leave me with variable pairs with relevant predictive powers.
(2) Then I calculate spearman correlation on these pairs, revealing if there is a monotonic relationship. This, typically, has a low computing cost.
(3) Finally I calculate MIC for the unique combination from these variables pairs. This has a big computing cost, that is why it's important to reduce the feature space with ppscore. For me, MIC tells the relationship strength.

So ultimately I would be able to conclude sth like:

Variable A has strong predictive power on Variable B, but Variable B doesn't have predictive power on Variable A.
Variable A has a strong and positive relationship with Variable B

Conclusion rational:

strong predictive power - "strong" provided by ppscore
strong and positive relationship: "strong" is provided by MIC level and "positive" is provided by spearman correlation

Based on your personal and professional experience, this "relationship detection" approach that I described, makes sense to you? Are you aware of any (preferably) open-source python package that does that?

Regards, Fernando

FlorianWetschoreck · 2020-10-02T16:08:14Z

Hi Fernando,
what is the surrounding use case that you are working on and why do you need to estimate both the relationship strength and shape?
More information in this regard might inform the solution approach.
Also, I am suprised by this general notion of "positive" because this only applies for a handful of relationships but might be valid in your context.

About your solution approach:

please review if you want to apply spearman correlation in parallel to ppscore matrix because the ppscore is not that good for detecting linear relationships and the scores might be below your threshold

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ppscore interpretation #39

ppscore interpretation #39

FernandoDoreto commented Sep 17, 2020

FlorianWetschoreck commented Sep 21, 2020

FlorianWetschoreck commented Sep 30, 2020

FernandoDoreto commented Sep 30, 2020

FlorianWetschoreck commented Oct 2, 2020

ppscore interpretation #39

ppscore interpretation #39

Comments

FernandoDoreto commented Sep 17, 2020

FlorianWetschoreck commented Sep 21, 2020

FlorianWetschoreck commented Sep 30, 2020

FernandoDoreto commented Sep 30, 2020

FlorianWetschoreck commented Oct 2, 2020