Thought on a possible enhancement of the PPS #61

jeroenvermunt · 2022-02-10T10:38:09Z

Currently, the PPS score is already very useful and I regularly use it for feature selection and general insights whenever I encounter a new data set. Recently I had an idea to maybe increase the capabilities of the metric.

As mentioned in the article RIP correlation. Introducing the Predictive Power Score, When using the PPS one should keep in mind that it only captures direct relations, and not combinations of input features.

To address this weakness, would it be an idea to give the underlying decision tree 2 variables instead of one? This will take a significantly longer time, but it gives combinations of variables a chance and might also be able to give additional information about the input features.

For example, if I have target variable 'y' and input features 'x1, x2, x3, x4'. I apply the pps and find the scores 0.4, 0, 0.4 and 0.6 respectively. Now, as a follow up I try all combinations and discover the following:

If I combine x1 and x2, I get a predictive power of 0.5, I now know that this combination increases the PPS by 0.5 - 0.4 = 0.1
If I combine x1 and x4, I get a predictive power score of 0.6. The increase is now 0.6 - 0.6 = 0. Implying that even though x1 has a pps of 0.4, I might as well use x4 and drop x1.

This requires a slightly different implementation of the algorithm, and before committing to developing the implementation I was wondering if this train of thought makes any sense. Opinions on such an additional feature?

FlorianWetschoreck · 2022-02-10T22:30:19Z

Hey Jeroen, thank you for the proposal! You are right that the general calculation of the ppscore scales to 2 and more variables - possibly also using different models than a decision tree.
So, I totally want to encourage you to go ahead in that direction.

Please note though that we will not add that code into this code base here in order to keep it lean and to not increase the maintenance surface.

However, you can create your own library for those (and other) use cases and if it is well-done we can add a link to our docs in order to let other users of the ppscore know

What do you think?

jeroenvermunt · 2022-02-14T11:36:01Z

Alright, I will keep the idea in my head and might write the code for it in the coming weeks. As I'm still uncertain about the added value it brings to the current ppscore and whether the results of such and addition can be easily interpreted.

The motivation for this idea came when me and my coworkers were discussing methods for feature selection, and whether we can think of ways were we can reduce manual exploration as much as possible. Where the manual exploration is replaced by various reliable methods which easy (but) strong visual interpretations of all the variables of a data set.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thought on a possible enhancement of the PPS #61

Thought on a possible enhancement of the PPS #61

jeroenvermunt commented Feb 10, 2022

FlorianWetschoreck commented Feb 10, 2022

jeroenvermunt commented Feb 14, 2022

Thought on a possible enhancement of the PPS #61

Thought on a possible enhancement of the PPS #61

Comments

jeroenvermunt commented Feb 10, 2022

FlorianWetschoreck commented Feb 10, 2022

jeroenvermunt commented Feb 14, 2022