Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thought on a possible enhancement of the PPS #61

Open
jeroenvermunt opened this issue Feb 10, 2022 · 2 comments
Open

Thought on a possible enhancement of the PPS #61

jeroenvermunt opened this issue Feb 10, 2022 · 2 comments

Comments

@jeroenvermunt
Copy link

Currently, the PPS score is already very useful and I regularly use it for feature selection and general insights whenever I encounter a new data set. Recently I had an idea to maybe increase the capabilities of the metric.

As mentioned in the article RIP correlation. Introducing the Predictive Power Score, When using the PPS one should keep in mind that it only captures direct relations, and not combinations of input features.

To address this weakness, would it be an idea to give the underlying decision tree 2 variables instead of one? This will take a significantly longer time, but it gives combinations of variables a chance and might also be able to give additional information about the input features.

For example, if I have target variable 'y' and input features 'x1, x2, x3, x4'. I apply the pps and find the scores 0.4, 0, 0.4 and 0.6 respectively. Now, as a follow up I try all combinations and discover the following:

  • If I combine x1 and x2, I get a predictive power of 0.5, I now know that this combination increases the PPS by 0.5 - 0.4 = 0.1
  • If I combine x1 and x4, I get a predictive power score of 0.6. The increase is now 0.6 - 0.6 = 0. Implying that even though x1 has a pps of 0.4, I might as well use x4 and drop x1.

This requires a slightly different implementation of the algorithm, and before committing to developing the implementation I was wondering if this train of thought makes any sense. Opinions on such an additional feature?

@FlorianWetschoreck
Copy link
Collaborator

Hey Jeroen, thank you for the proposal! You are right that the general calculation of the ppscore scales to 2 and more variables - possibly also using different models than a decision tree.
So, I totally want to encourage you to go ahead in that direction.

Please note though that we will not add that code into this code base here in order to keep it lean and to not increase the maintenance surface.

However, you can create your own library for those (and other) use cases and if it is well-done we can add a link to our docs in order to let other users of the ppscore know

What do you think?

@jeroenvermunt
Copy link
Author

Alright, I will keep the idea in my head and might write the code for it in the coming weeks. As I'm still uncertain about the added value it brings to the current ppscore and whether the results of such and addition can be easily interpreted.

The motivation for this idea came when me and my coworkers were discussing methods for feature selection, and whether we can think of ways were we can reduce manual exploration as much as possible. Where the manual exploration is replaced by various reliable methods which easy (but) strong visual interpretations of all the variables of a data set.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants