You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, the PPS score is already very useful and I regularly use it for feature selection and general insights whenever I encounter a new data set. Recently I had an idea to maybe increase the capabilities of the metric.
To address this weakness, would it be an idea to give the underlying decision tree 2 variables instead of one? This will take a significantly longer time, but it gives combinations of variables a chance and might also be able to give additional information about the input features.
For example, if I have target variable 'y' and input features 'x1, x2, x3, x4'. I apply the pps and find the scores 0.4, 0, 0.4 and 0.6 respectively. Now, as a follow up I try all combinations and discover the following:
If I combine x1 and x2, I get a predictive power of 0.5, I now know that this combination increases the PPS by 0.5 - 0.4 = 0.1
If I combine x1 and x4, I get a predictive power score of 0.6. The increase is now 0.6 - 0.6 = 0. Implying that even though x1 has a pps of 0.4, I might as well use x4 and drop x1.
This requires a slightly different implementation of the algorithm, and before committing to developing the implementation I was wondering if this train of thought makes any sense. Opinions on such an additional feature?
The text was updated successfully, but these errors were encountered:
Hey Jeroen, thank you for the proposal! You are right that the general calculation of the ppscore scales to 2 and more variables - possibly also using different models than a decision tree.
So, I totally want to encourage you to go ahead in that direction.
Please note though that we will not add that code into this code base here in order to keep it lean and to not increase the maintenance surface.
However, you can create your own library for those (and other) use cases and if it is well-done we can add a link to our docs in order to let other users of the ppscore know
Alright, I will keep the idea in my head and might write the code for it in the coming weeks. As I'm still uncertain about the added value it brings to the current ppscore and whether the results of such and addition can be easily interpreted.
The motivation for this idea came when me and my coworkers were discussing methods for feature selection, and whether we can think of ways were we can reduce manual exploration as much as possible. Where the manual exploration is replaced by various reliable methods which easy (but) strong visual interpretations of all the variables of a data set.
Currently, the PPS score is already very useful and I regularly use it for feature selection and general insights whenever I encounter a new data set. Recently I had an idea to maybe increase the capabilities of the metric.
As mentioned in the article RIP correlation. Introducing the Predictive Power Score, When using the PPS one should keep in mind that it only captures direct relations, and not combinations of input features.
To address this weakness, would it be an idea to give the underlying decision tree 2 variables instead of one? This will take a significantly longer time, but it gives combinations of variables a chance and might also be able to give additional information about the input features.
For example, if I have target variable 'y' and input features 'x1, x2, x3, x4'. I apply the pps and find the scores 0.4, 0, 0.4 and 0.6 respectively. Now, as a follow up I try all combinations and discover the following:
This requires a slightly different implementation of the algorithm, and before committing to developing the implementation I was wondering if this train of thought makes any sense. Opinions on such an additional feature?
The text was updated successfully, but these errors were encountered: