Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detection of linear patterns and decoupling of concerns #37

Open
FlorianWetschoreck opened this issue Aug 31, 2020 · 0 comments
Open

Detection of linear patterns and decoupling of concerns #37

FlorianWetschoreck opened this issue Aug 31, 2020 · 0 comments

Comments

@FlorianWetschoreck
Copy link
Collaborator

When the PPS is applied toward linear relationships with the same error but different slopes, the score varies a lot e.g. from 0.1 to 0.7 depending on the slope.

This might not be the behaviour that we expect intuitively and normalizing the target does not help.
The reason for this is that the ppscore calculates the ratio of the variance of the predictor to the variance of the baseline. If the slope is steep, the ratio is higher because the baseline makes more errors. If the slope is flat, the variances are nearly the same.

The underlying problem is that the current metric and calculation of the ppscore couples to questions:

  1. Is there a valid pattern? e.g. statistical significance or predictive power after cross-validation
  2. Is the variance of the pattern low? (compared to baseline variance)

If either of those two criteria is wrong or weak, the ppscore will be low, too.
Only if both are true, the ppscore will be high.

The problem with the linear cases is that the pattern is valid BUT the variance of the pattern is not low because there is a lot of noise - even if the pattern is statistically significant. (High error to signal ratio)
For this scenario (and maybe also for others), we might want to find a calculation that decouples those two concerns

Some rough code:

import pandas as pd
import numpy as np
import seaborn as sns
import bamboolib as bam

import ppscore as pps

df = pd.DataFrame()
df["x"] = np.random.uniform(-2, 2, 1_000_000)
df["error"] = np.random.uniform(-0.5, 0.5, 1_000_000)
df["y"] = df["x"] * df["x"] + df["error"]
df["0.3_linear_x"] = 0.3*df["x"]+df["error"] #0.11 pps
df["0.5_linear_x"] = 0.5*df["x"]+df["error"] #0.4 pps
df["1_linear_x"] = 1*df["x"]+df["error"] # 0.68 pps

# normalized linear to [0,1] via +2 and /4
df["1_linear_x_norm"] = (df["1_linear_x"] + 2)/4 #0.68 pps, too
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant