You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When the PPS is applied toward linear relationships with the same error but different slopes, the score varies a lot e.g. from 0.1 to 0.7 depending on the slope.
This might not be the behaviour that we expect intuitively and normalizing the target does not help.
The reason for this is that the ppscore calculates the ratio of the variance of the predictor to the variance of the baseline. If the slope is steep, the ratio is higher because the baseline makes more errors. If the slope is flat, the variances are nearly the same.
The underlying problem is that the current metric and calculation of the ppscore couples to questions:
Is there a valid pattern? e.g. statistical significance or predictive power after cross-validation
Is the variance of the pattern low? (compared to baseline variance)
If either of those two criteria is wrong or weak, the ppscore will be low, too.
Only if both are true, the ppscore will be high.
The problem with the linear cases is that the pattern is valid BUT the variance of the pattern is not low because there is a lot of noise - even if the pattern is statistically significant. (High error to signal ratio)
For this scenario (and maybe also for others), we might want to find a calculation that decouples those two concerns
Some rough code:
import pandas as pd
import numpy as np
import seaborn as sns
import bamboolib as bam
import ppscore as pps
df = pd.DataFrame()
df["x"] = np.random.uniform(-2, 2, 1_000_000)
df["error"] = np.random.uniform(-0.5, 0.5, 1_000_000)
df["y"] = df["x"] * df["x"] + df["error"]
df["0.3_linear_x"] = 0.3*df["x"]+df["error"] #0.11 pps
df["0.5_linear_x"] = 0.5*df["x"]+df["error"] #0.4 pps
df["1_linear_x"] = 1*df["x"]+df["error"] # 0.68 pps
# normalized linear to [0,1] via +2 and /4
df["1_linear_x_norm"] = (df["1_linear_x"] + 2)/4 #0.68 pps, too
The text was updated successfully, but these errors were encountered:
When the PPS is applied toward linear relationships with the same error but different slopes, the score varies a lot e.g. from 0.1 to 0.7 depending on the slope.
This might not be the behaviour that we expect intuitively and normalizing the target does not help.
The reason for this is that the ppscore calculates the ratio of the variance of the predictor to the variance of the baseline. If the slope is steep, the ratio is higher because the baseline makes more errors. If the slope is flat, the variances are nearly the same.
The underlying problem is that the current metric and calculation of the ppscore couples to questions:
If either of those two criteria is wrong or weak, the ppscore will be low, too.
Only if both are true, the ppscore will be high.
The problem with the linear cases is that the pattern is valid BUT the variance of the pattern is not low because there is a lot of noise - even if the pattern is statistically significant. (High error to signal ratio)
For this scenario (and maybe also for others), we might want to find a calculation that decouples those two concerns
Some rough code:
The text was updated successfully, but these errors were encountered: