-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Categorical PSI #1039
Categorical PSI #1039
Conversation
"chi2-test": { | ||
"chi2-statistic": 82 / 35, | ||
"df": 2, | ||
"p-value": 0.3099238764710244, | ||
}, | ||
"psi": 0.0990210257942779, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PSI value for when new category is added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new pd.Series(["y", "maybe", "y", "y", "n", "n", "maybe"])
old df_categorical = pd.Series(["y", "y", "y", "y", "n", "n", "n"])
actual_diff = profile.diff(profile2) | ||
self.assertDictEqual(expected_diff, actual_diff) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
variable clean up -- no functional change
with self.assertWarnsRegex( | ||
RuntimeWarning, | ||
"psi was not calculated due to the differences in categories " | ||
"of the profiles. Differences:\n{'maybe'}", | ||
): | ||
test_profile_diff = profile.diff(profile2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removing since this is un-needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we now handle keys not being equal in the the preprocessing function for categorical PSI
# chi2-statistic = sum((observed-expected)^2/expected for each category in each column) | ||
# df = categories - 1 | ||
# psi = (% of records based on Sample (A) - % of records Sample (B)) * ln(A/ B) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removing un-needed
64a9412
to
bd9f11c
Compare
( | ||
self_cat_count, | ||
other_cat_count, | ||
) = self._preprocess_for_categorical_psi_calculation( | ||
self_cat_count=cat_count1, | ||
other_cat_count=cat_count2, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
formatting is pre-commit
"chi2-test": { | ||
"chi2-statistic": 2.1, | ||
"df": 2, | ||
"p-value": 0.3499377491111554, | ||
}, | ||
"psi": 0.009815252971365292, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PSI value for ColumnStatsProfileCompiler that is identifying the test data as categorical... therefore needs a PSI value for expected_diff
percent_self = self_cat_count[iter_key] / self.sample_size | ||
percent_other = other_cat_count[iter_key] / other_profile.sample_size | ||
if (percent_other == 0) or (percent_self == 0): | ||
total_psi += 0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this total_psi += 0.0
isn't really needed but added for readability and clarity on what is occuring
@@ -500,12 +500,13 @@ def test_column_stats_profile_compiler_stats_diff(self): | |||
"categories": [["1"], ["9"], ["10"]], | |||
"gini_impurity": 0.06944444444444448, | |||
"unalikeability": 0.16666666666666663, | |||
"categorical_count": {"9": -1, "1": [1, None], "10": [None, 1]}, | |||
"categorical_count": {"9": -1, "1": 1, "10": -1}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually I found the justification for this change and it lies in profiler_utils.find_diff_of_dicts
... on L577, we do not hit that any more due to both cat_count
dictionaries having all the categories between both profiles
@@ -720,21 +721,17 @@ def test_categorical_diff(self): | |||
"categories": [[], ["y", "n"], ["maybe"]], | |||
"gini_impurity": -0.16326530612244894, | |||
"unalikeability": -0.19047619047619047, | |||
"categorical_count": {"y": 1, "n": 1, "maybe": [None, 2]}, | |||
"categorical_count": {"y": 1, "n": 1, "maybe": -2}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
justification for this change lies in profiler_utils.find_diff_of_dicts ... on L577, we do not hit that any more due to both cat_count dictionaries having all the categories between both profiles
def _preprocess_for_categorical_psi_calculation( | ||
self, self_cat_count, other_cat_count | ||
): | ||
super_set_categories = set(self_cat_count.keys()) | set(other_cat_count.keys()) | ||
if (super_set_categories != self_cat_count.keys()) or ( | ||
super_set_categories != other_cat_count.keys() | ||
): | ||
logger.info( | ||
f"""PSI data pre-processing found that categories between | ||
the profiles were not equal. Both profiles not contain | ||
the following categories {super_set_categories}.""" | ||
) | ||
|
||
for iter_key in super_set_categories: | ||
for iter_dictionary in [self_cat_count, other_cat_count]: | ||
try: | ||
iter_dictionary[iter_key] = iter_dictionary[iter_key] | ||
except KeyError: | ||
iter_dictionary[iter_key] = 0 | ||
return self_cat_count, other_cat_count | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
main fix is here to ensure that each cat_count has the same keys even if some are zero.... this is to ensure the PSI is calculated when new categories are added or old categories are removed over time
from .. import dp_logging | ||
from . import profiler_utils | ||
from .base_column_profilers import BaseColumnProfiler | ||
from .profiler_options import CategoricalOptions | ||
|
||
logger = dp_logging.get_child_logger(__name__) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adding a logger for awareness of preprocessing changes to categories
Co-authored-by: Michael Davis <[email protected]>
Co-authored-by: Michael Davis <[email protected]>
* Categorical PSI (#1039) * fix bug * reformatting pre-commit * clean up and remove try/except * pre-commit fix * typo fix * update version tag
Enable PSI for categorical -- even as categories shift over time