Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Categorical PSI #1039

Merged
merged 7 commits into from
Sep 25, 2023
Merged

Categorical PSI #1039

merged 7 commits into from
Sep 25, 2023

Conversation

taylorfturner
Copy link
Contributor

@taylorfturner taylorfturner commented Sep 23, 2023

Enable PSI for categorical -- even as categories shift over time

@taylorfturner taylorfturner added Bug Something isn't working High Priority Dramatic improvement, inaccurate calculation(s) or bug / feature making the library unusable labels Sep 23, 2023
@taylorfturner taylorfturner self-assigned this Sep 23, 2023
"chi2-test": {
"chi2-statistic": 82 / 35,
"df": 2,
"p-value": 0.3099238764710244,
},
"psi": 0.0990210257942779,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PSI value for when new category is added

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new pd.Series(["y", "maybe", "y", "y", "n", "n", "maybe"])
old df_categorical = pd.Series(["y", "y", "y", "y", "n", "n", "n"])

Comment on lines +733 to +734
actual_diff = profile.diff(profile2)
self.assertDictEqual(expected_diff, actual_diff)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

variable clean up -- no functional change

Comment on lines -731 to -736
with self.assertWarnsRegex(
RuntimeWarning,
"psi was not calculated due to the differences in categories "
"of the profiles. Differences:\n{'maybe'}",
):
test_profile_diff = profile.diff(profile2)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing since this is un-needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we now handle keys not being equal in the the preprocessing function for categorical PSI

Comment on lines -773 to -775
# chi2-statistic = sum((observed-expected)^2/expected for each category in each column)
# df = categories - 1
# psi = (% of records based on Sample (A) - % of records Sample (B)) * ln(A/ B)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing un-needed

Comment on lines +308 to +314
(
self_cat_count,
other_cat_count,
) = self._preprocess_for_categorical_psi_calculation(
self_cat_count=cat_count1,
other_cat_count=cat_count2,
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

formatting is pre-commit

"chi2-test": {
"chi2-statistic": 2.1,
"df": 2,
"p-value": 0.3499377491111554,
},
"psi": 0.009815252971365292,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PSI value for ColumnStatsProfileCompiler that is identifying the test data as categorical... therefore needs a PSI value for expected_diff

percent_self = self_cat_count[iter_key] / self.sample_size
percent_other = other_cat_count[iter_key] / other_profile.sample_size
if (percent_other == 0) or (percent_self == 0):
total_psi += 0.0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this total_psi += 0.0 isn't really needed but added for readability and clarity on what is occuring

@taylorfturner taylorfturner changed the title PSI Categorical PSI Sep 23, 2023
@@ -500,12 +500,13 @@ def test_column_stats_profile_compiler_stats_diff(self):
"categories": [["1"], ["9"], ["10"]],
"gini_impurity": 0.06944444444444448,
"unalikeability": 0.16666666666666663,
"categorical_count": {"9": -1, "1": [1, None], "10": [None, 1]},
"categorical_count": {"9": -1, "1": 1, "10": -1},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ksneab7 @micdavis I'm uncomfortable with this changing in the expected_diff but the more I look at it, it appears to be correct, but again changing expected values like this in the diff is uncomfortable... so if you do research something more in-depth take an extra look at this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually I found the justification for this change and it lies in profiler_utils.find_diff_of_dicts ... on L577, we do not hit that any more due to both cat_count dictionaries having all the categories between both profiles

@@ -720,21 +721,17 @@ def test_categorical_diff(self):
"categories": [[], ["y", "n"], ["maybe"]],
"gini_impurity": -0.16326530612244894,
"unalikeability": -0.19047619047619047,
"categorical_count": {"y": 1, "n": 1, "maybe": [None, 2]},
"categorical_count": {"y": 1, "n": 1, "maybe": -2},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

justification for this change lies in profiler_utils.find_diff_of_dicts ... on L577, we do not hit that any more due to both cat_count dictionaries having all the categories between both profiles

Comment on lines 441 to 461
def _preprocess_for_categorical_psi_calculation(
self, self_cat_count, other_cat_count
):
super_set_categories = set(self_cat_count.keys()) | set(other_cat_count.keys())
if (super_set_categories != self_cat_count.keys()) or (
super_set_categories != other_cat_count.keys()
):
logger.info(
f"""PSI data pre-processing found that categories between
the profiles were not equal. Both profiles not contain
the following categories {super_set_categories}."""
)

for iter_key in super_set_categories:
for iter_dictionary in [self_cat_count, other_cat_count]:
try:
iter_dictionary[iter_key] = iter_dictionary[iter_key]
except KeyError:
iter_dictionary[iter_key] = 0
return self_cat_count, other_cat_count

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

main fix is here to ensure that each cat_count has the same keys even if some are zero.... this is to ensure the PSI is calculated when new categories are added or old categories are removed over time

Comment on lines +12 to +18
from .. import dp_logging
from . import profiler_utils
from .base_column_profilers import BaseColumnProfiler
from .profiler_options import CategoricalOptions

logger = dp_logging.get_child_logger(__name__)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding a logger for awareness of preprocessing changes to categories

ksneab7
ksneab7 previously approved these changes Sep 23, 2023
@ksneab7 ksneab7 enabled auto-merge (squash) September 23, 2023 15:15
@ksneab7 ksneab7 merged commit 96a8876 into capitalone:dev Sep 25, 2023
micdavis pushed a commit that referenced this pull request Sep 25, 2023
* Categorical PSI  (#1039)

* fix bug

* reformatting pre-commit

* clean up and remove try/except

* pre-commit fix

* typo fix

* update version tag
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working High Priority Dramatic improvement, inaccurate calculation(s) or bug / feature making the library unusable
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants