Categorical PSI #1039

taylorfturner · 2023-09-23T05:05:15Z

Enable PSI for categorical -- even as categories shift over time

dataprofiler/profilers/categorical_column_profile.py

taylorfturner · 2023-09-23T05:09:27Z

dataprofiler/tests/profilers/test_categorical_column_profile.py

                "chi2-test": {
                    "chi2-statistic": 82 / 35,
                    "df": 2,
                    "p-value": 0.3099238764710244,
                },
+                "psi": 0.0990210257942779,


PSI value for when new category is added

new pd.Series(["y", "maybe", "y", "y", "n", "n", "maybe"])
old df_categorical = pd.Series(["y", "y", "y", "y", "n", "n", "n"])

taylorfturner · 2023-09-23T05:09:50Z

dataprofiler/tests/profilers/test_categorical_column_profile.py

+        actual_diff = profile.diff(profile2)
+        self.assertDictEqual(expected_diff, actual_diff)


variable clean up -- no functional change

taylorfturner · 2023-09-23T05:10:06Z

dataprofiler/tests/profilers/test_categorical_column_profile.py

-        with self.assertWarnsRegex(
-            RuntimeWarning,
-            "psi was not calculated due to the differences in categories "
-            "of the profiles. Differences:\n{'maybe'}",
-        ):
-            test_profile_diff = profile.diff(profile2)


removing since this is un-needed

since we now handle keys not being equal in the the preprocessing function for categorical PSI

taylorfturner · 2023-09-23T05:10:16Z

dataprofiler/tests/profilers/test_categorical_column_profile.py

-        # chi2-statistic = sum((observed-expected)^2/expected for each category in each column)
-        # df = categories - 1
-        # psi = (% of records based on Sample (A) - % of records  Sample (B)) * ln(A/ B)


removing un-needed

dataprofiler/tests/profilers/test_column_profile_compilers.py

taylorfturner · 2023-09-23T13:21:27Z

dataprofiler/profilers/categorical_column_profile.py

+            (
+                self_cat_count,
+                other_cat_count,
+            ) = self._preprocess_for_categorical_psi_calculation(
+                self_cat_count=cat_count1,
+                other_cat_count=cat_count2,
+            )


formatting is pre-commit

dataprofiler/profilers/categorical_column_profile.py

taylorfturner · 2023-09-23T13:34:43Z

dataprofiler/tests/profilers/test_column_profile_compilers.py

                "chi2-test": {
                    "chi2-statistic": 2.1,
                    "df": 2,
                    "p-value": 0.3499377491111554,
                },
+                "psi": 0.009815252971365292,


PSI value for ColumnStatsProfileCompiler that is identifying the test data as categorical... therefore needs a PSI value for expected_diff

taylorfturner · 2023-09-23T13:42:59Z

dataprofiler/profilers/categorical_column_profile.py

+                percent_self = self_cat_count[iter_key] / self.sample_size
+                percent_other = other_cat_count[iter_key] / other_profile.sample_size
+                if (percent_other == 0) or (percent_self == 0):
+                    total_psi += 0.0


this total_psi += 0.0 isn't really needed but added for readability and clarity on what is occuring

taylorfturner · 2023-09-23T13:54:48Z

dataprofiler/tests/profilers/test_column_profile_compilers.py

@@ -500,12 +500,13 @@ def test_column_stats_profile_compiler_stats_diff(self):
                "categories": [["1"], ["9"], ["10"]],
                "gini_impurity": 0.06944444444444448,
                "unalikeability": 0.16666666666666663,
-                "categorical_count": {"9": -1, "1": [1, None], "10": [None, 1]},
+                "categorical_count": {"9": -1, "1": 1, "10": -1},


@ksneab7 @micdavis I'm uncomfortable with this changing in the expected_diff but the more I look at it, it appears to be correct, but again changing expected values like this in the diff is uncomfortable... so if you do research something more in-depth take an extra look at this

actually I found the justification for this change and it lies in profiler_utils.find_diff_of_dicts ... on L577, we do not hit that any more due to both cat_count dictionaries having all the categories between both profiles

taylorfturner · 2023-09-23T14:09:02Z

dataprofiler/tests/profilers/test_categorical_column_profile.py

@@ -720,21 +721,17 @@ def test_categorical_diff(self):
                "categories": [[], ["y", "n"], ["maybe"]],
                "gini_impurity": -0.16326530612244894,
                "unalikeability": -0.19047619047619047,
-                "categorical_count": {"y": 1, "n": 1, "maybe": [None, 2]},
+                "categorical_count": {"y": 1, "n": 1, "maybe": -2},


justification for this change lies in profiler_utils.find_diff_of_dicts ... on L577, we do not hit that any more due to both cat_count dictionaries having all the categories between both profiles

taylorfturner · 2023-09-23T14:32:50Z

dataprofiler/profilers/categorical_column_profile.py

+    def _preprocess_for_categorical_psi_calculation(
+        self, self_cat_count, other_cat_count
+    ):
+        super_set_categories = set(self_cat_count.keys()) | set(other_cat_count.keys())
+        if (super_set_categories != self_cat_count.keys()) or (
+            super_set_categories != other_cat_count.keys()
+        ):
+            logger.info(
+                f"""PSI data pre-processing found that categories between
+                    the profiles were not equal. Both profiles not contain
+                    the following categories {super_set_categories}."""
+            )
+
+        for iter_key in super_set_categories:
+            for iter_dictionary in [self_cat_count, other_cat_count]:
+                try:
+                    iter_dictionary[iter_key] = iter_dictionary[iter_key]
+                except KeyError:
+                    iter_dictionary[iter_key] = 0
+        return self_cat_count, other_cat_count
+


main fix is here to ensure that each cat_count has the same keys even if some are zero.... this is to ensure the PSI is calculated when new categories are added or old categories are removed over time

taylorfturner · 2023-09-23T14:35:12Z

dataprofiler/profilers/categorical_column_profile.py

+from .. import dp_logging
 from . import profiler_utils
 from .base_column_profilers import BaseColumnProfiler
 from .profiler_options import CategoricalOptions

+logger = dp_logging.get_child_logger(__name__)
+


adding a logger for awareness of preprocessing changes to categories

dataprofiler/profilers/categorical_column_profile.py

Co-authored-by: Michael Davis <[email protected]>

* Categorical PSI (#1039) * fix bug * reformatting pre-commit * clean up and remove try/except * pre-commit fix * typo fix * update version tag

taylorfturner added Bug Something isn't working High Priority Dramatic improvement, inaccurate calculation(s) or bug / feature making the library unusable labels Sep 23, 2023

taylorfturner self-assigned this Sep 23, 2023

taylorfturner requested review from ksneab7, micdavis and tyfarnan as code owners September 23, 2023 05:05