Fix bug with null replication metrics #702

tonywu315 · 2022-10-26T14:41:12Z

When null_replication_metrics is enabled and any columns is all null, this errors:

import dataprofiler as dp
import pandas as pd
import re

profiler_options = dp.ProfilerOptions()

NO_FLAG = 0
profiler_options.set({
    '*.null_values': {
        "": NO_FLAG,
        "nan": re.IGNORECASE,
        "none": re.IGNORECASE,
        "null": re.IGNORECASE,
        "  *": NO_FLAG,
        "--*": NO_FLAG,
        "__*": NO_FLAG,
        "9"*7: NO_FLAG,
    }, 
    "*.null_replication_metrics.is_enabled": True,
    "data_labeler.is_enabled": False,
    "multiprocess.is_enabled": False,
})
profiler = dp.Profiler(pd.DataFrame([[9999999,9], [9999999,9]]), options=profiler_options)

JGSweets · 2022-10-26T15:55:26Z

dataprofiler/profilers/profile_builder.py

@@ -2244,7 +2244,10 @@ def _update_null_replication_metrics(self, clean_samples: Dict) -> None:
        ]._profiles[get_data_type(profile)]

        total_row_sum = np.asarray(
-            [get_data_type_profiler(profile).sum for profile in self._profile]
+            [
+                get_data_type_profiler(profile).sum if get_data_type(profile) else 0


I think np.nan might be more appropriate.

0 implies a value is real rather than all null

JGSweets · 2022-10-26T15:56:30Z

dataprofiler/profilers/profile_builder.py

+            if true_count == 0:
+                mean_not_null = sum_not_null
+            else:
+                mean_not_null = sum_not_null / true_count


is this necessary?

When a column is all null, true_count is 0, which creates a divide by 0 error.

JGSweets · 2022-10-26T15:56:41Z

dataprofiler/profilers/profile_builder.py

@@ -2244,7 +2244,10 @@ def _update_null_replication_metrics(self, clean_samples: Dict) -> None:
        ]._profiles[get_data_type(profile)]

        total_row_sum = np.asarray(
-            [get_data_type_profiler(profile).sum for profile in self._profile]


need to add a test for this.

The test can be as simple as running the example code you have in the description and ensuring the rep matrix is what we expect

JGSweets · 2022-10-26T15:57:24Z

profiler = dp.Profiler(pd.DataFrame([[9999999,9], [9999999,9]]), options=profiler_options)

Is the erroring I meant to put.

taylorfturner · 2022-10-28T20:56:46Z

@tonywu315 needs and update branch. Will review one updated

dataprofiler/tests/profilers/test_profile_builder.py

taylorfturner

just two comments around simplifying the test suite that is added in this PR

dataprofiler/tests/profilers/test_profile_builder.py

taylorfturner

LGTM

dataprofiler/profilers/profile_builder.py

taylorfturner

LGTM

dismissing... tests may actually fail

taylorfturner · 2022-11-02T12:29:55Z

@tonywu315 update branch FYI

taylorfturner

code was changed back to a prior paradigm... can we keep that same paradigm but just update first assignment?

dataprofiler/profilers/profile_builder.py

taylorfturner

LGTM

dataprofiler/profilers/profile_builder.py

dataprofiler/tests/profilers/test_profile_builder.py

taylorfturner · 2022-11-03T18:00:10Z

dataprofiler/tests/profilers/test_profile_builder.py

@@ -2048,6 +2048,39 @@ def test_null_replication_metrics_calculation(self):
        np.testing.assert_array_almost_equal([17 / 8, 48 / 8], column["class_mean"][0])
        np.testing.assert_array_almost_equal([12 / 2, 6 / 2], column["class_mean"][1])

+        # Test with all null in a column
+        data_3 = pd.DataFrame([[9999999, 9], [9999999, 9]])


in the future, might be worthwhile to create a dataset with more edge cases
data_3 = pd.DataFrame([[9999999, 9999999, 123], [9999999, 123, 123]])

Fix bug

f689710

tonywu315 requested review from JGSweets, ksneab7, taylorfturner, micdavis and tyfarnan as code owners October 26, 2022 14:41

tonywu315 changed the title ~~Fix bug~~ Fix bug with null replication metrics Oct 26, 2022

taylorfturner assigned tonywu315 Oct 26, 2022

taylorfturner added Bug Something isn't working Medium Priority Significant improvement or bug / feature reducing overall performance labels Oct 26, 2022

JGSweets reviewed Oct 26, 2022

View reviewed changes

tonywu315 and others added 3 commits October 26, 2022 12:09

Change 0 to np.nan

e04ed3c

Merge branch 'main' into null_replication_matrix

b840181

Add tests

0346fa3

Merge branch 'main' into null_replication_matrix

20a55ce

taylorfturner reviewed Oct 31, 2022

View reviewed changes

dataprofiler/tests/profilers/test_profile_builder.py Outdated Show resolved Hide resolved

Change test

af6eb98

taylorfturner reviewed Nov 1, 2022

View reviewed changes

dataprofiler/tests/profilers/test_profile_builder.py Outdated Show resolved Hide resolved

dataprofiler/tests/profilers/test_profile_builder.py Outdated Show resolved Hide resolved

Simplify tests

c271e39

taylorfturner previously approved these changes Nov 1, 2022

View reviewed changes

taylorfturner reviewed Nov 1, 2022

View reviewed changes

dataprofiler/profilers/profile_builder.py Outdated Show resolved Hide resolved

taylorfturner requested a review from JGSweets November 1, 2022 18:06

Change mean_not_null to nan

cf725b2

tonywu315 dismissed taylorfturner’s stale review via cf725b2 November 1, 2022 18:34

taylorfturner suggested changes Nov 1, 2022

View reviewed changes

dataprofiler/profilers/profile_builder.py Outdated Show resolved Hide resolved

Fix

66d5a1c

taylorfturner previously approved these changes Nov 1, 2022

View reviewed changes

Use nan for sum and mean with no values

0075816

taylorfturner previously approved these changes Nov 2, 2022

View reviewed changes

taylorfturner enabled auto-merge (squash) November 2, 2022 00:35

taylorfturner reviewed Nov 2, 2022

View reviewed changes

dataprofiler/profilers/profile_builder.py Outdated Show resolved Hide resolved

dataprofiler/profilers/profile_builder.py Outdated Show resolved Hide resolved

tonywu315 and others added 2 commits November 2, 2022 09:11

Merge branch 'main' into null_replication_matrix

0a9fa2c

Reorder code

b1ac019

auto-merge was automatically disabled November 2, 2022 13:20
Head branch was pushed to by a user without write access

tonywu315 dismissed taylorfturner’s stale review via b1ac019 November 2, 2022 13:20

taylorfturner approved these changes Nov 2, 2022

View reviewed changes

taylorfturner enabled auto-merge (squash) November 2, 2022 13:23

taylorfturner reviewed Nov 2, 2022

View reviewed changes

dataprofiler/profilers/profile_builder.py Show resolved Hide resolved

dataprofiler/profilers/profile_builder.py Show resolved Hide resolved

Merge branch 'main' into null_replication_matrix

e00b24a

taylorfturner reviewed Nov 3, 2022

View reviewed changes

dataprofiler/profilers/profile_builder.py Show resolved Hide resolved

taylorfturner reviewed Nov 3, 2022

View reviewed changes

dataprofiler/profilers/profile_builder.py Show resolved Hide resolved

taylorfturner reviewed Nov 3, 2022

View reviewed changes

dataprofiler/tests/profilers/test_profile_builder.py Show resolved Hide resolved

taylorfturner approved these changes Nov 3, 2022

View reviewed changes

JGSweets approved these changes Nov 3, 2022

View reviewed changes

taylorfturner merged commit d4b5860 into capitalone:main Nov 3, 2022

taylorfturner reviewed Nov 3, 2022

View reviewed changes

tonywu315 deleted the null_replication_matrix branch November 3, 2022 18:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bug with null replication metrics #702

Fix bug with null replication metrics #702

tonywu315 commented Oct 26, 2022

JGSweets Oct 26, 2022

JGSweets Oct 26, 2022

JGSweets Oct 26, 2022

tonywu315 Oct 26, 2022

JGSweets Oct 26, 2022

JGSweets Oct 26, 2022

JGSweets commented Oct 26, 2022

taylorfturner commented Oct 28, 2022

taylorfturner left a comment

taylorfturner left a comment

taylorfturner left a comment

taylorfturner commented Nov 2, 2022

taylorfturner left a comment

taylorfturner left a comment •

edited

Loading

taylorfturner Nov 3, 2022

Fix bug with null replication metrics #702

Fix bug with null replication metrics #702

Conversation

tonywu315 commented Oct 26, 2022

JGSweets Oct 26, 2022

Choose a reason for hiding this comment

JGSweets Oct 26, 2022

Choose a reason for hiding this comment

JGSweets Oct 26, 2022

Choose a reason for hiding this comment

tonywu315 Oct 26, 2022

Choose a reason for hiding this comment

JGSweets Oct 26, 2022

Choose a reason for hiding this comment

JGSweets Oct 26, 2022

Choose a reason for hiding this comment

JGSweets commented Oct 26, 2022

taylorfturner commented Oct 28, 2022

taylorfturner left a comment

Choose a reason for hiding this comment

taylorfturner left a comment

Choose a reason for hiding this comment

taylorfturner left a comment

Choose a reason for hiding this comment

taylorfturner commented Nov 2, 2022

taylorfturner left a comment

Choose a reason for hiding this comment

taylorfturner left a comment • edited Loading

Choose a reason for hiding this comment

taylorfturner Nov 3, 2022

Choose a reason for hiding this comment

taylorfturner left a comment •

edited

Loading