Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug with null replication metrics #702

Merged
merged 13 commits into from
Nov 3, 2022

Conversation

tonywu315
Copy link
Contributor

When null_replication_metrics is enabled and any columns is all null, this errors:

import dataprofiler as dp
import pandas as pd
import re

profiler_options = dp.ProfilerOptions()

NO_FLAG = 0
profiler_options.set({
    '*.null_values': {
        "": NO_FLAG,
        "nan": re.IGNORECASE,
        "none": re.IGNORECASE,
        "null": re.IGNORECASE,
        "  *": NO_FLAG,
        "--*": NO_FLAG,
        "__*": NO_FLAG,
        "9"*7: NO_FLAG,
    }, 
    "*.null_replication_metrics.is_enabled": True,
    "data_labeler.is_enabled": False,
    "multiprocess.is_enabled": False,
})
profiler = dp.Profiler(pd.DataFrame([[9999999,9], [9999999,9]]), options=profiler_options)

@tonywu315 tonywu315 changed the title Fix bug Fix bug with null replication metrics Oct 26, 2022
@taylorfturner taylorfturner added Bug Something isn't working Medium Priority Significant improvement or bug / feature reducing overall performance labels Oct 26, 2022
@@ -2244,7 +2244,10 @@ def _update_null_replication_metrics(self, clean_samples: Dict) -> None:
]._profiles[get_data_type(profile)]

total_row_sum = np.asarray(
[get_data_type_profiler(profile).sum for profile in self._profile]
[
get_data_type_profiler(profile).sum if get_data_type(profile) else 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think np.nan might be more appropriate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 implies a value is real rather than all null

Comment on lines 2297 to 2300
if true_count == 0:
mean_not_null = sum_not_null
else:
mean_not_null = sum_not_null / true_count
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a column is all null, true_count is 0, which creates a divide by 0 error.

@@ -2244,7 +2244,10 @@ def _update_null_replication_metrics(self, clean_samples: Dict) -> None:
]._profiles[get_data_type(profile)]

total_row_sum = np.asarray(
[get_data_type_profiler(profile).sum for profile in self._profile]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to add a test for this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test can be as simple as running the example code you have in the description and ensuring the rep matrix is what we expect

@JGSweets
Copy link
Contributor

profiler = dp.Profiler(pd.DataFrame([[9999999,9], [9999999,9]]), options=profiler_options)

Is the erroring I meant to put.

@taylorfturner
Copy link
Contributor

@tonywu315 needs and update branch. Will review one updated

Copy link
Contributor

@taylorfturner taylorfturner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just two comments around simplifying the test suite that is added in this PR

dataprofiler/tests/profilers/test_profile_builder.py Outdated Show resolved Hide resolved
dataprofiler/tests/profilers/test_profile_builder.py Outdated Show resolved Hide resolved
taylorfturner
taylorfturner previously approved these changes Nov 1, 2022
Copy link
Contributor

@taylorfturner taylorfturner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

taylorfturner
taylorfturner previously approved these changes Nov 1, 2022
Copy link
Contributor

@taylorfturner taylorfturner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@taylorfturner taylorfturner dismissed their stale review November 1, 2022 18:39

dismissing... tests may actually fail

taylorfturner
taylorfturner previously approved these changes Nov 2, 2022
@taylorfturner taylorfturner enabled auto-merge (squash) November 2, 2022 00:35
@taylorfturner
Copy link
Contributor

@tonywu315 update branch FYI

Copy link
Contributor

@taylorfturner taylorfturner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code was changed back to a prior paradigm... can we keep that same paradigm but just update first assignment?

dataprofiler/profilers/profile_builder.py Outdated Show resolved Hide resolved
dataprofiler/profilers/profile_builder.py Outdated Show resolved Hide resolved
auto-merge was automatically disabled November 2, 2022 13:20

Head branch was pushed to by a user without write access

Copy link
Contributor

@taylorfturner taylorfturner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@taylorfturner taylorfturner enabled auto-merge (squash) November 2, 2022 13:23
@taylorfturner taylorfturner merged commit d4b5860 into capitalone:main Nov 3, 2022
@@ -2048,6 +2048,39 @@ def test_null_replication_metrics_calculation(self):
np.testing.assert_array_almost_equal([17 / 8, 48 / 8], column["class_mean"][0])
np.testing.assert_array_almost_equal([12 / 2, 6 / 2], column["class_mean"][1])

# Test with all null in a column
data_3 = pd.DataFrame([[9999999, 9], [9999999, 9]])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the future, might be worthwhile to create a dataset with more edge cases
data_3 = pd.DataFrame([[9999999, 9999999, 123], [9999999, 123, 123]])

@tonywu315 tonywu315 deleted the null_replication_matrix branch November 3, 2022 18:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working Medium Priority Significant improvement or bug / feature reducing overall performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants