-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bug with null replication metrics #702
Fix bug with null replication metrics #702
Conversation
@@ -2244,7 +2244,10 @@ def _update_null_replication_metrics(self, clean_samples: Dict) -> None: | |||
]._profiles[get_data_type(profile)] | |||
|
|||
total_row_sum = np.asarray( | |||
[get_data_type_profiler(profile).sum for profile in self._profile] | |||
[ | |||
get_data_type_profiler(profile).sum if get_data_type(profile) else 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think np.nan might be more appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0 implies a value is real rather than all null
if true_count == 0: | ||
mean_not_null = sum_not_null | ||
else: | ||
mean_not_null = sum_not_null / true_count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When a column is all null, true_count
is 0, which creates a divide by 0 error.
@@ -2244,7 +2244,10 @@ def _update_null_replication_metrics(self, clean_samples: Dict) -> None: | |||
]._profiles[get_data_type(profile)] | |||
|
|||
total_row_sum = np.asarray( | |||
[get_data_type_profiler(profile).sum for profile in self._profile] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to add a test for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test can be as simple as running the example code you have in the description and ensuring the rep matrix is what we expect
profiler = dp.Profiler(pd.DataFrame([[9999999,9], [9999999,9]]), options=profiler_options) Is the erroring I meant to put. |
@tonywu315 needs and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just two comments around simplifying the test suite that is added in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
dismissing... tests may actually fail
@tonywu315 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code was changed back to a prior paradigm... can we keep that same paradigm but just update first assignment?
Head branch was pushed to by a user without write access
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -2048,6 +2048,39 @@ def test_null_replication_metrics_calculation(self): | |||
np.testing.assert_array_almost_equal([17 / 8, 48 / 8], column["class_mean"][0]) | |||
np.testing.assert_array_almost_equal([12 / 2, 6 / 2], column["class_mean"][1]) | |||
|
|||
# Test with all null in a column | |||
data_3 = pd.DataFrame([[9999999, 9], [9999999, 9]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the future, might be worthwhile to create a dataset with more edge cases
data_3 = pd.DataFrame([[9999999, 9999999, 123], [9999999, 123, 123]])
When
null_replication_metrics
is enabled and any columns is all null, this errors: