-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect select noise function #18
Conversation
weights: Union[list, pd.Series] = None, | ||
additional_key: Any = None, | ||
random_seed: int = None, | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docstring maybe?
additional_key: Any = None, | ||
random_seed: int = None, | ||
): | ||
if not randomness_stream and (additional_key == None and random_seed == None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
those () should be unnecessary, right?
src/pseudopeople/utilities.py
Outdated
|
||
|
||
def get_possible_indices_to_noise(column: pd.Series) -> pd.Index: | ||
idx = column.index[(column != "") & (column != np.NaN)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like ~column.isna()
would be better than an inequality test on np.NaN
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or column.notna()
tests/integration/test_interface.py
Outdated
|
||
|
||
def test_noise_decennial_census_with_two_noise_functions(dummy_census_data, tmp_path_factory): | ||
# todo: Make config tree with 2 function calls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this todo still a thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No that was just me writing out what to do and shouldn't have made it in here.
src/pseudopeople/utilities.py
Outdated
return np.take(options, chosen_indices) | ||
|
||
|
||
def get_possible_indices_to_noise(column: pd.Series) -> pd.Index: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: get_non_missing_idx
might be more descriptive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
tests/integration/test_interface.py
Outdated
common_data = data.loc[common_idx] | ||
common_noised_data = noised_data.loc[common_idx].drop_duplicates() | ||
assert common_data.shape == common_noised_data.shape | ||
assert set(noised_data.columns) == set(data.columns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this what we settled on as being a good-enough integration test? (in addition to the new two-function one below)? @rmudambi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. We edited this code on the call.
tests/integration/test_interface.py
Outdated
"missing_data": {"row_noise_level": 0.01}, | ||
}, | ||
"state": { | ||
"missing_data": {"row_noise_level": 0.01}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we discussed yet which permutations of functions we need for good-enough coverage? This one is obvious b/c incorrect selection is the second function to be implemented, but what about from here on out? @rmudambi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No we haven't, and we should figure this out very soon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We haven't. This is also something I was wondering. We could in theory try to set this up to be more flexible and take additional args that create a config tree from args so we could have several permutations of two tests which would add a lot more coverage but I was unsure how to do that.
tests/integration/test_interface.py
Outdated
with open(config_path, "w") as file: | ||
yaml.dump(config_dict, file) | ||
|
||
data = pd.read_csv(dummy_census_data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lines 58-62 should be in a utility function since every test will need this (my current branch has it implemented as a fixture but maybe just a function makes more sense since every test will need a different configuration).
Tjhe function could take the config_dict and name of yaml you want to save it as, dump the dict to the location, and return the path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea.
tests/integration/test_interface.py
Outdated
noise_types = [k for k in config[col]] | ||
noise_rates = [ | ||
config[col][noise_type]["row_noise_level"] for noise_type in noise_types | ||
] | ||
expected_noise_rate = 1 - np.prod([1 - x for x in noise_rates]) | ||
assert np.isclose(actual_noise_rate, expected_noise_rate, rtol=0.07) | ||
else: | ||
assert (common_data[col] == common_noised_data[col]).all() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this assertion that column not in the config are unchanged should stay
tests/integration/test_interface.py
Outdated
# assert (data.loc[ | ||
# data.index.difference(non_missing_idx), col] == noised_data.loc[ | ||
# noised_data.index.difference(non_missing_idx), col]).all() | ||
old = data.loc[non_missing_idx, col] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: don't be ageist - maybe orig_col
?
tests/integration/test_interface.py
Outdated
old = data.loc[non_missing_idx, col] | ||
noised_col = noised_data.loc[non_missing_idx, col] | ||
assert len(old) == len(noised_col) | ||
actual_noise_rate = (noised_col != old).sum() / len(noised_col) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(noised_col != old).mean()
is more concise
tests/integration/test_interface.py
Outdated
old = data.loc[non_missing_idx, col] | ||
noised_col = noised_data.loc[non_missing_idx, col] | ||
assert len(old) == len(noised_col) | ||
actual_noise_rate = (noised_col != old).sum() / len(noised_col) | ||
noise_types = [k for k in config[col]] | ||
noise_rates = [ | ||
config[col][noise_type]["row_noise_level"] for noise_type in noise_types | ||
] | ||
expected_noise_rate = 1 - np.prod([1 - x for x in noise_rates]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wasn't it decided that we need to create bespoke noise rates per function instead of relying on this approach? Or with only two columns is this "good enough" as long as the rtol remains below some unknown acceptable threshold? @rmudambi
tests/unit/test_column_noise.py
Outdated
# Get real expected noise to account for possibility of noising with original value | ||
# Here we have a a possibility of choosing any of the 50 states for our categorical series fixture | ||
expected_noise = expected_noise * (1 - 1 / 50) | ||
actual_noise = (noised_data != categorical_series).sum() / len(noised_data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: .mean()
# todo: Update when generate_incorrect_selection uses exclusive resampling | ||
# Get real expected noise to account for possibility of noising with original value | ||
# Here we have a a possibility of choosing any of the 50 states for our categorical series fixture | ||
expected_noise = expected_noise * (1 - 1 / 50) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might need to implement missingness (ie "" due to the generate_missing_data
function that will always be run before any other column noising function) to ensure that's being correctly handled. These noise calculations will then need to be updated to account for that like you did in the integration test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm understanding this correctly, I like this approach. If our input categorical series should has missingness, we are effectively testing that the interaction between missing data and this noise function is correct automatically. We can then have a single test that checks that two arbitrary noise functions when run together affect rows independently rather than needing to test every permutation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I triggered something smarter than I intended. We should probably talk about this b/c I'm not understanding what you're proposing @rmudambi
tests/unit/test_column_noise.py
Outdated
|
||
# Check that un-noised values are unchanged | ||
not_noised_idx = noised_data.index[noised_data == categorical_series] | ||
assert (categorical_series[not_noised_idx] == noised_data[not_noised_idx]).all() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this assertion always true by how you defined not_noised_idx
? (I think I did the same thing on a test I previously wrote)
I'm actually not sure if there is a useful test to check that un-noised values are unchanged.
I suppose the assertian below that all noised data is notna is nice, but I don't think there are guarantees that incomming data are notna so this will break in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed this assert is not actually testing anything.
src/pseudopeople/utilities.py
Outdated
return np.take(options, chosen_indices) | ||
|
||
|
||
def get_possible_indices_to_noise(column: pd.Series) -> pd.Index: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
tests/integration/test_interface.py
Outdated
common_data = data.loc[common_idx] | ||
common_noised_data = noised_data.loc[common_idx].drop_duplicates() | ||
assert common_data.shape == common_noised_data.shape | ||
assert set(noised_data.columns) == set(data.columns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep
tests/integration/test_interface.py
Outdated
"missing_data": {"row_noise_level": 0.01}, | ||
}, | ||
"state": { | ||
"missing_data": {"row_noise_level": 0.01}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No we haven't, and we should figure this out very soon
tests/integration/test_interface.py
Outdated
} | ||
) | ||
config_dict = config_tree.to_dict() | ||
config_path = tmp_path_factory.getbasetemp() / "test_multiple_ooise_config.yaml" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: "test_multiple_ooise_config.yaml"
tests/integration/test_interface.py
Outdated
"missing_data": {"row_noise_level": 0.01}, | ||
"incorrect_select": {"row_noise_level": 0.01}, | ||
}, | ||
"duplication": 0.01, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The omission and duplication config values should be 0 since this test is only testing the interaction between the missing data and incorrect selection functions.
tests/integration/test_interface.py
Outdated
assert set(noised_data.columns) == set(data.columns) | ||
|
||
|
||
def test_noise_decennial_census_with_two_noise_functions(dummy_census_data, tmp_path_factory): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we discussed this tst being in test_noise_form.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rmudambi That is my fault - I suggested he move it into the integration test suite. It feels more like an integration test, no? I don't care strongly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, it's not an integration test
tests/integration/test_interface.py
Outdated
assert np.isclose(actual_noise_rate, expected_noise_rate, rtol=0.07) | ||
else: | ||
assert (common_data[col] == common_noised_data[col]).all() | ||
assert np.isclose(actual_noise_rate, expected_noise_rate, rtol=0.10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you accounting for the fact that the same selection can be redrawn? Is that why this rtol is so high (0.1 is a very high rtol)?
# todo: Update when generate_incorrect_selection uses exclusive resampling | ||
# Get real expected noise to account for possibility of noising with original value | ||
# Here we have a a possibility of choosing any of the 50 states for our categorical series fixture | ||
expected_noise = expected_noise * (1 - 1 / 50) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm understanding this correctly, I like this approach. If our input categorical series should has missingness, we are effectively testing that the interaction between missing data and this noise function is correct automatically. We can then have a single test that checks that two arbitrary noise functions when run together affect rows independently rather than needing to test every permutation.
tests/unit/test_column_noise.py
Outdated
|
||
# Check that un-noised values are unchanged | ||
not_noised_idx = noised_data.index[noised_data == categorical_series] | ||
assert (categorical_series[not_noised_idx] == noised_data[not_noised_idx]).all() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed this assert is not actually testing anything.
src/pseudopeople/utilities.py
Outdated
@@ -3,6 +3,7 @@ | |||
|
|||
import numpy as np | |||
import pandas as pd | |||
import yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this used?
src/pseudopeople/utilities.py
Outdated
def get_possible_indices_to_noise(column: pd.Series) -> pd.Index: | ||
idx = column.index[(column != "") & (column != np.NaN)] | ||
return idx | ||
def get_to_noise_idx( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think i prefer get_index_to_noise
rng = np.random.default_rng(seed=randomness_stream.seed) | ||
for idx in to_noise_idx: | ||
for idx in column.index: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for remembering to update this!
tests/unit/test_column_noise.py
Outdated
|
||
|
||
def test_generate_missing_data(dummy_dataset): | ||
# TODO: [MIC-3910] Use custom config (MIC-3866) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we specifically are NOT doing this anymore since you're getting the default and then updating below
|
||
original_empty_idx = categorical_series.index[categorical_series == ""] | ||
noised_empty_idx = noised_data.index[noised_data == ""] | ||
pd.testing.assert_index_equal(original_empty_idx, noised_empty_idx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I always forget about this pd.testing assertion!
noised_data = func(column, config, RANDOMNESS0, f"test_{func.__name__}") | ||
noised_data_same_seed = func(column, config, RANDOMNESS0, f"test_{func.__name__}") | ||
noised_data_different_seed = func(column, config, RANDOMNESS1, f"test_{func.__name__}") | ||
noised_data = noise_type(column, config, RANDOMNESS0, f"test_{noise_type.name}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait...how is noise_type a callable at this point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
noise_type is a ColumnNoiseType or a noise function. We are carrrying along the args here and this is where we actually run the noise function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you look in the places it gets called we are choosing which noise functions to run, with what data, and with what config and here is where we actually run it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the function you refactored since every noise function will need to do this.
tests/unit/test_noise_form.py
Outdated
|
||
|
||
@pytest.fixture(scope="module") | ||
def dummy_data(): | ||
"""Create a two-column dummy dataset""" | ||
random.seed(0) | ||
num_rows = 1_000_000 | ||
num_rows = 100_000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you discuss 100k being acceptble or did you just forget to change it back after testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We never discussed that. It doesn't hurt to change it back now that I'm not debugging.
tests/unit/test_noise_form.py
Outdated
), | ||
field, | ||
) | ||
# if isinstance(field, ColumnNoiseType): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete
|
||
# Mock objects for testing | ||
|
||
class MockNoiseTypes(NamedTuple): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
|
||
# Assert columns experience both noise | ||
assert np.isclose( | ||
noised_data["fake_column_one"].str.contains("abc123").mean(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very clever
Implement incorrect select noise function
Adds generate_incorrect_selection to noise functions.
-Adds CSV containing possible values for incorrect selection by column
-Adds paths module
-Adds noise function and test for generate_incorrect_selection
Testing
-Test suites pass successfully and generated decennial census form.