Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect select noise function #18

Merged
merged 20 commits into from
Mar 30, 2023
Merged

Incorrect select noise function #18

merged 20 commits into from
Mar 30, 2023

Conversation

albrja
Copy link
Contributor

@albrja albrja commented Mar 24, 2023

Implement incorrect select noise function

Adds generate_incorrect_selection to noise functions.

  • Category: Feature
  • JIRA issue: MIC-3873

-Adds CSV containing possible values for incorrect selection by column
-Adds paths module
-Adds noise function and test for generate_incorrect_selection

Testing

-Test suites pass successfully and generated decennial census form.

@albrja albrja changed the base branch from main to develop March 24, 2023 23:14
weights: Union[list, pd.Series] = None,
additional_key: Any = None,
random_seed: int = None,
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring maybe?

additional_key: Any = None,
random_seed: int = None,
):
if not randomness_stream and (additional_key == None and random_seed == None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

those () should be unnecessary, right?



def get_possible_indices_to_noise(column: pd.Series) -> pd.Index:
idx = column.index[(column != "") & (column != np.NaN)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like ~column.isna() would be better than an inequality test on np.NaN

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or column.notna()



def test_noise_decennial_census_with_two_noise_functions(dummy_census_data, tmp_path_factory):
# todo: Make config tree with 2 function calls
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this todo still a thing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No that was just me writing out what to do and shouldn't have made it in here.

return np.take(options, chosen_indices)


def get_possible_indices_to_noise(column: pd.Series) -> pd.Index:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: get_non_missing_idx might be more descriptive

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

common_data = data.loc[common_idx]
common_noised_data = noised_data.loc[common_idx].drop_duplicates()
assert common_data.shape == common_noised_data.shape
assert set(noised_data.columns) == set(data.columns)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this what we settled on as being a good-enough integration test? (in addition to the new two-function one below)? @rmudambi

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We edited this code on the call.

"missing_data": {"row_noise_level": 0.01},
},
"state": {
"missing_data": {"row_noise_level": 0.01},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we discussed yet which permutations of functions we need for good-enough coverage? This one is obvious b/c incorrect selection is the second function to be implemented, but what about from here on out? @rmudambi

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No we haven't, and we should figure this out very soon

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We haven't. This is also something I was wondering. We could in theory try to set this up to be more flexible and take additional args that create a config tree from args so we could have several permutations of two tests which would add a lot more coverage but I was unsure how to do that.

with open(config_path, "w") as file:
yaml.dump(config_dict, file)

data = pd.read_csv(dummy_census_data)
Copy link
Contributor

@stevebachmeier stevebachmeier Mar 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lines 58-62 should be in a utility function since every test will need this (my current branch has it implemented as a fixture but maybe just a function makes more sense since every test will need a different configuration).

Tjhe function could take the config_dict and name of yaml you want to save it as, dump the dict to the location, and return the path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea.

noise_types = [k for k in config[col]]
noise_rates = [
config[col][noise_type]["row_noise_level"] for noise_type in noise_types
]
expected_noise_rate = 1 - np.prod([1 - x for x in noise_rates])
assert np.isclose(actual_noise_rate, expected_noise_rate, rtol=0.07)
else:
assert (common_data[col] == common_noised_data[col]).all()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this assertion that column not in the config are unchanged should stay

# assert (data.loc[
# data.index.difference(non_missing_idx), col] == noised_data.loc[
# noised_data.index.difference(non_missing_idx), col]).all()
old = data.loc[non_missing_idx, col]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: don't be ageist - maybe orig_col?

old = data.loc[non_missing_idx, col]
noised_col = noised_data.loc[non_missing_idx, col]
assert len(old) == len(noised_col)
actual_noise_rate = (noised_col != old).sum() / len(noised_col)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(noised_col != old).mean() is more concise

old = data.loc[non_missing_idx, col]
noised_col = noised_data.loc[non_missing_idx, col]
assert len(old) == len(noised_col)
actual_noise_rate = (noised_col != old).sum() / len(noised_col)
noise_types = [k for k in config[col]]
noise_rates = [
config[col][noise_type]["row_noise_level"] for noise_type in noise_types
]
expected_noise_rate = 1 - np.prod([1 - x for x in noise_rates])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wasn't it decided that we need to create bespoke noise rates per function instead of relying on this approach? Or with only two columns is this "good enough" as long as the rtol remains below some unknown acceptable threshold? @rmudambi

# Get real expected noise to account for possibility of noising with original value
# Here we have a a possibility of choosing any of the 50 states for our categorical series fixture
expected_noise = expected_noise * (1 - 1 / 50)
actual_noise = (noised_data != categorical_series).sum() / len(noised_data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: .mean()

# todo: Update when generate_incorrect_selection uses exclusive resampling
# Get real expected noise to account for possibility of noising with original value
# Here we have a a possibility of choosing any of the 50 states for our categorical series fixture
expected_noise = expected_noise * (1 - 1 / 50)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might need to implement missingness (ie "" due to the generate_missing_data function that will always be run before any other column noising function) to ensure that's being correctly handled. These noise calculations will then need to be updated to account for that like you did in the integration test.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm understanding this correctly, I like this approach. If our input categorical series should has missingness, we are effectively testing that the interaction between missing data and this noise function is correct automatically. We can then have a single test that checks that two arbitrary noise functions when run together affect rows independently rather than needing to test every permutation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I triggered something smarter than I intended. We should probably talk about this b/c I'm not understanding what you're proposing @rmudambi


# Check that un-noised values are unchanged
not_noised_idx = noised_data.index[noised_data == categorical_series]
assert (categorical_series[not_noised_idx] == noised_data[not_noised_idx]).all()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this assertion always true by how you defined not_noised_idx? (I think I did the same thing on a test I previously wrote)

I'm actually not sure if there is a useful test to check that un-noised values are unchanged.

I suppose the assertian below that all noised data is notna is nice, but I don't think there are guarantees that incomming data are notna so this will break in that case.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed this assert is not actually testing anything.

return np.take(options, chosen_indices)


def get_possible_indices_to_noise(column: pd.Series) -> pd.Index:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

common_data = data.loc[common_idx]
common_noised_data = noised_data.loc[common_idx].drop_duplicates()
assert common_data.shape == common_noised_data.shape
assert set(noised_data.columns) == set(data.columns)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep

"missing_data": {"row_noise_level": 0.01},
},
"state": {
"missing_data": {"row_noise_level": 0.01},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No we haven't, and we should figure this out very soon

}
)
config_dict = config_tree.to_dict()
config_path = tmp_path_factory.getbasetemp() / "test_multiple_ooise_config.yaml"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: "test_multiple_ooise_config.yaml"

"missing_data": {"row_noise_level": 0.01},
"incorrect_select": {"row_noise_level": 0.01},
},
"duplication": 0.01,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The omission and duplication config values should be 0 since this test is only testing the interaction between the missing data and incorrect selection functions.

assert set(noised_data.columns) == set(data.columns)


def test_noise_decennial_census_with_two_noise_functions(dummy_census_data, tmp_path_factory):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we discussed this tst being in test_noise_form.py

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rmudambi That is my fault - I suggested he move it into the integration test suite. It feels more like an integration test, no? I don't care strongly

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, it's not an integration test

assert np.isclose(actual_noise_rate, expected_noise_rate, rtol=0.07)
else:
assert (common_data[col] == common_noised_data[col]).all()
assert np.isclose(actual_noise_rate, expected_noise_rate, rtol=0.10)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you accounting for the fact that the same selection can be redrawn? Is that why this rtol is so high (0.1 is a very high rtol)?

# todo: Update when generate_incorrect_selection uses exclusive resampling
# Get real expected noise to account for possibility of noising with original value
# Here we have a a possibility of choosing any of the 50 states for our categorical series fixture
expected_noise = expected_noise * (1 - 1 / 50)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm understanding this correctly, I like this approach. If our input categorical series should has missingness, we are effectively testing that the interaction between missing data and this noise function is correct automatically. We can then have a single test that checks that two arbitrary noise functions when run together affect rows independently rather than needing to test every permutation.


# Check that un-noised values are unchanged
not_noised_idx = noised_data.index[noised_data == categorical_series]
assert (categorical_series[not_noised_idx] == noised_data[not_noised_idx]).all()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed this assert is not actually testing anything.

@@ -3,6 +3,7 @@

import numpy as np
import pandas as pd
import yaml
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used?

def get_possible_indices_to_noise(column: pd.Series) -> pd.Index:
idx = column.index[(column != "") & (column != np.NaN)]
return idx
def get_to_noise_idx(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think i prefer get_index_to_noise

rng = np.random.default_rng(seed=randomness_stream.seed)
for idx in to_noise_idx:
for idx in column.index:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for remembering to update this!



def test_generate_missing_data(dummy_dataset):
# TODO: [MIC-3910] Use custom config (MIC-3866)
Copy link
Contributor

@stevebachmeier stevebachmeier Mar 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we specifically are NOT doing this anymore since you're getting the default and then updating below


original_empty_idx = categorical_series.index[categorical_series == ""]
noised_empty_idx = noised_data.index[noised_data == ""]
pd.testing.assert_index_equal(original_empty_idx, noised_empty_idx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always forget about this pd.testing assertion!

noised_data = func(column, config, RANDOMNESS0, f"test_{func.__name__}")
noised_data_same_seed = func(column, config, RANDOMNESS0, f"test_{func.__name__}")
noised_data_different_seed = func(column, config, RANDOMNESS1, f"test_{func.__name__}")
noised_data = noise_type(column, config, RANDOMNESS0, f"test_{noise_type.name}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait...how is noise_type a callable at this point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noise_type is a ColumnNoiseType or a noise function. We are carrrying along the args here and this is where we actually run the noise function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you look in the places it gets called we are choosing which noise functions to run, with what data, and with what config and here is where we actually run it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the function you refactored since every noise function will need to do this.



@pytest.fixture(scope="module")
def dummy_data():
"""Create a two-column dummy dataset"""
random.seed(0)
num_rows = 1_000_000
num_rows = 100_000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you discuss 100k being acceptble or did you just forget to change it back after testing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We never discussed that. It doesn't hurt to change it back now that I'm not debugging.

),
field,
)
# if isinstance(field, ColumnNoiseType):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete


# Mock objects for testing

class MockNoiseTypes(NamedTuple):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice


# Assert columns experience both noise
assert np.isclose(
noised_data["fake_column_one"].str.contains("abc123").mean(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very clever

@albrja albrja merged commit 190cae4 into develop Mar 30, 2023
@albrja albrja deleted the incorrect-select branch March 30, 2023 01:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants