Incorrect select noise function #18

albrja · 2023-03-24T23:13:52Z

Implement incorrect select noise function

Adds generate_incorrect_selection to noise functions.

Category: Feature
JIRA issue: MIC-3873

-Adds CSV containing possible values for incorrect selection by column
-Adds paths module
-Adds noise function and test for generate_incorrect_selection

Testing

-Test suites pass successfully and generated decennial census form.

mattkappel · 2023-03-28T01:49:23Z

src/pseudopeople/utilities.py

+    weights: Union[list, pd.Series] = None,
+    additional_key: Any = None,
+    random_seed: int = None,
+):


Docstring maybe?

mattkappel · 2023-03-28T02:05:07Z

src/pseudopeople/utilities.py

+    additional_key: Any = None,
+    random_seed: int = None,
+):
+    if not randomness_stream and (additional_key == None and random_seed == None):


those () should be unnecessary, right?

mattkappel · 2023-03-28T02:06:17Z

src/pseudopeople/utilities.py

+
+
+def get_possible_indices_to_noise(column: pd.Series) -> pd.Index:
+    idx = column.index[(column != "") & (column != np.NaN)]


seems like ~column.isna() would be better than an inequality test on np.NaN

or column.notna()

mattkappel · 2023-03-28T02:07:53Z

tests/integration/test_interface.py

+
+
+def test_noise_decennial_census_with_two_noise_functions(dummy_census_data, tmp_path_factory):
+    # todo: Make config tree with 2 function calls


Is this todo still a thing?

No that was just me writing out what to do and shouldn't have made it in here.

stevebachmeier · 2023-03-28T16:09:13Z

src/pseudopeople/utilities.py

+    return np.take(options, chosen_indices)
+
+
+def get_possible_indices_to_noise(column: pd.Series) -> pd.Index:


nit: get_non_missing_idx might be more descriptive

stevebachmeier · 2023-03-28T16:11:10Z

tests/integration/test_interface.py

-    common_data = data.loc[common_idx]
-    common_noised_data = noised_data.loc[common_idx].drop_duplicates()
-    assert common_data.shape == common_noised_data.shape
+    assert set(noised_data.columns) == set(data.columns)


Is this what we settled on as being a good-enough integration test? (in addition to the new two-function one below)? @rmudambi

Yes. We edited this code on the call.

stevebachmeier · 2023-03-28T16:12:07Z

tests/integration/test_interface.py

+                    "missing_data": {"row_noise_level": 0.01},
+                },
+                "state": {
+                    "missing_data": {"row_noise_level": 0.01},


Have we discussed yet which permutations of functions we need for good-enough coverage? This one is obvious b/c incorrect selection is the second function to be implemented, but what about from here on out? @rmudambi

No we haven't, and we should figure this out very soon

We haven't. This is also something I was wondering. We could in theory try to set this up to be more flexible and take additional args that create a config tree from args so we could have several permutations of two tests which would add a lot more coverage but I was unsure how to do that.

stevebachmeier · 2023-03-28T16:17:37Z

tests/integration/test_interface.py

+    with open(config_path, "w") as file:
+        yaml.dump(config_dict, file)
+
+    data = pd.read_csv(dummy_census_data)


Lines 58-62 should be in a utility function since every test will need this (my current branch has it implemented as a fixture but maybe just a function makes more sense since every test will need a different configuration).

Tjhe function could take the config_dict and name of yaml you want to save it as, dump the dict to the location, and return the path.

Great idea.

stevebachmeier · 2023-03-28T16:19:45Z

tests/integration/test_interface.py

            noise_types = [k for k in config[col]]
            noise_rates = [
                config[col][noise_type]["row_noise_level"] for noise_type in noise_types
            ]
            expected_noise_rate = 1 - np.prod([1 - x for x in noise_rates])
-            assert np.isclose(actual_noise_rate, expected_noise_rate, rtol=0.07)
-        else:
-            assert (common_data[col] == common_noised_data[col]).all()


I think this assertion that column not in the config are unchanged should stay

stevebachmeier · 2023-03-28T16:21:01Z

tests/integration/test_interface.py

+            # assert (data.loc[
+            #             data.index.difference(non_missing_idx), col] == noised_data.loc[
+            #     noised_data.index.difference(non_missing_idx), col]).all()
+            old = data.loc[non_missing_idx, col]


nit: don't be ageist - maybe orig_col?

stevebachmeier · 2023-03-28T16:21:38Z

tests/integration/test_interface.py

+            old = data.loc[non_missing_idx, col]
+            noised_col = noised_data.loc[non_missing_idx, col]
+            assert len(old) == len(noised_col)
+            actual_noise_rate = (noised_col != old).sum() / len(noised_col)


(noised_col != old).mean() is more concise

stevebachmeier · 2023-03-28T16:23:34Z

tests/integration/test_interface.py

+            old = data.loc[non_missing_idx, col]
+            noised_col = noised_data.loc[non_missing_idx, col]
+            assert len(old) == len(noised_col)
+            actual_noise_rate = (noised_col != old).sum() / len(noised_col)
            noise_types = [k for k in config[col]]
            noise_rates = [
                config[col][noise_type]["row_noise_level"] for noise_type in noise_types
            ]
            expected_noise_rate = 1 - np.prod([1 - x for x in noise_rates])


Wasn't it decided that we need to create bespoke noise rates per function instead of relying on this approach? Or with only two columns is this "good enough" as long as the rtol remains below some unknown acceptable threshold? @rmudambi

stevebachmeier · 2023-03-28T16:24:23Z

tests/unit/test_column_noise.py

+    # Get real expected noise to account for possibility of noising with original value
+    # Here we have a a possibility of choosing any of the 50 states for our categorical series fixture
+    expected_noise = expected_noise * (1 - 1 / 50)
+    actual_noise = (noised_data != categorical_series).sum() / len(noised_data)


nit: .mean()

stevebachmeier · 2023-03-28T16:29:23Z

tests/unit/test_column_noise.py

+    # todo: Update when generate_incorrect_selection uses exclusive resampling
+    # Get real expected noise to account for possibility of noising with original value
+    # Here we have a a possibility of choosing any of the 50 states for our categorical series fixture
+    expected_noise = expected_noise * (1 - 1 / 50)


I think we might need to implement missingness (ie "" due to the generate_missing_data function that will always be run before any other column noising function) to ensure that's being correctly handled. These noise calculations will then need to be updated to account for that like you did in the integration test.

If I'm understanding this correctly, I like this approach. If our input categorical series should has missingness, we are effectively testing that the interaction between missing data and this noise function is correct automatically. We can then have a single test that checks that two arbitrary noise functions when run together affect rows independently rather than needing to test every permutation.

I think I triggered something smarter than I intended. We should probably talk about this b/c I'm not understanding what you're proposing @rmudambi

stevebachmeier · 2023-03-28T16:30:28Z

tests/unit/test_column_noise.py

+
+    # Check that un-noised values are unchanged
+    not_noised_idx = noised_data.index[noised_data == categorical_series]
+    assert (categorical_series[not_noised_idx] == noised_data[not_noised_idx]).all()


Isn't this assertion always true by how you defined not_noised_idx? (I think I did the same thing on a test I previously wrote)

I'm actually not sure if there is a useful test to check that un-noised values are unchanged.

I suppose the assertian below that all noised data is notna is nice, but I don't think there are guarantees that incomming data are notna so this will break in that case.

Agreed this assert is not actually testing anything.

rmudambi · 2023-03-28T17:23:29Z

src/pseudopeople/utilities.py

+    return np.take(options, chosen_indices)
+
+
+def get_possible_indices_to_noise(column: pd.Series) -> pd.Index:


rmudambi · 2023-03-28T17:26:53Z

tests/integration/test_interface.py

-    common_data = data.loc[common_idx]
-    common_noised_data = noised_data.loc[common_idx].drop_duplicates()
-    assert common_data.shape == common_noised_data.shape
+    assert set(noised_data.columns) == set(data.columns)


rmudambi · 2023-03-28T17:27:49Z

tests/integration/test_interface.py

+                    "missing_data": {"row_noise_level": 0.01},
+                },
+                "state": {
+                    "missing_data": {"row_noise_level": 0.01},


No we haven't, and we should figure this out very soon

rmudambi · 2023-03-28T17:28:43Z

tests/integration/test_interface.py

+        }
+    )
+    config_dict = config_tree.to_dict()
+    config_path = tmp_path_factory.getbasetemp() / "test_multiple_ooise_config.yaml"


Typo: "test_multiple_ooise_config.yaml"

rmudambi · 2023-03-28T17:32:01Z

tests/integration/test_interface.py

+                    "missing_data": {"row_noise_level": 0.01},
+                    "incorrect_select": {"row_noise_level": 0.01},
+                },
+                "duplication": 0.01,


The omission and duplication config values should be 0 since this test is only testing the interaction between the missing data and incorrect selection functions.

rmudambi · 2023-03-28T17:36:24Z

tests/integration/test_interface.py

+    assert set(noised_data.columns) == set(data.columns)
+
+
+def test_noise_decennial_census_with_two_noise_functions(dummy_census_data, tmp_path_factory):


I thought we discussed this tst being in test_noise_form.py

@rmudambi That is my fault - I suggested he move it into the integration test suite. It feels more like an integration test, no? I don't care strongly

no, it's not an integration test

rmudambi · 2023-03-28T18:25:23Z

tests/integration/test_interface.py

-            assert np.isclose(actual_noise_rate, expected_noise_rate, rtol=0.07)
-        else:
-            assert (common_data[col] == common_noised_data[col]).all()
+            assert np.isclose(actual_noise_rate, expected_noise_rate, rtol=0.10)


Are you accounting for the fact that the same selection can be redrawn? Is that why this rtol is so high (0.1 is a very high rtol)?

rmudambi · 2023-03-28T18:29:42Z

tests/unit/test_column_noise.py

+    # todo: Update when generate_incorrect_selection uses exclusive resampling
+    # Get real expected noise to account for possibility of noising with original value
+    # Here we have a a possibility of choosing any of the 50 states for our categorical series fixture
+    expected_noise = expected_noise * (1 - 1 / 50)


If I'm understanding this correctly, I like this approach. If our input categorical series should has missingness, we are effectively testing that the interaction between missing data and this noise function is correct automatically. We can then have a single test that checks that two arbitrary noise functions when run together affect rows independently rather than needing to test every permutation.

rmudambi · 2023-03-28T18:31:02Z

tests/unit/test_column_noise.py

+
+    # Check that un-noised values are unchanged
+    not_noised_idx = noised_data.index[noised_data == categorical_series]
+    assert (categorical_series[not_noised_idx] == noised_data[not_noised_idx]).all()


Agreed this assert is not actually testing anything.

…idx out of each noise function

rmudambi · 2023-03-30T00:05:41Z

src/pseudopeople/utilities.py

@@ -3,6 +3,7 @@

 import numpy as np
 import pandas as pd
+import yaml


Is this used?

rmudambi · 2023-03-30T00:07:01Z

src/pseudopeople/utilities.py

-def get_possible_indices_to_noise(column: pd.Series) -> pd.Index:
-    idx = column.index[(column != "") & (column != np.NaN)]
-    return idx
+def get_to_noise_idx(


I think i prefer get_index_to_noise

stevebachmeier · 2023-03-30T00:20:08Z

src/pseudopeople/noise_functions.py

    rng = np.random.default_rng(seed=randomness_stream.seed)
-    for idx in to_noise_idx:
+    for idx in column.index:


Thanks for remembering to update this!

stevebachmeier · 2023-03-30T00:23:08Z

tests/unit/test_column_noise.py

+
+
+def test_generate_missing_data(dummy_dataset):
+    # TODO: [MIC-3910] Use custom config (MIC-3866)


I think we specifically are NOT doing this anymore since you're getting the default and then updating below

stevebachmeier · 2023-03-30T00:26:22Z

tests/unit/test_column_noise.py

+
+    original_empty_idx = categorical_series.index[categorical_series == ""]
+    noised_empty_idx = noised_data.index[noised_data == ""]
+    pd.testing.assert_index_equal(original_empty_idx, noised_empty_idx)


I always forget about this pd.testing assertion!

stevebachmeier · 2023-03-30T00:31:11Z

tests/unit/test_column_noise.py

-    noised_data = func(column, config, RANDOMNESS0, f"test_{func.__name__}")
-    noised_data_same_seed = func(column, config, RANDOMNESS0, f"test_{func.__name__}")
-    noised_data_different_seed = func(column, config, RANDOMNESS1, f"test_{func.__name__}")
+    noised_data = noise_type(column, config, RANDOMNESS0, f"test_{noise_type.name}")


wait...how is noise_type a callable at this point?

noise_type is a ColumnNoiseType or a noise function. We are carrrying along the args here and this is where we actually run the noise function.

If you look in the places it gets called we are choosing which noise functions to run, with what data, and with what config and here is where we actually run it.

This is the function you refactored since every noise function will need to do this.

stevebachmeier · 2023-03-30T00:31:53Z

tests/unit/test_noise_form.py



 @pytest.fixture(scope="module")
 def dummy_data():
    """Create a two-column dummy dataset"""
    random.seed(0)
-    num_rows = 1_000_000
+    num_rows = 100_000


Did you discuss 100k being acceptble or did you just forget to change it back after testing?

We never discussed that. It doesn't hurt to change it back now that I'm not debugging.

stevebachmeier · 2023-03-30T00:32:09Z

tests/unit/test_noise_form.py

            ),
            field,
        )
+        # if isinstance(field, ColumnNoiseType):


stevebachmeier · 2023-03-30T00:34:06Z

tests/unit/test_noise_form.py

+
+    # Mock objects for testing
+
+    class MockNoiseTypes(NamedTuple):


stevebachmeier · 2023-03-30T00:36:34Z

tests/unit/test_noise_form.py

+
+    # Assert columns experience both noise
+    assert np.isclose(
+        noised_data["fake_column_one"].str.contains("abc123").mean(),


very clever

albrja added 10 commits March 24, 2023 16:13

Added data for incorrect select and added paths module

0197fcc

Adding vectorized_choice to utilities

a0303f0

Separating entities to avoid circular dependencies

433214f

Fixing import

a7bd47b

Linting

61c7db4

Updating incorrect select csv

6fac945

Implementation and tests added

d627f6b

Adding csv type to be tracked and added doc string

a01c9cf

Linting

a97f7a0

Removing unnecessary todo

58e13c7

albrja requested review from hussain-jafari, mattkappel, ramittal, rmudambi and stevebachmeier as code owners March 24, 2023 23:13

albrja changed the base branch from main to develop March 24, 2023 23:14

Update to implementatio nto handle missing data. Update to tests

26c50f7

mattkappel reviewed Mar 28, 2023

View reviewed changes

stevebachmeier reviewed Mar 28, 2023

View reviewed changes

rmudambi reviewed Mar 28, 2023

View reviewed changes

albrja added 6 commits March 29, 2023 15:21

Updates based on PR feedback. Refactored entity type to remove noise_…

cb031f0

…idx out of each noise function

Resolving merge conflicts

5faf2b8

Linting

04e1209

Merge branch 'develop' into incorrect-select

e93a7c3

Added docstring comment

b1e42e8

Linting

6ca95fe

rmudambi approved these changes Mar 30, 2023

View reviewed changes

Removing unused import and change function name

0ff5881

stevebachmeier reviewed Mar 30, 2023

View reviewed changes

Fixing call for mocker with index function

72e4a06

stevebachmeier reviewed Mar 30, 2023

View reviewed changes

stevebachmeier approved these changes Mar 30, 2023

View reviewed changes

Addressing PR comments

56e4b71

albrja merged commit 190cae4 into develop Mar 30, 2023

albrja deleted the incorrect-select branch March 30, 2023 01:06



		def get_possible_indices_to_noise(column: pd.Series) -> pd.Index:
		idx = column.index[(column != "") & (column != np.NaN)]



		def test_noise_decennial_census_with_two_noise_functions(dummy_census_data, tmp_path_factory):
		# todo: Make config tree with 2 function calls

		return np.take(options, chosen_indices)


		def get_possible_indices_to_noise(column: pd.Series) -> pd.Index:

		assert set(noised_data.columns) == set(data.columns)


		def test_noise_decennial_census_with_two_noise_functions(dummy_census_data, tmp_path_factory):



		def test_generate_missing_data(dummy_dataset):
		# TODO: [MIC-3910] Use custom config (MIC-3866)

Incorrect select noise function #18

Incorrect select noise function #18

Conversation

albrja commented Mar 24, 2023

Implement incorrect select noise function

Adds generate_incorrect_selection to noise functions.

Testing

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevebachmeier Mar 28, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevebachmeier Mar 30, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevebachmeier Mar 28, 2023 •

edited

Loading

stevebachmeier Mar 30, 2023 •

edited

Loading