Copy from household member noise function implementation #191

albrja · 2023-06-14T21:46:51Z

Copy from household member noise function implementation

Implementation and tests for copy from household member noise function

Category: Feature
JIRA issue: MIC-4044

-implements and test copy from household member noise function
-adds default value for tax w2 dataset to not apply this noise function for SSN.

Testing

All tests pass.

…orporate dfs

…ise tests

zmbc · 2023-06-15T02:34:11Z

src/pseudopeople/noise_entities.py

+    copy_from_household_member: ColumnNoiseType = ColumnNoiseType(
+        "copy_from_household_member",
+        noise_functions.copy_from_household_member,
+        additional_column_getter=column_getters.copy_from_household_member_column_getter,


I haven't reviewed this closely, but I'm surprised not to see a noise_level_scaling_function here.

You were right I forgot to port that over from the other branch but it is implemented now.

stevebachmeier · 2023-06-15T17:40:49Z

src/pseudopeople/column_getters.py

Maybe rename this to utilities.py so it's more generic and usaable for future utility functions

stevebachmeier · 2023-06-15T21:10:22Z

src/pseudopeople/schema_entities.py

@@ -147,7 +147,7 @@ class __Columns(NamedTuple):
        "itin",
        (
            NOISE_TYPES.leave_blank,
-            # NOISE_TYPES.copy_from_within_household,
+            NOISE_TYPES.copy_from_household_member,


Isn't there still ITIN implementation needed to be done?

I believe so, but the only dataset that has an ITIN column is the (unimplemented) 1040.

This should not be changed. The 1040 would need to be updated and we would need to implement copying ITIN in PRL.

stevebachmeier · 2023-06-16T16:49:35Z

src/pseudopeople/noise_entities.py

+    copy_from_household_member: ColumnNoiseType = ColumnNoiseType(
+        "copy_from_household_member",
+        noise_functions.copy_from_household_member,
+        additional_column_getter=column_getters.copy_from_household_member_column_getter,


I don't understand how this is working - don't you need to add this to the class ColumnNoiseType attributes alongside additional_parameters, etc?

Edit: nevermind, in the other PR

rmudambi · 2023-06-20T16:36:39Z

src/pseudopeople/configuration/entities.py

@@ -8,3 +8,4 @@ class Keys:
    TOKEN_PROBABILITY = "token_probability"
    POSSIBLE_AGE_DIFFERENCES = "possible_age_differences"
    ZIPCODE_DIGIT_PROBABILITIES = "digit_probabilities"
+    NO_NOISE = "no_noise"


Is this supposed to be here?

rmudambi · 2023-06-20T16:38:11Z

src/pseudopeople/configuration/generator.py

+                    Keys.CELL_PROBABILITY: 0.00,
+                }
+            },
+        },


This isn't necessary, since below we are setting the noise level on the SSA dataset to 0 for all noise types.

Yes this is no noise for copy_from_household_member for W2.

rmudambi · 2023-06-20T16:42:14Z

src/pseudopeople/noise_functions.py

+    """
+
+    copy_values = data[COPY_HOUSEHOLD_MEMBER_COLS[column_name]]
+    column = pd.Series(copy_values, index=data.index, name=column_name)


This looks right, but this seems clearer to me:

column = ( data[COPY_HOUSEHOLD_MEMBER_COLS[column_name]].copy() .rename(column_name) )

Feel free to leave it as is if you disagree.

rmudambi · 2023-06-20T16:46:48Z

src/pseudopeople/noise_scaling.py

+    proportion_eligible = len(eligible_idx) / len(data)
+    if proportion_eligible == 0.0:
+        return 0.0
+    return 1 / proportion_eligible


Do we actually need this? Aren't all ineligible rows filtered out with the call to get_index_to_noise in the column noise type's __call__ method?

This is scaling the noise proportion just like we did for nicknames. Similar to how we have a dataset where we are choosing a certain number of rows for our noise level, we are choosing more of the eligible rows to noise due to those who are ineligible from PRL.

I think maybe it is confusing because it looks similar to getting the rows that have NAs like in get_index_to_noise but we need to scale it based on NAs.

rmudambi · 2023-06-20T17:48:11Z

src/pseudopeople/schema_entities.py

@@ -147,7 +147,7 @@ class __Columns(NamedTuple):
        "itin",
        (
            NOISE_TYPES.leave_blank,
-            # NOISE_TYPES.copy_from_within_household,
+            NOISE_TYPES.copy_from_household_member,


I believe so, but the only dataset that has an ITIN column is the (unimplemented) 1040.

rmudambi · 2023-06-20T17:54:19Z

tests/unit/test_column_noise.py

+    assert (
+        dummy_dataset.loc[noised_idx, COPY_HOUSEHOLD_MEMBER_COLS["age"]]
+        == noised_data.loc[noised_idx]
+    ).all()


You can send a mask or an index in lines 212 and 213, so no need to convert the mask to an index..

I've changed the variable name from mask. It technically was a mask because it was a bool series but it is not the same length as dummy_dataset and noised_data which is why I need to find the index.

rmudambi · 2023-06-20T17:56:37Z

tests/unit/test_column_noise.py

+    ).mean()
+    is_close_wrapper(actual_noise, expected_noise, 0.02)
+
+    # Noised values should be the same as the copy column


You should also check that all simulants ineligible for noise remain unchanged.

rmudambi · 2023-06-20T17:59:31Z

tests/unit/test_column_noise.py

+    if noise == NOISE_TYPES.copy_from_household_member.name:
+        data = data[data_col]
+    else:
+        data = data.squeeze()


I don't think you need this if/else. The if block should work for all noise types, since in the non-copy cases you'll have a DataFrame with 1 column and data[data_col] will get that column.

rmudambi · 2023-06-20T18:02:58Z

tests/unit/test_noise_form.py


    # FIXME: would be better to mock the dataset instead of using census
    noise_dataset(DATASETS.census, dummy_data, dummy_config_noise_numbers, 0)

-    call_order = [x[0] for x in mock.mock_calls if not x[0].startswith("__")]
+    call_order = [x[0] for x in mock.mock_calls if type(x[1][0]) == str]


Add a comment explaining what is going on here.

rmudambi · 2023-06-20T20:05:39Z

tests/unit/test_noise_form.py

@@ -150,6 +150,8 @@ def test_noise_order(mocker, dummy_data, dummy_config_noise_numbers):
    # FIXME: would be better to mock the dataset instead of using census
    noise_dataset(DATASETS.census, dummy_data, dummy_config_noise_numbers, 0)

+    # This is getting the string of each noise type. It is filtering down to string type for each noise type
+    # while not including mock calls that also contain that string.


But why do we care if this is a string? You should communicate these things:

What is x[0]?

What is x[1][0]? (the first argument of the call)

Why do we care if the first argument of the mock call is a string?

Mention the duplicate calls occurring due to having two methods mocked on the noise type*

albrja added 13 commits June 8, 2023 16:59

Update to entity types, scaling functions, and noise functions to inc…

72744a5

…orporate dfs

Update pytest with wrapper for np.isclose

39b0f68

Fixing usage of dataframes in noise type

a7c557f

Update to tests

83400b6

Removing dict of copy cols

0cc9d5f

Linting

276d444

Updating docstring

e11ef34

Addressing PR comments

28a1acd

Added copy_from_household_member noise function and updated column no…

8807d5c

…ise tests

Adding new sample data

8e43dde

Update other tests

262400b

Update test order for mock handling

d37e35f

Linting

ad5ebc2

albrja requested review from hussain-jafari, mattkappel, ramittal, rmudambi and stevebachmeier as code owners June 14, 2023 21:46

zmbc reviewed Jun 15, 2023

View reviewed changes

stevebachmeier reviewed Jun 15, 2023

View reviewed changes

stevebachmeier reviewed Jun 16, 2023

View reviewed changes

albrja added 2 commits June 16, 2023 11:13

Update test for noise scalling

ce7f92e

Update mock for test

bad1edb

stevebachmeier approved these changes Jun 16, 2023

View reviewed changes

Base automatically changed from copy-member-df to develop June 16, 2023 21:13

Merge

c8fc847

rmudambi reviewed Jun 20, 2023

View reviewed changes

Updates for PR review

681d286

rmudambi approved these changes Jun 20, 2023

View reviewed changes

Update comment for mock calls list

d9fbd80

albrja merged commit 70936c1 into develop Jun 20, 2023

albrja deleted the copy-member-noise-func branch June 20, 2023 22:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Copy from household member noise function implementation #191

Copy from household member noise function implementation #191

albrja commented Jun 14, 2023

zmbc Jun 15, 2023

albrja Jun 16, 2023

stevebachmeier Jun 15, 2023

stevebachmeier Jun 15, 2023

rmudambi Jun 20, 2023

albrja Jun 20, 2023

stevebachmeier Jun 16, 2023 •

edited

Loading

rmudambi Jun 20, 2023

albrja Jun 20, 2023

rmudambi Jun 20, 2023

albrja Jun 20, 2023

rmudambi Jun 20, 2023

rmudambi Jun 20, 2023

albrja Jun 20, 2023

albrja Jun 20, 2023

rmudambi Jun 20, 2023

rmudambi Jun 20, 2023

albrja Jun 20, 2023

rmudambi Jun 20, 2023

rmudambi Jun 20, 2023

rmudambi Jun 20, 2023

rmudambi Jun 20, 2023

Copy from household member noise function implementation #191

Copy from household member noise function implementation #191

Conversation

albrja commented Jun 14, 2023