Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copy from household member noise function implementation #191

Merged
merged 18 commits into from
Jun 20, 2023

Conversation

albrja
Copy link
Contributor

@albrja albrja commented Jun 14, 2023

Copy from household member noise function implementation

Implementation and tests for copy from household member noise function

  • Category: Feature
  • JIRA issue: MIC-4044

-implements and test copy from household member noise function
-adds default value for tax w2 dataset to not apply this noise function for SSN.

Testing

All tests pass.

copy_from_household_member: ColumnNoiseType = ColumnNoiseType(
"copy_from_household_member",
noise_functions.copy_from_household_member,
additional_column_getter=column_getters.copy_from_household_member_column_getter,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't reviewed this closely, but I'm surprised not to see a noise_level_scaling_function here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You were right I forgot to port that over from the other branch but it is implemented now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rename this to utilities.py so it's more generic and usaable for future utility functions

@@ -147,7 +147,7 @@ class __Columns(NamedTuple):
"itin",
(
NOISE_TYPES.leave_blank,
# NOISE_TYPES.copy_from_within_household,
NOISE_TYPES.copy_from_household_member,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't there still ITIN implementation needed to be done?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe so, but the only dataset that has an ITIN column is the (unimplemented) 1040.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be changed. The 1040 would need to be updated and we would need to implement copying ITIN in PRL.

copy_from_household_member: ColumnNoiseType = ColumnNoiseType(
"copy_from_household_member",
noise_functions.copy_from_household_member,
additional_column_getter=column_getters.copy_from_household_member_column_getter,
Copy link
Contributor

@stevebachmeier stevebachmeier Jun 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how this is working - don't you need to add this to the class ColumnNoiseType attributes alongside additional_parameters, etc?

Edit: nevermind, in the other PR

Base automatically changed from copy-member-df to develop June 16, 2023 21:13
@@ -8,3 +8,4 @@ class Keys:
TOKEN_PROBABILITY = "token_probability"
POSSIBLE_AGE_DIFFERENCES = "possible_age_differences"
ZIPCODE_DIGIT_PROBABILITIES = "digit_probabilities"
NO_NOISE = "no_noise"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supposed to be here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No.

Keys.CELL_PROBABILITY: 0.00,
}
},
},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't necessary, since below we are setting the noise level on the SSA dataset to 0 for all noise types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is no noise for copy_from_household_member for W2.

"""

copy_values = data[COPY_HOUSEHOLD_MEMBER_COLS[column_name]]
column = pd.Series(copy_values, index=data.index, name=column_name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks right, but this seems clearer to me:

column = (
    data[COPY_HOUSEHOLD_MEMBER_COLS[column_name]].copy()
    .rename(column_name)
)

Feel free to leave it as is if you disagree.

proportion_eligible = len(eligible_idx) / len(data)
if proportion_eligible == 0.0:
return 0.0
return 1 / proportion_eligible
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually need this? Aren't all ineligible rows filtered out with the call to get_index_to_noise in the column noise type's __call__ method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is scaling the noise proportion just like we did for nicknames. Similar to how we have a dataset where we are choosing a certain number of rows for our noise level, we are choosing more of the eligible rows to noise due to those who are ineligible from PRL.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe it is confusing because it looks similar to getting the rows that have NAs like in get_index_to_noise but we need to scale it based on NAs.

@@ -147,7 +147,7 @@ class __Columns(NamedTuple):
"itin",
(
NOISE_TYPES.leave_blank,
# NOISE_TYPES.copy_from_within_household,
NOISE_TYPES.copy_from_household_member,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe so, but the only dataset that has an ITIN column is the (unimplemented) 1040.

assert (
dummy_dataset.loc[noised_idx, COPY_HOUSEHOLD_MEMBER_COLS["age"]]
== noised_data.loc[noised_idx]
).all()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can send a mask or an index in lines 212 and 213, so no need to convert the mask to an index..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed the variable name from mask. It technically was a mask because it was a bool series but it is not the same length as dummy_dataset and noised_data which is why I need to find the index.

).mean()
is_close_wrapper(actual_noise, expected_noise, 0.02)

# Noised values should be the same as the copy column
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should also check that all simulants ineligible for noise remain unchanged.

if noise == NOISE_TYPES.copy_from_household_member.name:
data = data[data_col]
else:
data = data.squeeze()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need this if/else. The if block should work for all noise types, since in the non-copy cases you'll have a DataFrame with 1 column and data[data_col] will get that column.


# FIXME: would be better to mock the dataset instead of using census
noise_dataset(DATASETS.census, dummy_data, dummy_config_noise_numbers, 0)

call_order = [x[0] for x in mock.mock_calls if not x[0].startswith("__")]
call_order = [x[0] for x in mock.mock_calls if type(x[1][0]) == str]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment explaining what is going on here.

@@ -150,6 +150,8 @@ def test_noise_order(mocker, dummy_data, dummy_config_noise_numbers):
# FIXME: would be better to mock the dataset instead of using census
noise_dataset(DATASETS.census, dummy_data, dummy_config_noise_numbers, 0)

# This is getting the string of each noise type. It is filtering down to string type for each noise type
# while not including mock calls that also contain that string.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But why do we care if this is a string? You should communicate these things:

  • What is x[0]?
  • What is x[1][0]? (the first argument of the call)
  • Why do we care if the first argument of the mock call is a string?
  • Mention the duplicate calls occurring due to having two methods mocked on the noise type*

@albrja albrja merged commit 70936c1 into develop Jun 20, 2023
@albrja albrja deleted the copy-member-noise-func branch June 20, 2023 22:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants