Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Numeric miswriting #26

Merged
merged 8 commits into from
Apr 4, 2023
Merged

Numeric miswriting #26

merged 8 commits into from
Apr 4, 2023

Conversation

albrja
Copy link
Contributor

@albrja albrja commented Apr 1, 2023

Numeric miswriting noise function

Implementation and tests for numeric miswriting noise function

  • Category: Feature
  • JIRA issue: MIC-3907

-Adds numeric miswriting noise function
-Adds tests for numeric miswriting noise function

Testing

Test suites pass with no failures

assert np.isclose(
expected_noise,
(data[ssn].str[i] != noised_data[ssn].str[i]).mean(),
rtol=0.03,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fails with rtol=0.02 but passes with 0.03. Should I change all of them or is it worth investigating?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine. After you make the adjustment to token level noise try again with rtol=0.02 since you'll get different randomness. If it only passes at 0.03 after that, it's fine.

src/pseudopeople/noise_functions.py Outdated Show resolved Hide resolved
tests/unit/test_column_noise.py Outdated Show resolved Hide resolved
tests/unit/test_column_noise.py Outdated Show resolved Hide resolved
tests/unit/test_column_noise.py Outdated Show resolved Hide resolved
src/pseudopeople/noise_functions.py Outdated Show resolved Hide resolved
tests/unit/test_column_noise.py Outdated Show resolved Hide resolved
tests/unit/test_column_noise.py Outdated Show resolved Hide resolved
tests/unit/test_column_noise.py Outdated Show resolved Hide resolved
assert np.isclose(
expected_noise,
(data[ssn].str[i] != noised_data[ssn].str[i]).mean(),
rtol=0.03,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine. After you make the adjustment to token level noise try again with rtol=0.02 since you'll get different randomness. If it only passes at 0.03 after that, it's fine.

@@ -160,9 +184,15 @@ taxes_w2_and_1099:
mailing_address_street_number:
missing_data:
row_noise_level: 0.01
numeric_miswriting:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have this in my PR but you'd better add it here as well. This config is currently missing mailing_address_po_box for the tax forms.

mailing_address_po_box:
        missing_data:
            row_noise_level: 0.01

@@ -138,6 +159,9 @@ taxes_w2_and_1099:
income:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs have noising for "Income / Wages". Can you clarify that there is no concept of wages? ie everything is income?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

src/pseudopeople/noise_functions.py Outdated Show resolved Hide resolved
is_number = pd.concat(
[same_len_col.str[i].str.isdigit() for i in range(longest_str)], axis=1
)
is_number.columns = list(range(len(is_number.columns)))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to set the columns explicitly? They should automatically have those values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was being used because below, we used loc on the datafrom instead of iloc when looping through each column. I have updated to iloc.

tests/unit/test_column_noise.py Outdated Show resolved Hide resolved
for i in range(7): # "Unit 1A"
if i == 4:
assert (data[unit_number].str[i] == noised_data[unit_number].str[i]).all()
assert (noised_data[unit_number].str[i].str.isspace()).all()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like I said above you don't need to check that this is still a space since you're already checking that it is unchanged.

src/pseudopeople/noise_functions.py Outdated Show resolved Hide resolved
@@ -183,7 +183,7 @@ def miswrite_numerics(
noised_column = pd.Series("", index=column.index)
digits = []
for i in range(len(is_number.columns)):
digit = np.where(replace.loc[:, i], random_digits[:, i], same_len_col.str[i])
digit = np.where(replace.iloc[:, i], random_digits[:, i], same_len_col.str[i])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this not work with loc? What are the column names?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No because loc is name based.

@@ -200,7 +199,6 @@ def test_miswrite_numerics(string_series):
for i in range(7): # "Unit 1A"
if i == 4:
assert (data[unit_number].str[i] == noised_data[unit_number].str[i]).all()
assert (noised_data[unit_number].str[i].str.isspace()).all()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you've removed the isspace assert, this if block is now the same as the else block, so you can just remove it and have it use the else.

rtol=0.02,
)
assert (noised_data[unit_number].str[i].str.isdigit()).all()
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just a copy of line 200. I think what you need is

if i == 5: # check numbers
else: # check everything else is the same

@albrja albrja merged commit b05d20d into develop Apr 4, 2023
@albrja albrja deleted the numeric-miswriting branch April 4, 2023 00:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants