-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Numeric miswriting #26
Conversation
tests/unit/test_column_noise.py
Outdated
assert np.isclose( | ||
expected_noise, | ||
(data[ssn].str[i] != noised_data[ssn].str[i]).mean(), | ||
rtol=0.03, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fails with rtol=0.02 but passes with 0.03. Should I change all of them or is it worth investigating?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine. After you make the adjustment to token level noise try again with rtol=0.02
since you'll get different randomness. If it only passes at 0.03 after that, it's fine.
tests/unit/test_column_noise.py
Outdated
assert np.isclose( | ||
expected_noise, | ||
(data[ssn].str[i] != noised_data[ssn].str[i]).mean(), | ||
rtol=0.03, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine. After you make the adjustment to token level noise try again with rtol=0.02
since you'll get different randomness. If it only passes at 0.03 after that, it's fine.
@@ -160,9 +184,15 @@ taxes_w2_and_1099: | |||
mailing_address_street_number: | |||
missing_data: | |||
row_noise_level: 0.01 | |||
numeric_miswriting: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have this in my PR but you'd better add it here as well. This config is currently missing mailing_address_po_box
for the tax forms.
mailing_address_po_box:
missing_data:
row_noise_level: 0.01
@@ -138,6 +159,9 @@ taxes_w2_and_1099: | |||
income: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docs have noising for "Income / Wages". Can you clarify that there is no concept of wages? ie everything is income?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
src/pseudopeople/noise_functions.py
Outdated
is_number = pd.concat( | ||
[same_len_col.str[i].str.isdigit() for i in range(longest_str)], axis=1 | ||
) | ||
is_number.columns = list(range(len(is_number.columns))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need to set the columns explicitly? They should automatically have those values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was being used because below, we used loc on the datafrom instead of iloc when looping through each column. I have updated to iloc.
tests/unit/test_column_noise.py
Outdated
for i in range(7): # "Unit 1A" | ||
if i == 4: | ||
assert (data[unit_number].str[i] == noised_data[unit_number].str[i]).all() | ||
assert (noised_data[unit_number].str[i].str.isspace()).all() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like I said above you don't need to check that this is still a space since you're already checking that it is unchanged.
@@ -183,7 +183,7 @@ def miswrite_numerics( | |||
noised_column = pd.Series("", index=column.index) | |||
digits = [] | |||
for i in range(len(is_number.columns)): | |||
digit = np.where(replace.loc[:, i], random_digits[:, i], same_len_col.str[i]) | |||
digit = np.where(replace.iloc[:, i], random_digits[:, i], same_len_col.str[i]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this not work with loc
? What are the column names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No because loc is name based.
tests/unit/test_column_noise.py
Outdated
@@ -200,7 +199,6 @@ def test_miswrite_numerics(string_series): | |||
for i in range(7): # "Unit 1A" | |||
if i == 4: | |||
assert (data[unit_number].str[i] == noised_data[unit_number].str[i]).all() | |||
assert (noised_data[unit_number].str[i].str.isspace()).all() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you've removed the isspace
assert, this if
block is now the same as the else
block, so you can just remove it and have it use the else
.
rtol=0.02, | ||
) | ||
assert (noised_data[unit_number].str[i].str.isdigit()).all() | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is just a copy of line 200. I think what you need is
if i == 5: # check numbers
else: # check everything else is the same
Numeric miswriting noise function
Implementation and tests for numeric miswriting noise function
-Adds numeric miswriting noise function
-Adds tests for numeric miswriting noise function
Testing
Test suites pass with no failures