Numeric miswriting #26

albrja · 2023-04-01T00:07:59Z

Numeric miswriting noise function

Implementation and tests for numeric miswriting noise function

Category: Feature
JIRA issue: MIC-3907

-Adds numeric miswriting noise function
-Adds tests for numeric miswriting noise function

Testing

Test suites pass with no failures

…orm column

albrja · 2023-04-01T00:09:39Z

tests/unit/test_column_noise.py

+            assert np.isclose(
+                expected_noise,
+                (data[ssn].str[i] != noised_data[ssn].str[i]).mean(),
+                rtol=0.03,


This fails with rtol=0.02 but passes with 0.03. Should I change all of them or is it worth investigating?

I think it's fine. After you make the adjustment to token level noise try again with rtol=0.02 since you'll get different randomness. If it only passes at 0.03 after that, it's fine.

src/pseudopeople/noise_functions.py

tests/unit/test_column_noise.py

src/pseudopeople/noise_functions.py

tests/unit/test_column_noise.py

rmudambi · 2023-04-01T00:53:38Z

tests/unit/test_column_noise.py

+            assert np.isclose(
+                expected_noise,
+                (data[ssn].str[i] != noised_data[ssn].str[i]).mean(),
+                rtol=0.03,


I think it's fine. After you make the adjustment to token level noise try again with rtol=0.02 since you'll get different randomness. If it only passes at 0.03 after that, it's fine.

stevebachmeier · 2023-04-03T16:08:50Z

src/pseudopeople/default_configuration.yaml

@@ -160,9 +184,15 @@ taxes_w2_and_1099:
    mailing_address_street_number:
        missing_data:
            row_noise_level: 0.01
+        numeric_miswriting:


I have this in my PR but you'd better add it here as well. This config is currently missing mailing_address_po_box for the tax forms.

mailing_address_po_box: missing_data: row_noise_level: 0.01

stevebachmeier · 2023-04-03T16:10:31Z

src/pseudopeople/default_configuration.yaml

@@ -138,6 +159,9 @@ taxes_w2_and_1099:
    income:


The docs have noising for "Income / Wages". Can you clarify that there is no concept of wages? ie everything is income?

src/pseudopeople/noise_functions.py

rmudambi · 2023-04-03T21:51:32Z

src/pseudopeople/noise_functions.py

+    is_number = pd.concat(
+        [same_len_col.str[i].str.isdigit() for i in range(longest_str)], axis=1
+    )
+    is_number.columns = list(range(len(is_number.columns)))


Why do you need to set the columns explicitly? They should automatically have those values.

This was being used because below, we used loc on the datafrom instead of iloc when looping through each column. I have updated to iloc.

tests/unit/test_column_noise.py

rmudambi · 2023-04-03T21:56:10Z

tests/unit/test_column_noise.py

+    for i in range(7):  # "Unit 1A"
+        if i == 4:
+            assert (data[unit_number].str[i] == noised_data[unit_number].str[i]).all()
+            assert (noised_data[unit_number].str[i].str.isspace()).all()


Like I said above you don't need to check that this is still a space since you're already checking that it is unchanged.

src/pseudopeople/noise_functions.py

rmudambi · 2023-04-03T23:00:24Z

src/pseudopeople/noise_functions.py

@@ -183,7 +183,7 @@ def miswrite_numerics(
    noised_column = pd.Series("", index=column.index)
    digits = []
    for i in range(len(is_number.columns)):
-        digit = np.where(replace.loc[:, i], random_digits[:, i], same_len_col.str[i])
+        digit = np.where(replace.iloc[:, i], random_digits[:, i], same_len_col.str[i])


Does this not work with loc? What are the column names?

No because loc is name based.

rmudambi · 2023-04-03T23:01:45Z

tests/unit/test_column_noise.py

@@ -200,7 +199,6 @@ def test_miswrite_numerics(string_series):
    for i in range(7):  # "Unit 1A"
        if i == 4:
            assert (data[unit_number].str[i] == noised_data[unit_number].str[i]).all()
-            assert (noised_data[unit_number].str[i].str.isspace()).all()


Since you've removed the isspace assert, this if block is now the same as the else block, so you can just remove it and have it use the else.

stevebachmeier · 2023-04-03T23:07:03Z

tests/unit/test_column_noise.py

+                rtol=0.02,
+            )
+            assert (noised_data[unit_number].str[i].str.isdigit()).all()
+        else:


this is just a copy of line 200. I think what you need is

if i == 5: # check numbers else: # check everything else is the same

albrja added 4 commits March 30, 2023 10:35

Update to incorrect_selection.csv for change in post-processing tax_f…

b8d436f

…orm column

Removing unnecessary helper function due to refactor

9943a07

Implementation and tests for numeric miswriting

d7c6011

Merge branch 'develop' into numeric-miswriting

bcc034c

albrja requested review from hussain-jafari, mattkappel, ramittal, rmudambi and stevebachmeier as code owners April 1, 2023 00:08

albrja commented Apr 1, 2023

View reviewed changes

rmudambi requested changes Apr 1, 2023

View reviewed changes

stevebachmeier reviewed Apr 3, 2023

View reviewed changes

Addressing PR comments

5842700

rmudambi approved these changes Apr 3, 2023

View reviewed changes

More PR updates

c758a53

rmudambi reviewed Apr 3, 2023

View reviewed changes

stevebachmeier reviewed Apr 3, 2023

View reviewed changes

Removing unnecessary comment

dc6a0a7

stevebachmeier approved these changes Apr 4, 2023

View reviewed changes

Merge branch 'develop' into numeric-miswriting

abbcb89

albrja merged commit b05d20d into develop Apr 4, 2023

albrja deleted the numeric-miswriting branch April 4, 2023 00:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Numeric miswriting #26

Numeric miswriting #26

albrja commented Apr 1, 2023

albrja Apr 1, 2023

rmudambi Apr 1, 2023

rmudambi Apr 1, 2023

stevebachmeier Apr 3, 2023

stevebachmeier Apr 3, 2023

albrja Apr 3, 2023

rmudambi Apr 3, 2023

albrja Apr 3, 2023

rmudambi Apr 3, 2023

rmudambi Apr 3, 2023

albrja Apr 3, 2023

rmudambi Apr 3, 2023

stevebachmeier Apr 3, 2023

Numeric miswriting #26

Numeric miswriting #26

Conversation

albrja commented Apr 1, 2023

Numeric miswriting noise function

Implementation and tests for numeric miswriting noise function

Testing

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment