Bugfix/trailing spaces #184

albrja · 2023-05-16T17:58:17Z

Bugfix/trailing spaces

Fixes bug adding trailing spaces to strings.

Category: Bugfix
JIRA issue: MIC-4048

-updates miswrite numerics noise function to properly strip strings of empty characters
-updates tests to incorporate change

Testing

All tests pass.

albrja · 2023-05-16T18:00:20Z

tests/integration/test_interface.py

@@ -99,7 +99,7 @@ def test_generate_dataset_from_sample_and_source(
        ).mean()
        # we special-case a few sparse columns that have larger differences
        if dataset_name == DATASETS.cps.name and col == COLUMNS.unit_number.name:
-            rtol = 0.21
+            rtol = 0.30


This is caused by the umpdate to miswrite numerics which is a main source of noise for unit number column.

Shouldn't the change you made have decreased (or at least not increased) the amount of noise? Why did the rtol need to increase?

Does it somehow make sense that this increased after stripping?

This is the way I was thinking of it but it still is confusing to me - we are noising the same amount for this column with all our noise types here but we now have less characters to noise but in miswrite numerics we DO NOT noise empty strings. I would have thought similar to what @rmudambi expects. My guess is this isn't being caused by miswrite numerics but instead is phonetic or ocr and the shorter strings make it more likely for a row to be chosen as we saw with middle initial.

stevebachmeier · 2023-05-16T19:57:22Z

src/pseudopeople/noise_functions.py

@@ -356,7 +356,7 @@ def write_wrong_digits(
        digit = pd.Series(digit, index=column.index, name=column.name)
        digits.append(digit)
        noised_column = noised_column + digits[i]
-    noised_column.str.strip()
+    noised_column = noised_column.str.strip()


lol whoops.

This is like when I found one of the integration tests was passing b/c we were doing np.isclose()...but we weren't actually asserting anything.

It's just False but don't worry about it XD

stevebachmeier · 2023-05-16T19:58:32Z

tests/unit/test_column_noise.py

@@ -459,6 +459,8 @@ def test_miswrite_numerics(string_series):

    # Check empty strings havent changed
    assert (noised_data[empty_str] == "").all()
+    # Assert string length doesn't change after noising
+    assert (data.str.len() == noised_data.str.len()).all()


Fixing string bug causing tailing empty space

cb78d3c

albrja requested review from hussain-jafari, mattkappel, ramittal, rmudambi and stevebachmeier as code owners May 16, 2023 17:58

albrja commented May 16, 2023

View reviewed changes

stevebachmeier reviewed May 16, 2023

View reviewed changes

stevebachmeier approved these changes May 16, 2023

View reviewed changes

rmudambi approved these changes May 16, 2023

View reviewed changes

albrja added 2 commits May 16, 2023 14:29

Changing names in test

32625f0

Merge branch 'develop' into bugfix/trailing-space

ebd4a80

albrja merged commit 5d419ba into develop May 17, 2023

albrja deleted the bugfix/trailing-space branch May 17, 2023 00:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugfix/trailing spaces #184

Bugfix/trailing spaces #184

albrja commented May 16, 2023

albrja May 16, 2023

rmudambi May 16, 2023

stevebachmeier May 16, 2023

albrja May 16, 2023

stevebachmeier May 16, 2023

albrja May 16, 2023

stevebachmeier May 16, 2023

Bugfix/trailing spaces #184

Bugfix/trailing spaces #184

Conversation

albrja commented May 16, 2023