Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix/trailing spaces #184

Merged
merged 3 commits into from
May 17, 2023
Merged

Bugfix/trailing spaces #184

merged 3 commits into from
May 17, 2023

Conversation

albrja
Copy link
Contributor

@albrja albrja commented May 16, 2023

Bugfix/trailing spaces

Fixes bug adding trailing spaces to strings.

-updates miswrite numerics noise function to properly strip strings of empty characters
-updates tests to incorporate change

Testing

All tests pass.

@@ -99,7 +99,7 @@ def test_generate_dataset_from_sample_and_source(
).mean()
# we special-case a few sparse columns that have larger differences
if dataset_name == DATASETS.cps.name and col == COLUMNS.unit_number.name:
rtol = 0.21
rtol = 0.30
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is caused by the umpdate to miswrite numerics which is a main source of noise for unit number column.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the change you made have decreased (or at least not increased) the amount of noise? Why did the rtol need to increase?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it somehow make sense that this increased after stripping?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the way I was thinking of it but it still is confusing to me - we are noising the same amount for this column with all our noise types here but we now have less characters to noise but in miswrite numerics we DO NOT noise empty strings. I would have thought similar to what @rmudambi expects. My guess is this isn't being caused by miswrite numerics but instead is phonetic or ocr and the shorter strings make it more likely for a row to be chosen as we saw with middle initial.

@@ -356,7 +356,7 @@ def write_wrong_digits(
digit = pd.Series(digit, index=column.index, name=column.name)
digits.append(digit)
noised_column = noised_column + digits[i]
noised_column.str.strip()
noised_column = noised_column.str.strip()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol whoops.

This is like when I found one of the integration tests was passing b/c we were doing np.isclose()...but we weren't actually asserting anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just False but don't worry about it XD

@@ -459,6 +459,8 @@ def test_miswrite_numerics(string_series):

# Check empty strings havent changed
assert (noised_data[empty_str] == "").all()
# Assert string length doesn't change after noising
assert (data.str.len() == noised_data.str.len()).all()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

@albrja albrja merged commit 5d419ba into develop May 17, 2023
@albrja albrja deleted the bugfix/trailing-space branch May 17, 2023 00:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants