-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bugfix/trailing spaces #184
Conversation
@@ -99,7 +99,7 @@ def test_generate_dataset_from_sample_and_source( | |||
).mean() | |||
# we special-case a few sparse columns that have larger differences | |||
if dataset_name == DATASETS.cps.name and col == COLUMNS.unit_number.name: | |||
rtol = 0.21 | |||
rtol = 0.30 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is caused by the umpdate to miswrite numerics which is a main source of noise for unit number column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the change you made have decreased (or at least not increased) the amount of noise? Why did the rtol need to increase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it somehow make sense that this increased after stripping?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the way I was thinking of it but it still is confusing to me - we are noising the same amount for this column with all our noise types here but we now have less characters to noise but in miswrite numerics we DO NOT noise empty strings. I would have thought similar to what @rmudambi expects. My guess is this isn't being caused by miswrite numerics but instead is phonetic or ocr and the shorter strings make it more likely for a row to be chosen as we saw with middle initial.
@@ -356,7 +356,7 @@ def write_wrong_digits( | |||
digit = pd.Series(digit, index=column.index, name=column.name) | |||
digits.append(digit) | |||
noised_column = noised_column + digits[i] | |||
noised_column.str.strip() | |||
noised_column = noised_column.str.strip() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol whoops.
This is like when I found one of the integration tests was passing b/c we were doing np.isclose()
...but we weren't actually asserting anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just False but don't worry about it XD
@@ -459,6 +459,8 @@ def test_miswrite_numerics(string_series): | |||
|
|||
# Check empty strings havent changed | |||
assert (noised_data[empty_str] == "").all() | |||
# Assert string length doesn't change after noising | |||
assert (data.str.len() == noised_data.str.len()).all() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
Bugfix/trailing spaces
Fixes bug adding trailing spaces to strings.
-updates miswrite numerics noise function to properly strip strings of empty characters
-updates tests to incorporate change
Testing
All tests pass.