-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Copy member df #190
Copy member df #190
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a few comments. If follow them, you should reduce the footprint of this PR significantly. Looks good otherwise though.
src/pseudopeople/entity_types.py
Outdated
@@ -48,51 +48,57 @@ class ColumnNoiseType: | |||
The name is the name of the particular noise type (e.g. use_nickname" or | |||
"make_phonetic_errors"). | |||
|
|||
The noise function takes as input a Series, the ConfigTree object for this | |||
The noise function takes as input a dataframe, the ConfigTree object for this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Capitalize DataFrame
column = column.copy() | ||
if data[column_name].empty: | ||
return data[column_name] | ||
data = data.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you're assigning a new value to the column in the data frame, you shouldn't need to do this copy here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I don't do this I get noise levels of 0 in the tests. I have been unable to figure out why that is the case because your comment makes sense.
src/pseudopeople/noise_functions.py
Outdated
) | ||
|
||
data = data[column_name] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this function and in every other one,, if you do column = data[column_name]
rather than reassigning it to data
you don't have to modify as many lines
src/pseudopeople/noise_functions.py
Outdated
""" | ||
|
||
return pd.Series(np.nan, index=column.index) | ||
data = data[column_name] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't necessary. You can just do data.index
of a DataFrame
src/pseudopeople/schema_entities.py
Outdated
@@ -1,5 +1,5 @@ | |||
from dataclasses import dataclass, field | |||
from typing import Dict, NamedTuple, Optional, Tuple | |||
from typing import Dict, NamedTuple, Optional, Tuple, Union |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No
"baz3", | ||
"fo1", | ||
"fo2", | ||
"fo3", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why were these changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to test another variation of string formatting for test numerics.
tests/unit/test_column_noise.py
Outdated
|
||
|
||
def np_isclose_wrapper(actual_noise, expected_noise, rtol): | ||
return np.isclose(actual_noise, expected_noise, rtol) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you're always using the same failure message, you should define that in this function so you don't have to duplicate it each time you call this. I was envisioning this function looking like this:
def assert_is_close(actual_noise: float, expected_noise: float, rtol: float) -> None:
assert np_isclose_wrapper(
actual_noise, expected_noise, rtol
), f"Actual noise is {actual_noise} while expected noise was {expected_noise} with a rtol of {rtol}"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see.
:returns: pd.Series where data has been noised with other values from a list of possibilities | ||
""" | ||
|
||
selection_type = { | ||
"employer_state": "state", | ||
"mailing_address_state": "state", | ||
}.get(str(column.name), column.name) | ||
}.get(str(column_name), column_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: no need to str()
if isinstance(data, pd.Series): | ||
not_empty_idx = data.index[(data != "") & (data.notna())] | ||
if is_column_noise: | ||
missing_idx = data.index[(data.isna().any(axis=1)) | (data.isin([""]).any(axis=1))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, but out of curiousity why did you switch to the opposite logic and then take the difference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting the missing_idx helped me get this working while debugging. I was having trouble getting and and all to work for the original logic. I also thought I may have to extract this line of code into a helper function for a mock like we did in one of the refactors but that wasn't the case.
Refactor to pass dataframes to column noise types
Refactor that passes dataframes to column noise types
-passes dataframes to column noise types
-updates noise scaling and utils functions
-updates test suites for refactor
Testing
All tests pass successfully.