Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copy member df #190

Merged
merged 8 commits into from
Jun 16, 2023
Merged

Copy member df #190

merged 8 commits into from
Jun 16, 2023

Conversation

albrja
Copy link
Contributor

@albrja albrja commented Jun 10, 2023

Refactor to pass dataframes to column noise types

Refactor that passes dataframes to column noise types

  • Category: Refactor
  • JIRA issue: MIC-4048

-passes dataframes to column noise types
-updates noise scaling and utils functions
-updates test suites for refactor

Testing

All tests pass successfully.

Copy link
Collaborator

@rmudambi rmudambi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few comments. If follow them, you should reduce the footprint of this PR significantly. Looks good otherwise though.

@@ -48,51 +48,57 @@ class ColumnNoiseType:
The name is the name of the particular noise type (e.g. use_nickname" or
"make_phonetic_errors").

The noise function takes as input a Series, the ConfigTree object for this
The noise function takes as input a dataframe, the ConfigTree object for this
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Capitalize DataFrame

column = column.copy()
if data[column_name].empty:
return data[column_name]
data = data.copy()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you're assigning a new value to the column in the data frame, you shouldn't need to do this copy here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I don't do this I get noise levels of 0 in the tests. I have been unable to figure out why that is the case because your comment makes sense.

)

data = data[column_name]
Copy link
Collaborator

@rmudambi rmudambi Jun 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this function and in every other one,, if you do column = data[column_name] rather than reassigning it to data you don't have to modify as many lines

"""

return pd.Series(np.nan, index=column.index)
data = data[column_name]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't necessary. You can just do data.index of a DataFrame

@@ -1,5 +1,5 @@
from dataclasses import dataclass, field
from typing import Dict, NamedTuple, Optional, Tuple
from typing import Dict, NamedTuple, Optional, Tuple, Union
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No

"baz3",
"fo1",
"fo2",
"fo3",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why were these changed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to test another variation of string formatting for test numerics.



def np_isclose_wrapper(actual_noise, expected_noise, rtol):
return np.isclose(actual_noise, expected_noise, rtol)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you're always using the same failure message, you should define that in this function so you don't have to duplicate it each time you call this. I was envisioning this function looking like this:

def assert_is_close(actual_noise: float, expected_noise: float, rtol: float) -> None:
    assert np_isclose_wrapper(
        actual_noise, expected_noise, rtol
    ), f"Actual noise is {actual_noise} while expected noise was {expected_noise} with a rtol of {rtol}"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.

:returns: pd.Series where data has been noised with other values from a list of possibilities
"""

selection_type = {
"employer_state": "state",
"mailing_address_state": "state",
}.get(str(column.name), column.name)
}.get(str(column_name), column_name)
Copy link
Contributor

@stevebachmeier stevebachmeier Jun 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: no need to str()

if isinstance(data, pd.Series):
not_empty_idx = data.index[(data != "") & (data.notna())]
if is_column_noise:
missing_idx = data.index[(data.isna().any(axis=1)) | (data.isin([""]).any(axis=1))]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but out of curiousity why did you switch to the opposite logic and then take the difference?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting the missing_idx helped me get this working while debugging. I was having trouble getting and and all to work for the original logic. I also thought I may have to extract this line of code into a helper function for a mock like we did in one of the refactors but that wasn't the case.

@albrja albrja merged commit b4ae5d8 into develop Jun 16, 2023
@albrja albrja deleted the copy-member-df branch June 16, 2023 21:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants