Copy member df #190

albrja · 2023-06-10T00:16:49Z

Refactor to pass dataframes to column noise types

Refactor that passes dataframes to column noise types

Category: Refactor
JIRA issue: MIC-4048

-passes dataframes to column noise types
-updates noise scaling and utils functions
-updates test suites for refactor

Testing

All tests pass successfully.

…orporate dfs

rmudambi

I have a few comments. If follow them, you should reduce the footprint of this PR significantly. Looks good otherwise though.

rmudambi · 2023-06-12T23:08:40Z

src/pseudopeople/entity_types.py

@@ -48,51 +48,57 @@ class ColumnNoiseType:
    The name is the name of the particular noise type (e.g. use_nickname" or
    "make_phonetic_errors").

-    The noise function takes as input a Series, the ConfigTree object for this
+    The noise function takes as input a dataframe, the ConfigTree object for this


Nit: Capitalize DataFrame

rmudambi · 2023-06-12T23:53:57Z

src/pseudopeople/entity_types.py

-        column = column.copy()
+        if data[column_name].empty:
+            return data[column_name]
+        data = data.copy()


Since you're assigning a new value to the column in the data frame, you shouldn't need to do this copy here.

If I don't do this I get noise levels of 0 in the tests. I have been unable to figure out why that is the case because your comment makes sense.

rmudambi · 2023-06-12T23:59:30Z

src/pseudopeople/noise_functions.py

        )

+    data = data[column_name]


In this function and in every other one,, if you do column = data[column_name] rather than reassigning it to data you don't have to modify as many lines

rmudambi · 2023-06-13T00:02:42Z

src/pseudopeople/noise_functions.py

    """
-
-    return pd.Series(np.nan, index=column.index)
+    data = data[column_name]


This isn't necessary. You can just do data.index of a DataFrame

rmudambi · 2023-06-13T00:05:36Z

src/pseudopeople/schema_entities.py

@@ -1,5 +1,5 @@
 from dataclasses import dataclass, field
-from typing import Dict, NamedTuple, Optional, Tuple
+from typing import Dict, NamedTuple, Optional, Tuple, Union


Is this needed?

rmudambi · 2023-06-13T00:06:46Z

tests/unit/test_column_noise.py

-        "baz3",
+        "fo1",
+        "fo2",
+        "fo3",


Why were these changed?

Just to test another variation of string formatting for test numerics.

rmudambi · 2023-06-13T00:09:10Z

tests/unit/test_column_noise.py

+
+
+def np_isclose_wrapper(actual_noise, expected_noise, rtol):
+    return np.isclose(actual_noise, expected_noise, rtol)


Since you're always using the same failure message, you should define that in this function so you don't have to duplicate it each time you call this. I was envisioning this function looking like this:

def assert_is_close(actual_noise: float, expected_noise: float, rtol: float) -> None: assert np_isclose_wrapper( actual_noise, expected_noise, rtol ), f"Actual noise is {actual_noise} while expected noise was {expected_noise} with a rtol of {rtol}"

stevebachmeier · 2023-06-16T20:21:40Z

src/pseudopeople/noise_functions.py

    :returns: pd.Series where data has been noised with other values from a list of possibilities
    """

    selection_type = {
        "employer_state": "state",
        "mailing_address_state": "state",
-    }.get(str(column.name), column.name)
+    }.get(str(column_name), column_name)


nit: no need to str()

stevebachmeier · 2023-06-16T20:24:36Z

src/pseudopeople/utilities.py

-    if isinstance(data, pd.Series):
-        not_empty_idx = data.index[(data != "") & (data.notna())]
+    if is_column_noise:
+        missing_idx = data.index[(data.isna().any(axis=1)) | (data.isin([""]).any(axis=1))]


Looks good, but out of curiousity why did you switch to the opposite logic and then take the difference?

Getting the missing_idx helped me get this working while debugging. I was having trouble getting and and all to work for the original logic. I also thought I may have to extract this line of code into a helper function for a mock like we did in one of the refactors but that wasn't the case.

albrja added 2 commits June 8, 2023 16:59

Update to entity types, scaling functions, and noise functions to inc…

72744a5

…orporate dfs

Update pytest with wrapper for np.isclose

39b0f68

albrja requested review from hussain-jafari, mattkappel, ramittal, rmudambi and stevebachmeier as code owners June 10, 2023 00:16

albrja added 5 commits June 9, 2023 17:52

Fixing usage of dataframes in noise type

a7c557f

Update to tests

83400b6

Removing dict of copy cols

0cc9d5f

Linting

276d444

Updating docstring

e11ef34

rmudambi requested changes Jun 13, 2023

View reviewed changes

Addressing PR comments

28a1acd

rmudambi approved these changes Jun 16, 2023

View reviewed changes

stevebachmeier reviewed Jun 16, 2023

View reviewed changes

stevebachmeier approved these changes Jun 16, 2023

View reviewed changes

albrja merged commit b4ae5d8 into develop Jun 16, 2023

albrja deleted the copy-member-df branch June 16, 2023 21:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Copy member df #190

Copy member df #190

albrja commented Jun 10, 2023 •

edited

Loading

rmudambi left a comment

rmudambi Jun 12, 2023

rmudambi Jun 12, 2023

albrja Jun 13, 2023

rmudambi Jun 12, 2023 •

edited

Loading

rmudambi Jun 13, 2023

rmudambi Jun 13, 2023

albrja Jun 13, 2023

rmudambi Jun 13, 2023

albrja Jun 13, 2023

rmudambi Jun 13, 2023

albrja Jun 13, 2023

stevebachmeier Jun 16, 2023 •

edited

Loading

stevebachmeier Jun 16, 2023

albrja Jun 16, 2023



		def np_isclose_wrapper(actual_noise, expected_noise, rtol):
		return np.isclose(actual_noise, expected_noise, rtol)

Copy member df #190

Copy member df #190

Conversation

albrja commented Jun 10, 2023 • edited Loading

Refactor to pass dataframes to column noise types

Refactor that passes dataframes to column noise types

Testing

rmudambi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rmudambi Jun 12, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevebachmeier Jun 16, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

albrja commented Jun 10, 2023 •

edited

Loading

rmudambi Jun 12, 2023 •

edited

Loading

stevebachmeier Jun 16, 2023 •

edited

Loading