Refactor to add logic for 1040 in interface #217

albrja · 2023-07-07T23:25:58Z

Refactor to add logic for 1040 in interface

Adds logic to _generate_dataset in interface.py for special 1040 case.

Category: Refactor
JIRA issue: MIC-4217

-Adds logic in _generate_dataset for special 1040 use case

Testing

Tested refactor by setting breakpoint and reaching 1040 special use case. All test suites pass.

albrja · 2023-07-07T23:26:41Z

src/pseudopeople/interface.py

+    if isinstance(data_paths, dict):
+        suffix = set(x.suffix for item in list(data_paths.values()) for x in item)
+    else:
+        suffix = set(x.suffix for x in data_paths)


Is there a cleaner way to do this?

Could be cleaner to extract everything from line 55 to line 63 into a method called validate_data_path_suffix()

albrja · 2023-07-07T23:27:34Z

src/pseudopeople/schema_entities.py

+    #     row_noise_types=(
+    #         NOISE_TYPES.omit_row,
+    #         # NOISE_TYPES.duplication,
+    #     ),


This was added for testing but remains commented out since it is not implemented yet.

rmudambi · 2023-07-10T15:54:05Z

src/pseudopeople/interface.py

-            continue
-        data = _reformat_dates_for_noising(data, dataset)
-        data = _coerce_dtypes(data, dataset)
+        if dataset.name == DatasetNames.TAXES_1040:


I think it's cleaner to move the logic to differentiate between the 1040 and the other datasets inside the method _load_data_from_path.

I'm not 100% sure I understand what I should do here for this. If it's the 1040 it calls a different function vs all other datasets call _load_data_from_path. Should I do this check in there then call the separate methods in there? This would conceptually mean _load_data_from_path loads and preps the data for noise_dataset which I think is pretty nice but would be a bit confusing for additional methods to be called in there and not this main runner loop.

rmudambi · 2023-07-10T15:56:44Z

src/pseudopeople/interface.py

+    if isinstance(data_paths, dict):
+        suffix = set(x.suffix for item in list(data_paths.values()) for x in item)
+    else:
+        suffix = set(x.suffix for x in data_paths)


Could be cleaner to extract everything from line 55 to line 63 into a method called validate_data_path_suffix()

rmudambi · 2023-07-11T17:59:19Z

src/pseudopeople/interface.py

@@ -102,22 +91,12 @@ def _coerce_dtypes(data: pd.DataFrame, dataset: Dataset):
    return data


-def _load_data_from_path(data_path: Path, user_filters: List[Tuple]):
+def _load_data_from_path(data_path: Union[Path, dict], user_filters: List[Tuple]):


Need to specify the return type as pd.DataFrame.

rmudambi · 2023-07-11T18:00:56Z

src/pseudopeople/interface.py

@@ -407,23 +386,61 @@ def generate_social_security(


 def fetch_filepaths(dataset, source):
+    # todo: add typing


Remove todo and add typing

rmudambi · 2023-07-11T18:05:01Z

src/pseudopeople/interface.py

-            data_paths[tax_dataset] = dataset_paths
+            sorted_dataset_paths = sorted(dataset_paths)
+            if tax_dataset == DatasetNames.TAXES_1040:
+                data_paths = [{} for i in range(len(sorted_dataset_paths))]


If you define this data_paths object outside the loop, you don't need the if statement.

rmudambi · 2023-07-11T18:06:25Z

src/pseudopeople/interface.py

+            sorted_dataset_paths = sorted(dataset_paths)
+            if tax_dataset == DatasetNames.TAXES_1040:
+                data_paths = [{} for i in range(len(sorted_dataset_paths))]
+            for i in range(len(sorted_dataset_paths)):


Just iterate over data_paths rather than a range you create from it.

rmudambi · 2023-07-11T18:07:33Z

src/pseudopeople/interface.py

+    return None
+
+
+def load_file(data_path: Path, user_filters: List[Tuple]) -> pd.DataFrame:


Maybe rename to load_standard_dataset?

stevebachmeier · 2023-07-12T17:47:39Z

src/pseudopeople/interface.py

+    return pd.DataFrame()
+
+
+def validate_data_path_suffix(data_paths):


nit: I'd prefer the name to be something like validate_data_type

stevebachmeier · 2023-07-12T17:47:56Z

src/pseudopeople/interface.py

+    return pd.DataFrame()
+
+
+def validate_data_path_suffix(data_paths):


typing --> None

stevebachmeier · 2023-07-12T17:50:49Z

src/pseudopeople/interface.py

-            user_filters = None
-        data = pq.read_table(data_path, filters=user_filters).to_pandas()
+    if isinstance(data_path, dict):
+        data = load_and_prep_1040_data(data_path, user_filters)


nit: This is a 1040-specific method name but the if logic is only that date_path is a dict. Is there an easy way to clear that up a bit?

This is the main split between the 1040 and the other datasets. Here data_path will either be a path or a dict. load_and_prep_1040_data is going to call load_standard_dataset_file for each filepath in the dict but the return of this function needs to be a dataframe and this is where we need to format the 1040.

stevebachmeier · 2023-07-12T17:53:29Z

src/pseudopeople/interface.py

-def fetch_filepaths(dataset, source):
+def fetch_filepaths(dataset: Dataset, source: Path) -> Union[List, List[dict]]:
+    # returns a list of filepaths for all Datasets except 1040.
+    # 1040 returns a list of dicts where each dict is a shard containing a key for each tax dataset


What do you mean by each dict is a "shard"?

As in the shards for our full size run - one of the 334 shards. So the list will be [{tax_1040: 'file_1", tax_w2: "file_1", tax_dependents: "file_1"}, {tax_1040: "file_2", tax_w2: "file_2"....}]

albrja requested review from hussain-jafari, mattkappel, ramittal, rmudambi and stevebachmeier as code owners July 7, 2023 23:25

albrja commented Jul 7, 2023

View reviewed changes

rmudambi approved these changes Jul 10, 2023

View reviewed changes

rmudambi requested changes Jul 11, 2023

View reviewed changes

rmudambi approved these changes Jul 11, 2023

View reviewed changes

stevebachmeier reviewed Jul 12, 2023

View reviewed changes

src/pseudopeople/interface.py Outdated

return pd.DataFrame()

def validate_data_path_suffix(data_paths):

Copy link

Contributor

stevebachmeier Jul 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typing --> None

stevebachmeier reviewed Jul 12, 2023

View reviewed changes

stevebachmeier approved these changes Jul 12, 2023

View reviewed changes

albrja force-pushed the mic-4217/refactor-interface-1040 branch from eeba83c to f8630dd Compare July 12, 2023 18:07

albrja added 8 commits July 12, 2023 11:22

Refactor to add logic for 1040 in interface

50c33c7

Remove breakpoint

9307949

Refactor per PR comments

c3e1e7e

typing

23e3920

Add helper function and dict comp

33dad7b

Update integration test for path object

ba9d877

Update more tmpdir to path object

b6b3fcf

typing

9f5ed67

albrja force-pushed the mic-4217/refactor-interface-1040 branch from f8630dd to 9f5ed67 Compare July 12, 2023 18:22

albrja merged commit 76c1f6a into develop Jul 12, 2023

albrja deleted the mic-4217/refactor-interface-1040 branch July 12, 2023 18:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor to add logic for 1040 in interface #217

Refactor to add logic for 1040 in interface #217

albrja commented Jul 7, 2023

albrja Jul 7, 2023

rmudambi Jul 10, 2023

albrja Jul 7, 2023

rmudambi Jul 10, 2023

albrja Jul 10, 2023

rmudambi Jul 10, 2023

rmudambi Jul 11, 2023

rmudambi Jul 11, 2023

rmudambi Jul 11, 2023

rmudambi Jul 11, 2023

rmudambi Jul 11, 2023

stevebachmeier Jul 12, 2023

stevebachmeier Jul 12, 2023

stevebachmeier Jul 12, 2023

albrja Jul 12, 2023

stevebachmeier Jul 12, 2023

albrja Jul 12, 2023

		@@ -407,23 +386,61 @@ def generate_social_security(


		def fetch_filepaths(dataset, source):
		# todo: add typing

		return None


		def load_file(data_path: Path, user_filters: List[Tuple]) -> pd.DataFrame:

		return pd.DataFrame()


		def validate_data_path_suffix(data_paths):

Refactor to add logic for 1040 in interface #217

Refactor to add logic for 1040 in interface #217

Conversation

albrja commented Jul 7, 2023