Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor to add logic for 1040 in interface #217

Merged
merged 8 commits into from
Jul 12, 2023

Conversation

albrja
Copy link
Contributor

@albrja albrja commented Jul 7, 2023

Refactor to add logic for 1040 in interface

Adds logic to _generate_dataset in interface.py for special 1040 case.

  • Category: Refactor
  • JIRA issue: MIC-4217

-Adds logic in _generate_dataset for special 1040 use case

Testing

Tested refactor by setting breakpoint and reaching 1040 special use case. All test suites pass.

if isinstance(data_paths, dict):
suffix = set(x.suffix for item in list(data_paths.values()) for x in item)
else:
suffix = set(x.suffix for x in data_paths)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a cleaner way to do this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be cleaner to extract everything from line 55 to line 63 into a method called validate_data_path_suffix()

# row_noise_types=(
# NOISE_TYPES.omit_row,
# # NOISE_TYPES.duplication,
# ),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was added for testing but remains commented out since it is not implemented yet.

continue
data = _reformat_dates_for_noising(data, dataset)
data = _coerce_dtypes(data, dataset)
if dataset.name == DatasetNames.TAXES_1040:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's cleaner to move the logic to differentiate between the 1040 and the other datasets inside the method _load_data_from_path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure I understand what I should do here for this. If it's the 1040 it calls a different function vs all other datasets call _load_data_from_path. Should I do this check in there then call the separate methods in there? This would conceptually mean _load_data_from_path loads and preps the data for noise_dataset which I think is pretty nice but would be a bit confusing for additional methods to be called in there and not this main runner loop.

if isinstance(data_paths, dict):
suffix = set(x.suffix for item in list(data_paths.values()) for x in item)
else:
suffix = set(x.suffix for x in data_paths)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be cleaner to extract everything from line 55 to line 63 into a method called validate_data_path_suffix()

@@ -102,22 +91,12 @@ def _coerce_dtypes(data: pd.DataFrame, dataset: Dataset):
return data


def _load_data_from_path(data_path: Path, user_filters: List[Tuple]):
def _load_data_from_path(data_path: Union[Path, dict], user_filters: List[Tuple]):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to specify the return type as pd.DataFrame.

@@ -407,23 +386,61 @@ def generate_social_security(


def fetch_filepaths(dataset, source):
# todo: add typing
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove todo and add typing

data_paths[tax_dataset] = dataset_paths
sorted_dataset_paths = sorted(dataset_paths)
if tax_dataset == DatasetNames.TAXES_1040:
data_paths = [{} for i in range(len(sorted_dataset_paths))]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you define this data_paths object outside the loop, you don't need the if statement.

sorted_dataset_paths = sorted(dataset_paths)
if tax_dataset == DatasetNames.TAXES_1040:
data_paths = [{} for i in range(len(sorted_dataset_paths))]
for i in range(len(sorted_dataset_paths)):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just iterate over data_paths rather than a range you create from it.

return None


def load_file(data_path: Path, user_filters: List[Tuple]) -> pd.DataFrame:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rename to load_standard_dataset?

return pd.DataFrame()


def validate_data_path_suffix(data_paths):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd prefer the name to be something like validate_data_type

return pd.DataFrame()


def validate_data_path_suffix(data_paths):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typing --> None

user_filters = None
data = pq.read_table(data_path, filters=user_filters).to_pandas()
if isinstance(data_path, dict):
data = load_and_prep_1040_data(data_path, user_filters)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This is a 1040-specific method name but the if logic is only that date_path is a dict. Is there an easy way to clear that up a bit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the main split between the 1040 and the other datasets. Here data_path will either be a path or a dict. load_and_prep_1040_data is going to call load_standard_dataset_file for each filepath in the dict but the return of this function needs to be a dataframe and this is where we need to format the 1040.

def fetch_filepaths(dataset, source):
def fetch_filepaths(dataset: Dataset, source: Path) -> Union[List, List[dict]]:
# returns a list of filepaths for all Datasets except 1040.
# 1040 returns a list of dicts where each dict is a shard containing a key for each tax dataset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by each dict is a "shard"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in the shards for our full size run - one of the 334 shards. So the list will be [{tax_1040: 'file_1", tax_w2: "file_1", tax_dependents: "file_1"}, {tax_1040: "file_2", tax_w2: "file_2"....}]

@albrja albrja force-pushed the mic-4217/refactor-interface-1040 branch from eeba83c to f8630dd Compare July 12, 2023 18:07
@albrja albrja force-pushed the mic-4217/refactor-interface-1040 branch from f8630dd to 9f5ed67 Compare July 12, 2023 18:22
@albrja albrja merged commit 76c1f6a into develop Jul 12, 2023
@albrja albrja deleted the mic-4217/refactor-interface-1040 branch July 12, 2023 18:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants