-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor to add logic for 1040 in interface #217
Conversation
src/pseudopeople/interface.py
Outdated
if isinstance(data_paths, dict): | ||
suffix = set(x.suffix for item in list(data_paths.values()) for x in item) | ||
else: | ||
suffix = set(x.suffix for x in data_paths) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a cleaner way to do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be cleaner to extract everything from line 55 to line 63 into a method called validate_data_path_suffix()
# row_noise_types=( | ||
# NOISE_TYPES.omit_row, | ||
# # NOISE_TYPES.duplication, | ||
# ), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was added for testing but remains commented out since it is not implemented yet.
src/pseudopeople/interface.py
Outdated
continue | ||
data = _reformat_dates_for_noising(data, dataset) | ||
data = _coerce_dtypes(data, dataset) | ||
if dataset.name == DatasetNames.TAXES_1040: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's cleaner to move the logic to differentiate between the 1040 and the other datasets inside the method _load_data_from_path
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% sure I understand what I should do here for this. If it's the 1040 it calls a different function vs all other datasets call _load_data_from_path
. Should I do this check in there then call the separate methods in there? This would conceptually mean _load_data_from_path
loads and preps the data for noise_dataset
which I think is pretty nice but would be a bit confusing for additional methods to be called in there and not this main runner loop.
src/pseudopeople/interface.py
Outdated
if isinstance(data_paths, dict): | ||
suffix = set(x.suffix for item in list(data_paths.values()) for x in item) | ||
else: | ||
suffix = set(x.suffix for x in data_paths) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be cleaner to extract everything from line 55 to line 63 into a method called validate_data_path_suffix()
src/pseudopeople/interface.py
Outdated
@@ -102,22 +91,12 @@ def _coerce_dtypes(data: pd.DataFrame, dataset: Dataset): | |||
return data | |||
|
|||
|
|||
def _load_data_from_path(data_path: Path, user_filters: List[Tuple]): | |||
def _load_data_from_path(data_path: Union[Path, dict], user_filters: List[Tuple]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to specify the return type as pd.DataFrame
.
src/pseudopeople/interface.py
Outdated
@@ -407,23 +386,61 @@ def generate_social_security( | |||
|
|||
|
|||
def fetch_filepaths(dataset, source): | |||
# todo: add typing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove todo and add typing
src/pseudopeople/interface.py
Outdated
data_paths[tax_dataset] = dataset_paths | ||
sorted_dataset_paths = sorted(dataset_paths) | ||
if tax_dataset == DatasetNames.TAXES_1040: | ||
data_paths = [{} for i in range(len(sorted_dataset_paths))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you define this data_paths
object outside the loop, you don't need the if statement.
src/pseudopeople/interface.py
Outdated
sorted_dataset_paths = sorted(dataset_paths) | ||
if tax_dataset == DatasetNames.TAXES_1040: | ||
data_paths = [{} for i in range(len(sorted_dataset_paths))] | ||
for i in range(len(sorted_dataset_paths)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just iterate over data_paths
rather than a range you create from it.
src/pseudopeople/interface.py
Outdated
return None | ||
|
||
|
||
def load_file(data_path: Path, user_filters: List[Tuple]) -> pd.DataFrame: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe rename to load_standard_dataset
?
src/pseudopeople/interface.py
Outdated
return pd.DataFrame() | ||
|
||
|
||
def validate_data_path_suffix(data_paths): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'd prefer the name to be something like validate_data_type
src/pseudopeople/interface.py
Outdated
return pd.DataFrame() | ||
|
||
|
||
def validate_data_path_suffix(data_paths): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typing --> None
user_filters = None | ||
data = pq.read_table(data_path, filters=user_filters).to_pandas() | ||
if isinstance(data_path, dict): | ||
data = load_and_prep_1040_data(data_path, user_filters) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This is a 1040-specific method name but the if
logic is only that date_path is a dict. Is there an easy way to clear that up a bit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the main split between the 1040 and the other datasets. Here data_path will either be a path or a dict. load_and_prep_1040_data is going to call load_standard_dataset_file for each filepath in the dict but the return of this function needs to be a dataframe and this is where we need to format the 1040.
def fetch_filepaths(dataset, source): | ||
def fetch_filepaths(dataset: Dataset, source: Path) -> Union[List, List[dict]]: | ||
# returns a list of filepaths for all Datasets except 1040. | ||
# 1040 returns a list of dicts where each dict is a shard containing a key for each tax dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by each dict is a "shard"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As in the shards for our full size run - one of the 334 shards. So the list will be [{tax_1040: 'file_1", tax_w2: "file_1", tax_dependents: "file_1"}, {tax_1040: "file_2", tax_w2: "file_2"....}]
eeba83c
to
f8630dd
Compare
f8630dd
to
9f5ed67
Compare
Refactor to add logic for 1040 in interface
Adds logic to _generate_dataset in interface.py for special 1040 case.
-Adds logic in _generate_dataset for special 1040 use case
Testing
Tested refactor by setting breakpoint and reaching 1040 special use case. All test suites pass.