Implement output schema #51

stevebachmeier · 2023-04-11T22:16:49Z

Title: Implement output schema (columns and dtypes)

Description

Category: feature
JIRA issue: MIC-3961

The primary goal of this is to have the noising function output specific columns
rather than all columns required for noising. It should also output the columns
in the correct order.

A secondary goal is to start getting a handle on dtypes. This may not be necessary
anymore, however, now that we have moved away from csv files and to hdf and
parquet (both of which store dtype data).

NOTES:

The dtype enforcement is happening prior to noising to prevent a runtime
error very late in noising when in reality the users likely won't care about dtypes
Related...noising basically converts everything to strings anyway.

Is it even worth managing dtypes anymore?

Testing

pytests pass.
TODO: possibly create a dtype pytest if it's decided we want to
keep that functionality.

src/pseudopeople/interface.py

src/pseudopeople/configuration.py

rmudambi · 2023-04-11T22:24:17Z

src/pseudopeople/constants/metadata.py

Out of scope for this PR, but this almost makes incorrect_select_options.csv unnecessary.

src/pseudopeople/interface.py

tests/integration/test_interface.py

src/pseudopeople/noise_functions.py

tests/integration/test_interface.py

rmudambi · 2023-04-12T17:35:32Z

tests/unit/test_column_noise.py

@@ -389,6 +389,9 @@ def test_miswrite_numerics(string_series):
    p_row_noise = config.row_noise_level
    p_token_noise = config.token_noise_level
    data = string_series
+    # Hack: we need to name the series something with the miswrite_numeric noising
+    # function applied to check dtypes.
+    data.name = "ssn"


Why not actually give our dummy_data appropriate column names when we define the fixture?

That's just an old fixture and didn't seem worth doing atm.

…ata.py

…/sbachmei/MIC-3961-subset-output-columns

Implement output schema

2d06e4a

stevebachmeier requested review from albrja, hussain-jafari, mattkappel, ramittal and rmudambi as code owners April 11, 2023 22:16

isort/black

30d0455

mattkappel reviewed Apr 11, 2023

View reviewed changes

src/pseudopeople/interface.py Show resolved Hide resolved

mattkappel approved these changes Apr 11, 2023

View reviewed changes

rmudambi reviewed Apr 11, 2023

View reviewed changes

pr requests

484c2b8

stevebachmeier requested a review from rmudambi April 11, 2023 22:57

stevebachmeier commented Apr 12, 2023

View reviewed changes

tests/integration/test_interface.py Outdated Show resolved Hide resolved

albrja approved these changes Apr 12, 2023

View reviewed changes

better enforcement of dtypes; test dtypes

711a734

stevebachmeier requested review from mattkappel and albrja April 12, 2023 01:44

resolve merge conflicts

b8de02f

rmudambi approved these changes Apr 12, 2023

View reviewed changes

stevebachmeier added 3 commits April 12, 2023 15:24

refactor dtype coercion; base dtype checking on name and remove metad…

518c163

…ata.py

Merge branch 'develop' of github.com:ihmeuw/pseudopeople into feature…

a87f734

…/sbachmei/MIC-3961-subset-output-columns

Merge branch 'develop' of github.com:ihmeuw/pseudopeople into feature…

1df0b53

…/sbachmei/MIC-3961-subset-output-columns

stevebachmeier merged commit 1d0b797 into develop Apr 12, 2023

stevebachmeier deleted the feature/sbachmei/MIC-3961-subset-output-columns branch April 12, 2023 22:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement output schema #51

Implement output schema #51

stevebachmeier commented Apr 11, 2023

rmudambi Apr 11, 2023

rmudambi Apr 12, 2023

stevebachmeier Apr 12, 2023

Implement output schema #51

Implement output schema #51

Conversation

stevebachmeier commented Apr 11, 2023

Title: Implement output schema (columns and dtypes)

Description

Testing

rmudambi Apr 11, 2023

Choose a reason for hiding this comment

rmudambi Apr 12, 2023

Choose a reason for hiding this comment

stevebachmeier Apr 12, 2023

Choose a reason for hiding this comment