Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement output schema #51

Merged

Conversation

stevebachmeier
Copy link
Contributor

Title: Implement output schema (columns and dtypes)

Description

  • Category: feature
  • JIRA issue: MIC-3961

The primary goal of this is to have the noising function output specific columns
rather than all columns required for noising. It should also output the columns
in the correct order.

A secondary goal is to start getting a handle on dtypes. This may not be necessary
anymore, however, now that we have moved away from csv files and to hdf and
parquet (both of which store dtype data).

NOTES:

  • The dtype enforcement is happening prior to noising to prevent a runtime
    error very late in noising when in reality the users likely won't care about dtypes
  • Related...noising basically converts everything to strings anyway.

Is it even worth managing dtypes anymore?

Testing

pytests pass.
TODO: possibly create a dtype pytest if it's decided we want to
keep that functionality.

src/pseudopeople/configuration.py Outdated Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of scope for this PR, but this almost makes incorrect_select_options.csv unnecessary.

src/pseudopeople/interface.py Outdated Show resolved Hide resolved
@stevebachmeier stevebachmeier requested a review from rmudambi April 11, 2023 22:57
src/pseudopeople/noise_functions.py Outdated Show resolved Hide resolved
tests/integration/test_interface.py Outdated Show resolved Hide resolved
@@ -389,6 +389,9 @@ def test_miswrite_numerics(string_series):
p_row_noise = config.row_noise_level
p_token_noise = config.token_noise_level
data = string_series
# Hack: we need to name the series something with the miswrite_numeric noising
# function applied to check dtypes.
data.name = "ssn"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not actually give our dummy_data appropriate column names when we define the fixture?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's just an old fixture and didn't seem worth doing atm.

@stevebachmeier stevebachmeier merged commit 1d0b797 into develop Apr 12, 2023
@stevebachmeier stevebachmeier deleted the feature/sbachmei/MIC-3961-subset-output-columns branch April 12, 2023 22:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants