-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement output schema #51
Implement output schema #51
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of scope for this PR, but this almost makes incorrect_select_options.csv
unnecessary.
@@ -389,6 +389,9 @@ def test_miswrite_numerics(string_series): | |||
p_row_noise = config.row_noise_level | |||
p_token_noise = config.token_noise_level | |||
data = string_series | |||
# Hack: we need to name the series something with the miswrite_numeric noising | |||
# function applied to check dtypes. | |||
data.name = "ssn" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not actually give our dummy_data
appropriate column names when we define the fixture?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's just an old fixture and didn't seem worth doing atm.
…/sbachmei/MIC-3961-subset-output-columns
…/sbachmei/MIC-3961-subset-output-columns
Title: Implement output schema (columns and dtypes)
Description
The primary goal of this is to have the noising function output specific columns
rather than all columns required for noising. It should also output the columns
in the correct order.
A secondary goal is to start getting a handle on dtypes. This may not be necessary
anymore, however, now that we have moved away from csv files and to hdf and
parquet (both of which store dtype data).
NOTES:
error very late in noising when in reality the users likely won't care about dtypes
Is it even worth managing dtypes anymore?
Testing
pytests pass.
TODO: possibly create a dtype pytest if it's decided we want to
keep that functionality.