-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix to only transform raw data when requested. #268
base: master
Are you sure you want to change the base?
Fix to only transform raw data when requested. #268
Conversation
When read_raw_data_for_training is set to False when invoking the main function, common.transform_data was being called on raw train and test data anyway. This fix moves the transformation to the block where read_raw_data_for_training is True. The scenario here is the data has already been preprocessed, and the user wishes to re-use that preprocessed data.
Thanks for the PR! It's true that transforming the raw data, serializing and writing the results is unnecessary when transform/examples/census_example_common.py Lines 210 to 258 in 2ac89ab
We can even always call |
The behavior of the default usage won't change, as
so To address your questions, yes, this would be the second invocation of
I am pretty new to github/distributed development, so apologies if I'm not structuring my questions/suggestions properly. Thanks! |
Scenario
When there is already pre-processed data available, and the user wants to re-use that data by passing read_raw_data_for_training=False to main, the flow was calling common.transform_data again on the raw data. This was causing WriteTransformFn to fail because there are already existing artifacts there, and unnecessarily recomputing statistics etc.
Fix
This fix moves the common.transform_data invocation to where we are processing the raw data for the first time.