https://engineeringfordatascience.com/posts/ml_repeatable_splitting_using_hashing/
Blog post inspired by the ML Design Patterns book
Reproducibility is critical for robust data science -- after all, it is a science.
But reproducibility in ML can be surprisingly difficult.
The behaviour of your model doesn't only depend on your code, but also the underlying dataset that was used to train it
Therefore, you need to keep tight control on which data points were used to train and test your model to ensure reproducibility.
A fundamental tenet of the ML workflow is splitting your data into training and testing sets. This involves deliberately withholding some data points from the model training in order to evaluate the performance of your model on 'unseen' data.
It is vitally important to be able to reproducibly split your data across different training runs. For a few main reasons:
- So you can use the same test data points to effectively compare the performance of different model candidates
- To control as many variables as possible to help troubleshoot performance issues
- To ensure that you, or your colleagues, can reproduce your model exactly How you split your data can have a big effect on the perceived model performance
It is important to control and understand the training and test splits when comparing different model candidates and across multiple training runs.
Sklearn train_test_split
Probably, the most common way to split your dataset is to use Sklearn's train_test_split function.
Out of the box, the train_test_split function will randomly split your data into a training set and a test set. Each time you run the function you will get a different split for your data. Not ideal for reproducibility.
"Ah!" you say. "I set the random seed so it is reproducible!".
Fair point. Setting random seeds is certainly an excellent idea and goes a long way to improve reproducibility. I would highly recommend setting random seeds for any functions which have non-deterministic outputs.
However, random seeds might not be enough to ensure reproducibility
In this post I will demonstrate that the train_test_split function is more sensitive than you might think, and explain why using a random seed does not always guarantee reproducibility, particularly if you need to retrain your model in the future.
Sometimes train_test_split()
does not guarantee reproducibility -- even if you set the random seed.
See the notebook in this folder for a demonstration and commentary on using the Farmhash algorithm to split your data in a robust manner.
Hashing
- ML Design Pattern: Repeatable sampling (inspiration for this post)
- Hash your data before you create the train-test split
- Farmhash Algorithm Description
- Python Farmhash library
- Different hashing implementation from Hands-on Machine Learning with Scikit-Learn, Keras and TensorFlow also see notebook
- Google documentation: Considerations for Hashing
Differences between Python and BigQuery