Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
Reproducible ML - Maybe you shouldn't be using sklearn's train_test_split.ipynb		Reproducible ML - Maybe you shouldn't be using sklearn's train_test_split.ipynb

README.md

ML Design Pattern: Repeatable Splitting

https://engineeringfordatascience.com/posts/ml_repeatable_splitting_using_hashing/

Blog post inspired by the ML Design Patterns book

Reproducible ML: Maybe you shouldn't be using train_test_split¶

Reproducibility is critical for robust data science -- after all, it is a science.

But reproducibility in ML can be surprisingly difficult.

The behaviour of your model doesn't only depend on your code, but also the underlying dataset that was used to train it

Therefore, you need to keep tight control on which data points were used to train and test your model to ensure reproducibility.

A fundamental tenet of the ML workflow is splitting your data into training and testing sets. This involves deliberately withholding some data points from the model training in order to evaluate the performance of your model on 'unseen' data.

It is vitally important to be able to reproducibly split your data across different training runs. For a few main reasons:

So you can use the same test data points to effectively compare the performance of different model candidates
To control as many variables as possible to help troubleshoot performance issues
To ensure that you, or your colleagues, can reproduce your model exactly How you split your data can have a big effect on the perceived model performance

It is important to control and understand the training and test splits when comparing different model candidates and across multiple training runs.

Sklearn train_test_split

Probably, the most common way to split your dataset is to use Sklearn's train_test_split function.

Out of the box, the train_test_split function will randomly split your data into a training set and a test set. Each time you run the function you will get a different split for your data. Not ideal for reproducibility.

"Ah!" you say. "I set the random seed so it is reproducible!".

Fair point. Setting random seeds is certainly an excellent idea and goes a long way to improve reproducibility. I would highly recommend setting random seeds for any functions which have non-deterministic outputs.

However, random seeds might not be enough to ensure reproducibility

In this post I will demonstrate that the train_test_split function is more sensitive than you might think, and explain why using a random seed does not always guarantee reproducibility, particularly if you need to retrain your model in the future.

Sometimes train_test_split() does not guarantee reproducibility -- even if you set the random seed.

The Solution: Hashing

See the notebook in this folder for a demonstration and commentary on using the Farmhash algorithm to split your data in a robust manner.

Resources

Hashing

Differences between Python and BigQuery

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

repeatable-splitting

repeatable-splitting

README.md

ML Design Pattern: Repeatable Splitting

Reproducible ML: Maybe you shouldn't be using train_test_split¶

The Solution: Hashing

Resources

Files

repeatable-splitting

Directory actions

More options

Directory actions

More options

Latest commit

History

repeatable-splitting

Folders and files

parent directory

README.md

ML Design Pattern: Repeatable Splitting

Reproducible ML: Maybe you shouldn't be using train_test_split¶

The Solution: Hashing

Resources