Skip to content

Latest commit

 

History

History

repeatable-splitting

ML Design Pattern: Repeatable Splitting

https://engineeringfordatascience.com/posts/ml_repeatable_splitting_using_hashing/

Blog post inspired by the ML Design Patterns book

Reproducible ML: Maybe you shouldn't be using train_test_split¶

Reproducibility is critical for robust data science -- after all, it is a science.

But reproducibility in ML can be surprisingly difficult.

The behaviour of your model doesn't only depend on your code, but also the underlying dataset that was used to train it

Therefore, you need to keep tight control on which data points were used to train and test your model to ensure reproducibility.

A fundamental tenet of the ML workflow is splitting your data into training and testing sets. This involves deliberately withholding some data points from the model training in order to evaluate the performance of your model on 'unseen' data.

It is vitally important to be able to reproducibly split your data across different training runs. For a few main reasons:

  • So you can use the same test data points to effectively compare the performance of different model candidates
  • To control as many variables as possible to help troubleshoot performance issues
  • To ensure that you, or your colleagues, can reproduce your model exactly How you split your data can have a big effect on the perceived model performance

It is important to control and understand the training and test splits when comparing different model candidates and across multiple training runs.

Sklearn train_test_split

Probably, the most common way to split your dataset is to use Sklearn's train_test_split function.

Out of the box, the train_test_split function will randomly split your data into a training set and a test set. Each time you run the function you will get a different split for your data. Not ideal for reproducibility.

"Ah!" you say. "I set the random seed so it is reproducible!".

Fair point. Setting random seeds is certainly an excellent idea and goes a long way to improve reproducibility. I would highly recommend setting random seeds for any functions which have non-deterministic outputs.

However, random seeds might not be enough to ensure reproducibility

In this post I will demonstrate that the train_test_split function is more sensitive than you might think, and explain why using a random seed does not always guarantee reproducibility, particularly if you need to retrain your model in the future.

Sometimes train_test_split() does not guarantee reproducibility -- even if you set the random seed.

The Solution: Hashing

See the notebook in this folder for a demonstration and commentary on using the Farmhash algorithm to split your data in a robust manner.

Resources

Hashing

Differences between Python and BigQuery