For the purpose of this tutorial, the notebooks have been created already. They all contain dependencies information in their metadata thanks to jupyterlab-requirements library that allow for reproducibility.
Dependency management is one of the most important requirements for reproducibility. Having dependencies clearly stated allows portability of notebooks, so they can be shared safely with others, reused in other projects or simply reproduced. If you want to know more about this issue in the data science domain, have a look at this article or this video. If you are interested in a tutorial on managing dependencies, you can have a look at this tutorial.
The notebooks that will be used in this tutorial are:
-
explore_dataset.ipynb for exploring the dataset from HuggingFace from GLUE Benchmark.
-
preprocess_dataset.ipynb for processing the dataset from HuggingFace from GLUE Benchmark.
-
fine_tune_model.ipynb for to download a pre-trained model and fine tune it.
NOTE: These notebooks have been derived from HuggingFace material.
In this step we want to create overlays with software stacks for the three notebooks/steps we want to have in the AI Pipeline
-
Open one of the jupyter notebook in
notebooks
folder. -
Run
%horus extract --use-overlay --pipfile --pipfile-lock --store-files-path ..
to extract software stack and runtime environment from -
Repeat this process for each notebook.
At this point you will have an overlays folder with the following structure:
overlays/
download-dataset/
Pipfile
Pipfile.lock
fine-tune-model/
Pipfile
Pipfile.lock
preprocess-dataset/
Pipfile
Pipfile.lock
As you can see we have specific software stacks for each of the notebooks/steps we have created.