Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding HuggingFace Tutorial #1040

Merged
merged 9 commits into from
Jan 17, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -12,7 +12,7 @@ repos:
rev: v2.2.1
hooks:
- id: codespell
args: ["--skip", "*.html,*.ipynb,dashboard/src/.yarn/**,dashboard/src/yarn.lock,dashboard/build/**,dashboard/src/src/__tests__/**", "--ignore-words-list=hist,wont"]
args: ["--skip", "*.html,*.ipynb,dashboard/src/.yarn/**,dashboard/src/yarn.lock,dashboard/build/**,dashboard/src/src/__tests__/**", "--ignore-words-list=hist,wont,ro"]
- repo: https://github.com/PyCQA/flake8
rev: 5.0.4
hooks:
210 changes: 210 additions & 0 deletions docs/src/tutorials/huggingface.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,210 @@
*****************************************************************************************
Hyperparameters optimisation using a HuggingFace Model and the Hydra-Orion-Sweeper plugin
*****************************************************************************************

In this tutorial, we will show an easy Orion integration of a HuggingFace translation model using
Hydra, with the `Hydra_Orion_Sweeper <https://github.com/Epistimio/hydra_orion_sweeper>`_ plugin.
Hydra is essentially a framework for configuring applications. We will use it to define our
Hyperparameters and some Orion configuration. We will also be using
`Comet <https://www.comet.com/>`_ for monitoring our experiments.
Installation
^^^^^^^^^^^^
For this tutorial everything that we need to install can be can be found in the ``requirements.txt``
file located in the ``examples/huggingface`` repository. You can then install the requirements
with ``pip``.

.. code-block:: bash
$ pip install -r examples/huggingface/requirements.txt
Imports
^^^^^^^
You will now need to import these modules.

.. literalinclude:: /../../examples/huggingface/main.py
:language: python
:lines: 5-6,13-22


Hydra configuration file
^^^^^^^^^^^^^^^^^^^^^^^^

Notice here how the arguments that are not defined will be set at ``None``, which will be
overridden by default values or not used at all. This serves as a replacement for parsing arguments
in the command line, but is integrated with Orion, which makes it more practical
to manage search spaces of hyperparameters.

.. literalinclude:: /../../examples/huggingface/config.yaml
:language: yaml
:lines: 1-

If you want to change your working or/and logging directory, you can also easily do that.
From the config file, you can specify

.. literalinclude:: /../../examples/huggingface/config.yaml
:language: yaml
:lines: 29-31

This will change your working directory. You can see that with the hydra-orion-sweeper, we are able
to specify 4 variables from Orion. ``${hydra.sweeper.orion.name}`` for the ``experiment_name``,
``${hydra.sweeper.orion.id}`` for the ``experiment_id``, ``${hydra.sweeper.orion.uuid}`` for the
experiment ``uuid`` and ``${hydra.sweeper.orion.trial}`` for the ``trial_id``.

In the code, you can now specify the output directory to the trainer with the
``output_dir`` parameter. ``os.getcwd()`` specifies the current working dir.

Including these options will create different folders for each trial, under different ones for
each experiment and even different ones for each sweep. You do not have to add them all, but it
can be quite useful when you don't want 2 trials writing their cache in the same file, which
could result in an error.

.. code-block:: python
output_dir=str(os.getcwd())+"/test_trainer",
You can find more about the Hydra-Orion-Sweeper plugin by looking directly at the
Github Repository : `Hydra_Orion_Sweeper <https://github.com/Epistimio/hydra_orion_sweeper>`_ ,
or find out more about Hydra in general here : `Hydra <https://hydra.cc/docs/intro/>`_


Adapting the code for Hydra
^^^^^^^^^^^^^^^^^^^^^^^^^^^
The ``config_path`` and ``config_name`` here specifies the path of your Hydra file.
The ``cfg.args`` are reference to the ``args`` in the config file.
We also are going to be using this main function as the entry point of the program.

.. literalinclude:: /../../examples/huggingface/main.py
:language: python
:lines: 141-146, 310-311

With the ``hydra-orion-sweeper``, this function needs to return the objective. You have 3 choices
for what you can return :

.. code-block:: python
if cfg.return_type == "float":
return out
if cfg.return_type == "dict":
return dict(name="objective", type="objective", value=out)
if cfg.return_type == "list":
return [dict(name="objective", type="objective", value=out)]
For the purpose of this tutorial, we are going to keep it simple and simply return a float,
our objective that we want to minimize.

.. literalinclude:: /../../examples/huggingface/main.py
:language: python
:lines: 304

Comet
^^^^^
We are going to use Comet to track our experiments. It is quite simple to use. First,
install comet using

.. code-block:: bash
$ pip install comet-ml
Now that it is installed, we simply have to set some environment variables, such as

.. literalinclude:: /../../examples/huggingface/main.py
:language: python
:lines: 8-13

You can also set them in your working environment. If you are to set them in python, however,
you need to make sure to set them before importing ``transformers``.

For the ``COMET_API_KEY``, you will be given a token when creating your comet account.
This is the one you are going to use here.

And that is it ! If the variables are set and comet-ml is downloaded, HuggingFace will
automatically upload your data to Comet, you simply have to go to your profile on their site
and see your experiments.

It is important to note here that we can swap the Comet logger to many others, such as WandB,
MLFlow, Neptune and ClearML. You can see the complete list in the HuggingFace documentation
`HF callbacks <https://huggingface.co/docs/transformers/main_classes/callback#callbacks>`_

Example code
^^^^^^^^^^^^
For this example, we are fine-tuning a pretrained translation model named ``Helsinki-NLP``.
We start by setting the training arguments.

.. literalinclude:: /../../examples/huggingface/main.py
:language: python
:lines: 165-174

For our purposes, we will be using a ``Seq2SeqTrainer``, so for the training arguments are going
to be ``Seq2SeqTrainingArguments``. The ``set_training_args`` function adds the hydra arguments
into the training arguments.

.. literalinclude:: /../../examples/huggingface/main.py
:language: python
:lines: 114-118

For the dataset, we are going to use the ``wmt16`` dataset. We can set a ``cache_dir`` to where
the dataset cache will be stored

.. literalinclude:: /../../examples/huggingface/main.py
:language: python
:lines: 179,182-184

We then prepare our training and evaluation datasets. In this example, we want to evaluate our
model with the validation dataset and the training dataset.

.. literalinclude:: /../../examples/huggingface/main.py
:language: python
:lines: 196-229

For the metric, we are going to use ``sacrebleu``. We can also set a ``cache_dir`` here for the
metric cache files. The ``compute_metrics`` function goes as follows :

.. literalinclude:: /../../examples/huggingface/main.py
:language: python
:lines: 238-240, 247-268

Now we have to create the actual Trainer, a ``Seq2SeqTrainer`` as mentioned previously.
It is very much like a classic ``Trainer`` from HuggingFace.

.. literalinclude:: /../../examples/huggingface/main.py
:language: python
:lines: 284-292

HuggingFace will log the evaluation from the ``eval_dataset`` to Comet. Since we also want the
evaluation from the training dataset, we will have to implement something called a
``CustomCallback``. The one I made for this tutorial takes the ``trainer`` and the dataset we want
to add (in our case, our train dataset) as parameters.
We can then rewrite some callback functions, such as ``on_epoch_end()``.

.. literalinclude:: /../../examples/huggingface/main.py
:language: python
:lines: 270-282,294

All that is left to do now is to train the model, and once it's finish training, send the data to
Orion by returning it.

.. literalinclude:: /../../examples/huggingface/main.py
:language: python
:lines: 295-297, 304

For more details, feel free to simply go look at the code, in ``examples/huggingface/main.py``

Execution
^^^^^^^^^
We simply have to run the main.py file with the -m argument, which makes sure we use the
Hydra-Orion-Sweeper plugin.

.. code-block:: bash
$ python3 main.py -m
Visualizing results
^^^^^^^^^^^^^^^^^^^
With Orion, after your experiment has finished running, you can easily visualize your results
using `regret plots <https://orion.readthedocs.io/en/stable/auto_examples/plot_1_regret.html>`_
and `partial dependencies plots
<https://orion.readthedocs.io/en/stable/auto_examples/plot_4_partial_dependencies.html>`_
These are very helpful to see what is happening during the optimization, and what can be adjusted
if necessary.
49 changes: 49 additions & 0 deletions examples/huggingface/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
defaults:
- override hydra/sweeper: orion

hydra:
sweeper:
params:
lr: "loguniform(1e-8, 1.0)"
wd: "loguniform(1e-10, 1)"
orion:
name: 'translationexp'
version: '1'

algorithm:
type: random
config:
seed: 1

worker:
n_workers: 1
max_broken: 3
max_trials: 1

storage:
type: legacy
database:
type: pickleddb
host: 'orion_db.pkl'

sweep:
dir: hydra_log/multirun/translation/${now:%Y-%m-%d}/${now:%H-%M-%S}
subdir: ${hydra.sweeper.orion.name}/${hydra.sweeper.orion.uuid}/${hydra.job.id}

#Default value
lr: 0.01
wd: 0.00

args:
size_train_dataset: 5000
size_eval_dataset:
freeze_base_model:
per_device_train_batch_size:
optim:
weight_decay: ${wd}
adam_beta1:
adam_beta2:
adam_epsilon:
logfile:
learning_rate: ${lr}
num_train_epochs: 20
311 changes: 311 additions & 0 deletions examples/huggingface/main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,311 @@
# [markdown]
# # Fine-tune a pretrained model from Hugging Face
#
# source tutorial: https://huggingface.co/docs/transformers/training
import logging
import os

os.environ["COMET_API_KEY"] = "comet_token"
os.environ["COMET_WORKSPACE"] = "workspace"
os.environ["COMET_PROJECT_NAME"] = "project"
os.environ["COMET_MODE"] = "ONLINE"
os.environ["COMET_LOG_ASSETS"] = "True"
os.environ["COMET_AUTO_LOG_METRICS"] = "True"
import argparse
from copy import deepcopy

import hydra
import numpy as np
import torch
from datasets import load_dataset, load_metric
from omegaconf import DictConfig
from transformers import (
AutoModelForSeq2SeqLM,
AutoTokenizer,
DataCollatorForSeq2Seq,
Seq2SeqTrainer,
Seq2SeqTrainingArguments,
TrainerCallback,
)


def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument(
"-st",
"--size-train-dataset",
help="Number of samples to use from training data set. If not specified, use complete dataset",
type=int,
required=False,
)
parser.add_argument(
"-se",
"--size-eval-dataset",
help="Number of samples to use from evaluation data set. If not specified, use complete dataset",
type=int,
required=False,
)
parser.add_argument(
"-f",
"--freeze-base-model",
help="Freeze parameters of base model during training",
action="store_true",
required=False,
)
parser.add_argument(
"-lr", "--learning-rate", help="Learning rate", type=float, required=False
)
parser.add_argument(
"-e",
"--num_train_epochs",
help="Number of training epochs",
type=int,
required=False,
)
parser.add_argument(
"-b",
"--per_device_train_batch_size",
help="Per device batch size",
type=int,
required=False,
)
parser.add_argument(
"-opt",
"--optim",
help="Optimizer (one of: adamw_hf, adamw_torch, adamw_apex_fused, or adafactor",
type=str,
required=False,
)
parser.add_argument(
"-wd",
"--weight_decay",
help="Weight decay for AdamW optimizer",
type=float,
required=False,
)
parser.add_argument(
"-b1",
"--adam_beta1",
help="beta1 hyperparameter for AdamW optimizer",
type=float,
required=False,
)
parser.add_argument(
"-b2",
"--adam_beta2",
help="beta2 hyperparameter for AdamW optimizer",
type=float,
required=False,
)
parser.add_argument(
"-eps",
"--adam_epsilon",
help="epsilon hyperparameter for AdamW optimizer",
type=float,
required=False,
)
parser.add_argument(
"-log", "--logfile", help="Log file name and path", type=str, required=False
)
args = parser.parse_args()
return vars(args)


def set_training_args(training_args, args):
for argname, argvalue in args.items():
if argvalue is not None:
setattr(training_args, argname, argvalue)
return training_args


class GPUMemoryCallback(TrainerCallback):
def on_epoch_end(self, args, state, control, **kwargs):
print(
"GPU mem: Tot - ",
torch.cuda.get_device_properties(0).total_memory,
"res - ",
torch.cuda.memory_reserved(0),
"used - ",
torch.cuda.memory_allocated(0),
)


def get_free_gpu():
for i in range(torch.cuda.device_count()):
gpu_procs_str = torch.cuda.list_gpu_processes(i)
if "no processes are running" in gpu_procs_str:
return i
return None


@hydra.main(config_path=".", config_name="config")
def main(cfg: DictConfig) -> float:
print("args", cfg)

# Get command line arguments and apply hyperparameters to training arguments
args = cfg.args

# Logger setup
logfile = args["logfile"] or "translation_hf.log"

logging.basicConfig(filename=logfile, level=logging.INFO)
logger = logging.getLogger()

# Get a GPU if available
if torch.cuda.is_available():
device = f"cuda:{get_free_gpu()}"
else:
device = "cpu"

# We only use the device to print out what HF should be using by default
logger.info("Compute device: %s", device)

batch_size = 16

# Get hyperparameters
training_args = Seq2SeqTrainingArguments(
output_dir=str(os.getcwd()) + "/test_trainer",
save_total_limit=2,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
evaluation_strategy="epoch",
predict_with_generate=True,
)

training_args = set_training_args(training_args, args)
print("Training arguments:", training_args)

# Load a dataset
dataset_name = "wmt16"
logger.info("Dataset: %s", dataset_name)

raw_dataset = load_dataset(
dataset_name, "ro-en", cache_dir="hydra_log/multirun/translation/dataset"
)

model_checkpoint = "Helsinki-NLP/opus-mt-en-ro"

# Create tokenizer and tokenize the dataset
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

max_input_length = 128
max_target_length = 128
source_lang = "en"
target_lang = "ro"

def preprocess_function(examples):
inputs = [ex[source_lang] for ex in examples["translation"]]
targets = [ex[target_lang] for ex in examples["translation"]]
model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

# Setup the tokenizer for targets
with tokenizer.as_target_tokenizer():
labels = tokenizer(targets, max_length=max_target_length, truncation=True)

model_inputs["labels"] = labels["input_ids"]
return model_inputs

tokenized_datasets = raw_dataset.map(preprocess_function, batched=True)

# Use only a subset of the available data, if desired
size_train_dataset = len(tokenized_datasets["train"])
size_eval_dataset = len(tokenized_datasets["validation"])

if args["size_train_dataset"] is not None:
size_train_dataset = args["size_train_dataset"]
if args["size_eval_dataset"] is not None:
size_eval_dataset = args["size_eval_dataset"]
train_dataset = (
tokenized_datasets["train"].shuffle(seed=42).select(range(size_train_dataset))
)
eval_train_dataset = (
tokenized_datasets["train"].shuffle(seed=42).select(range(size_eval_dataset))
)
eval_dataset = (
tokenized_datasets["validation"]
.shuffle(seed=42)
.select(range(size_eval_dataset))
)

# Create model
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

model_name = model_checkpoint.split("/")[-1]

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

# Train the model
metric = load_metric(
"sacrebleu", cache_dir="hydra_log/multirun/translation/dataset"
)

def postprocess_text(preds, labels):
preds = [pred.strip() for pred in preds]
labels = [[label.strip()] for label in labels]
return preds, labels

def compute_metrics(eval_preds):
preds, labels = eval_preds
if isinstance(preds, tuple):
preds = preds[0]
decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

# Replace -100 in the labels as we can't decode them.
labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

# Some simple post-processing
decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

result = metric.compute(predictions=decoded_preds, references=decoded_labels)
result = {"sacrebleu": result["score"]}

prediction_lens = [
np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds
]
result["gen_len"] = np.mean(prediction_lens)
result = {k: round(v, 4) for k, v in result.items()}
return result

class CustomCallback(TrainerCallback):
def __init__(self, trainer, dataset) -> None:
super().__init__()
self._trainer = trainer
self.dataset = dataset

def on_epoch_end(self, args, state, control, **kwargs):
if control.should_evaluate:
control_copy = deepcopy(control)
self._trainer.evaluate(
eval_dataset=self.dataset, metric_key_prefix="train"
)
return control_copy

trainer = Seq2SeqTrainer(
model,
training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
data_collator=data_collator,
tokenizer=tokenizer,
compute_metrics=compute_metrics,
)

trainer.add_callback(CustomCallback(trainer, eval_train_dataset))
trainer.train()
# Evaluate model
eval_metrics = trainer.evaluate()

# Print out memory stats
print("Total GPU memory:", torch.cuda.get_device_properties(0).total_memory)
print("GPU memory reserved:", torch.cuda.memory_reserved(0))
print("GPU memory allocated:", torch.cuda.memory_allocated(0))

return -eval_metrics["eval_sacrebleu"]


# =======================================================================
# Main
# =======================================================================
if __name__ == "__main__":
main()
5 changes: 5 additions & 0 deletions examples/huggingface/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
datasets
transformers
hydra
hydra-orion-sweeper
comet-ml