Epistimio · bouthilx · Jan 17, 2023 · Dec 9, 2022 · Dec 13, 2022 · Dec 13, 2022
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -12,7 +12,7 @@ repos:
     rev: v2.2.1
     hooks:
       - id: codespell
-        args: ["--skip", "*.html,*.ipynb,dashboard/src/.yarn/**,dashboard/src/yarn.lock,dashboard/build/**,dashboard/src/src/__tests__/**", "--ignore-words-list=hist,wont"]
+        args: ["--skip", "*.html,*.ipynb,dashboard/src/.yarn/**,dashboard/src/yarn.lock,dashboard/build/**,dashboard/src/src/__tests__/**", "--ignore-words-list=hist,wont,ro"]
   - repo: https://github.com/PyCQA/flake8
     rev: 5.0.4
     hooks:

diff --git a/docs/src/tutorials/huggingface.rst b/docs/src/tutorials/huggingface.rst
@@ -0,0 +1,210 @@
+*****************************************************************************************
+Hyperparameters optimisation using a HuggingFace Model and the Hydra-Orion-Sweeper plugin
+*****************************************************************************************
+
+In this tutorial, we will show an easy Orion integration of a HuggingFace translation model using
+Hydra, with the `Hydra_Orion_Sweeper <https://github.com/Epistimio/hydra_orion_sweeper>`_ plugin.
+Hydra is essentially a framework for configuring applications. We will use it to define our
+Hyperparameters and some Orion configuration. We will also be using
+`Comet <https://www.comet.com/>`_ for monitoring our experiments.
+Installation
+^^^^^^^^^^^^
+For this tutorial everything that we need to install can be can be found in the ``requirements.txt``
+file located in the ``examples/huggingface`` repository. You can then install the requirements
+with ``pip``.
+
+.. code-block:: bash
+
+   $ pip install -r examples/huggingface/requirements.txt
+
+Imports
+^^^^^^^
+You will now need to import these modules.
+
+.. literalinclude:: /../../examples/huggingface/main.py
+  :language: python
+  :lines: 5-6,13-22
+
+
+Hydra configuration file
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+Notice here how the arguments that are not defined will be set at ``None``, which will be
+overridden by default values or not used at all. This serves as a replacement for parsing arguments
+in the command line, but is integrated with Orion, which makes it more practical
+to manage search spaces of hyperparameters.
+
+.. literalinclude:: /../../examples/huggingface/config.yaml
+  :language: yaml
+  :lines: 1-
+
+If you want to change your working or/and logging directory, you can also easily do that.
+From the config file, you can specify
+
+.. literalinclude:: /../../examples/huggingface/config.yaml
+  :language: yaml
+  :lines: 29-31
+
+This will change your working directory. You can see that with the hydra-orion-sweeper, we are able
+to specify 4 variables from Orion. ``${hydra.sweeper.orion.name}`` for the ``experiment_name``,
+``${hydra.sweeper.orion.id}`` for the ``experiment_id``, ``${hydra.sweeper.orion.uuid}`` for the
+experiment ``uuid`` and ``${hydra.sweeper.orion.trial}`` for the ``trial_id``.
+
+In the code, you can now specify the output directory to the trainer with the
+``output_dir`` parameter. ``os.getcwd()`` specifies the current working dir.
+
+Including these options will create different folders for each trial, under different ones for
+each experiment and even different ones for each sweep. You do not have to add them all, but it
+can be quite useful when you don't want 2 trials writing their cache in the same file, which
+could result in an error.
+
+.. code-block:: python
+
+    output_dir=str(os.getcwd())+"/test_trainer",
+
+You can find more about the Hydra-Orion-Sweeper plugin by looking directly at the
+Github Repository : `Hydra_Orion_Sweeper <https://github.com/Epistimio/hydra_orion_sweeper>`_ ,
+or find out more about Hydra in general here : `Hydra <https://hydra.cc/docs/intro/>`_
+
+
+Adapting the code for Hydra
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+The ``config_path`` and ``config_name`` here specifies the path of your Hydra file.
+The ``cfg.args`` are reference to the ``args`` in the config file.
+We also are going to be using this main function as the entry point of the program.
+
+.. literalinclude:: /../../examples/huggingface/main.py
+  :language: python
+  :lines: 141-146, 310-311
+
+With the ``hydra-orion-sweeper``, this function needs to return the objective. You have 3 choices
+for what you can return :
+
+.. code-block:: python
+
+    if cfg.return_type == "float":
+        return out
+
+    if cfg.return_type == "dict":
+        return dict(name="objective", type="objective", value=out)
+
+    if cfg.return_type == "list":
+        return [dict(name="objective", type="objective", value=out)]
+
+For the purpose of this tutorial, we are going to keep it simple and simply return a float,
+our objective that we want to minimize.
+
+.. literalinclude:: /../../examples/huggingface/main.py
+  :language: python
+  :lines: 304
+
+Comet
+^^^^^
+We are going to use Comet to track our experiments. It is quite simple to use. First,
+install comet using
+
+.. code-block:: bash
+
+   $ pip install comet-ml
+
+Now that it is installed, we simply have to set some environment variables, such as
+
+.. literalinclude:: /../../examples/huggingface/main.py
+  :language: python
+  :lines: 8-13
+
+You can also set them in your working environment. If you are to set them in python, however,
+you need to make sure to set them before importing ``transformers``.
+
+For the ``COMET_API_KEY``, you will be given a token when creating your comet account.
+This is the one you are going to use here.
+
+And that is it ! If the variables are set and comet-ml is downloaded, HuggingFace will
+automatically upload your data to Comet, you simply have to go to your profile on their site
+and see your experiments.
+
+It is important to note here that we can swap the Comet logger to many others, such as WandB,
+MLFlow, Neptune and ClearML. You can see the complete list in the HuggingFace documentation
+`HF callbacks <https://huggingface.co/docs/transformers/main_classes/callback#callbacks>`_
+
+Example code
+^^^^^^^^^^^^
+For this example, we are fine-tuning a pretrained translation model named ``Helsinki-NLP``.
+We start by setting the training arguments.
+
+.. literalinclude:: /../../examples/huggingface/main.py
+  :language: python
+  :lines: 165-174
+
+For our purposes, we will be using a ``Seq2SeqTrainer``, so for the training arguments are going
+to be ``Seq2SeqTrainingArguments``. The ``set_training_args`` function adds the hydra arguments
+into the training arguments.
+
+.. literalinclude:: /../../examples/huggingface/main.py
+  :language: python
+  :lines: 114-118
+
+For the dataset, we are going to use the ``wmt16`` dataset. We can set a ``cache_dir`` to where
+the dataset cache will be stored
+
+.. literalinclude:: /../../examples/huggingface/main.py
+   :language: python
+   :lines: 179,182-184
+
+We then prepare our training and evaluation datasets. In this example, we want to evaluate our
+model with the validation dataset and the training dataset.
+
+.. literalinclude:: /../../examples/huggingface/main.py
+  :language: python
+  :lines: 196-229
+
+For the metric, we are going to use ``sacrebleu``. We can also set a ``cache_dir`` here for the
+metric cache files. The ``compute_metrics`` function goes as follows :
+
+.. literalinclude:: /../../examples/huggingface/main.py
+  :language: python
+  :lines: 238-240, 247-268
+
+Now we have to create the actual Trainer, a ``Seq2SeqTrainer`` as mentioned previously.
+It is very much like a classic ``Trainer`` from HuggingFace.
+
+.. literalinclude:: /../../examples/huggingface/main.py
+  :language: python
+  :lines: 284-292
+
+HuggingFace will log the evaluation from the ``eval_dataset`` to Comet. Since we also want the
+evaluation from the training dataset, we will have to implement something called a
+``CustomCallback``. The one I made for this tutorial takes the ``trainer`` and the dataset we want
+to add (in our case, our train dataset) as parameters.
+We can then rewrite some callback functions, such as ``on_epoch_end()``.
+
+.. literalinclude:: /../../examples/huggingface/main.py
+  :language: python
+  :lines: 270-282,294
+
+All that is left to do now is to train the model, and once it's finish training, send the data to
+Orion by returning it.
+
+.. literalinclude:: /../../examples/huggingface/main.py
+  :language: python
+  :lines: 295-297, 304
+
+For more details, feel free to simply go look at the code, in ``examples/huggingface/main.py``
+
+Execution
+^^^^^^^^^
+We simply have to run the main.py file with the -m argument, which makes sure we use the
+Hydra-Orion-Sweeper plugin.
+
+.. code-block:: bash
+
+   $ python3 main.py -m
+
+Visualizing results
+^^^^^^^^^^^^^^^^^^^
+With Orion, after your experiment has finished running, you can easily visualize your results
+using `regret plots <https://orion.readthedocs.io/en/stable/auto_examples/plot_1_regret.html>`_
+and `partial dependencies plots
+<https://orion.readthedocs.io/en/stable/auto_examples/plot_4_partial_dependencies.html>`_
+These are very helpful to see what is happening during the optimization, and what can be adjusted
+if necessary.
diff --git a/examples/huggingface/config.yaml b/examples/huggingface/config.yaml
@@ -0,0 +1,49 @@
+defaults:
+- override hydra/sweeper: orion
+
+hydra:
+    sweeper:
+       params:
+          lr: "loguniform(1e-8, 1.0)"
+          wd: "loguniform(1e-10, 1)"
+       orion:
+          name: 'translationexp'
+          version: '1'
+
+       algorithm:
+          type: random
+          config:
+             seed: 1
+
+       worker:
+          n_workers: 1
+          max_broken: 3
+          max_trials: 1
+
+       storage:
+          type: legacy
+          database:
+             type: pickleddb
+             host: 'orion_db.pkl'
+
+    sweep:
+       dir: hydra_log/multirun/translation/${now:%Y-%m-%d}/${now:%H-%M-%S}
+       subdir: ${hydra.sweeper.orion.name}/${hydra.sweeper.orion.uuid}/${hydra.job.id}
+
+#Default value
+lr: 0.01
+wd: 0.00
+
+args:
+   size_train_dataset: 5000
+   size_eval_dataset: 
+   freeze_base_model:
+   per_device_train_batch_size:
+   optim:
+   weight_decay: ${wd}
+   adam_beta1:
+   adam_beta2:
+   adam_epsilon:
+   logfile:
+   learning_rate: ${lr}
+   num_train_epochs: 20
diff --git a/examples/huggingface/main.py b/examples/huggingface/main.py
@@ -0,0 +1,311 @@
+#  [markdown]
+# # Fine-tune a pretrained model from Hugging Face
+#
+# source tutorial: https://huggingface.co/docs/transformers/training
+import logging
+import os
+
+os.environ["COMET_API_KEY"] = "comet_token"
+os.environ["COMET_WORKSPACE"] = "workspace"
+os.environ["COMET_PROJECT_NAME"] = "project"
+os.environ["COMET_MODE"] = "ONLINE"
+os.environ["COMET_LOG_ASSETS"] = "True"
+os.environ["COMET_AUTO_LOG_METRICS"] = "True"
+import argparse
+from copy import deepcopy
+
+import hydra
+import numpy as np
+import torch
+from datasets import load_dataset, load_metric
+from omegaconf import DictConfig
+from transformers import (
+    AutoModelForSeq2SeqLM,
+    AutoTokenizer,
+    DataCollatorForSeq2Seq,
+    Seq2SeqTrainer,
+    Seq2SeqTrainingArguments,
+    TrainerCallback,
+)
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "-st",
+        "--size-train-dataset",
+        help="Number of samples to use from training data set. If not specified, use complete dataset",
+        type=int,
+        required=False,
+    )
+    parser.add_argument(
+        "-se",
+        "--size-eval-dataset",
+        help="Number of samples to use from evaluation data set. If not specified, use complete dataset",
+        type=int,
+        required=False,
+    )
+    parser.add_argument(
+        "-f",
+        "--freeze-base-model",
+        help="Freeze parameters of base model during training",
+        action="store_true",
+        required=False,
+    )
+    parser.add_argument(
+        "-lr", "--learning-rate", help="Learning rate", type=float, required=False
+    )
+    parser.add_argument(
+        "-e",
+        "--num_train_epochs",
+        help="Number of training epochs",
+        type=int,
+        required=False,
+    )
+    parser.add_argument(
+        "-b",
+        "--per_device_train_batch_size",
+        help="Per device batch size",
+        type=int,
+        required=False,
+    )
+    parser.add_argument(
+        "-opt",
+        "--optim",
+        help="Optimizer (one of: adamw_hf, adamw_torch, adamw_apex_fused, or adafactor",
+        type=str,
+        required=False,
+    )
+    parser.add_argument(
+        "-wd",
+        "--weight_decay",
+        help="Weight decay for AdamW optimizer",
+        type=float,
+        required=False,
+    )
+    parser.add_argument(
+        "-b1",
+        "--adam_beta1",
+        help="beta1 hyperparameter for AdamW optimizer",
+        type=float,
+        required=False,
+    )
+    parser.add_argument(
+        "-b2",
+        "--adam_beta2",
+        help="beta2 hyperparameter for AdamW optimizer",
+        type=float,
+        required=False,
+    )
+    parser.add_argument(
+        "-eps",
+        "--adam_epsilon",
+        help="epsilon hyperparameter for AdamW optimizer",
+        type=float,
+        required=False,
+    )
+    parser.add_argument(
+        "-log", "--logfile", help="Log file name and path", type=str, required=False
+    )
+    args = parser.parse_args()
+    return vars(args)
+
+
+def set_training_args(training_args, args):
+    for argname, argvalue in args.items():
+        if argvalue is not None:
+            setattr(training_args, argname, argvalue)
+    return training_args
+
+
+class GPUMemoryCallback(TrainerCallback):
+    def on_epoch_end(self, args, state, control, **kwargs):
+        print(
+            "GPU mem: Tot - ",
+            torch.cuda.get_device_properties(0).total_memory,
+            "res - ",
+            torch.cuda.memory_reserved(0),
+            "used - ",
+            torch.cuda.memory_allocated(0),
+        )
+
+
+def get_free_gpu():
+    for i in range(torch.cuda.device_count()):
+        gpu_procs_str = torch.cuda.list_gpu_processes(i)
+        if "no processes are running" in gpu_procs_str:
+            return i
+    return None
+
+
+@hydra.main(config_path=".", config_name="config")
+def main(cfg: DictConfig) -> float:
+    print("args", cfg)
+
+    # Get command line arguments and apply hyperparameters to training arguments
+    args = cfg.args
+
+    # Logger setup
+    logfile = args["logfile"] or "translation_hf.log"
+
+    logging.basicConfig(filename=logfile, level=logging.INFO)
+    logger = logging.getLogger()
+
+    # Get a GPU if available
+    if torch.cuda.is_available():
+        device = f"cuda:{get_free_gpu()}"
+    else:
+        device = "cpu"
+
+    # We only use the device to print out what HF should be using by default
+    logger.info("Compute device: %s", device)
+
+    batch_size = 16
+
+    # Get hyperparameters
+    training_args = Seq2SeqTrainingArguments(
+        output_dir=str(os.getcwd()) + "/test_trainer",
+        save_total_limit=2,
+        per_device_train_batch_size=batch_size,
+        per_device_eval_batch_size=batch_size,
+        evaluation_strategy="epoch",
+        predict_with_generate=True,
+    )
+
+    training_args = set_training_args(training_args, args)
+    print("Training arguments:", training_args)
+
+    # Load a dataset
+    dataset_name = "wmt16"
+    logger.info("Dataset: %s", dataset_name)
+
+    raw_dataset = load_dataset(
+        dataset_name, "ro-en", cache_dir="hydra_log/multirun/translation/dataset"
+    )
+
+    model_checkpoint = "Helsinki-NLP/opus-mt-en-ro"
+
+    # Create tokenizer and tokenize the dataset
+    tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
+
+    max_input_length = 128
+    max_target_length = 128
+    source_lang = "en"
+    target_lang = "ro"
+
+    def preprocess_function(examples):
+        inputs = [ex[source_lang] for ex in examples["translation"]]
+        targets = [ex[target_lang] for ex in examples["translation"]]
+        model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)
+
+        # Setup the tokenizer for targets
+        with tokenizer.as_target_tokenizer():
+            labels = tokenizer(targets, max_length=max_target_length, truncation=True)
+
+        model_inputs["labels"] = labels["input_ids"]
+        return model_inputs
+
+    tokenized_datasets = raw_dataset.map(preprocess_function, batched=True)
+
+    # Use only a subset of the available data, if desired
+    size_train_dataset = len(tokenized_datasets["train"])
+    size_eval_dataset = len(tokenized_datasets["validation"])
+
+    if args["size_train_dataset"] is not None:
+        size_train_dataset = args["size_train_dataset"]
+    if args["size_eval_dataset"] is not None:
+        size_eval_dataset = args["size_eval_dataset"]
+    train_dataset = (
+        tokenized_datasets["train"].shuffle(seed=42).select(range(size_train_dataset))
+    )
+    eval_train_dataset = (
+        tokenized_datasets["train"].shuffle(seed=42).select(range(size_eval_dataset))
+    )
+    eval_dataset = (
+        tokenized_datasets["validation"]
+        .shuffle(seed=42)
+        .select(range(size_eval_dataset))
+    )
+
+    # Create model
+    model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)
+
+    model_name = model_checkpoint.split("/")[-1]
+
+    data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)
+
+    # Train the model
+    metric = load_metric(
+        "sacrebleu", cache_dir="hydra_log/multirun/translation/dataset"
+    )
+
+    def postprocess_text(preds, labels):
+        preds = [pred.strip() for pred in preds]
+        labels = [[label.strip()] for label in labels]
+        return preds, labels
+
+    def compute_metrics(eval_preds):
+        preds, labels = eval_preds
+        if isinstance(preds, tuple):
+            preds = preds[0]
+        decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
+
+        # Replace -100 in the labels as we can't decode them.
+        labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
+        decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
+
+        # Some simple post-processing
+        decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)
+
+        result = metric.compute(predictions=decoded_preds, references=decoded_labels)
+        result = {"sacrebleu": result["score"]}
+
+        prediction_lens = [
+            np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds
+        ]
+        result["gen_len"] = np.mean(prediction_lens)
+        result = {k: round(v, 4) for k, v in result.items()}
+        return result
+
+    class CustomCallback(TrainerCallback):
+        def __init__(self, trainer, dataset) -> None:
+            super().__init__()
+            self._trainer = trainer
+            self.dataset = dataset
+
+        def on_epoch_end(self, args, state, control, **kwargs):
+            if control.should_evaluate:
+                control_copy = deepcopy(control)
+                self._trainer.evaluate(
+                    eval_dataset=self.dataset, metric_key_prefix="train"
+                )
+                return control_copy
+
+    trainer = Seq2SeqTrainer(
+        model,
+        training_args,
+        train_dataset=train_dataset,
+        eval_dataset=eval_dataset,
+        data_collator=data_collator,
+        tokenizer=tokenizer,
+        compute_metrics=compute_metrics,
+    )
+
+    trainer.add_callback(CustomCallback(trainer, eval_train_dataset))
+    trainer.train()
+    # Evaluate model
+    eval_metrics = trainer.evaluate()
+
+    # Print out memory stats
+    print("Total GPU memory:", torch.cuda.get_device_properties(0).total_memory)
+    print("GPU memory reserved:", torch.cuda.memory_reserved(0))
+    print("GPU memory allocated:", torch.cuda.memory_allocated(0))
+
+    return -eval_metrics["eval_sacrebleu"]
+
+
+# =======================================================================
+# Main
+# =======================================================================
+if __name__ == "__main__":
+    main()
diff --git a/examples/huggingface/requirements.txt b/examples/huggingface/requirements.txt
@@ -0,0 +1,5 @@
+datasets
+transformers
+hydra
+hydra-orion-sweeper
+comet-ml