[UPD] The documentation of the signwiriting2text example has been imp…

…roved.
GerrySant · Jan 20, 2025 · 369bb38 · 369bb38
1 parent f6aa03f
commit 369bb38
Show file tree

Hide file tree

Showing 4 changed files with 142 additions and 169 deletions.
diff --git a/examples/multimodal_translation/signwriting2text_translation/README.md b/examples/multimodal_translation/signwriting2text_translation/README.md
@@ -1,177 +1,107 @@
-# SignWriting2Text Translation
 
-## Before Training
+# SignWriting2Text Translation Example
 
-This framework is designed to maximize the use of existing Hugging Face pipelines. Most components required for training (e.g., tokenizers, models, processors) can be initialized using the `.from_pretrained()` method. Therefore, before training, you must instantiate and configure all components involved by either using pretrained models or training custom ones. More information about Auto classes can be found [here](https://huggingface.co/docs/transformers/model_doc/auto).
+This directory provides an example of preparing and training a **SignWriting2Text** translation model using the **MultimodalHugs** framework. The workflow includes the following steps:
 
-1. **Create and save the dataset:**
+1. **Dataset Preparation**
+2. **Setting Up the Training Environment**
+3. **Launching the Training Process**
 
-    Implement and run the code responsible for creating and saving the dataset you intend to train on. Make sure the dataset class you choose or implement inherits from `datasets.GeneratorBasedBuilder`. Use the following code as a template:
+> **Note**: This example uses MultimodalHugs' specialized classes (e.g., `SignWritingProcessor`, `MultiModalEmbedderModel`) alongside Hugging Face’s `Seq2SeqTrainer` to manage dataset loading, tokenization, preprocessing, and training.
 
-    ```python
-    import torch
-    from omegaconf import OmegaConf
-    from multimodalhugs.data import SignWritingDataset, MultimodalMTDataConfig
+---
 
-    config_path = "/examples/multimodal_translation/configs/example_config.yaml"
+## 1. Dataset Preparation
 
-    # Load config and initialize dataset
-    config = OmegaConf.load(config_path)
-    dataset = SignWritingDataset(config=MultimodalMTDataConfig(config))
+**Goal**: Convert raw CSV files into `.tsv` metadata files for SignWriting2Text translation.
 
-    # Download, prepare, and save dataset
-    dataset.download_and_prepare(config.training.output_dir + "/datasets")
-    dataset.as_dataset().save_to_disk(config.training.output_dir + "/datasets")
-    ```
+- **Script**: [`signbankplus_dataset_preprocessing_script.py`](./example_scripts/signbankplus_dataset_preprocessing_script.py)
+- **Input**: A CSV file containing fields like `source`, `target`, `src_lang`, and `tgt_lang`.
+- **Output**: A `.tsv` file formatted for training.
 
-2. **Create and save the Tokenizer:**
-
-    In case you're using a custom tokenizer, implement and run the code responsible for creating and saving it. Ensure that the tokenizer is compatible with the `.save_pretrained()` and `.from_pretrained()` methods from the `AutoTokenizer` class.
-
-2. **Create and save the Processor:**
-
-    Implement and run the code responsible for creating and saving the processor used during training. The processor ensures that the dataset batches are transformed correctly for the model. Make sure it is compatible with the `.save_pretrained()` and `.from_pretrained()` methods from Auto. Use the following code as a template:
-
-    ```python
-    import torch
-    from omegaconf import OmegaConf
-    from multimodalhugs.data import MultimodalMTDataConfig
-    from multimodalhugs.processors import SignwritingPreprocessor
-    from transformers.models.clip.image_processing_clip import CLIPImageProcessor
-    from transformers import M2M100Tokenizer
-    from multimodalhugs.data import load_tokenizer_from_vocab_file
-
-    config_path = "/examples/multimodal_translation/configs/example_config.yaml"
-
-    config = OmegaConf.load(config_path)
-    dataset_config = MultimodalMTDataConfig(config)
-
-    frame_preprocessor = CLIPImageProcessor(
-                do_resize = dataset_config.preprocess.do_resize,
-                size = dataset_config.preprocess.width,
-                do_center_crop = dataset_config.preprocess.do_center_crop,
-                do_rescale = dataset_config.preprocess.do_rescale,
-                do_normalize = dataset_config.preprocess.do_normalize,
-                image_mean = dataset_config.preprocess.dataset_mean,
-                image_std = dataset_config.preprocess.dataset_std,
-            )
-
-    tokenizer_m2m = M2M100Tokenizer.from_pretrained(config.data.text_tokenizer_path)
-    src_tokenizer = load_tokenizer_from_vocab_file(config.data.src_lang_tokenizer_path)
-
-    input_processor = SignwritingPreprocessor(
-            width=dataset_config.preprocess.width,
-            height=dataset_config.preprocess.height,
-            channels=dataset_config.preprocess.channels,
-            invert_frame=dataset_config.preprocess.invert_frame,
-            dataset_mean=dataset_config.preprocess.dataset_mean,
-            dataset_std=dataset_config.preprocess.dataset_std,
-            frame_preprocessor = frame_preprocessor,
-            tokenizer=tokenizer_m2m,
-            lang_tokenizer=src_tokenizer,
-    )
-
-    input_processor.save_pretrained(save_directory= config.training.output_dir + "/signwriting_processor", push_to_hub=False)
-    ```
-
-4. **Create and save a model:**
+Run the script for each data partition (train, dev, test):
+```bash
+python signbankplus_dataset_preprocessing_script.py /path/to/input.csv /path/to/output.tsv
+```
 
-    Implement and run the code responsible for creating and saving the model you want to train. Make sure the model is compatible with the `.save_pretrained()` and `.from_pretrained()` methods from `AutoModel` classes. Use the following code as a template:
+### Metadata File Requirements
 
-    ```python
-    from omegaconf import OmegaConf
-    from transformers import M2M100Tokenizer
-    from multimodalhugs.data import load_tokenizer_from_vocab_file
-    from multimodalhugs.models import MultiModalEmbedderModel
+The `metadata.tsv` files must contain the following fields:
 
-    config_path = "/examples/multimodal_translation/configs/example_config.yaml"
+- `input`: The SignWriting source sequence.
+- `source_prompt`: A text string (e.g., `__signwriting__ __en__`) to guide modality and language processing.
+- `generation_prompt`: A prompt for the target language.
+- `output_text`: The corresponding text translation.
 
-    cfg = OmegaConf.load(config_path)
-    tokenizer_m2m = M2M100Tokenizer.from_pretrained(cfg.data.text_tokenizer_path)
-    src_tokenizer = load_tokenizer_from_vocab_file(cfg.data.src_lang_tokenizer_path)
+These fields ensure compatibility with the **SignWriting2Text** processing pipeline.
 
-    model = MultiModalEmbedderModel.build_model(cfg.model, src_tokenizer, tokenizer_m2m)
-    model.save_pretrained(f"{cfg.training.output_dir}/trained_model")
-    ```
+---
 
+## 2. Setting Up the Training Environment
 
-5. **Register all the custom Auto Class Component involved in the Training:**
+**Goal**: Initialize tokenizers, preprocessors, and models, and save their paths for further usage.
 
-    In the example code below, `AutoConfig`, `AutoModelForSeq2SeqLM`, and `AutoProcessor` are used to register the Model Config, Model, and Processor, respectively. If you are using a custom tokenizer, you can use any `AutoTokenizer` class to register it.
+- **Script**: [`signbankplus_training_setup.py`](./example_scripts/signbankplus_training_setup.py)
+- **Input**: A configuration file (e.g., `configs/example_config.yaml`) specifying:
+  - Model parameters
+  - Tokenizer paths
+  - Dataset paths
+  - Additional vocabulary (e.g., `other/new_languages_sign_bank_plus.txt`)
+- **Output**:
+  - `MODEL_PATH`: Directory with the trained model.
+  - `PROCESSOR_PATH`: Directory with the saved processor.
+  - `DATA_PATH`: Directory where the processed dataset is stored.
 
-    ```python
-    from transformers import AutoConfig, AutoModelForSeq2SeqLM, AutoProcessor
-    from multimodalhugs.processors import SignwritingPreprocessor
-    from multimodalhugs.models import MultiModalEmbedderModel, MultiModalEmbedderConfig
+Run the setup script:
 
-    AutoConfig.register("multimodal_embedder", MultiModalEmbedderConfig)
-    AutoModelForSeq2SeqLM.register(MultiModalEmbedderConfig, MultiModalEmbedderModel)
-    SignwritingPreprocessor.register_for_auto_class()
-    AutoProcessor.register("signwritting_processor", SignwritingPreprocessor)
-    ```
+```bash
+python signbankplus_training_setup.py --config_path /path/to/signwriting_config.yaml
+```
 
+The script outputs environment variables (`MODEL_PATH`, `PROCESSOR_PATH`, `DATA_PATH`) for downstream usage.
 
-## Model Training
+---
 
-This section provides instructions on how to train the multimodal embedder model using the provided bash script. The script uses a pretrained signwriting embedder and runs training on a multimodal dataset. 
+## 3. Launching the Training Process
 
-### Steps to Train the Model
+**Goal**: Train the **SignWriting2Text** model using the prepared data and configurations.
 
-1. Activate the virtual environment containing all necessary dependencies.
-2. Set up the environment variables for the model, processor, data paths, and output directory.
-3. Run the Python script with the specified parameters for training and evaluation.
+- **Script**: [`signbankplus_training.sh`](./example_scripts/signbankplus_training.sh)
+- **Process**:
+  1. (Optional) Activate your conda or virtual environment.
+  2. Define environment variables (`MODEL_NAME`, `CONFIG_PATH`, `OUTPUT_PATH`).
+  3. Preprocess datasets using the `signbankplus_dataset_preprocessing_script.py` script.
+  4. Run `signbankplus_training_setup.py` to initialize paths and settings.
+  5. Start training with `run_translation.py`.
 
-#### Sample Bash Script
+Run the script:
 
 ```bash
-#!/usr/bin/env bash
-
-# Activate the virtual environment
-source /path/to/your/environment/bin/activate
-
-# Set model and processor paths
-export MODEL_NAME="signwritting_embedder_polynommial"
-export MODEL_PATH="/path/to/your/model_directory"
-export PROCESSOR_PATH="/path/to/your/processor_directory"
-export DATA_PATH="/path/to/your/dataset_directory"
-export OUTPUT_PATH="/path/to/your/output_directory"
-export CUDA_VISIBLE_DEVICES=0
-
-# Run the training script
-python ./examples/multimodal_translation/run_translation.py \
-    --model_name_or_path $MODEL_PATH \
-    --processor_name_or_path $PROCESSOR_PATH \
-    --run_name $MODEL_NAME \
-    --dataset_dir $DATA_PATH \
-    --output_dir $OUTPUT_PATH \
-    --do_train True \
-    --do_eval True \
-    --fp16 \
-    --label_smoothing_factor 0.1 \
-    --remove_unused_columns False \
-    --per_device_train_batch_size 8 \
-    --per_device_eval_batch_size 8 \
-    --evaluation_strategy "steps" \
-    --eval_steps 10000 \
-    --save_strategy "steps" \
-    --save_steps 10000 \
-    --save_total_limit 3 \
-    --overwrite_output_dir \
-    --per_device_eval_batch_size 8 \
-    --gradient_accumulation_steps 4 \
-    --learning_rate 0.001 \
-    --warmup_steps 10000 \
-    --max_steps 100000 \
-    --lr_scheduler_type "polynomial"
-
+bash signbankplus_training.sh
 ```
 
-## Implemented Example:
+### Hyperparameter Tuning
 
-[Here](/examples/multimodal_translation/signwriting2text_translation/example_scripts/) you will find the templates to run the training. With them, you only have to enter the following commandline in order to train the model:
+Modify the `signbankplus_training.sh` script or pass arguments via the command line to adjust hyperparameters such as batch size, learning rate, and evaluation steps.
 
-```bash
-cd /path/to/multimodalhugs/
-. /examples/multimodal_translation/signwriting2text_translation/example_scripts/signbankplus_training.sh
+---
+
+## Directory Overview
+
+```plaintext
+signwriting2text_translation
+├── README.md                  # Documentation (current file)
+├── configs
+│   └── example_config.yaml    # Example configuration file
+├── example_scripts
+│   ├── signbankplus_dataset_preprocessing_script.py
+│   ├── signbankplus_training.sh
+│   └── signbankplus_training_setup.py
+└── other
+    └── new_languages_sign_bank_plus.txt  # Additional tokens for the tokenizer
 ```
 
+### Directory Details
+- **configs**: Contains YAML configuration files for model, dataset, and training parameters.
+- **example_scripts**: Includes scripts for dataset preprocessing, training setup, and execution.
+- **other**: Additional resources such as vocabulary files for new tokens.
diff --git a/...signwriting2text_translation/example_scripts/signbankplus_dataset_preprocessing_script.py b/...signwriting2text_translation/example_scripts/signbankplus_dataset_preprocessing_script.py
@@ -11,7 +11,7 @@
 args = parser.parse_args()
 
 # Placeholder functions for constructing new fields
-def construct_source_sequence(row):
+def construct_input(row):
     return ""  # Replace with the actual implementation or keep empty
 
 def construct_source_prompt(row):
@@ -35,12 +35,12 @@ def map_column_to_new_field(original_column, new_column_name, data):
 data['generation_prompt'] = data.apply(construct_generation_prompt, axis=1)
 
 # Example of mapping original columns to new ones
-map_column_to_new_field('source', 'source_sequence', data)
+map_column_to_new_field('source', 'input', data)
 map_column_to_new_field('target', 'output_text', data)
 
 # Select the desired columns for the new dataset
 output_columns = [
-    'source_sequence',
+    'input',
     'source_prompt',
     'generation_prompt',
     'output_text'

diff --git a/...timodal_translation/signwriting2text_translation/example_scripts/signbankplus_training.sh b/...timodal_translation/signwriting2text_translation/example_scripts/signbankplus_training.sh
@@ -1,53 +1,96 @@
-#!/bin/bash
+#!/usr/bin/env bash
+# ----------------------------------------------------------
+# Sample Script for Signwriting2Text Translation (Generalized)
+# ----------------------------------------------------------
+# This script demonstrates a more general approach to:
+#  1. Setting up environment variables.
+#  2. Preprocessing data.
+#  3. Configuring and starting a multimodal translation run.
+# ----------------------------------------------------------
 
-# Activate the virtual environment
-source /path/to/your/environment/bin/activate
+# (Optional) Activate your conda environment
+# source /path/to/your/anaconda/bin/activate <YOUR_ENV_NAME>
 
-export RUN_NAME="<run_name>"
-export CONFIG_PATH='/multimodalhugs/examples/multimodal_translation/signwriting2text_translation/configs/example_config.yaml'
-export OUTPUT_PATH="/path/to/your/output_directory"
+# ----------------------------------------------------------
+# 1. Define environment variables
+# ----------------------------------------------------------
+
+export MODEL_NAME="src_prompt_signwriting"
+export REPO_PATH="/path/to/your/repository"
+
+export CONFIG_PATH="${REPO_PATH}/examples/multimodal_translation/signwriting2text_translation/configs/example_config.yaml/path/to/your/configs/signwriting_src_prompt.yaml"
+export OUTPUT_PATH="/path/to/your/experiments/output_directory"
 export CUDA_VISIBLE_DEVICES=0
+export EVAL_STEPS=250
+
+# ----------------------------------------------------------
+# 2. Preprocess Data
+# ----------------------------------------------------------
+python ${REPO_PATH}/examples/multimodal_translation/signwriting2text_translation/example_scripts/signbankplus_dataset_preprocessing_script.py \
+    /path/to/your/data/parallel/cleaned/dev.csv \
+    /path/to/your/data/parallel/cleaned/dev_processed.tsv
+
+python ${REPO_PATH}/examples/multimodal_translation/signwriting2text_translation/example_scripts/signbankplus_dataset_preprocessing_script.py \
+    /path/to/your/data/parallel/cleaned/train_toy.csv \
+    /path/to/your/data/parallel/cleaned/train_processed.tsv
 
-# Ejecuta el script Python y captura sus salidas
-output=$(python /examples/multimodal_translation/signwriting2text_translation/example_scripts/signbankplus_training_setup.py \
+python ${REPO_PATH}/examples/multimodal_translation/signwriting2text_translation/example_scripts/signbankplus_dataset_preprocessing_script.py \
+    /path/to/your/data/parallel/test/all.csv \
+    /path/to/your/data/parallel/test/test_processed.tsv
+
+
+# ----------------------------------------------------------
+# 3. Prepare Training Environment
+# ----------------------------------------------------------
+# This Python script sets up environment variables for the model, processor, etc.
+output=$(python ${REPO_PATH}/examples/multimodal_translation/signwriting2text_translation/example_scripts/signbankplus_training_setup.py \
     --config_path $CONFIG_PATH)
 
-# Extrae las variables de entorno del output del script Python
+# Extract environment variables from the Python script’s output
 export MODEL_PATH=$(echo "$output" | grep 'MODEL_PATH' | cut -d '=' -f 2)
 export PROCESSOR_PATH=$(echo "$output" | grep 'PROCESSOR_PATH' | cut -d '=' -f 2)
 export DATA_PATH=$(echo "$output" | grep 'DATA_PATH' | cut -d '=' -f 2)
 
-# Visualizar las variables declaradas
+# Display the environment variables (for debugging)
+echo "MODEL_PATH: $MODEL_NAME"
+echo "REPO_PATH: $REPO_PATH"
 echo "CONFIG_PATH: $CONFIG_PATH"
 echo "OUTPUT_PATH: $OUTPUT_PATH"
 echo "CUDA_VISIBLE_DEVICES: $CUDA_VISIBLE_DEVICES"
 echo "MODEL_PATH: $MODEL_PATH"
 echo "PROCESSOR_PATH: $PROCESSOR_PATH"
 echo "DATA_PATH: $DATA_PATH"
+echo "EVAL_STEPS: $EVAL_STEPS"
+
 
-python /multimodalhugs/examples/multimodal_translation/run_translation.py \
+# ----------------------------------------------------------
+# 4. Train the Model
+# ----------------------------------------------------------
+python ${REPO_PATH}/examples/multimodal_translation/run_translation.py  \
     --model_name_or_path $MODEL_PATH \
     --processor_name_or_path $PROCESSOR_PATH \
-    --run_name $RUN_NAME \
+    --run_name $MODEL_NAME \
     --dataset_dir $DATA_PATH \
     --output_dir $OUTPUT_PATH \
     --do_train True \
     --do_eval True \
-    --fp16 \
-    --label_smoothing_factor 0.1 \
+    --logging_steps 100 \
     --remove_unused_columns False \
-    --per_device_train_batch_size 8 \
+    --per_device_train_batch_size 16 \
     --per_device_eval_batch_size 8 \
     --evaluation_strategy "steps" \
-    --eval_steps 10 \
+    --eval_steps $EVAL_STEPS \
     --save_strategy "steps" \
-    --save_steps 10 \
+    --save_steps $EVAL_STEPS \
     --save_total_limit 3 \
+    --load_best_model_at_end true \
+    --metric_for_best_model 'bleu' \
     --overwrite_output_dir \
     --per_device_eval_batch_size 8 \
     --gradient_accumulation_steps 4 \
-    --learning_rate 0.001 \
-    --warmup_steps 10000 \
-    --max_steps 100000 \
+    --learning_rate 5e-05 \
+    --warmup_steps 20000 \
+    --max_steps 200000 \
     --predict_with_generate True \
-    --lr_scheduler_type "polynomial"
+    --lr_scheduler_type "inverse_sqrt" \
+    --report_to none