Skip to content

Commit

Permalink
[UPD] The documentation of the signwiriting2text example has been imp…
Browse files Browse the repository at this point in the history
…roved.
  • Loading branch information
GerrySant committed Jan 20, 2025
1 parent f6aa03f commit 369bb38
Show file tree
Hide file tree
Showing 4 changed files with 142 additions and 169 deletions.
218 changes: 74 additions & 144 deletions examples/multimodal_translation/signwriting2text_translation/README.md
Original file line number Diff line number Diff line change
@@ -1,177 +1,107 @@
# SignWriting2Text Translation

## Before Training
# SignWriting2Text Translation Example

This framework is designed to maximize the use of existing Hugging Face pipelines. Most components required for training (e.g., tokenizers, models, processors) can be initialized using the `.from_pretrained()` method. Therefore, before training, you must instantiate and configure all components involved by either using pretrained models or training custom ones. More information about Auto classes can be found [here](https://huggingface.co/docs/transformers/model_doc/auto).
This directory provides an example of preparing and training a **SignWriting2Text** translation model using the **MultimodalHugs** framework. The workflow includes the following steps:

1. **Create and save the dataset:**
1. **Dataset Preparation**
2. **Setting Up the Training Environment**
3. **Launching the Training Process**

Implement and run the code responsible for creating and saving the dataset you intend to train on. Make sure the dataset class you choose or implement inherits from `datasets.GeneratorBasedBuilder`. Use the following code as a template:
> **Note**: This example uses MultimodalHugs' specialized classes (e.g., `SignWritingProcessor`, `MultiModalEmbedderModel`) alongside Hugging Face’s `Seq2SeqTrainer` to manage dataset loading, tokenization, preprocessing, and training.
```python
import torch
from omegaconf import OmegaConf
from multimodalhugs.data import SignWritingDataset, MultimodalMTDataConfig
---

config_path = "/examples/multimodal_translation/configs/example_config.yaml"
## 1. Dataset Preparation

# Load config and initialize dataset
config = OmegaConf.load(config_path)
dataset = SignWritingDataset(config=MultimodalMTDataConfig(config))
**Goal**: Convert raw CSV files into `.tsv` metadata files for SignWriting2Text translation.

# Download, prepare, and save dataset
dataset.download_and_prepare(config.training.output_dir + "/datasets")
dataset.as_dataset().save_to_disk(config.training.output_dir + "/datasets")
```
- **Script**: [`signbankplus_dataset_preprocessing_script.py`](./example_scripts/signbankplus_dataset_preprocessing_script.py)
- **Input**: A CSV file containing fields like `source`, `target`, `src_lang`, and `tgt_lang`.
- **Output**: A `.tsv` file formatted for training.

2. **Create and save the Tokenizer:**

In case you're using a custom tokenizer, implement and run the code responsible for creating and saving it. Ensure that the tokenizer is compatible with the `.save_pretrained()` and `.from_pretrained()` methods from the `AutoTokenizer` class.

2. **Create and save the Processor:**

Implement and run the code responsible for creating and saving the processor used during training. The processor ensures that the dataset batches are transformed correctly for the model. Make sure it is compatible with the `.save_pretrained()` and `.from_pretrained()` methods from Auto. Use the following code as a template:

```python
import torch
from omegaconf import OmegaConf
from multimodalhugs.data import MultimodalMTDataConfig
from multimodalhugs.processors import SignwritingPreprocessor
from transformers.models.clip.image_processing_clip import CLIPImageProcessor
from transformers import M2M100Tokenizer
from multimodalhugs.data import load_tokenizer_from_vocab_file

config_path = "/examples/multimodal_translation/configs/example_config.yaml"

config = OmegaConf.load(config_path)
dataset_config = MultimodalMTDataConfig(config)

frame_preprocessor = CLIPImageProcessor(
do_resize = dataset_config.preprocess.do_resize,
size = dataset_config.preprocess.width,
do_center_crop = dataset_config.preprocess.do_center_crop,
do_rescale = dataset_config.preprocess.do_rescale,
do_normalize = dataset_config.preprocess.do_normalize,
image_mean = dataset_config.preprocess.dataset_mean,
image_std = dataset_config.preprocess.dataset_std,
)

tokenizer_m2m = M2M100Tokenizer.from_pretrained(config.data.text_tokenizer_path)
src_tokenizer = load_tokenizer_from_vocab_file(config.data.src_lang_tokenizer_path)

input_processor = SignwritingPreprocessor(
width=dataset_config.preprocess.width,
height=dataset_config.preprocess.height,
channels=dataset_config.preprocess.channels,
invert_frame=dataset_config.preprocess.invert_frame,
dataset_mean=dataset_config.preprocess.dataset_mean,
dataset_std=dataset_config.preprocess.dataset_std,
frame_preprocessor = frame_preprocessor,
tokenizer=tokenizer_m2m,
lang_tokenizer=src_tokenizer,
)

input_processor.save_pretrained(save_directory= config.training.output_dir + "/signwriting_processor", push_to_hub=False)
```

4. **Create and save a model:**
Run the script for each data partition (train, dev, test):
```bash
python signbankplus_dataset_preprocessing_script.py /path/to/input.csv /path/to/output.tsv
```

Implement and run the code responsible for creating and saving the model you want to train. Make sure the model is compatible with the `.save_pretrained()` and `.from_pretrained()` methods from `AutoModel` classes. Use the following code as a template:
### Metadata File Requirements

```python
from omegaconf import OmegaConf
from transformers import M2M100Tokenizer
from multimodalhugs.data import load_tokenizer_from_vocab_file
from multimodalhugs.models import MultiModalEmbedderModel
The `metadata.tsv` files must contain the following fields:

config_path = "/examples/multimodal_translation/configs/example_config.yaml"
- `input`: The SignWriting source sequence.
- `source_prompt`: A text string (e.g., `__signwriting__ __en__`) to guide modality and language processing.
- `generation_prompt`: A prompt for the target language.
- `output_text`: The corresponding text translation.

cfg = OmegaConf.load(config_path)
tokenizer_m2m = M2M100Tokenizer.from_pretrained(cfg.data.text_tokenizer_path)
src_tokenizer = load_tokenizer_from_vocab_file(cfg.data.src_lang_tokenizer_path)
These fields ensure compatibility with the **SignWriting2Text** processing pipeline.

model = MultiModalEmbedderModel.build_model(cfg.model, src_tokenizer, tokenizer_m2m)
model.save_pretrained(f"{cfg.training.output_dir}/trained_model")
```
---

## 2. Setting Up the Training Environment

5. **Register all the custom Auto Class Component involved in the Training:**
**Goal**: Initialize tokenizers, preprocessors, and models, and save their paths for further usage.

In the example code below, `AutoConfig`, `AutoModelForSeq2SeqLM`, and `AutoProcessor` are used to register the Model Config, Model, and Processor, respectively. If you are using a custom tokenizer, you can use any `AutoTokenizer` class to register it.
- **Script**: [`signbankplus_training_setup.py`](./example_scripts/signbankplus_training_setup.py)
- **Input**: A configuration file (e.g., `configs/example_config.yaml`) specifying:
- Model parameters
- Tokenizer paths
- Dataset paths
- Additional vocabulary (e.g., `other/new_languages_sign_bank_plus.txt`)
- **Output**:
- `MODEL_PATH`: Directory with the trained model.
- `PROCESSOR_PATH`: Directory with the saved processor.
- `DATA_PATH`: Directory where the processed dataset is stored.

```python
from transformers import AutoConfig, AutoModelForSeq2SeqLM, AutoProcessor
from multimodalhugs.processors import SignwritingPreprocessor
from multimodalhugs.models import MultiModalEmbedderModel, MultiModalEmbedderConfig
Run the setup script:

AutoConfig.register("multimodal_embedder", MultiModalEmbedderConfig)
AutoModelForSeq2SeqLM.register(MultiModalEmbedderConfig, MultiModalEmbedderModel)
SignwritingPreprocessor.register_for_auto_class()
AutoProcessor.register("signwritting_processor", SignwritingPreprocessor)
```
```bash
python signbankplus_training_setup.py --config_path /path/to/signwriting_config.yaml
```

The script outputs environment variables (`MODEL_PATH`, `PROCESSOR_PATH`, `DATA_PATH`) for downstream usage.

## Model Training
---

This section provides instructions on how to train the multimodal embedder model using the provided bash script. The script uses a pretrained signwriting embedder and runs training on a multimodal dataset.
## 3. Launching the Training Process

### Steps to Train the Model
**Goal**: Train the **SignWriting2Text** model using the prepared data and configurations.

1. Activate the virtual environment containing all necessary dependencies.
2. Set up the environment variables for the model, processor, data paths, and output directory.
3. Run the Python script with the specified parameters for training and evaluation.
- **Script**: [`signbankplus_training.sh`](./example_scripts/signbankplus_training.sh)
- **Process**:
1. (Optional) Activate your conda or virtual environment.
2. Define environment variables (`MODEL_NAME`, `CONFIG_PATH`, `OUTPUT_PATH`).
3. Preprocess datasets using the `signbankplus_dataset_preprocessing_script.py` script.
4. Run `signbankplus_training_setup.py` to initialize paths and settings.
5. Start training with `run_translation.py`.

#### Sample Bash Script
Run the script:

```bash
#!/usr/bin/env bash

# Activate the virtual environment
source /path/to/your/environment/bin/activate

# Set model and processor paths
export MODEL_NAME="signwritting_embedder_polynommial"
export MODEL_PATH="/path/to/your/model_directory"
export PROCESSOR_PATH="/path/to/your/processor_directory"
export DATA_PATH="/path/to/your/dataset_directory"
export OUTPUT_PATH="/path/to/your/output_directory"
export CUDA_VISIBLE_DEVICES=0

# Run the training script
python ./examples/multimodal_translation/run_translation.py \
--model_name_or_path $MODEL_PATH \
--processor_name_or_path $PROCESSOR_PATH \
--run_name $MODEL_NAME \
--dataset_dir $DATA_PATH \
--output_dir $OUTPUT_PATH \
--do_train True \
--do_eval True \
--fp16 \
--label_smoothing_factor 0.1 \
--remove_unused_columns False \
--per_device_train_batch_size 8 \
--per_device_eval_batch_size 8 \
--evaluation_strategy "steps" \
--eval_steps 10000 \
--save_strategy "steps" \
--save_steps 10000 \
--save_total_limit 3 \
--overwrite_output_dir \
--per_device_eval_batch_size 8 \
--gradient_accumulation_steps 4 \
--learning_rate 0.001 \
--warmup_steps 10000 \
--max_steps 100000 \
--lr_scheduler_type "polynomial"

bash signbankplus_training.sh
```

## Implemented Example:
### Hyperparameter Tuning

[Here](/examples/multimodal_translation/signwriting2text_translation/example_scripts/) you will find the templates to run the training. With them, you only have to enter the following commandline in order to train the model:
Modify the `signbankplus_training.sh` script or pass arguments via the command line to adjust hyperparameters such as batch size, learning rate, and evaluation steps.

```bash
cd /path/to/multimodalhugs/
. /examples/multimodal_translation/signwriting2text_translation/example_scripts/signbankplus_training.sh
---

## Directory Overview

```plaintext
signwriting2text_translation
├── README.md # Documentation (current file)
├── configs
│ └── example_config.yaml # Example configuration file
├── example_scripts
│ ├── signbankplus_dataset_preprocessing_script.py
│ ├── signbankplus_training.sh
│ └── signbankplus_training_setup.py
└── other
└── new_languages_sign_bank_plus.txt # Additional tokens for the tokenizer
```

### Directory Details
- **configs**: Contains YAML configuration files for model, dataset, and training parameters.
- **example_scripts**: Includes scripts for dataset preprocessing, training setup, and execution.
- **other**: Additional resources such as vocabulary files for new tokens.
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
args = parser.parse_args()

# Placeholder functions for constructing new fields
def construct_source_sequence(row):
def construct_input(row):
return "" # Replace with the actual implementation or keep empty

def construct_source_prompt(row):
Expand All @@ -35,12 +35,12 @@ def map_column_to_new_field(original_column, new_column_name, data):
data['generation_prompt'] = data.apply(construct_generation_prompt, axis=1)

# Example of mapping original columns to new ones
map_column_to_new_field('source', 'source_sequence', data)
map_column_to_new_field('source', 'input', data)
map_column_to_new_field('target', 'output_text', data)

# Select the desired columns for the new dataset
output_columns = [
'source_sequence',
'input',
'source_prompt',
'generation_prompt',
'output_text'
Expand Down
Original file line number Diff line number Diff line change
@@ -1,53 +1,96 @@
#!/bin/bash
#!/usr/bin/env bash
# ----------------------------------------------------------
# Sample Script for Signwriting2Text Translation (Generalized)
# ----------------------------------------------------------
# This script demonstrates a more general approach to:
# 1. Setting up environment variables.
# 2. Preprocessing data.
# 3. Configuring and starting a multimodal translation run.
# ----------------------------------------------------------

# Activate the virtual environment
source /path/to/your/environment/bin/activate
# (Optional) Activate your conda environment
# source /path/to/your/anaconda/bin/activate <YOUR_ENV_NAME>

export RUN_NAME="<run_name>"
export CONFIG_PATH='/multimodalhugs/examples/multimodal_translation/signwriting2text_translation/configs/example_config.yaml'
export OUTPUT_PATH="/path/to/your/output_directory"
# ----------------------------------------------------------
# 1. Define environment variables
# ----------------------------------------------------------

export MODEL_NAME="src_prompt_signwriting"
export REPO_PATH="/path/to/your/repository"

export CONFIG_PATH="${REPO_PATH}/examples/multimodal_translation/signwriting2text_translation/configs/example_config.yaml/path/to/your/configs/signwriting_src_prompt.yaml"
export OUTPUT_PATH="/path/to/your/experiments/output_directory"
export CUDA_VISIBLE_DEVICES=0
export EVAL_STEPS=250

# ----------------------------------------------------------
# 2. Preprocess Data
# ----------------------------------------------------------
python ${REPO_PATH}/examples/multimodal_translation/signwriting2text_translation/example_scripts/signbankplus_dataset_preprocessing_script.py \
/path/to/your/data/parallel/cleaned/dev.csv \
/path/to/your/data/parallel/cleaned/dev_processed.tsv

python ${REPO_PATH}/examples/multimodal_translation/signwriting2text_translation/example_scripts/signbankplus_dataset_preprocessing_script.py \
/path/to/your/data/parallel/cleaned/train_toy.csv \
/path/to/your/data/parallel/cleaned/train_processed.tsv

# Ejecuta el script Python y captura sus salidas
output=$(python /examples/multimodal_translation/signwriting2text_translation/example_scripts/signbankplus_training_setup.py \
python ${REPO_PATH}/examples/multimodal_translation/signwriting2text_translation/example_scripts/signbankplus_dataset_preprocessing_script.py \
/path/to/your/data/parallel/test/all.csv \
/path/to/your/data/parallel/test/test_processed.tsv


# ----------------------------------------------------------
# 3. Prepare Training Environment
# ----------------------------------------------------------
# This Python script sets up environment variables for the model, processor, etc.
output=$(python ${REPO_PATH}/examples/multimodal_translation/signwriting2text_translation/example_scripts/signbankplus_training_setup.py \
--config_path $CONFIG_PATH)

# Extrae las variables de entorno del output del script Python
# Extract environment variables from the Python script’s output
export MODEL_PATH=$(echo "$output" | grep 'MODEL_PATH' | cut -d '=' -f 2)
export PROCESSOR_PATH=$(echo "$output" | grep 'PROCESSOR_PATH' | cut -d '=' -f 2)
export DATA_PATH=$(echo "$output" | grep 'DATA_PATH' | cut -d '=' -f 2)

# Visualizar las variables declaradas
# Display the environment variables (for debugging)
echo "MODEL_PATH: $MODEL_NAME"
echo "REPO_PATH: $REPO_PATH"
echo "CONFIG_PATH: $CONFIG_PATH"
echo "OUTPUT_PATH: $OUTPUT_PATH"
echo "CUDA_VISIBLE_DEVICES: $CUDA_VISIBLE_DEVICES"
echo "MODEL_PATH: $MODEL_PATH"
echo "PROCESSOR_PATH: $PROCESSOR_PATH"
echo "DATA_PATH: $DATA_PATH"
echo "EVAL_STEPS: $EVAL_STEPS"


python /multimodalhugs/examples/multimodal_translation/run_translation.py \
# ----------------------------------------------------------
# 4. Train the Model
# ----------------------------------------------------------
python ${REPO_PATH}/examples/multimodal_translation/run_translation.py \
--model_name_or_path $MODEL_PATH \
--processor_name_or_path $PROCESSOR_PATH \
--run_name $RUN_NAME \
--run_name $MODEL_NAME \
--dataset_dir $DATA_PATH \
--output_dir $OUTPUT_PATH \
--do_train True \
--do_eval True \
--fp16 \
--label_smoothing_factor 0.1 \
--logging_steps 100 \
--remove_unused_columns False \
--per_device_train_batch_size 8 \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 8 \
--evaluation_strategy "steps" \
--eval_steps 10 \
--eval_steps $EVAL_STEPS \
--save_strategy "steps" \
--save_steps 10 \
--save_steps $EVAL_STEPS \
--save_total_limit 3 \
--load_best_model_at_end true \
--metric_for_best_model 'bleu' \
--overwrite_output_dir \
--per_device_eval_batch_size 8 \
--gradient_accumulation_steps 4 \
--learning_rate 0.001 \
--warmup_steps 10000 \
--max_steps 100000 \
--learning_rate 5e-05 \
--warmup_steps 20000 \
--max_steps 200000 \
--predict_with_generate True \
--lr_scheduler_type "polynomial"
--lr_scheduler_type "inverse_sqrt" \
--report_to none
Loading

0 comments on commit 369bb38

Please sign in to comment.