MultimodalHugs is a streamlined extension of Hugging Face designed for training, evaluating, and deploying multimodal AI models. Built atop Hugging Face’s powerful ecosystem, MultimodalHugs integrates seamlessly with standard pipelines while providing additional functionalities to handle multilingual and multimodal inputs—reducing boilerplate and simplifying your codebase.
- Unified Framework: Train and evaluate multimodal models (e.g., image-to-text, pose-to-text, signwriting-to-text) using a consistent API.
- Minimal Code Changes: Leverage Hugging Face’s pipelines with only minor modifications.
- Data in TSV: Avoid the complexity of numerous hyperparameters by maintaining data splits in
.tsv
files—easily specify prompts, languages, targets, or other attributes in dedicated columns. - Modular Design: Use or extend any of the components (datasets, models, modules, processors) to suit your custom tasks.
- Examples Included: Refer to the
examples/
directory for guided scripts, configurations, and best practices.
-
Clone the repository:
git clone https://github.com/GerrySant/multimodalhugs.git
-
Navigate and install the package:
- Standard installation:
cd multimodalhugs pip install .
- Developer installation:
cd multimodalhugs pip install -e .[dev]
- Standard installation:
Explore the examples/multimodal_translation/ directory for an end-to-end workflow demonstrating how to:
- Preprocess Data: Convert raw data into
.tsv
format with columns for prompts, languages, target labels, etc. - Configure Training: Tune model hyperparameters via YAML or Python script.
- Train & Evaluate: Utilize the included training scripts and Hugging Face's Trainer for effortless experimentation.
- Extend & Adapt: Incorporate custom datasets, tokenizers, or specialized processing modules.
Note: Each example folder (e.g., Image2text_translation, pose2text_translation) contains its own detailed documentation. Refer there for more specifics.
multimodalhugs
├── examples
│ └── multimodal_translation
│ ├── image2text_translation
│ ├── pose2text_translation
│ └── signwriting2text_translation
├── multimodalhugs
│ ├── custom_datasets
│ ├── data
│ ├── models
│ ├── modules
│ ├── multimodalhugs_cli
│ ├── processors
│ ├── tasks
│ ├── training_setup
│ └── utils
└── tests
examples/
: Contains ready-to-run demos for various multimodal tasks.multimodalhugs/
: Core library code (datasets, models, modules, etc.).tests/
: Automated tests to ensure code integrity.
All contributions—bug reports, feature requests, or pull requests—are welcome. Please see our GitHub repository to get involved.
This project is licensed under the terms of the MIT License.
If you use MultimodalHugs in your research or applications, please cite:
@misc{multimodalhugs2024,
title={MultimodalHugs: Extending HuggingFace for Generalized Multimodal AI Model Training and Evaluation},
author={Sant, Gerard and Moryossef, Amit},
howpublished={\url{https://github.com/GerrySant/multimodalhugs}},
year={2024}
}