ISR-DPO:Aligning Large Multimodal Models for Videos by Iterative Self-Retrospective DPO,
Daechul Ahn*1, Yura Choi*1,2, San Kim1, Youngjae Yu2, Dongyeop Kang3, Jonghyun Choi1,†(*Equal Contribution)
1Seoul National University, 2Yonsei University, 3University of Minnesota
†Corresponding Author
Abstract: Iterative self-improvement, a concept extending beyond personal growth, has found powerful applications in machine learning, particularly in transforming weak models into strong ones. While recent advances in natural language processing have shown its efficacy through iterative preference optimization, applying this approach to Video Large Multimodal Models (VLMMs) remains challenging due to modality misalignment. VLMMs struggle with this misalignment during iterative preference modeling, as the self-judge model often prioritizes linguistic knowledge over visual information. Additionally, iterative preference optimization can lead to visually hallucinated verbose responses due to length bias within the self-rewarding cycle. To address these issues, we propose Iterative Self-Retrospective Direct Preference Optimization (ISR-DPO), a method that uses self-retrospection to enhance preference modeling. This approach enhances the self-judge's focus on informative video regions, resulting in more visually grounded preferences. In extensive empirical evaluations across dieverse video question answering benchmarks, the ISR-DPO significantly outperforms the state of the art.
- [12/10] Our paper is accepted to AAAI 2025!
- [07/02] Upload model checkpoint & evaluation code
- [06/17] Create repository, update README
- using the script from LLaVA-Hound-DPO
TEST_VIDEO_DIR=YOUR_PATH bash setup/setup_test_data.sh
- or, download manually from this link
# out-domain video question answering
bash Evaluation/pipeline/outdomain_test_pipeline.sh \
results \
SNUMPR/isrt_video_llava_7b_9th
- Coming soon
- Coming soon
GNU GENERAL PUBLIC LICENSE
- LLaVA-Hound-DPO: Our code is built upon the codebase from LLaVA-Hound-DPO