Skip to content

Latest commit

 

History

History
252 lines (215 loc) · 18.9 KB

README.md

File metadata and controls

252 lines (215 loc) · 18.9 KB

Paper Open in OpenXLab YouTube Video Dataset meta Dataset meta Dataset meta Dataset meta

PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC

images

🔥 Updates

  • 2024/06/29: The instruction data for VideoChat2_HD is updated in VideoChat2-IT, which is helpful for more detailed and accurate responses.
  • 2024/06/25: We release the branch of videochat2 using vllm, speed up the inference of videochat2.
  • 2024/06/19: 🎉🎉 Our VideoChat2 achieves the best performances among the open-sourced VideoLLMs on MLVU, a multi-task long video understanding benchmark.
  • 2024/06/13: Fix some bug and give testing scripts/
  • 2024/06/07: 🔥🔥🔥 We release VideoChat2_HD, which is fine-tuned with high-resolution data and is capable of handling more diverse tasks. It showcases better performance on different benchmarks, especially for detailed captioning. Furthermore, it achieves 54.8% on Video-MME, the best score among 7B MLLMs. Have a try! 🏃🏻‍♀️🏃🏻
  • 2024/06/06: We release VideoChat2_phi3, a faster model with robust performaces.
  • 2024/05/22: We release VideoChat2_mistral, which shows better capacity on diverse tasks (60.4% on MVBench, 78.6% on NExT-QA, 63.8% on STAR, 46.4% on TVQA, 54.4% on EgoSchema-full and 80.5% on IntentQA). More details have been updated in the paper.
  • 2024/04/05: MVBench is selected as Poster (Highlight)! 🎉🎉
  • 2024/02/27: MVBench is accepted by CVPR2024! 🎉🎉
  • 2023/12/17: Online Leaderboard:
  • 2023/12/04: Brief introduction:
  • 2023/11/29: Release VideoChat2 and MVBench:

🦜 VideoChat2

Progressive Training

images Stage1 aligns UMT-L, the visual encoder, with QFormer to efficiently compress extensive visual inputs. Stage2 extends this connection to incorporate LLM, while Stage3 focuses on effective instruction tuning to enhance model performance.

We build a diver instruction data with 2M samples from 34 distince sources. Check DATA for more details.

Model

ViT QFormer LLM LoRA Shell (Vicuna) Model (Vicuna) Shell (Mistral) Model (Mistral) Shell (Phi3) Model (Phi3)
Stage1 ❄️ 🔥 🚫 🚫 config & run 🤗ckpt SAME SAME SAME SAME
Stage2 🔥 🔥 ❄️ 🚫 config & run 🤗ckpt config & run 🤗ckpt config & run 🤗ckpt
Stage3 🔥 🔥 ❄️ 🔥 config & run 🤗ckpt config & run 🤗ckpt config & run 🤗ckpt
Stage4_HD 🔥 🔥 ❄️ 🔥 - - config & run 🤗ckpt - -

Inference

You can refer to demo.ipynb

Results

Model MVBench Video-MME Video-MME
w/ subtitles
Video
ChatGPT
NExT-QA
(in-domain)
STAR
(zero-shot)
TVQA
(zero-shot)
EgoSchema
(full)
EgoSchema
(subset)
IntentQA
(in-domain Val)
IntentQA
(in-domain Test)
VideoChat2
(Vicuna)
51.1 - - 2.98 68.6 59.0 40.6 - - - -
VideoChat2
(Phi3)
55.1 - - 2.91 73.1 63.3 40.1 56.7 59.8 69.0 71.6
VideoChat2
(Mistral)
60.4 42.3 54.6 2.95 78.6 63.8 46.4 54.4 63.6 80.5 81.9
VideoChat2_HD
(Mistral)
62.3 45.3 55.7 3.10 79.5 63.9 50.6 55.8 65.6 81.1 83.4
  • (2024/06/07) For Video-MME, our current version has some missing videos and subtitles, see issue
    • Missing videos: Short (2), Medium (3), Long (11)
    • Missing subtitles: Short (93), Medium (52), Long (10)
  • For VideoChatGPT, the VideoChat2_mistral and VideoChat2_phi3 are evaluated based on gpt-3.5-turbo-0125, while the VideoChat2_vicuna used gpt-3.5-turbo-1106.
  • For NExT-QA, we report in-domain results since the training set are used as instruction data.
  • For STAR, we input 32 frames, but we input 16 frames for other datasets.
  • For IntentQA, we report the result on validation and testing splits.
  • For testing EgoSchema and Video-MME, please check the demo_mistral.ipynb and demo_mistral_hd.ipynb.

Training

  • Prepare the envirment:

    conda create -n videochat2 python=3.9
    conda activate videochat2
    pip install -r requirements.txt
  • Stage1 training:

    bash scripts/videochat_vicuna/run_7b_stage1.sh
  • Stage2 training:

    # Vicuna
    bash scripts/videochat_vicuna/run_7b_stage2.sh
    # Mistral
    bash scripts/videochat_mistral/run_7b_stage2.sh
  • Stage3 training:

    # Vicuna
    bash scripts/videochat_vicuna/run_7b_stage3.sh
    # Mistral
    bash scripts/videochat_mistral/run_7b_stage3.sh
  • Runing demo:

    # Set the related model path in configs/config.json and demo/demo.py
    python demo/demo.py
  • Evaluation:

    • MVBench: mvbench.ipynb. The script is used for Vicuna, and for Mistral, please follow demo_mistral.ipynb to change the script.
    • For VideoChatGPT Benchmark, we follow the original repo and use ChatGPT-3.5 to evalute the performances.
    • For NExT-QA, STAR and TVQA, we follow SeViLA to prepare the data. And we simple modify mvbench.ipynb and directly output the options to calculate the accuracy.

📊 MVBench

We propose a comprehensive video understanding benchmark with 20 challenging video tasks, where our VideoChat2 secures the top ranking on 15 tasks. More details can be found here.

The online leaderboard is held in 🤗 Hugging Face.

📄 Citation

If you find this project useful in your research, please consider cite:

@article{2023videochat,
  title={VideoChat: Chat-Centric Video Understanding},
  author={KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao},
  journal={arXiv preprint arXiv:2305.06355},
  year={2023}
}

@misc{li2023mvbench,
      title={MVBench: A Comprehensive Multi-modal Video Understanding Benchmark}, 
      author={Kunchang Li and Yali Wang and Yinan He and Yizhuo Li and Yi Wang and Yi Liu and Zun Wang and Jilan Xu and Guo Chen and Ping Luo and Limin Wang and Yu Qiao},
      year={2023},
      eprint={2311.17005},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

💫 Acknowledgement

Thanks to the open source of the following projects:

InternVid, UMT, MiniGPT-4, LLaVA, BLIP2, VideoChatGPT, Vicuna, M3-IT.