Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated results of VideoChat2 and some problems #7

Closed
Andy1621 opened this issue Jun 7, 2024 · 1 comment
Closed

Updated results of VideoChat2 and some problems #7

Andy1621 opened this issue Jun 7, 2024 · 1 comment

Comments

@Andy1621
Copy link

Andy1621 commented Jun 7, 2024

Hi! Thanks for your efforts in building more comprehensive video benchmarks.

VideoChat2-Mistral

Recently, we have updated VideoChat2 with Mistral-7B and better instruction data. Since the official testing script of Video-MME is not provided, I have tested our VideoChat2 based on the logic of MVBench.

Results

Here are the results. For VideoChat2_text, we input the fake videos, which are full of zeros:

Model Short w/o subs Short w/ subs Medium w/o subs Medium w/ subs Long w/o subs Long w/ subs Overall w/o subs Overall w/ subs
VideoChat2
(Mistral)
49.6 56.0 38.4 55.1 39.0 52.6 42.3 54.6
VideoChat2-HD
(Mistral)
54.7 59.5 41.3 53.6 39.8 53.9 45.3 55.7
VideoChat2_text
(Mistral)
23.5 39.3 25.2 47.7 24.2 45.6 24.6 45.6

Here are the JSON files:

⚠️ Our current version has some missing videos and subtitles:

  • Missing videos: Short (2), Medium (3), Long (11)
  • Missing subtitles: Short (93), Medium (52), Long (10)

Problem

About missing videos and subtitles

Considering some of the videos are no longer available, could you provide the raw videos?

#######Missing Videos#######
----------short: 2----------
['6YhlYu70uNA', 'j_9Fa9Uk8cA']
----------medium: 3----------
['Vo6HXsHgizc', 'azZZZbSwLQght', '3fGlUydVjRE']
----------long: 11----------
['Ry2dJuJ-9UE', 'cL27iZTOSNU', 'LvrcYs00wLk', 'mIk8MyVMLt0', 'k9n7kGQmFzI', '_eJsOYC8SVU', 'TYmHe20p_zU', 'r1PoSdFWvQg', 'N62mrplIygc', 'JLg4Rd6ukcg', 'azJ5pk5reX0']
#######Missing Subtitles#######
----------short: 93----------
['HwnB8aCn8yE', 'sUDY-SMREtA', 'OE5S-NbNsro', '5wLv3pCqZ9o', 'O0qVPW1fUn4', 'tX8a00l_Dfs', 'iZYLeIJwe4w', 'ebzbKa32kuk', '6YhlYu70uNA', 'fo0Hmch2YS0', 'M7WOXFvwbSY', 'RvnC--JjDBw', 'DF_J3vCcbBA', 'm4qhFFdHTCc', 'aBdQQxgxDrY', 'tF4DML7FIWk', 'M69Sn3OERZo', 'PU-XOFIJMlg', 'rQhLWHtHyiM', '2Qno7H4BwAU', 'a0AGwUACt7E', 'qu9ImFMLYxw', 'Qgr4dcsY-60', 'tGdL-34L-GE', 's50vvwTystA', 'y6ReUXtm_VE', 'ykJ7pyr87Qc', '4V6G0qYVoBg', 'uF3zNOthLAg', '6Z_XNM_iT4g', 'fJ-hp5Jlbv0', 'of62s85uMMs', '80p80ynsZ78', 'Kv1JXuOkAfk', 'lsfEHOtYGyk', 'rj6rJzs029A', 'QUmkzhUQoEA', 'KZu5iE7yrPI', 'JP67IM1LX-M', '_KU4X06VNiE', 'BkiTScinYOQ', '-qTAeVGl_e8', 'CHlJdMVLV2s', 'E895PNqSgEI', 'FsLaTZmP6Uw', 'tbKTfX5Az6w', 'TblFD8H4j94', '8np5YKYx3sU', 'NjxJY7P-Qpo', 'HiJb_2dvuHc', 'QhL6ICNQ_So', 'iOOseoPiw8E', 'PSC_HUeqaUk', 'jznsxDcKSnE', 'QGHVpr8FIN8', 'lNKtsi2Cu0E', 'V_skpmEXebM', 'QtKT3q7xB4M', '026dzf-vc5g', 'KEy2iCzpce4', 'dH8l--46j6s', 'VmNBt1tzC6k', '5Knkqo-lYF0', 'BrrDT_uYQsg', 'jmE9y0vv2aM', 'fSDq8CPXHQM', 'D61jenC5oBA', 'OCTekmo_szs', 'g1MCVp5xICM', 'BiJtPU9uP0c', 'OQAf_rTw7n8', 'ufZMrlZRn8o', 'IKtnfFHjERg', 'zPx3EibuO_w', 'J2rlJV7zKZw', 'v0m3E_uSyFg', 'NAahAX62MZ0', 'YZ11C-U7S8I', 'jb-To6qJcxU', 'frX8GujpmkI', 'j_9Fa9Uk8cA', 'ZfNSxRiYfZQ', 'PaC3CEkCD6k', '8e05W-wi38g', 'Atf_Af1q_5w', 'Kn10Jf1x24Q', 'vszWsYOdnPg', 'ZHWZf1Z4B5k', '6Cr_8tvvQ0k', 'H8fbAFOMTp4', 'tGdL-34L-GE', 'MiPw-RZMHCQ', 'UcL6WlcyBcI']
----------medium: 52----------
['1pHkv4KUiFY', 'z2APU5ob9Og', '7Hk9jct2ozY', 'AmrrSfiMxGA', 'Vo6HXsHgizc', 'QI9VIulqTCA', '2za5RwplXdI', 'ZN6kyy2SXnk', 'uMIohuKRq58', 'OApAF--FqLA', 'Gokw6n0qf-w', '0FM64MrRuZE', 'Rh8sz6ZnXM4', 'erjwCQ-UZyw', 'H54zMD-9Q-8', 'QnGwAUTM1fY', 'ACvLYz9nHvg', '2NOcM1HoRRg', '6ksaVHtfBG8', 'UmIYanq5gH8', 'SBssmtYTJpM', 'u_f0697yLBI', 'zW87tVnDKIU', 'vf3sPa2W0sA', 'e-XGSYnhUjg', 'ZFtxvf72NxA', 'vgWlRdNOGx4', 'G-mmtUxSt5k', 'hQYRDNl-lGI', 'awHc8IySU-I', 'CwzjlmBLfrQ', 'BEIVOKz4zXw', 'PB06auioy0Y', '_w4XRiUVfY4', 'azZZZbSwLQght', 'qNWL4S5bDt4', 'a8IGDMrohnY', '1sTNqJVrqx8', 'zbvamKv81o0', 'VfSIqKSiguc', 'dy8AvyRlXpQ', 'hZ8DmOpRTVc', 'EFD9BLgMVK8', 'HZJXPm0nhgY', '3EQzRz-V7uE', '_CqKv0Y1FB0', 'BVJGvKaFG7U', 'cFqLEwAvaHI', 'VoJ-Ey6q8uM', 'l73rmrLTHQc', 'p9uIBCDhyr0', '3fGlUydVjRE']
----------long: 10----------
['Sp2nxlrQ89w', 'cL27iZTOSNU', 'LvrcYs00wLk', 'mIk8MyVMLt0', 'k9n7kGQmFzI', 'yh-EHgkFci4', '7TydWUguPRU', 'JLg4Rd6ukcg', 't23Zi0DBSiI', '4IenX7OHumk']

About the processing of long subtitles

During the testing, I found that some subtitles have extremely long tokens. How do you process them? For me, I simply sample the beginning and the end of the subtitles.

About the performances

It is interesting that most of the methods achieve not bad performances with only 10 frames, since the videos are much longer than those in the previous benchmarks.
I have conducted some experiments with fake videos, as shown in the VideoChat_text. The results show that without any vision information or subtitles, the model answers randomly. The subtitles provide enough information to answer the questions, which reflects that the current benchmark may still have some bias of language.
Moreover, from the results, we can find that the larger LLMs lead to better performances

  • 7B => 40+
  • 20B => 50+
  • Closed-sourced => 60+
@BradyFU
Copy link
Owner

BradyFU commented Jun 7, 2024

Hi, thanks for your attention on our work!

We have conducted corresponding experiments, and find that the subtitles always contain some useful information, which is fundamentally hard to eliminate, especially for the longer videos and some video classes like News.

We have tried to make the question as relevant as possible to the visual content, but as mentioned above, the model will inevitably get some information from the subtitles.

We find that video frames + subtitles are always better than frames only or subtitles only, which indicates the multi-modal characteristics of our dataset, that is, both video frames and subtitles can provide certain information, but they need to be integrated to obtain the best results.

In addition, if you sample ten frames of the video, you should only feed the subtitles corresponding to the ten frames, instead of all substitles. We will give a detailed explanation on the use of substitles soon.

For the question about the dataset, you could send an email to [email protected].

@BradyFU BradyFU closed this as completed Jun 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants