Updated results of VideoChat2 and some problems #7

Andy1621 · 2024-06-07T10:34:00Z

Hi! Thanks for your efforts in building more comprehensive video benchmarks.

VideoChat2-Mistral

Recently, we have updated VideoChat2 with Mistral-7B and better instruction data. Since the official testing script of Video-MME is not provided, I have tested our VideoChat2 based on the logic of MVBench.

Results

Here are the results. For VideoChat2_text, we input the fake videos, which are full of zeros:

Model	Short w/o subs	Short w/ subs	Medium w/o subs	Medium w/ subs	Long w/o subs	Long w/ subs	Overall w/o subs	Overall w/ subs
VideoChat2 (Mistral)	49.6	56.0	38.4	55.1	39.0	52.6	42.3	54.6
VideoChat2-HD (Mistral)	54.7	59.5	41.3	53.6	39.8	53.9	45.3	55.7
VideoChat2_text (Mistral)	23.5	39.3	25.2	47.7	24.2	45.6	24.6	45.6

Here are the JSON files:

⚠️ Our current version has some missing videos and subtitles:

Missing videos: Short (2), Medium (3), Long (11)
Missing subtitles: Short (93), Medium (52), Long (10)

Problem

About missing videos and subtitles

Considering some of the videos are no longer available, could you provide the raw videos?

#######Missing Videos#######
----------short: 2----------
['6YhlYu70uNA', 'j_9Fa9Uk8cA']
----------medium: 3----------
['Vo6HXsHgizc', 'azZZZbSwLQght', '3fGlUydVjRE']
----------long: 11----------
['Ry2dJuJ-9UE', 'cL27iZTOSNU', 'LvrcYs00wLk', 'mIk8MyVMLt0', 'k9n7kGQmFzI', '_eJsOYC8SVU', 'TYmHe20p_zU', 'r1PoSdFWvQg', 'N62mrplIygc', 'JLg4Rd6ukcg', 'azJ5pk5reX0']
#######Missing Subtitles#######
----------short: 93----------
['HwnB8aCn8yE', 'sUDY-SMREtA', 'OE5S-NbNsro', '5wLv3pCqZ9o', 'O0qVPW1fUn4', 'tX8a00l_Dfs', 'iZYLeIJwe4w', 'ebzbKa32kuk', '6YhlYu70uNA', 'fo0Hmch2YS0', 'M7WOXFvwbSY', 'RvnC--JjDBw', 'DF_J3vCcbBA', 'm4qhFFdHTCc', 'aBdQQxgxDrY', 'tF4DML7FIWk', 'M69Sn3OERZo', 'PU-XOFIJMlg', 'rQhLWHtHyiM', '2Qno7H4BwAU', 'a0AGwUACt7E', 'qu9ImFMLYxw', 'Qgr4dcsY-60', 'tGdL-34L-GE', 's50vvwTystA', 'y6ReUXtm_VE', 'ykJ7pyr87Qc', '4V6G0qYVoBg', 'uF3zNOthLAg', '6Z_XNM_iT4g', 'fJ-hp5Jlbv0', 'of62s85uMMs', '80p80ynsZ78', 'Kv1JXuOkAfk', 'lsfEHOtYGyk', 'rj6rJzs029A', 'QUmkzhUQoEA', 'KZu5iE7yrPI', 'JP67IM1LX-M', '_KU4X06VNiE', 'BkiTScinYOQ', '-qTAeVGl_e8', 'CHlJdMVLV2s', 'E895PNqSgEI', 'FsLaTZmP6Uw', 'tbKTfX5Az6w', 'TblFD8H4j94', '8np5YKYx3sU', 'NjxJY7P-Qpo', 'HiJb_2dvuHc', 'QhL6ICNQ_So', 'iOOseoPiw8E', 'PSC_HUeqaUk', 'jznsxDcKSnE', 'QGHVpr8FIN8', 'lNKtsi2Cu0E', 'V_skpmEXebM', 'QtKT3q7xB4M', '026dzf-vc5g', 'KEy2iCzpce4', 'dH8l--46j6s', 'VmNBt1tzC6k', '5Knkqo-lYF0', 'BrrDT_uYQsg', 'jmE9y0vv2aM', 'fSDq8CPXHQM', 'D61jenC5oBA', 'OCTekmo_szs', 'g1MCVp5xICM', 'BiJtPU9uP0c', 'OQAf_rTw7n8', 'ufZMrlZRn8o', 'IKtnfFHjERg', 'zPx3EibuO_w', 'J2rlJV7zKZw', 'v0m3E_uSyFg', 'NAahAX62MZ0', 'YZ11C-U7S8I', 'jb-To6qJcxU', 'frX8GujpmkI', 'j_9Fa9Uk8cA', 'ZfNSxRiYfZQ', 'PaC3CEkCD6k', '8e05W-wi38g', 'Atf_Af1q_5w', 'Kn10Jf1x24Q', 'vszWsYOdnPg', 'ZHWZf1Z4B5k', '6Cr_8tvvQ0k', 'H8fbAFOMTp4', 'tGdL-34L-GE', 'MiPw-RZMHCQ', 'UcL6WlcyBcI']
----------medium: 52----------
['1pHkv4KUiFY', 'z2APU5ob9Og', '7Hk9jct2ozY', 'AmrrSfiMxGA', 'Vo6HXsHgizc', 'QI9VIulqTCA', '2za5RwplXdI', 'ZN6kyy2SXnk', 'uMIohuKRq58', 'OApAF--FqLA', 'Gokw6n0qf-w', '0FM64MrRuZE', 'Rh8sz6ZnXM4', 'erjwCQ-UZyw', 'H54zMD-9Q-8', 'QnGwAUTM1fY', 'ACvLYz9nHvg', '2NOcM1HoRRg', '6ksaVHtfBG8', 'UmIYanq5gH8', 'SBssmtYTJpM', 'u_f0697yLBI', 'zW87tVnDKIU', 'vf3sPa2W0sA', 'e-XGSYnhUjg', 'ZFtxvf72NxA', 'vgWlRdNOGx4', 'G-mmtUxSt5k', 'hQYRDNl-lGI', 'awHc8IySU-I', 'CwzjlmBLfrQ', 'BEIVOKz4zXw', 'PB06auioy0Y', '_w4XRiUVfY4', 'azZZZbSwLQght', 'qNWL4S5bDt4', 'a8IGDMrohnY', '1sTNqJVrqx8', 'zbvamKv81o0', 'VfSIqKSiguc', 'dy8AvyRlXpQ', 'hZ8DmOpRTVc', 'EFD9BLgMVK8', 'HZJXPm0nhgY', '3EQzRz-V7uE', '_CqKv0Y1FB0', 'BVJGvKaFG7U', 'cFqLEwAvaHI', 'VoJ-Ey6q8uM', 'l73rmrLTHQc', 'p9uIBCDhyr0', '3fGlUydVjRE']
----------long: 10----------
['Sp2nxlrQ89w', 'cL27iZTOSNU', 'LvrcYs00wLk', 'mIk8MyVMLt0', 'k9n7kGQmFzI', 'yh-EHgkFci4', '7TydWUguPRU', 'JLg4Rd6ukcg', 't23Zi0DBSiI', '4IenX7OHumk']

About the processing of long subtitles

During the testing, I found that some subtitles have extremely long tokens. How do you process them? For me, I simply sample the beginning and the end of the subtitles.

About the performances

It is interesting that most of the methods achieve not bad performances with only 10 frames, since the videos are much longer than those in the previous benchmarks.
I have conducted some experiments with fake videos, as shown in the VideoChat_text. The results show that without any vision information or subtitles, the model answers randomly. The subtitles provide enough information to answer the questions, which reflects that the current benchmark may still have some bias of language.
Moreover, from the results, we can find that the larger LLMs lead to better performances

7B => 40+
20B => 50+
Closed-sourced => 60+

BradyFU · 2024-06-07T11:57:37Z

Hi, thanks for your attention on our work!

We have conducted corresponding experiments, and find that the subtitles always contain some useful information, which is fundamentally hard to eliminate, especially for the longer videos and some video classes like News.

We have tried to make the question as relevant as possible to the visual content, but as mentioned above, the model will inevitably get some information from the subtitles.

We find that video frames + subtitles are always better than frames only or subtitles only, which indicates the multi-modal characteristics of our dataset, that is, both video frames and subtitles can provide certain information, but they need to be integrated to obtain the best results.

In addition, if you sample ten frames of the video, you should only feed the subtitles corresponding to the ten frames, instead of all substitles. We will give a detailed explanation on the use of substitles soon.

For the question about the dataset, you could send an email to [email protected].

BradyFU closed this as completed Jun 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updated results of VideoChat2 and some problems #7

Updated results of VideoChat2 and some problems #7

Andy1621 commented Jun 7, 2024

BradyFU commented Jun 7, 2024

Updated results of VideoChat2 and some problems #7

Updated results of VideoChat2 and some problems #7

Comments

Andy1621 commented Jun 7, 2024

VideoChat2-Mistral

Results

Problem

About missing videos and subtitles

About the processing of long subtitles

About the performances

BradyFU commented Jun 7, 2024