You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi! Thanks for your efforts in building more comprehensive video benchmarks.
VideoChat2-Mistral
Recently, we have updated VideoChat2 with Mistral-7B and better instruction data. Since the official testing script of Video-MME is not provided, I have tested our VideoChat2 based on the logic of MVBench.
Results
Here are the results. For VideoChat2_text, we input the fake videos, which are full of zeros:
During the testing, I found that some subtitles have extremely long tokens. How do you process them? For me, I simply sample the beginning and the end of the subtitles.
About the performances
It is interesting that most of the methods achieve not bad performances with only 10 frames, since the videos are much longer than those in the previous benchmarks.
I have conducted some experiments with fake videos, as shown in the VideoChat_text. The results show that without any vision information or subtitles, the model answers randomly. The subtitles provide enough information to answer the questions, which reflects that the current benchmark may still have some bias of language.
Moreover, from the results, we can find that the larger LLMs lead to better performances
7B => 40+
20B => 50+
Closed-sourced => 60+
The text was updated successfully, but these errors were encountered:
We have conducted corresponding experiments, and find that the subtitles always contain some useful information, which is fundamentally hard to eliminate, especially for the longer videos and some video classes like News.
We have tried to make the question as relevant as possible to the visual content, but as mentioned above, the model will inevitably get some information from the subtitles.
We find that video frames + subtitles are always better than frames only or subtitles only, which indicates the multi-modal characteristics of our dataset, that is, both video frames and subtitles can provide certain information, but they need to be integrated to obtain the best results.
In addition, if you sample ten frames of the video, you should only feed the subtitles corresponding to the ten frames, instead of all substitles. We will give a detailed explanation on the use of substitles soon.
For the question about the dataset, you could send an email to [email protected].
Hi! Thanks for your efforts in building more comprehensive video benchmarks.
VideoChat2-Mistral
Recently, we have updated VideoChat2 with Mistral-7B and better instruction data. Since the official testing script of Video-MME is not provided, I have tested our VideoChat2 based on the logic of MVBench.
Results
Here are the results. For VideoChat2_text, we input the fake videos, which are full of zeros:
(Mistral)
(Mistral)
(Mistral)
Here are the JSON files:
Problem
About missing videos and subtitles
Considering some of the videos are no longer available, could you provide the raw videos?
About the processing of long subtitles
During the testing, I found that some subtitles have extremely long tokens. How do you process them? For me, I simply sample the beginning and the end of the subtitles.
About the performances
It is interesting that most of the methods achieve not bad performances with only 10 frames, since the videos are much longer than those in the previous benchmarks.
I have conducted some experiments with fake videos, as shown in the VideoChat_text. The results show that without any vision information or subtitles, the model answers randomly. The subtitles provide enough information to answer the questions, which reflects that the current benchmark may still have some bias of language.
Moreover, from the results, we can find that the larger LLMs lead to better performances
The text was updated successfully, but these errors were encountered: