-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Request for Complete Test Script for Qwen2-Audio on AIR Bench #3
Comments
First, you need to download the evaluation dataset that I made public in issue; |
Thank you for your reply. Could you please share the complete script used to obtain these metrics? |
I tried writing the metrics code myself and used Assistant 2's answer as the score.
Following the Table2 of the paper, I attempted to merge the 8 categories into 4 categories as follows.
However, I found that the results do not align with those in the paper, mainly due to significant differences in the Mixed Audio category(5.95 vs 6.77). The differences in other categories are more acceptable.
Could you help me check if there's an issue with the way I merged the categories? Or do you have any suggestions? |
As you did, the final score is the average of each dataset. Thanks for your tip, I have added a simple summary code. The difference in the Qwen2-Audio score you tested (mainly Mixed-Audio) comes from the performance degradation caused by the model conversion when we convert huggingface. You can see in the official GitHub of Qwen2Audio that there is a table next to the table in the paper showing the score after converting HF. Of course, this is still different from your score. You forgot to swap the positions of Assistant 1 and 2 (Note in this GitHub, for fairness), and taking the average again is the final result. |
Thank you for your response. You mentioned, "You forgot to swap the positions of Assistant 1 and 2 (Note in this GitHub, for fairness), and taking the average again is the final result." Are you referring to the GPT judge phase or the calculation of the final score?
I used your cal_score.py script, and the results matched my calculations, so I believe the order should be correct. However, in your script, "sound" includes 'sound_generation_QA', and "music" includes 'music_generation_analysis_QA', which seems inconsistent with the paper. Additionally, does "Mixed Audio" refer to the combination of speech_and_music_QA and speech_and_sound_QA, or just speech_and_sound_QA? The latter seems closer to the "Mixed Audio" results you provided. |
In the GPT judge phase, the two model- responses should be swapped to eliminate the position bias of GPT. In the 24 lines of my script, sound includes 'sound_QA' and 'sound_generation_QA'; Mixed-Audio is the average of speech_and_music_QA and speech_and_sound_QA |
In the GPT judge phase, the two model- responses should be swapped to eliminate the position bias of GPT. In the 24 lines of my script, sound includes 'sound_QA' and 'sound_generation_QA'; Mixed-Audio is the average of speech_and_music_QA and speech_and_sound_QA
…------------------ 原始邮件 ------------------
发件人: "OFA-Sys/AIR-Bench" ***@***.***>;
发送时间: 2024年8月16日(星期五) 下午4:40
***@***.***>;
***@***.******@***.***>;
主题: Re: [OFA-Sys/AIR-Bench] Request for Complete Test Script for Qwen2-Audio on AIR Bench (Issue #3)
Thank you for your response. You mentioned, "You forgot to swap the positions of Assistant 1 and 2 (Note in this GitHub, for fairness), and taking the average again is the final result." Are you referring to the GPT judge phase or the calculation of the final score?
speech: Sum=772, Win_Rate=0.20854922279792745, gpt4_avg_score=8.2279792746114, llm_avg_score=7.224093264248705 sound: Sum=487, Win_Rate=0.19096509240246407, gpt4_avg_score=8.17659137577002, llm_avg_score=6.782340862422998 music: Sum=490, Win_Rate=0.22653061224489796, gpt4_avg_score=8.104081632653061, llm_avg_score=6.685714285714286 speech_and_sound: Sum=193, Win_Rate=0.20207253886010362, gpt4_avg_score=8.44041450777202, llm_avg_score=6.4974093264248705 speech_and_music: Sum=196, Win_Rate=0.1377551020408163, gpt4_avg_score=8.566326530612244, llm_avg_score=5.428571428571429
I used your cal_score.py script, and the results matched my calculations, so I believe the order should be correct. However, in your script, "sound" includes 'sound_generation_QA', and "music" includes 'music_generation_analysis_QA', which seems inconsistent with the paper.
Additionally, does "Mixed Audio" refer to the combination of speech_and_music_QA and speech_and_sound_QA, or just speech_and_sound_QA? The latter seems closer to the "Mixed Audio" results you provided.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you commented.Message ID: ***@***.***>
|
Hi,
I'm currently trying to replicate the performance of Qwen2-Audio on the AIR Bench. However, I noticed that the repository at AIR-Bench doesn't provide the complete test script. It only includes the inference script and the GPT-4 evaluation generation script.
Could you please clarify how the scores for the Speech, Sound, Music, and Mixed Audio metrics are obtained? It would be very helpful if you could provide the complete test script for these metrics.
Thank you for your assistance!
The text was updated successfully, but these errors were encountered: