Official repository for "Boosting Audio Visual Question Answering via Key Semantic-Aware Cues" in ACM MM 2024.
ArXiv: https://arxiv.org/abs/2407.20693
Authors: Guangyao Li, Henghui Du, Di Hu
python3.6 +
pytorch1.6.0
tensorboardX
ffmpeg
numpy
-
Clone this repo
git clone https://github.com/GeWu-Lab/TSPM.git
-
Download data
MUSIC-AVQA: https://gewu-lab.github.io/MUSIC-AVQA/
-
Feature extraction
cd feat_script/extract_clip_feat python extract_qst_ViT-L14@336px.py python extract_qaPrompt_ViT-L14@336px.py python extract_token-level_feat.py python extract_frames_ViT-L14@336px.py
-
Training
python -u main_train.py --Temp_Selection --top_k 10 \ --Spatio_Perception \ --batch-size 64 --epochs 30 --lr 1e-4 \ --num_workers 12 --gpu 0,1 \ --checkpoint TSPM \ --model_save_dir models
-
Testing
python -u main_test.py --Temp_Selection --top_k 10 \ --Spatio_Perception \ --batch-size 1 --gpu 1 \ --checkpoint TSPM \ --model_save_dir models \ --result_dir results
If you find this work useful, please consider citing it.
coming soon!
This research was supported by Public Computing Cloud, Renmin University of China.