-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to Use Temporal Pooling Layer? #5
Comments
I assume you meant: whisper-at/src/whisper_at_train/models.py Lines 113 to 167 in fc32775
What is your accuracy on ESC-50? We did test on Whisper Which temporal pooling layer do you mean? -Yuan |
Thank you for your quick reply. I use the "last_mlp" method in 'whisper-at/src/whisper_at_train/models.py' and test accuracy is 82% on the original Whisper large, not large-v1 or large-v2. I will try large-v1. I appreciate your reminder. For the temporal pooling in your paper Figure 4: the last whisper encoder layer output feature shape is [bs,500,1280]', and then through temporal pooling, it becomes [bs,25,1280]. Have you used ‘torch.nn.AvgPool2d’? I did not use your release of the ESC-50 features. So I don't know whether I am using AvgPool2d right or not. -YunFei |
hi there, Sorry, I have a hard deadline due this week, so need to follow up on this later. We released the code for actual feature extraction. whisper-at/src/noise_robust_asr/intermediate_feat_extract/whisper_feat_extracrt/whisper/model.py Lines 194 to 201 in fc32775
Note please change the hard-coded 1000 to 500 for ESC-50: Lines 45 to 46 in fc32775
The actual pooling layer (note 10x pooling for ESC, 20x for AudioSet, as ESC is 5s and AS is 10s, the output should be both 25 in time) When you call -Yuan |
I use “time_pooling = nn.AvgPool2d((60,1))” for whisper large pre-trained model(encoder out size is [batch,1500,1280])as Temporal Pooling Layer, but for 'last_mlp' and 'last_tr' methods cannot achieve the accuracy mentioned in the paper in my test on the ESC50 dataset. So I would like to ask if my settings are correct.
The detailed code is as follows:
def forward(self, x):
outputs = self.model.encoder(x) #large:[bs,1500,1280]
outputs = self.time_pooling(outputs) #large:[bs,25,1280]
outputs = self.time_tr(outputs) #Temporal transformer
outputs = torch.mean(outputs, dim=1) #large:[bs,1280]
logits = self.mlp_layer(outputs) #large:[bs,50]
Another question: What is "Linear Projection" mean? is it "nn.Linear" ? I didn't find this part in the 'models.py' file in your release code
The text was updated successfully, but these errors were encountered: